2 TFLOPs on Titan running at 784 MHz... how efficient is it? Let's see: 2 TFLOPs / (784 Mhz * 14 SM * 192 fma/SM * 2 op/fma) = 47% of theoretical peak, which I consider a pretty good number. And there is certaily room for improvement here. Training could also benefit from these improvements; I plan to port these changes to hessian calculators and updaters soon.
Here are all the changes in this release:
Here are all the changes in this release:
- C++11 limited support added: you can build everything except for CUDA backend - this is due to NVCC not yet supporting C++11
- Improved testing and validating (feed forward) performance of convolutional layers in CUDA backend for Kepler at the same time greatly simplifying the code
- Improved performance of max subsampling 2d tester for CUDA backend. The implementation is far from optimal yet