Maxim Milakov: nnForge v1.1.4

Here is new v1.1.4 release of nnForge. It contains mostly performance improvements but they are rather significant: the performance of convolutional layer in GPU (CUDA) backend improved from 1.15 TFLOPs to 2 TFLOPs for Galaxy Zoo example when running on NVIDIA GeForce GTX Titan, and the whole network performance improved from 1 TFLOPs to 1.55 TFLOPs. See the NSight VSE profiling screenshot:

2 TFLOPs on Titan running at 784 MHz... how efficient is it? Let's see: 2 TFLOPs / (784 Mhz * 14 SM * 192 fma/SM * 2 op/fma) = 47% of theoretical peak, which I consider a pretty good number. And there is certaily room for improvement here. Training could also benefit from these improvements; I plan to port these changes to hessian calculators and updaters soon.

Here are all the changes in this release:

C++11 limited support added: you can build everything except for CUDA backend - this is due to NVCC not yet supporting C++11
Improved testing and validating (feed forward) performance of convolutional layers in CUDA backend for Kepler at the same time greatly simplifying the code
Improved performance of max subsampling 2d tester for CUDA backend. The implementation is far from optimal yet

Maxim Milakov

Pages

Apr 19, 2014

nnForge v1.1.4

No comments:

Post a Comment