I significantly improved performance of CUDA backend recently in nnForge v2.0.1:
- Multiple improvements to reduce total buffer sizes, allows running larger chunks (3x for ImageNet):
- Taking buffer sizes into account when coloring graph
- Maxout, ReLU, and MaxSubsampling layers consume much less memory in CUDA backend
- Action graph is optimized to exclude unnecessary concurrency - taking into account device width here
- Migrated to cuDNN v3
- Reusing CUDA streams
- Allocating chunk of mem for fixed working buffers - improves perf
- Few bug-fixes
See buffer graph coloring for the optimized action graph of VGG-A-like schema to the right. You can get this and other interesting graphs by specifying "--debug_mode 1" option.