AlexNet Filter Groups. Amongst the seminal contributions made by Krizhevsky et al. is the use of ‘filter groups’ in the convolutional layers of a CNN. While the use of filter groups was necessitated by the practical consideration of sub-dividing the work of training a large network across multiple GPUs, the side effects are somewhat surprising. Specifically, the authors observe that independent filter groups learn a separation of responsibility (colour features vs. texture features) that is consistent over different random initializations. Also surprising, and not explicitly stated, is the fact that the AlexNet network has approximately 57% fewer connection weights than the corresponding network without filter groups (see Fig. 2). This is due to the reduction in the input channel dimension of the grouped convolution filters. Surprisingly, despite the large difference in the number of parameters between the models, both architectures achieve comparable error on ILSVRC – in fact the smaller grouped network gets ≈ 1% lower top-5 validation error.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) NIPS. pp. 1106–1114 (2012) Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups

We use hierarchical filter groups to allow the network itself to learn independent filters. By restricting the connectivity between filters on subsequent layers the network is forced to learn filters of limited interdependence.

We explored the effect of using a complex hierarchical arrangement of filter groups in CNNs and show that imposing a structured decrease in the degree of filter grouping with depth – a ‘root’ (inverse tree) topology – can allow us to obtain more efficient variants of state-of-the-art networks without compromising accuracy. Our method appears to be complementary to existing methods, such as low-dimensional embeddings, and can be used more efficiently to train deep networks than methods that only approximate a pre-trained model’s weights.