Efficient CNN Architecture Design Guided by Visualization
Efficient CNN Architecture Design Guided by Visualization
Liangqi Zhang, Haibo Shen, Yihao Luo, Xiang Cao, Leixilan Pan, Tianjiang Wang∗ , Qi Feng
School of Computer Science and Technology, Huazhong University of Science and Technology, China
{zhangliangqi, shenhaibo, luoyihao, caoxiang112, d202081082, tjwang, fengqi}@hust.edu.cn
ABSTRACT
Modern efficient Convolutional Neural Networks(CNNs) al-
ways use Depthwise Separable Convolutions(DSCs) and
Neural Architecture Search(NAS) to reduce the number of
parameters and the computational complexity. But some in-
arXiv:2207.10318v1 [cs.CV] 21 Jul 2022
(a) MobileNetV2: #11/1 × 3 × 3 (b) MobiletNetV3-S: #7/1 × 5 × 5 (c) EfficientNet-B0: #9/1 × 5 × 5 (d) EfficientNet-B7: #29/1×5×5
Fig. 3: Kernels from the middle layers of the network. It can be seen that lots of convolution kernels are similar to the identity kernel. As
a result, we replace some convolution operations with identity mapping to reduce the parameters and the computation complexity.
depthwise separable convolutions use between 8 to 9 times Class Maximization [14] explains what a network is looking
less computation than standard convolutions at only a small for. Saliency Maps [15], Guided Backpropagation [16], Grad-
reduction in accuracy. CAM [17] explains what part of an example is responsible for
MobileNetV1 [6] employs DSCs to substantially improve the network activating in a particular way. Network visualiza-
computational efficiency. MobileNetV2 [7] introduced the tion also could give us many intuitive inspirations to design
Inverted residual block, which takes as an input a low- new architectures.
dimensional compressed representation which is first ex-
panded to high dimension and filtered with a lightweight 3. CHARACTERISTICS OF CNNS AND
depthwise convolution. Features are subsequently projected GUIDELINES
back to a low-dimensional representation with a linear convo-
lution. ShuffleNetV1 [8], V2 [9] utilizes group convolution In this section, we study three typical networks constructed
and channel shuffle operations to further reduce the complex- by (i) standard convolutions such as ResNet-RS, (ii) group
ity. MobileNetV3 [18], RegNets [19] and EfficientNets [10] convolutions such as RegNet, (iii) depthwise separable con-
built upon the InvertedResidualBlock structure by introduc- volutions such as MobileNets, ShuffleNetV2 and Efficient-
ing lightweight attention modules [20] based on squeeze and Nets. These visualizations demonstrate that M×N×N kernels
exctitation into the bottleneck structure. have distinctly different patterns and distributions at different
Nyquist-Shannon Sampling Theorem and Shift- stages of networks. What follows are the characteristics of
Invariant As reported in [21], the convolutional architecture CNNs and guidelines:
does not give invariance since architecture ignores the clas-
sical sampling theorem, as small input shifts or translations
3.1. CNNs can learn to satisfy the sampling theorem
can cause drastic changes in the output. Even though blur-
ring before subsampling is sufficient for avoiding aliasing in The previous works [21, 22] always thought that convolu-
linear systems, the presence of nonlinearities may introduce tional neural networks ignore the classical sampling theorem,
aliasing even in the presence of blur before subsampling. [22] but we found that convolutional neural networks can sat-
integrates extra classic anti-aliasing to improve shift equivari- isfy the sampling theorem to some extent by learning low-
ance of deep networks. pass filters, especially the DSCs-based networks such as Mo-
Feature Visualization and Attribution Visualization is a bileNetV1 and EfficientNets, as shown in Figure 2.
powerful tool to study neural networks. Visualization of fea- Standard convolutions/Group convolutions As shown
tures in a fully trained model [11, 12] makes us see the pro- in Figure 2a and 2b, there are one or more salient N×N ker-
cess of extracting features. Activation Maximization [11, 13], nels like blur kernels in the whole M×N×N kernels, and this
(a) MobileNetV2: ks=1×3×3 (b) ShuffleNetV2: ks=1×3×3 (c) RegNetX-800MF: ks=16×3×3
Fig. 4: The M×N×N kernels of last convolution layers show similar phenomenon.
Iden�ty
Iden�ty 3x3 DW
1x1 Conv
Iden�ty ReLU/BN 1x1 Conv, ReLU/BN
Top-1
Model SiLU Epochs
Acc.(%) 7. CONCLUSION
100 68.0
VGNetG-1.5MP X 100 69.1 In this paper, we designed parameter-efficient CNN architec-
300 69.3
ture guided by visualizing the convolution kernels and feature
Table 5: Impact of non-linearities and training epochs. maps. Based on these visualizations, we proposed VGNets,
a new family of smaller and faster neural networks for im-
age recognition. Our VGNets achieve better accuracy with
6. DISCUSSION about 30%∼50% parameters reduction and lower inference
latency than the previous networks. Finally, we demonstrated
We demonstrated that edge detectors can take the place of that fixed Gaussian kernels and edge detection kernels could
learnable depthwise convolution layers in CNNs. But how replace the learnable depthwise convolution kernels.
8. REFERENCES Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformat-
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- ics), vol. 8689 LNCS, no. PART 1, pp. 818–833, 2014.
ton, “ImageNet classification with deep convolutional [13] Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus,
neural networks,” in Advances in Neural Information “Adaptive deconvolutional networks for mid and high
Processing Systems, 2012, vol. 2, pp. 1097–1105. level feature learning,” Proceedings of the IEEE In-
[2] Karen Simonyan and Andrew Zisserman, “Very deep ternational Conference on Computer Vision, pp. 2018–
convolutional networks for large-scale image recogni- 2025, 2011.
tion,” 3rd International Conference on Learning Repre- [14] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas
sentations, ICLR 2015 - Conference Track Proceedings, Fuchs, and Hod Lipson, “Understanding Neural Net-
pp. 1–14, 2015. works Through Deep Visualization,” 2015.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian [15] Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
Sun, “Deep residual learning for image recognition,” in man, “Deep inside convolutional networks: Visual-
Proceedings of the IEEE Computer Society Conference ising image classification models and saliency maps,”
on Computer Vision and Pattern Recognition, 2016, vol. 2nd International Conference on Learning Representa-
2016-Decem, pp. 770–778. tions, ICLR 2014 - Workshop Track Proceedings, pp. 1–
[4] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and 8, 2014.
Kilian Q. Weinberger, “Densely connected convolu- [16] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas
tional networks,” in Proceedings - 30th IEEE Con- Brox, and Martin Riedmiller, “Striving for simplicity:
ference on Computer Vision and Pattern Recognition, The all convolutional net,” 3rd International Confer-
CVPR 2017, 2017, vol. 2017-Janua, pp. 2261–2269. ence on Learning Representations, ICLR 2015 - Work-
[5] Laurent Sifre, PhD thesis Rigid-Motion Scattering For shop Track Proceedings, pp. 1–14, 2015.
Image Classification, 2014. [17] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek
[6] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv
Kalenichenko, Weijun Wang, Tobias Weyand, Marco Batra, “Grad-CAM: Visual Explanations from Deep
Andreetto, and Hartwig Adam, “MobileNets: Efficient Networks via Gradient-Based Localization,” Interna-
Convolutional Neural Networks for Mobile Vision Ap- tional Journal of Computer Vision, vol. 128, no. 2, pp.
plications,” 2017. 336–359, 2020.
[7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey [18] Andrew Howard, Mark Sandler, Bo Chen, Weijun
Zhmoginov, and Liang Chieh Chen, “MobileNetV2: In- Wang, Liang Chieh Chen, Mingxing Tan, Grace Chu,
verted Residuals and Linear Bottlenecks,” Proceedings Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Quoc Le,
of the IEEE Computer Society Conference on Computer and Hartwig Adam, “Searching for mobileNetV3,” Pro-
Vision and Pattern Recognition, pp. 4510–4520, 2018. ceedings of the IEEE International Conference on Com-
[8] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian puter Vision, vol. 2019-Octob, pp. 1314–1324, 2019.
Sun, “ShuffleNet: An Extremely Efficient Convolu- [19] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
tional Neural Network for Mobile Devices,” Proceed- Kaiming He, and Piotr Dollár, “Designing network de-
ings of the IEEE Computer Society Conference on Com- sign spaces,” Proceedings of the IEEE Computer Soci-
puter Vision and Pattern Recognition, pp. 6848–6856, ety Conference on Computer Vision and Pattern Recog-
2018. nition, pp. 10425–10433, 2020.
[9] Ningning Ma, Xiangyu Zhang, Hai Tao Zheng, and Jian [20] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua
Sun, “Shufflenet V2: Practical guidelines for efficient Wu, “Squeeze-and-Excitation Networks,” IEEE Trans-
cnn architecture design,” Lecture Notes in Computer actions on Pattern Analysis and Machine Intelligence,
Science (including subseries Lecture Notes in Artificial vol. 42, no. 8, pp. 2011–2023, 2020.
Intelligence and Lecture Notes in Bioinformatics), vol. [21] Aharon Azulay and Yair Weiss, “Why do deep convo-
11218 LNCS, pp. 122–138, 2018. lutional networks generalize so poorly to small image
[10] Mingxing Tan and Quoc V. Le, “EfficientNet: Rethink- transformations?,” Journal of Machine Learning Re-
ing model scaling for convolutional neural networks,” search, vol. 20, 2019.
36th International Conference on Machine Learning, [22] Richard Zhang, “Making convolutional networks shift-
ICML 2019, vol. 2019-June, pp. 10691–10700, 2019. invariant again,” 2019, vol. 2019-June.
[11] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and [23] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chun-
Pascal Vincent, “Visualizing higher-layer features of a jing Xu, and Chang Xu, “GhostNet: More features
deep network,” Bernoulli, , no. 1341, pp. 1–13, 2009. from cheap operations,” Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pat-
[12] Matthew D. Zeiler and Rob Fergus, “Visualizing and un-
tern Recognition, pp. 1577–1586, 2020.
derstanding convolutional networks,” Lecture Notes in