Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

EFFICIENT CNN ARCHITECTURE DESIGN GUIDED BY VISUALIZATION

Liangqi Zhang, Haibo Shen, Yihao Luo, Xiang Cao, Leixilan Pan, Tianjiang Wang∗ , Qi Feng

School of Computer Science and Technology, Huazhong University of Science and Technology, China
{zhangliangqi, shenhaibo, luoyihao, caoxiang112, d202081082, tjwang, fengqi}@hust.edu.cn

ABSTRACT
Modern efficient Convolutional Neural Networks(CNNs) al-
ways use Depthwise Separable Convolutions(DSCs) and
Neural Architecture Search(NAS) to reduce the number of
parameters and the computational complexity. But some in-
arXiv:2207.10318v1 [cs.CV] 21 Jul 2022

herent characteristics of networks are overlooked. Inspired


by visualizing feature maps and N×N(N>1) convolution ker- (a) #3/s = 1 (b) #7/s = 2 (c) #11/s = 1 (d) #16/s = 1 (e) last/s = 1
ks=1×3×3 ks=1×3×3 ks=1×3×3 ks=1×3×3 ks=1×3×3
nels, several guidelines are introduced in this paper to further
improve parameter efficiency and inference speed. Based on Fig. 1: Kerenls and their distributions of MobileNetV2. The ker-
nels at different stages in the network show distinctly different pat-
these guidelines, our parameter-efficient CNN architecture,
terns, like edge detection filter kernels, blur kernels or identity ker-
called VGNetG, achieves better accuracy and lower latency
nels. Here, in #N, N denotes layer number, s denotes stride and ks
than previous networks with about 30%∼50% parameters re- denotes kernel size.
duction. Our VGNetG-1.0MP achieves 67.7% top-1 accuracy
with 0.99M parameters and 69.2% top-1 accuracy with 1.14M
parameters on ImageNet classification dataset. makes us to see the process of extracting features. Feature
Furthermore, we demonstrate that edge detectors can re- Visualization by Optimization [13, 14] explains what a net-
place learnable depthwise convolution layers to mix features work is looking for. Attribution [15–17] explains what part
by replacing the N×N kernels with fixed edge detection ker- of an example is responsible for the network activating in a
nels. And our VGNetF-1.5MP archives 64.4%(-3.2%) top-1 particular way. Visualization can reveal many inherent char-
accuracy and 66.2%(-1.4%) top-1 accuracy with additional acteristics of neural networks.
Gaussian kernels. In this paper, we study the characteristics of networks by
Index Terms— Visualization, Efficient, Edge detection, visualizing the N×N kernels, the distribution of kernels, and
Gaussian blur feature maps. As shown in Figure 1, the N×N convolution
kernels show distinctly different patterns and distributions at
1. INTRODUCTION different stages of MobileNetV2 [7].
Our VGNets guided by these visualizations achieve better
Recently, Convolutional Neural Networks(CNNs) have made accuracy and lower latency than previous models with about
great progress in computer vision. Since AlexNet [1], CNN- 30%∼50% parameters reduction. Specifically, Our VGNetG-
based methods focus on designing wider or deeper network 1.0MP archives 67.7% top-1 accuracy with 0.997M parame-
architectures for the accuracy gains, including VGGNets [2], ters without strong regularization methods. Furthermore, we
ResNets [3], and DenseNets [4]. However, the computational demonstrate that edge detectors can replace learnable depth-
and storage capacity is always limited. The most prominent wise convolutions for mixing features between different spa-
approaches to reduce the parameters and computational com- tial locations.
plexity are based on Depthwise Separable Convolutions [5]
and Neural Architecture Search, such as MobileNets [6, 7],
2. RELATED WORK
ShuffleNets [8, 9], and EfficientNets [10]. Although these
approaches have been greatly successful, they also overlook
DSCs-based Architectures Depthwise Separable Convolu-
many inherent characteristics of convolutional neural net-
tions was introduced in [5] and subsequently used in efficient
works.
convolutional neural networks [6–10]. DSCs factorize a stan-
Visualization is a powerful tool to study neural networks.
dard convolution into a light weight depthwise convolution
Visualization of features in a fully trained model [11, 12]
for spatial filtering and a 1×1 convolution called a heavier
* Corresponding Author pointwise convolution for feature generation. Typical 3 × 3
(a) ResNet-RS 50: #18/64×3×3/s=2

(b) RegNetX-8GF: #24/120×3×3(partial)/s=2

(c) MobileNetV2: #4/1×3×3 (d) ShuffleNetV2: #6/1×3×3 (e) EfficientNet-B0: #12/1×5×5


Fig. 2: The kernels of downsampling layers are similar to low-pass filter kernels. (a) Standard convolution kernels and (b) Group
convolution kernels: have one or more salient N×N kernels like blur kernels in the whole M×N×N kernel. (c)(d)(e) Depthwise convolution
kernels: like blur kernels, especially Gaussian kernels.

(a) MobileNetV2: #11/1 × 3 × 3 (b) MobiletNetV3-S: #7/1 × 5 × 5 (c) EfficientNet-B0: #9/1 × 5 × 5 (d) EfficientNet-B7: #29/1×5×5
Fig. 3: Kernels from the middle layers of the network. It can be seen that lots of convolution kernels are similar to the identity kernel. As
a result, we replace some convolution operations with identity mapping to reduce the parameters and the computation complexity.

depthwise separable convolutions use between 8 to 9 times Class Maximization [14] explains what a network is looking
less computation than standard convolutions at only a small for. Saliency Maps [15], Guided Backpropagation [16], Grad-
reduction in accuracy. CAM [17] explains what part of an example is responsible for
MobileNetV1 [6] employs DSCs to substantially improve the network activating in a particular way. Network visualiza-
computational efficiency. MobileNetV2 [7] introduced the tion also could give us many intuitive inspirations to design
Inverted residual block, which takes as an input a low- new architectures.
dimensional compressed representation which is first ex-
panded to high dimension and filtered with a lightweight 3. CHARACTERISTICS OF CNNS AND
depthwise convolution. Features are subsequently projected GUIDELINES
back to a low-dimensional representation with a linear convo-
lution. ShuffleNetV1 [8], V2 [9] utilizes group convolution In this section, we study three typical networks constructed
and channel shuffle operations to further reduce the complex- by (i) standard convolutions such as ResNet-RS, (ii) group
ity. MobileNetV3 [18], RegNets [19] and EfficientNets [10] convolutions such as RegNet, (iii) depthwise separable con-
built upon the InvertedResidualBlock structure by introduc- volutions such as MobileNets, ShuffleNetV2 and Efficient-
ing lightweight attention modules [20] based on squeeze and Nets. These visualizations demonstrate that M×N×N kernels
exctitation into the bottleneck structure. have distinctly different patterns and distributions at different
Nyquist-Shannon Sampling Theorem and Shift- stages of networks. What follows are the characteristics of
Invariant As reported in [21], the convolutional architecture CNNs and guidelines:
does not give invariance since architecture ignores the clas-
sical sampling theorem, as small input shifts or translations
3.1. CNNs can learn to satisfy the sampling theorem
can cause drastic changes in the output. Even though blur-
ring before subsampling is sufficient for avoiding aliasing in The previous works [21, 22] always thought that convolu-
linear systems, the presence of nonlinearities may introduce tional neural networks ignore the classical sampling theorem,
aliasing even in the presence of blur before subsampling. [22] but we found that convolutional neural networks can sat-
integrates extra classic anti-aliasing to improve shift equivari- isfy the sampling theorem to some extent by learning low-
ance of deep networks. pass filters, especially the DSCs-based networks such as Mo-
Feature Visualization and Attribution Visualization is a bileNetV1 and EfficientNets, as shown in Figure 2.
powerful tool to study neural networks. Visualization of fea- Standard convolutions/Group convolutions As shown
tures in a fully trained model [11, 12] makes us see the pro- in Figure 2a and 2b, there are one or more salient N×N ker-
cess of extracting features. Activation Maximization [11, 13], nels like blur kernels in the whole M×N×N kernels, and this
(a) MobileNetV2: ks=1×3×3 (b) ShuffleNetV2: ks=1×3×3 (c) RegNetX-800MF: ks=16×3×3

Fig. 4: The M×N×N kernels of last convolution layers show similar phenomenon.

(a) 4 EKs (b) Additional 2 Eks (c) Additional 2 GKs


Fig. 7: The 8 unlearnable kernels, including 6 edge detection ker-
nels(EKs) and 2 Gaussian blur kernels(GKs). More details can be
found in Appendix D.
(a) #8/1×3×3 (b) (c) #6/1×3×3 (d)
Fig. 5: Convolution kernels that are similar to edge detection ker-
nels. (a) and (c) are from MobileNetV1. (b) Sobel filter kernels. (d) Concat

Iden�ty
Iden�ty 3x3 DW

Laplacian filter kernels. 3x3 DW, Stride=2

1x1 Conv
Iden�ty ReLU/BN 1x1 Conv, ReLU/BN

(a) DownsamplingBlock (b) HalfIdentityBlock


Fig. 8: DownsamplingBlock for downsampling and expanding the
channels. HalfIdentityBlock for increasing the depth.

(a) Left: the input. Right: feature maps from MobileNetV1.


3.2. Reuse feature maps between adjacent layers

Identity Kernel and Similar Features Maps As shown in


Figure 3, many depthwise convolution kernels only have a
large value at the center like identity kernel in the middle
part of networks. And convolutions with identity kernels lead
(b) Left: the input. Right: feature maps from RegNetX-400MF. to feature maps duplication and computational redundancy
since the inputs are just passed to the next layer. On the
Fig. 6: Similar feature maps from adjacent layers. Each row fea-
other hand, Figure 6 shows that many feature maps are simi-
ture maps output from the same layer. Here, in #N, N denotes layer
number or channel number.
lar(duplicated) between adjacent layers.
As a result, we could replace partial convolutions with
identity mapping. Otherwise, depthwise convolutions are
phenomenon also means the parameters of these layers are re- slow in early layers since they often cannot fully utilize mod-
dundant. Note that the salient kernels do not necessarily seem ern accelerators reported in [9]. So this method can improve
like Gaussian kernels. both the parameter efficiency and inference time.
Depthwise separable convolutions The kernels of
strided-DSCs are usually similar to Gaussian kernels, in-
cluding but not limited to MobileNetV1, MobileNetV2, Mo- 3.3. Edge detectors as learnable depthwise convolutions
bileNetV3, ShuffleNetV2, ReXNet, EfficientNets. In addi-
tion, the distributions of strided-DSC kernels are not Gaussian Edge features contain important information about the im-
distributions but Gaussian mixture distributions. ages. As shown in Figure 5, a large part of kernels approx-
Kernels of last convolution layers Modern CNNs always imate to edge detection kernels, like the Sobel filter kernels
use global pooling layers before the classifier to reduce the and the Laplacian filter kernels. And the proportion of such
dimension. Therefore, similar phenomenon is also shown on kernels decreases in the later layers while the proportion of
the last depthwise convolution layers, as shown in Figure 4. kernels that like blur kernels increases.
These visualizations indicate that we should choose Therefore, maybe the edge detectors could replace the
depthwise convolutions rather than standard convolutions and depthwise convolutions in the DSCs-based networks to mix
group convolutions in the downsampling layers and last lay- features between different spatial locations. We will demon-
ers. And further, we could use fixed Gaussian kernels in the strate that by replacing learnable kernels with edge detection
downsampling layers. kernels.
Input Operator Type Stride Channels Layers 5. EXPERIMENTS
2
224 Conv2d - 2 3 1
1122 DownsamplingBlock blur 2 28 1 In this section, we present our experimental setups, the main
562 HalfIdentityBlock - 1 56 3 results on ImageNet (the results on CIFAR-100 can be found
562 DownsamplingBlock blur 2 56 1
in Appendix B).
282 HalfIdentityBlock - 1 112 6
282 DownsamplingBlock blur 2 112 1
142 HalfIdentityBlock - 1 224 12
5.1. ImageNet Classification
142 DownsamplingBlock blur 2 224 1
72 HalfIdentityBlock - 1 368 1
72 SharedDWConv2d,t=8 - 1 368 1
The ImageNet ILSVRC2012 dataset contains about 1.28M
72 PointwiseBlock - 1 368 1 training images and 50,000 validation images with 1000
72 AvgPool2d - - 368 1 classes. We emphasize that VGNetG models are trained
Table 1: VGNetG-1.0MP Networks - HalfIdentityBlock and Down-
with no regularization except weight decay and label smooth-
samplingBlock are described in Figure 8. SharedDWConv2d just ing, while most networks use various enhancements, such
share the same depthwise convolution kernels for t times. as deep supervision, Cutout, DropPath, AutoAugment, Ran-
dAugment, and so on.
DownsamplingBlock HalfIdentityBlock
Training setup Our ImageNet training settings follow:
Variant SGD optimizer with momentum 0.9; mini-batch size of 512;
N×N kernels N×N kernels
weight decay 1e-4; initial learning rate 0.2 with 5 warmup
VGNetC random & learnable random & learnable
VGNetG GKs random & learnable
epochs; batch normalization with momentum 0.9; cosine
VGNetF GKs EKs & GKs learning rate decay for 300 epochs; label smoothing 0.1; the
biases and α and β in BN layers are left unregularized. Fi-
Table 2: Variants of VGNet. EKs denotes edge detection kernels;
nally, all training is done on resolution 224.
GKs denotes Gaussian kernels. EKs and GKs are shown in Figure 7.
Results Table 3 shows the performance comparison on
ImageNet, our VGNetGs achieves better accuracy and param-
4. NETWORK ARCHITECTURE eter efficiency than the MobileNet series and ShuffleNetV2
with more than 30% parameters reduction and lower infer-
Following these guidelines, we design our parameter-efficient ence latency. In particular, our VGNetG-1.0MP achieves
architecture and study the function of depthwise convolutions. 67.7% top-1 accuracy with less than 1M parameters and
Note that only pointwise convolutions are followed by ReLU 69.2% top-1 accuracy with 1.14M parameters.
and batchnorm for better accuracy and inference speed.
DownsamplingBlock The DownsamplingBlock halves 5.2. Ablation study
the resolution and expands the number of channels. As shown
in Figure 8a, only the expanded channels are generated by the In this section, we conduct ablation experiments to gain a
pointwise convolutions for reusing the features. The kernels better understanding of the impact of the depthwise convo-
of depthwise convolutions could be randomly initialized or lutions. The ablation experiments are performed on the Ima-
use fixed Gaussian kernels. geNet and CIFAR100(see Appendix B).
HalfIdentityBlock As shown in Figure 8b, we replace Kernels of downsampling layers As shown in Table 4,
half depthwise convolutions with identity mapping and re- The VGNetG-1.5MP which used the Gaussian blur kernels
duce half pointwise convolutions while keeping the width of in the downsampling layers achieves 68.0% top-1 accuracy,
the block. Note that the right half channels of the input outperforming the VGNetC-1.5MP by 0.4% accuracy.
become the left half channels of the output for better fea- Kernels of depthwise convolutions As shown in Table
tures reusing. 4, the VGNetF4, only used 2 Sobel kernels and 2 Laplacian
VGNet Architecture Using the DownsamplingBlock and filter kernels instead of all the depthwise convolution ker-
HalfIdentityBlock, we build our VGNets limited by the num- nels, has about 3% reduction in accuracy. The result indi-
ber of parameters. The overall VGNetG-1.0MP architecture cates that CNNs can use edge detectors as learnable depth-
is listed in Table 1. wise convolutions to mix features. As mentioned, the last few
Variants of VGNet To further study the impact of the layers have more kernels like blur kernels. So the VGNetF2
N×N kernels, several variants of VGNets are introduced: which used additional Gaussian kernels achieves better accu-
VGNetC, VGNetG, and VGNetF. VGNetC: All parameters racy than VGNetF4.
are randomly initialized and learnable. VGNetG: All param- VGNetF1 and VGNetF3 whose last depthwise convolu-
eters are randomly initialized and learnable except the ker- tion kernels are learnable only have a 0.8% reduction in accu-
nels of the DownsamplingBlock. VGNetF: All parameters of racy.
depthwise convolutions are fixed. The details can be found in Non-linearities and training epochs Table 5 compares
Table 2. the performance of VGNetG-1.5MP according to the number
Top-1 Top-5 Params Ratio-to GPU Speed Infer-time
Model SE
Acc.(%) Acc.(%) (M) VGNetG (batches/sec.) (ms)
MobileNet V2×0.5 [7] 63.9 85.1 1.969 1.97× 209 4.77
MobileNet V3 Small† [18] X 67.4 - 2.543 1.69× 196 5.09
VGNetG-1.0MP(ours) 66.6 87.1 0.997 1.00× 228 4.37
VGNetG-1.0MP+SiLU(ours)† 67.7 87.9 0.997 1.00× 226 4.41
VGNetG-1.0MP+SE(ours) X 69.2 88.7 1.143 1.14× 107 9.31
MobileNet V2×0.75 [7] 69.8 89.6 2.636 1.75× 194 5.15
ShuffleNet V2×1.0 [9] 69.4 88.0 2.279 1.52× 162 6.15
VGNetG-1.5MP(ours) 69.3 88.8 1.502 1.00× 222 4.50
VGNetG-1.5MP+SE(ours) X 71.3 90.1 1.702 1.13× 104 9.55
MobileNet V1×1.0 [6] 70.6 88.2 4.232 2.11× 201 4.97
MobileNet V2×1.0 [7] 72.0 91.0 3.505 1.75× 170 5.87
MobileNet V3 Large×0.75† X 73.3 - 3.994 1.99× 166 6.01
VGNetG-2.0MP(ours) 71.3 90.0 2.006 1.00× 224 4.45
VGNetG-2.0MP+SE(ours) X 73.5 91.4 2.345 1.17× 109 9.14
ShuffleNet V2×1.5 [9] 72.6 - 3.504 1.41× 157 6.34
GhostNet×1.0† [23] X 73.9 91.4 5.183 2.08× 114 8.75
RegNetX-400MF [19] 72.7 - 5.158 2.07× 109 9.12
RegNetY-400MF [19] X 74.1 - 4.344 1.74× 94 10.62
VGNetG-2.5MP(ours) 72.6 90.7 2.493 1.00× 200 4.98
VGNetG-2.5MP+SE(ours) X 74.2 91.8 2.922 1.17× 97 10.31
Table 3: Performance Results on ImageNet. Our VGNetGs achieve better accuracy with about 30%∼50% parameters reduction. GPU
Speed and Infer-time are measured on RTX6000 GPU with batch size 16. Params is measured by facebookresearch/fvcore and the Params
do not include the unlearnable parameters. The SEBlocks are used after every pointwise convolutions expect the last one. † : used SiLU(also
known as the swish function) or HardSwish.

of training epochs and non-linearities. It can be seen that more


epochs and the SiLU non-linearities show better performance.

Model Downsampling N×N Last DSC Top-1


(1.5MP) Kernels Kernels kernels Acc.(%)
Fig. 9: MobileNetV3 Large: 3×3 kernels of the second and third
VGNetC learnable learnable learnable 67.6
depthwise convolution layers. As shown in the figure, almost half
VGNetG GKs learnable learnable 68.0 kernels are zero(red ones).
VGNetF1 GKs 6 EKs+2 GKs learnable 66.8
VGNetF2 GKs 6 EKs+2 GKs 6 EKs+2 GKs 66.2
VGNetF3 GKs 4 EKs learnable 66.1 these edge features are used is still unclear.
VGNetF4 GKs 4 EKs 4 EKs 64.4
Moreover, as shown in Figure 9, MobileNetV3 Small and
Table 4: Abilation study for ImageNet classification. EK: Edge Large have almost half zero N×N kernels in the front layers.
detection Kernel; GK: Gaussian Kernel. Maybe we could reduce more parameters and computational
complexity in the front layers.

Top-1
Model SiLU Epochs
Acc.(%) 7. CONCLUSION
100 68.0
VGNetG-1.5MP X 100 69.1 In this paper, we designed parameter-efficient CNN architec-
300 69.3
ture guided by visualizing the convolution kernels and feature
Table 5: Impact of non-linearities and training epochs. maps. Based on these visualizations, we proposed VGNets,
a new family of smaller and faster neural networks for im-
age recognition. Our VGNets achieve better accuracy with
6. DISCUSSION about 30%∼50% parameters reduction and lower inference
latency than the previous networks. Finally, we demonstrated
We demonstrated that edge detectors can take the place of that fixed Gaussian kernels and edge detection kernels could
learnable depthwise convolution layers in CNNs. But how replace the learnable depthwise convolution kernels.
8. REFERENCES Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformat-
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- ics), vol. 8689 LNCS, no. PART 1, pp. 818–833, 2014.
ton, “ImageNet classification with deep convolutional [13] Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus,
neural networks,” in Advances in Neural Information “Adaptive deconvolutional networks for mid and high
Processing Systems, 2012, vol. 2, pp. 1097–1105. level feature learning,” Proceedings of the IEEE In-
[2] Karen Simonyan and Andrew Zisserman, “Very deep ternational Conference on Computer Vision, pp. 2018–
convolutional networks for large-scale image recogni- 2025, 2011.
tion,” 3rd International Conference on Learning Repre- [14] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas
sentations, ICLR 2015 - Conference Track Proceedings, Fuchs, and Hod Lipson, “Understanding Neural Net-
pp. 1–14, 2015. works Through Deep Visualization,” 2015.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian [15] Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
Sun, “Deep residual learning for image recognition,” in man, “Deep inside convolutional networks: Visual-
Proceedings of the IEEE Computer Society Conference ising image classification models and saliency maps,”
on Computer Vision and Pattern Recognition, 2016, vol. 2nd International Conference on Learning Representa-
2016-Decem, pp. 770–778. tions, ICLR 2014 - Workshop Track Proceedings, pp. 1–
[4] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and 8, 2014.
Kilian Q. Weinberger, “Densely connected convolu- [16] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas
tional networks,” in Proceedings - 30th IEEE Con- Brox, and Martin Riedmiller, “Striving for simplicity:
ference on Computer Vision and Pattern Recognition, The all convolutional net,” 3rd International Confer-
CVPR 2017, 2017, vol. 2017-Janua, pp. 2261–2269. ence on Learning Representations, ICLR 2015 - Work-
[5] Laurent Sifre, PhD thesis Rigid-Motion Scattering For shop Track Proceedings, pp. 1–14, 2015.
Image Classification, 2014. [17] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek
[6] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv
Kalenichenko, Weijun Wang, Tobias Weyand, Marco Batra, “Grad-CAM: Visual Explanations from Deep
Andreetto, and Hartwig Adam, “MobileNets: Efficient Networks via Gradient-Based Localization,” Interna-
Convolutional Neural Networks for Mobile Vision Ap- tional Journal of Computer Vision, vol. 128, no. 2, pp.
plications,” 2017. 336–359, 2020.
[7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey [18] Andrew Howard, Mark Sandler, Bo Chen, Weijun
Zhmoginov, and Liang Chieh Chen, “MobileNetV2: In- Wang, Liang Chieh Chen, Mingxing Tan, Grace Chu,
verted Residuals and Linear Bottlenecks,” Proceedings Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Quoc Le,
of the IEEE Computer Society Conference on Computer and Hartwig Adam, “Searching for mobileNetV3,” Pro-
Vision and Pattern Recognition, pp. 4510–4520, 2018. ceedings of the IEEE International Conference on Com-
[8] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian puter Vision, vol. 2019-Octob, pp. 1314–1324, 2019.
Sun, “ShuffleNet: An Extremely Efficient Convolu- [19] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
tional Neural Network for Mobile Devices,” Proceed- Kaiming He, and Piotr Dollár, “Designing network de-
ings of the IEEE Computer Society Conference on Com- sign spaces,” Proceedings of the IEEE Computer Soci-
puter Vision and Pattern Recognition, pp. 6848–6856, ety Conference on Computer Vision and Pattern Recog-
2018. nition, pp. 10425–10433, 2020.
[9] Ningning Ma, Xiangyu Zhang, Hai Tao Zheng, and Jian [20] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua
Sun, “Shufflenet V2: Practical guidelines for efficient Wu, “Squeeze-and-Excitation Networks,” IEEE Trans-
cnn architecture design,” Lecture Notes in Computer actions on Pattern Analysis and Machine Intelligence,
Science (including subseries Lecture Notes in Artificial vol. 42, no. 8, pp. 2011–2023, 2020.
Intelligence and Lecture Notes in Bioinformatics), vol. [21] Aharon Azulay and Yair Weiss, “Why do deep convo-
11218 LNCS, pp. 122–138, 2018. lutional networks generalize so poorly to small image
[10] Mingxing Tan and Quoc V. Le, “EfficientNet: Rethink- transformations?,” Journal of Machine Learning Re-
ing model scaling for convolutional neural networks,” search, vol. 20, 2019.
36th International Conference on Machine Learning, [22] Richard Zhang, “Making convolutional networks shift-
ICML 2019, vol. 2019-June, pp. 10691–10700, 2019. invariant again,” 2019, vol. 2019-June.
[11] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and [23] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chun-
Pascal Vincent, “Visualizing higher-layer features of a jing Xu, and Chang Xu, “GhostNet: More features
deep network,” Bernoulli, , no. 1341, pp. 1–13, 2009. from cheap operations,” Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pat-
[12] Matthew D. Zeiler and Rob Fergus, “Visualizing and un-
tern Recognition, pp. 1577–1586, 2020.
derstanding convolutional networks,” Lecture Notes in

You might also like