Google Le Net
Google Le Net
Christian Szegedy1 , Wei Liu2 , Yangqing Jia1 , Pierre Sermanet1 , Scott Reed3 ,
Dragomir Anguelov1 , Dumitru Erhan1 , Vincent Vanhoucke1 , Andrew Rabinovich4
1
3
1
2
{szegedy,jiayq,sermanet,dragomir,dumitru,vanhoucke}@google.com
Abstract
ger and bigger deep networks, but from the synergy of deep
architectures and classical computer vision, like the R-CNN
algorithm by Girshick et al [6].
Another notable factor is that with the ongoing traction
of mobile and embedded computing, the efficiency of our
algorithms especially their power and memory use gains
importance. It is noteworthy that the considerations leading
to the design of the deep architecture presented in this paper
included this factor rather than having a sheer fixation on
accuracy numbers. For most of the experiments, the models
were designed to keep a computational budget of 1.5 billion
multiply-adds at inference time, so that the they do not end
up to be a purely academic curiosity, but could be put to real
world use, even on large datasets, at a reasonable cost.
In this paper, we will focus on an efficient deep neural
network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous
we need to go deeper internet meme [1]. In our case, the
word deep is used in two different meanings: first of all,
in the sense that we introduce a new level of organization
in the form of the Inception module and also in the more
direct sense of increased network depth. In general, one can
view the Inception model as a logical culmination of [12]
while taking inspiration and guidance from the theoretical
work by Arora et al [2]. The benefits of the architecture are
experimentally verified on the ILSVRC 2014 classification
and detection challenges, where it significantly outperforms
the current state of the art.
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new
state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014
(ILSVRC14). The main hallmark of this architecture is the
improved utilization of the computing resources inside the
network. By a carefully crafted design, we increased the
depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and
the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called
GoogLeNet, a 22 layers deep network, the quality of which
is assessed in the context of classification and detection.
1. Introduction
In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10].
One encouraging news is that most of this progress is not
just the result of more powerful hardware, larger datasets
and bigger models, but mainly a consequence of new ideas,
algorithms and improved network architectures. No new
data sources were used, for example, by the top entries
in the ILSVRC 2014 competition besides the classification
dataset of the same competition for detection purposes. Our
GoogLeNet submission to ILSVRC 2014 actually uses 12
times fewer parameters than the winning architecture of
Krizhevsky et al [9] from two years ago, while being significantly more accurate. On the object detection front, the
biggest gains have not come from naive application of big-
2. Related Work
Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure
stacked convolutional layers (optionally followed by con1
4. Architectural Details
The main idea of the Inception architecture is to consider
how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily
available dense components. Note that assuming translation
invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal
local construction and to repeat it spatially. Arora et al. [2]
suggests a layer-by layer construction where one should analyze the correlation statistics of the last layer and cluster
them into groups of units with high correlation. These clusters form the units of the next layer and are connected to
the units in the previous layer. We assume that each unit
from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In
the lower layers (the ones close to the input) correlated units
would concentrate in local regions. Thus, we would end up
with a lot of clusters concentrated in a single region and
they can be covered by a layer of 11 convolutions in the
next layer, as suggested in [12]. However, one can also
expect that there will be a smaller number of more spatially
spread out clusters that can be covered by convolutions over
larger patches, and there will be a decreasing number of
patches over larger and larger regions. In order to avoid
patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 11, 33 and
55; this decision was based more on convenience rather
than necessity. It also means that the suggested architecture
is a combination of all those layers with their output filter
banks concatenated into a single output vector forming the
input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional
beneficial effect, too (see Figure 2(a)).
As these Inception modules are stacked on top of each
other, their output correlation statistics are bound to vary:
as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This
suggests that the ratio of 33 and 55 convolutions should
increase as we move to higher layers.
One big problem with the above modules, at least in this
nave form, is that even a modest number of 55 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added
to the mix: the number of output filters equals to the num-
Filter
concatenation
1x1 convolutions
3x3 convolutions
5x5 convolutions
Previous layer
3x3 convolutions
5x5 convolutions
1x1 convolutions
1x1 convolutions
1x1 convolutions
1x1 convolutions
Previous layer
efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping
the lower layers in traditional convolutional fashion. This is
not strictly necessary, simply reflecting some infrastructural
inefficiencies in our current implementation.
A useful aspect of this architecture is that it allows for
increasing the number of units at each stage significantly
without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous
use of dimensionality reduction prior to expensive convolutions with larger patch sizes. Furthermore, the design follows the practical intuition that visual information should
be processed at various scales and then aggregated so that
the next stage can abstract features from the different scales
simultaneously.
The improved use of computational resources allows for
increasing both the width of each stage as well as the number of stages without getting into computational difficulties.
One can utilize the Inception architecture to create slightly
inferior, but computationally cheaper versions of it. We
have found that all the available knobs and levers allow for
a controlled balancing of computational resources resulting
in networks that are 3 10 faster than similarly performing networks with non-Inception architecture, however this
requires careful manual design at this point.
5. GoogLeNet
By theGoogLeNet name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC 2014 competition. We also used one
deeper and wider Inception network with slightly superior
quality, but adding it to the ensemble seemed to improve the
results only marginally. We omit the details of that network,
as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table 1 illustrates the most common instance of Inception used in the
competition. This network (trained with different imagepatch sampling methods) was used for 6 out of the 7 models
in our ensemble.
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the
receptive field in our network is 224224 in the RGB color
space with zero mean. #33 reduce and #55 reduce
stands for the number of 11 filters in the reduction layer
used before the 33 and 55 convolutions. One can see
the number of 11 filters in the projection layer after the
built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as
well.
The network was designed with computational efficiency
and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint.
patch size/
stride
output
size
depth
convolution
77/2
11211264
max pool
33/2
565664
convolution
33/1
5656192
max pool
33/2
2828192
inception (3a)
2828256
inception (3b)
2828480
1414480
inception (4a)
1414512
inception (4b)
type
#11
#33
reduce
#33
#55
reduce
#55
pool
proj
params
ops
2.7K
34M
112K
360M
64
192
64
96
128
16
32
32
159K
128M
128
128
192
32
96
64
380K
304M
192
96
208
16
48
64
364K
73M
1414512
160
112
224
24
64
64
437K
88M
inception (4c)
1414512
128
128
256
24
64
64
463K
100M
inception (4d)
1414528
112
144
288
32
64
64
580K
119M
inception (4e)
1414832
256
160
320
32
128
128
840K
170M
77832
inception (5a)
77832
256
160
320
32
128
128
1072K
54M
inception (5b)
771024
384
192
384
48
128
128
1388K
71M
111024
dropout (40%)
111024
linear
111000
1000K
1M
softmax
111000
max pool
max pool
avg pool
33/2
33/2
77/1
A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but
removed at inference time).
softmax2
SoftmaxActivation
FC
AveragePool
7x7+1(V)
DepthConcat
Conv
1x1+1(S)
6. Training Methodology
GoogLeNet networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we
used a CPU based implementation only, a rough estimate
suggests that the GoogLeNet network could be trained to
convergence using few high-end GPUs within a week, the
main limitation being the memory usage. Our training used
asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was
used to create the final model used at inference time.
Image sampling methods have changed substantially
over the months leading to the competition, and already
converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such
as dropout and the learning rate. Therefore, it is hard to
give a definitive guidance to the most effective single way
to train these networks. To complicate matters further, some
of the models were mainly trained on smaller relative crops,
others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100%
of the image area with aspect ratio constrained to the interval [ 34 , 34 ]. Also, we found that the photometric distortions
of Andrew Howard [8] were useful to combat overfitting to
the imaging conditions of training data.
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
MaxPool
3x3+2(S)
FC
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
FC
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
1x1+1(S)
softmax0
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
SoftmaxActivation
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
FC
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
MaxPool
3x3+2(S)
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
Team
Year
Place
Error
(top-5)
Uses external
data
SuperVision
2012
1st
16.4%
no
SuperVision
2012
1st
15.3%
Imagenet 22k
Clarifai
2013
1st
11.7%
no
Clarifai
2013
1st
11.2%
Imagenet 22k
MSRA
2014
3rd
7.35%
no
VGG
2014
2nd
7.32%
no
GoogLeNet
2014
1st
6.67%
no
Number
of models
Number
of Crops
Cost
Top-5
error
compared
to base
10.07%
base
10
10
9.15%
-0.92%
144
144
7.89%
-2.18%
8.09%
-1.98%
10
70
7.62%
-2.45%
144
1008
6.67%
-3.45%
Team
Year
Place
mAP
external data
ensemble
approach
UvA-Euvision
2013
1st
22.6%
none
Fisher vectors
Deep Insight
2014
3rd
40.5%
ImageNet 1k
CNN
CUHK DeepID-Net
2014
2nd
40.7%
ImageNet 1k
CNN
GoogLeNet
2014
1st
43.9%
ImageNet 1k
CNN
Table 4: Comparison of detection performances. Unreported values are noted with question marks.
Team
mAP
TrimpsSoushen
31.6%
no
Berkeley
Vision
34.5%
no
yes
UvAEuvision
35.4%
CUHK
DeepIDNet2
37.7%
no
GoogLeNet
38.02% no
no
Deep
Insight
40.2%
yes
yes
9. Conclusions
Our results yield a solid evidence that approximating the
expected optimal sparse structure by readily available dense
building blocks is a viable method for improving neural networks for computer vision. The main advantage of this
method is a significant quality gain at a modest increase
of computational requirements compared to shallower and
narrower architectures.
Our object detection work was competitive despite not
References
[1] Know your meme:
We need to go deeper.
http://knowyourmeme.com/memes/we-need-to-go-deeper.
Accessed: 2014-09-15.
[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable
bounds for learning some deep representations. CoRR,
abs/1310.6343, 2013.
[3] U. V. C
atalyurek, C. Aykanat, and B. Ucar. On
two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput.,
32(2):656683, Feb. 2010.
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang,
Q. V. Le, and A. Y. Ng. Large scale distributed deep
networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232
1240. 2012.
[5] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov.
Scalable object detection using deep neural networks.
In CVPR, 2014.
[6] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection
and semantic segmentation. In Computer Vision and
Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.
[7] G. E. Hinton, N. Srivastava, A. Krizhevsky,
I. Sutskever, and R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
[8] A. G. Howard. Some improvements on deep convolutional neural network based image classification.
CoRR, abs/1312.5402, 2013.
[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in Neural Information Processing Systems 25, pages 11061114, 2012.
[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition.
Neural Comput., 1(4):541551, Dec. 1989.
[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324,
1998.
[12] M. Lin, Q. Chen, and S. Yan. Network in network.
CoRR, abs/1312.4400, 2013.
[13] B. T. Polyak and A. B. Juditsky. Acceleration of
stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838855, July 1992.
[14] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional
networks. CoRR, abs/1312.6229, 2013.
[15] T. Serre, L. Wolf, S. M. Bileschi, M. Riesenhuber, and
T. Poggio. Robust object recognition with cortex-like
mechanisms. IEEE Trans. Pattern Anal. Mach. Intell.,
29(3):411426, 2007.
[16] F. Song and J. Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000
cpu cores. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS 14, pages
333342, New York, NY, USA, 2014. ACM.
[17] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton.
On the importance of initialization and momentum in
deep learning. In ICML, volume 28 of JMLR Proceedings, pages 11391147. JMLR.org, 2013.
[18] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors,
NIPS, pages 25532561, 2013.
[19] A. Toshev and C. Szegedy. Deeppose: Human
pose estimation via deep neural networks. CoRR,
abs/1312.4659, 2013.
[20] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers,
and A. W. M. Smeulders. Segmentation as selective
search for object recognition. In Proceedings of the
2011 International Conference on Computer Vision,
ICCV 11, pages 18791886, Washington, DC, USA,
2011. IEEE Computer Society.
[21] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV,
volume 8689 of Lecture Notes in Computer Science,
pages 818833. Springer, 2014.