# 세줄 요약 #
- Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks.
- We are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
- We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters.
# 상세 리뷰 #
1. Introduction
- Since the 2012 ImageNet competition winning entry by “AlexNet” has been successfully applied to a larger variety of computer vision tasks,
- after that VGGNet and GoogLeNet yielded similarly high performance in the 2014 ILSVRC classification challenge.
- The Inception architecture of GoogLeNet was designed to perform well even under strict constraints on memory and computation budget,
- rather than VGGNet, which has the compelling feature of simple architecture but also requires a lot of computation.
- Although the complexity of the Inception architecture makes it more difficult to make changes to the network, we use the generic structure of the Inception-style building blocks for scaling up convolutional networks in efficient ways.
2. Architecture Design
- We try various architectural experiments on the original Inception module of the GooLeNet network.
- Factorization into smaller convolutions:
- Convolution with larger spatial filters (e.g. 5 x 5 or 7 x 7) replaced by a multi-layer with less parameters with the same input size and output depth
- e.g. replacing 5 x 5 convolution with two layers of 3 x 3 convolution; = (9 + 9)/25 x reduction of computation.
- resulting in a relative gain of 28% by this factorization.
- Spatial Factorization into Asymmetric Convolutions:
- we can replace any n x n convolution by a 1 x n convolution followed by a n x 1 convolution and the computational cost saving increases dramatically as n grows.
- e.g. using a 3 x 1 convolution followed by a 1 x 3 convolution is equivalent to sliding a two-layer network with the same receptive field as in a 3 x 3 convolution.
- Still, the two-layer solution is 33% cheaper for the same number of output filters.
- Efficient Grid Size Reduction:
- In order to reduce the computational cost and remove the representational bottleneck,
- we can use two parallel stride 2 blocks: pooling layer P and convolution layer C
- both of them are concatenated in the final layer.
- Utility of Auxiliary Classifiers:
- we found that the auxiliary classifier did not result in improved convergence early in the training
- → Not necessary!
3. Experimental Result
- Inception-v2: we have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture.
- (1) factorized the traditional 7 x 7 convolution into three 3 x 3 convolutions
- (2) 3 factorized inception modules as depicted in figure 5
- (3) 5 factorized inception modules as depicted in figure 6
- (4) 2 inception modules using the grid reduction technique described in figure 7
- Inception-v3: The last line of Inception-v2 experimental result is referring to all the cumulative changes is what we refer to as “Inception-v3” below.
- Training methodology: TensorFlow, NVidia Kepler GPU (50 replicas), 32 batch size, 100 epochs, momentum optimizer (0.9 decay), 0.045 learning rate (0.94 decay/2 epoch)
- Each Inception-v2 line shows the result of the cumulative changes including the high-lighted new modification plus all the earlier ones:
- 1) RMSProp Optimizer (momentum -> RMSProp)
- 2) Label smoothing regularization (LSR)
- 3) Factorized 7 x 7: changing that factorizes the first 7 x 7 convolutional layer into a sequence of 3 x 3 convolutional layers.
- 4) BN-auxiliary: the fully connected layer of the auxiliary classifier is also batch normalized, not just the convolutions.
- Conclusion:
- Our highest quality version of Inception-v3 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification.
- This is achieved with a relatively modest increase in computational cost compared to the previous networks (2.5 x GoogLeNet; much more efficient than VGGNet)
* Reference: Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
728x90
728x90
댓글