본문 바로가기
논문 리뷰/딥러닝

[Inception V3] Szegedy et al., 2016, Rethinking the Inception Architecture for Computer Vision

by 펄서까투리 2022. 4. 11.

# 세줄 요약 #

  • Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks.
  • We are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
  • We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters.

 

# 상세 리뷰 #

1. Introduction

  • Since the 2012 ImageNet competition winning entry by “AlexNet” has been successfully applied to a larger variety of computer vision tasks,
    • after that VGGNet and GoogLeNet yielded similarly high performance in the 2014 ILSVRC classification challenge.
  • The Inception architecture of GoogLeNet was designed to perform well even under strict constraints on memory and computation budget,
    • rather than VGGNet, which has the compelling feature of simple architecture but also requires a lot of computation.
  • Although the complexity of the Inception architecture makes it more difficult to make changes to the network, we use the generic structure of the Inception-style building blocks for scaling up convolutional networks in efficient ways.

 

2. Architecture Design

  • We try various architectural experiments on the original Inception module of the GooLeNet network.

  • Factorization into smaller convolutions:
    • Convolution with larger spatial filters (e.g. 5 x 5 or 7 x 7) replaced by a multi-layer with less parameters with the same input size and output depth
    • e.g. replacing 5 x 5 convolution with two layers of 3 x 3 convolution; = (9 + 9)/25 x reduction of computation.
    • resulting in a relative gain of 28% by this factorization.

  • Spatial Factorization into Asymmetric Convolutions:
    • we can replace any n x n convolution by a 1 x n convolution followed by a n x 1 convolution and the computational cost saving increases dramatically as n grows.
    • e.g. using a 3 x 1 convolution followed by a 1 x 3 convolution is equivalent to sliding a two-layer network with the same receptive field as in a 3 x 3 convolution.
    • Still, the two-layer solution is 33% cheaper for the same number of output filters.

  • Efficient Grid Size Reduction:
    • In order to reduce the computational cost and remove the representational bottleneck,
    • we can use two parallel stride 2 blocks: pooling layer P and convolution layer C
    • both of them are concatenated in the final layer.

  • Utility of Auxiliary Classifiers:
    • we found that the auxiliary classifier did not result in improved convergence early in the training
    • → Not necessary!

 

3. Experimental Result

  • Inception-v2: we have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture.
    • (1) factorized the traditional 7 x 7 convolution into three 3 x 3 convolutions
    • (2) 3 factorized inception modules as depicted in figure 5
    • (3) 5 factorized inception modules as depicted in figure 6
    • (4) 2 inception modules using the grid reduction technique described in figure 7

  • Inception-v3: The last line of Inception-v2 experimental result is referring to all the cumulative changes is what we refer to as “Inception-v3” below.
    • Training methodology: TensorFlow, NVidia Kepler GPU (50 replicas), 32 batch size, 100 epochs, momentum optimizer (0.9 decay), 0.045 learning rate (0.94 decay/2 epoch)
    • Each Inception-v2 line shows the result of the cumulative changes including the high-lighted new modification plus all the earlier ones: 
      • 1) RMSProp Optimizer (momentum -> RMSProp)
      • 2) Label smoothing regularization (LSR)
      • 3) Factorized 7 x 7: changing that factorizes the first 7 x 7 convolutional layer into a sequence of 3 x 3 convolutional layers.
      • 4) BN-auxiliary: the fully connected layer of the auxiliary classifier is also batch normalized, not just the convolutions.

  • Conclusion:
    • Our highest quality version of Inception-v3 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification.
    • This is achieved with a relatively modest increase in computational cost compared to the previous networks (2.5 x GoogLeNet; much more efficient than VGGNet) 

 

* Reference: Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

728x90
728x90

댓글