[Inception V3] Szegedy et al., 2016, Rethinking the Inception Architecture for Computer Vision

# 세줄 요약 #

Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks.
We are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters.

# 상세 리뷰 #

1. Introduction

Since the 2012 ImageNet competition winning entry by “AlexNet” has been successfully applied to a larger variety of computer vision tasks,
- after that VGGNet and GoogLeNet yielded similarly high performance in the 2014 ILSVRC classification challenge.
The Inception architecture of GoogLeNet was designed to perform well even under strict constraints on memory and computation budget,
- rather than VGGNet, which has the compelling feature of simple architecture but also requires a lot of computation.
Although the complexity of the Inception architecture makes it more difficult to make changes to the network, we use the generic structure of the Inception-style building blocks for scaling up convolutional networks in efficient ways.

2. Architecture Design

We try various architectural experiments on the original Inception module of the GooLeNet network.

Factorization into smaller convolutions:
- Convolution with larger spatial filters (e.g. 5 x 5 or 7 x 7) replaced by a multi-layer with less parameters with the same input size and output depth
- e.g. replacing 5 x 5 convolution with two layers of 3 x 3 convolution; = (9 + 9)/25 x reduction of computation.
- resulting in a relative gain of 28% by this factorization.

Spatial Factorization into Asymmetric Convolutions:
- we can replace any n x n convolution by a 1 x n convolution followed by a n x 1 convolution and the computational cost saving increases dramatically as n grows.
- e.g. using a 3 x 1 convolution followed by a 1 x 3 convolution is equivalent to sliding a two-layer network with the same receptive field as in a 3 x 3 convolution.
- Still, the two-layer solution is 33% cheaper for the same number of output filters.

Efficient Grid Size Reduction:
- In order to reduce the computational cost and remove the representational bottleneck,
- we can use two parallel stride 2 blocks: pooling layer P and convolution layer C
- both of them are concatenated in the final layer.

Utility of Auxiliary Classifiers:
- we found that the auxiliary classifier did not result in improved convergence early in the training
- → Not necessary!

3. Experimental Result

Inception-v2: we have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture.
- (1) factorized the traditional 7 x 7 convolution into three 3 x 3 convolutions
- (2) 3 factorized inception modules as depicted in figure 5
- (3) 5 factorized inception modules as depicted in figure 6
- (4) 2 inception modules using the grid reduction technique described in figure 7

Inception-v3: The last line of Inception-v2 experimental result is referring to all the cumulative changes is what we refer to as “Inception-v3” below.
- Training methodology: TensorFlow, NVidia Kepler GPU (50 replicas), 32 batch size, 100 epochs, momentum optimizer (0.9 decay), 0.045 learning rate (0.94 decay/2 epoch)
- Each Inception-v2 line shows the result of the cumulative changes including the high-lighted new modification plus all the earlier ones:
  - 1) RMSProp Optimizer (momentum -> RMSProp)
  - 2) Label smoothing regularization (LSR)
  - 3) Factorized 7 x 7: changing that factorizes the first 7 x 7 convolutional layer into a sequence of 3 x 3 convolutional layers.
  - 4) BN-auxiliary: the fully connected layer of the auxiliary classifier is also batch normalized, not just the convolutions.

Conclusion:
- Our highest quality version of Inception-v3 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification.
- This is achieved with a relatively modest increase in computational cost compared to the previous networks (2.5 x GoogLeNet; much more efficient than VGGNet)

* Reference: Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

728x90

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

[Review] Minaee et al., 2021, Image Segmentation Using Deep Learning: A Survey (0)	2022.10.21
[U-Net] Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation. (1)	2022.08.11
[MobileNet] Howard et al., 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (1)	2022.03.21
[DenseNet] Huang et al., 2017, Densely Connected Convolutional Networks (0)	2022.03.13
[ResNet] He et al., 2015, Deep Residual Learning for Image Recognition (0)	2021.10.17

펄서까투리의 세줄요약 리뷰 블로그

[Inception V3] Szegedy et al., 2016, Rethinking the Inception Architecture for Computer Vision

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

[Inception V3] Szegedy et al., 2016, Rethinking the Inception Architecture for Computer Vision

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바