[VGG] Simonyan & Zisserman, 2015, Very Deep Convolutional Networks For Large-Scale Image Recognition

# 세줄 요약 #

저자들은 대량의 영상 인식 과제에서 합성곱 신경망(Convolutional Neural Network)의 깊이(depth; 여기선 신경망의 층을 늘리는 것을 의미함)에 따른 정확도의 변화를 연구하였다.
이 논문에서 소개된 신경망 모델은 매우 작은 크기(3X3)의 합성곱 필터(Convolutional filter)로 구성하여 신경망의 깊이를 증가시켰으며, 이 모델은 선행 신경망 모델들과 비교하여 신경망의 깊이는 16~19층까지 늘려서 유의미한 성능 향상 결과를 보여주었다.
이 논문의 저자들인 VGG그룹은 2014년 이미지넷 영상 인식 대회(ImageNet Large-Scale Visual Recoginition Challenge; ILSVRC)에서 localization 부문에서 1등, classification 부분에서 2등을 차지하였다.

# 상세 리뷰 #

1. We investigate the effect of the convolutinal network depth on its accuracy in the large-scale image recognition setting.

[Introduction]

Convolutional Networks (ConvNets)은 최근 많은 이미지 또는 비디오 영상 인식 문제에서 큰 성공을 거두었다(Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014).
이 논문에서 저자들은 ConvNet의 구조에서 깊이(층의 갯수)에 집중하여, 다른 파라미터들은 고정한 상태로 네트워크의 합성곱 층(convolutional layer)을 서서히 늘리면서 ConvNet의 층을 늘렸다.
이전 모델들과의 차이점은 합성곱 층의 필터의 크기는 가장 작은 3x3으로 모두 통일시켰다.
그 결과 ILSVRC 2014에서 classification과 localisation 모두 아주 높은 성적을 거두었으며, ImageNet 데이터 뿐만 아니라 다른 영상 데이터셋에서도 높은 성능을 보이는 것을 확인하였다.
- 정확히는 Classification에서는 GoogleNet(Szegedy et al., 2014)에 이어 2등

2. The networks of increasing depth using an architecture with very small(3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

[VGG Configuration]

VGG 모델은 일반적인 ConvNet처럼 convoluion layer들로 이루어져 있으나 기존의 모델(Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)들과 다른 가장 큰 특징은 receptive field의 크기는 3x3, convolution stride의 크기는 1로 모든 층에서 가장 작은 크기로 동일하게 제한했다는 것이다.
- convolution 이후 이미지 크기를 유지하기 위하여 모든 층에서 padding 기능을 추가하였다.
Spatial pooling은 전체 모델 안에 5개의 max-pooling layer에서 진행된다(따라서 이미지 크기는 최소 32x32 이상은 되어야 한다).
convolution block(convolution layers + max-pooling layer)들이 끝나고 나면 마지막에는 Fully-Connected(FC) layer들을 붙여 분류기 역할을 해준다.

그림 1. VGG 16의 구조도. [https://neurohive.io/en/popular-networks/vgg16/]

표 1. VGG 모델의 구조. 모델을 구분하는 layer의 수는 convolution layer와 fully-connected layer의 수를 합친 것이다. [* Ref.: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014)]

큰 Receptive field(ex. 7x7)를 가진 convolution layer 하나보다 작은 Receptive field(3x3)를 가진 convolution layer 3개를 쌓아서 사용하는 이유
1. 3개의 비선형(non-linear) 층을 합성하여 사용함으로서 결정 함수(decision function)를 더 구체화한다.
2. 네트워크 전체의 파라미터 수를 줄 일 수 있다.
  - 3x3 Conv. 3개 = 3(3^2 * C^2) = 27C^2
  - 7x7 Conv. 1개 = 7^2 * C^2 = 49C^2

3. Our team secured the first and the second places in the localisation and classification tracks of ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.

[Experiment Result]

사용한 딥러닝 툴: C++ Caffe toolbox(Jia, 2013)
데이터셋: ILSVRC-2012 dataset (ILSVRC 2012-2014 챌린지에 사용된 데이터셋)
- 클래스: 1000개
- 데이터셋 분류
  - training: 1.3M images
  - validation: 50K images
  - test: 100K images

표 2. ILSVRC 결과 정리, 이 논문의 모델 이름이 VGG가 된것은 논문의 저자 그룹이 Visual Geometry Group(옥스포드 대학교 연구팀) 이기 때문이다. 마지막 열인 top-5 test error를 참고하여 성능을 비교하면 된다. [* Ref.: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014)]

VGG 모델은 ILSVRC 대회에서 Classification 성능으로는 최종 2위를 차지하였다.
- 마지막 열인 top-5 test error를 참고

앞서 표 1에서 소개한 모든 모델을 다 앙상블한 결과에서는 top-5 test error 7.3%를 달성하였다(3번째 줄: VGG (ILSVRC submission, 7 nets, dense eval.))
VGG 결과 중 가장 높은 성능을 나타낸 결과는 표 1에서 소개한 모델 중에서 가장 높은 성적을 낸 구조(D, E)만 앙상블한 결과이며 6.8%의 성능을 보였다(1번째 줄: VGG (2 nets, multi-crop & dense eval.))
- ILSVRC 전체에서 가장 높은 성능을 낸 모델은 GoogleNet이다(5번째 줄: GoogleNet (Szegedy et al., 2014) (7 nets))
모델 앙상블을 하지 않은 단일 모델 결과에서는 전체 모델 중에서 VGG19(표 1에서 E열)가 가장 높은 성능 7.0%를 보였다(2번째 줄: VGG (1 net, multi-crop & dense eval.))
- GoogleNet의 경우 단일 모델 결과에서는 7.9%로 VGG보다 0.9% 가량 성능이 떨어짐(4번째 줄: GoogleNet (Szegedy et al., 2014) (1 net))

* Reference:

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

728x90

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

[Inception V3] Szegedy et al., 2016, Rethinking the Inception Architecture for Computer Vision (1)	2022.04.11
[MobileNet] Howard et al., 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (1)	2022.03.21
[DenseNet] Huang et al., 2017, Densely Connected Convolutional Networks (0)	2022.03.13
[ResNet] He et al., 2015, Deep Residual Learning for Image Recognition (0)	2021.10.17
[GoogLeNet] Szegedy et al., 2015, Going Deeper with Convolutions (0)	2020.12.16

펄서까투리의 세줄요약 리뷰 블로그

[VGG] Simonyan & Zisserman, 2015, Very Deep Convolutional Networks For Large-Scale Image Recognition

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

[VGG] Simonyan & Zisserman, 2015, Very Deep Convolutional Networks For Large-Scale Image Recognition

'논문 리뷰 > 딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바