Swin Transformer: Hierarchical Vision Transformer using shifted Windows

Swin Transformer: Hierarchical Vision Transformer using shifted Windows

2021. 4. 1. 18:26ㆍ논문 리뷰

728x90

Abstract

Transformer의 등장 이후 NLP에서의 Transformer를 vision으로 domain adaptation하려는 시도를 많이 하지만 여러 문제들이 있어 여전히 challenge 한 문제로 남아있다 (such as Large variations in the scale of visual entities and the high resolution of pixels in imaes compared to words in text)

이를 해결하기 위해 representation이 shifted windows를 통해 계산되는 hierarchical Transformer를 제안 한다.

The shifted windowing scheme brings greater efficiency by limiting self attention computation to non-overlapping local windows while also allowing for cross-window connection.

This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

이를 통해 여러 분야에서 좋은 성능을 거두었다.

1. Introduction

그동안 vision 분야에서의 modeling은 CNN에 지배적이라고 해도 과언이 아닐 정도였다. AlexNet을 필두로 다양한 모델들이 CNN 기반으로 만들어졌다.

반면에 NLP에서의 혁신은 다른 방향으로 흘러갔다.

Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data.

Language domain에서의 눈부신 성공을 보고 많은 연구진들은 vision으로의 domain adaptation에 관심을 갖게 된다 (image classification 이나 joint-vision-language modeling)

이 논문에서는 computer vision에서 CNN처럼 일반적인 backbone으로 쓰일 수 있는 Transformer를 소개하고자 한다.

기존의 Transformer-based model들은 token들이 다 fixed scale이기 때문에 vision application에는 잘 맞지 않았다.

Sementic Segmentation 같은 task는 pixel 단위로 예측을 해야하기 때문에 Computational cost가 너무 많이 들어 Transformer로 처리하기가 쉽지 않다.

이를 해결하기 위해 Swin Transformer 를 제안하게 되었다.

위의 그림처럼 hierarchical feature maps를 만들고 linear computational complexity to image size를 각각 갖고 있기 때문에 훨씬 더 효율적이라고 볼 수 있다.

Figure 1처럼 작은 patch들에서의 representation부터 시작해서 점점 커지는 형식으로 (merging neighboring patches in deeper transformer layers) 진행된다.

각 윈도우 들은 고정되어 있는 사이즈기 때문에 각각 이미지 사이즈에 맞춰 linear한 complexity를 갖게 된다.

Swin Transformer 구조의 핵심은 아래 그림에 묘사된 shift of the window partition between consecutie self-attention layers 이다.

shifted window는 이전 레이어와 연결하여 중요하게 작용한다.

이 방법을 통해 real-world latency를 더 효율적으로 개선할 수 있다. 이전에 있었던 슬라이딩 방법들은 query pixels별로 각각 다른 key sets을 사용했기 때문에 latency가 생겼지만 새롭게 제시한 swin transformer에서는 모든 query patches는 같은 key set을 갖기 때문에 효율적으로 latency를 방지할 수 있다.

Swin Transformer는 Image classificatioin, Object Detection, Sementic Segmentation 3분야 모두에서 이전에 나왔던 ViT/DeiT를 다 뛰어넘었다.

2. Related Work

CNN and variants

CNN은 그동안 computer vision에서 standard network model로 여겨져왔다. CNN 방법이 고안된지는 십수년이 되었지만 AlexNet등장 이전에는 큰 주목을 받지 못했었다. 이 이후 더 층이 깊어지며 발전을 하게 되었고 VGG, GoogleNet 등 다양한 model들이 고안되었다.

Self-attention based backbone architectures

Transformer 출현 이후 기존의 모델의 CNN을 transformer로 교체해서 사용하고자 하는 노력들이 있었다. 하지만 너무 많은 computational cost와 latency발생이 있었다. 해당 논문에서는 sliding windows 방식이 아닌 shift windows 방식으로 처리하여 효율성을 높였다.

Self-attention/Transformers to complement CNNs

최근에는 encoder-decoder 구조의 Transformer가 object detection과 instance segmentation task에 많이 적용되고 있다.

Transformer based vision backbones

해당 논문의 가장 key related work는 Vision Transformer (ViT)와 이후 작들이다. ViT 연구의 선구 work은 Transformer Architecture를 non-overlapping medium-sized image patches에 직접적으로 적용해 image classification을 수행한다.

빠른 속도를 보였고 정확도 역시 CNN과 비슷하게 나왔지만 많은 학습데이터가 필요하다는 단점이 있었다.

또한, ViT는 image classification 에서는 잘 수행 했지만 일반화를 하기에는 dense vision task 혹은 input image의 해상도가 높은 경우에는 낮은 해상도의 feature map과 적용하기 힘들었다.

ViT 모델을 dense vision task(object detection and sementic segmentation)에 적용시킨 몇가지 work들이 있지만 좋지못한 performance를 보였다.

3. Method

3.1 Overall Architecture

Overview of the Swin Transformer architecture

RGB image를 ViT처럼 patch 로 split 해준다. 각 patch는 nlp에서의 token처럼 취급되며, 각 patch의 feature dimension은 4x4x3 = 48이 된다.

Linear Embedding layer는 이 raw-valued feature를 특정 dimension으로 projection 해준다.(denoted as C)

몇몇의 Swin Transformer block들에 이 patch tokens를 적용한다. 이때 Transformer block들은 (H/4, W/4) 갯수의 token을 유지한다. Linear Embedding과 Swin Transformer하는 것을 묶어서 "Stage 1"이라 한다.

Hierarchical representation을 얻어내기 위해 층이 깊어지면 patch merging layer들을 통해, token의 갯수는 줄어든다.

첫번째 patch merging layer 에서는 concatenates the features of each group of 2 x 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. 이를 통해 2x2=4 개의 token의 숫자를 줄이게 되고, output dimension의 dimension은 2C로 맞춰준다.

Swin Transformer block들은 feature transformation을 위해 merging 이후에 적용되며, resolution은

8/H x 8/W 로 유지된다.

이 절차는 두번 더 반복하게 되고 각각 "Stage3", "Stage4"로 일컬으며, (H/16, W/16), (H/32, W/32) 형태이다.

이렇게 hierarchical representation을 뽑아내게 되면, VGG나 ResNet에서의 CNN처럼 뽑아내는 역할을 하게되며, 사용하기 편리해 기존의 work들에 back bone 역할을 해주기에 편리한 상태가 된다.

Swin Transformer Block

Swin Transformer block은 기존의 transformer attention에 multi-head self attention부분을 바꿔 사용한 모양이다. 다음 3.2 section에서 더 자세하게 다루도록 하겠다. 다른 layer들은 그대로이며, MSA module을 base로 한 shifted window로 구성되어있다.

3.2 Shifted Window based Self-Attention

standard한 transformer 구조와 image classification을 위한 adaptation은 둘다 computed 된 다른 token들간의 관계를 살피는 global self-attention을 하는 행위이다.

global computation은 token의 길이에 의존해서 quadratic complexity를 가져온다. 하지만 많은 dense vision task 들은 dense prediction혹은 high-resolution image의 representation을 얻기 위해 방대한 양의 token을 필요로 한다.

Self-attention in non-overlapped windows

효율성을 위해 self-attention을 local windows에 적용하는 것을 제안한다.

M x M patches를 포함하고 있다고 가정하고, global MSA module과 비교를 해보자면

global self-attention computation은 hw 값이 커질수록 기하급수적으로 늘어나기 때문에 맞지 않는다. 반면에 window based self-attention은 scalable하다.

Shifted window partitioning in successive blocks

window-based self-attention module은 window들 간의 관계성 파악이 어렵다. 따라서 해당 논문에서는 이를 해결하기 위해 shifted window partitioning approach를 제안한다. 이는 위의 두번째 그림을 보면 된다. 처음에는 일반적으로 고르게 patch 화 시킨 이미지를 가지고 계산하고 다음번에는 각각 다르게 나누어 본다는 뜻이다.

원래 4x4 사이즈 window가 2x2개 있었다면, 4x4였던 window를 다시 2x2로 쪼개고 다른 구역에 있는 애들끼리 묶어서 다시 한번 계산을 해줘서 global 한 관계를 파악하게 했다

shifted window partitioning approach를 통한 결과

Efficient batch computation for shifted configuration

shifted windows를 사용했을 때의 문제는 windows 사이즈가 달라진다는 것이다.

[h/M] x [w/M] -----------> ([h/M] + 1) x ([w/M] + 1)

MxM 사이즈보다 작아지게 되는데 이를 해결하기 위해서 약간의 trick을 사용한다.

Relative position bias

self-attention을 계산할때, bias를 추가해 기존의 방법과 유사하게 계산한다.

3.3 Architecture Variants

이 논문에서는 base model 로 ViT-B/DeiT-B와 computational cost가 비슷하게 Swin-B를 만들었다.

Swin-T, Swin-S, Swin-L 등 다양한 model들도 같이 소개 한다

4. Experiments

다양한 Task 들에서 실험했을 때 좋은 성능을 냈다

left : Image Classification on ImageNet-1K / right: Object Detection on COCO

5. Conclusion

이 논문에서는 새로운 방식의 Transformer를 소개했다. 기존의 방식이 아닌 계층적으로 feature representation을 뽑아낼수 있고 image size에 맞춰 Linear computational complexity만으로 문제를 해결 할 수 있다.

shifted window based self-attention이 주요하게 작용했다고 보여지고 vision task 알맞게 작동한 것으로 보인다.

앞으로 이를 응용하여 nlp 에도 적용할 수 있을 것으로 기대한다.