Paper Reading ๐Ÿ“œ/multimodal models

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 4. 19. 22:44

The overview of this paper

์—ฌ๋Ÿฌ vision-and-language task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ  ์žˆ๋Š” VLP๋Š” region supervision(object detection)๊ณผ convolutional architecture(ResNet)์— ์ƒ๋‹นํžˆ ์˜์กดํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ feature๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ด ํšจ์œจ์„ฑ/์†๋„์™€ ํ‘œํ˜„๋ ฅ ์ธก๋ฉด์—์„œ ๋ฌธ์ œ๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.

 

  1. ํšจ์œจ์„ฑ/์†๋„: ์ž…๋ ฅ feature ์ถ”์ถœ์ด multi-modal ์ƒํ˜ธ์ž‘์šฉ๋ณด๋‹ค ๋” ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•จ.
  2. ํ‘œํ˜„๋ ฅ: ์‹œ๊ฐ์  ์ž„๋ฒ ๋”์˜ ํ‘œํ˜„๋ ฅ๊ณผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์‹œ๊ฐ์  vocabulary์— ๋Œ€ํ•œ ์ƒํ•œ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ.

 

 ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์€ ๊ทœ๋ชจ์˜ VLP model์ธ Vision-and-Language Transformer(ViLT)๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์—์„œ ์ž…๋ ฅ์€ ํ•˜๋‚˜์˜ ๋ฉ์–ด๋ฆฌ๋กœ ๋“ค์–ด์˜ค๋Š”๋ฐ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ convolution์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ฒ˜๋Ÿผ visual input์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์กŒ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ViLT๋Š” ๋‹ค๋ฅธ VLP model๋“ค๋ณด๋‹ค 10๋ฐฐ ๋” ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  downstream task์—์„œ ๋” ๋‚ซ๊ณ  ์œ ๋งํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 

Table of Contents

1. Introduction

2. Background

   2-1. Taxonomy of Vision-and-Language Models

3. Vision-and-Language Transformer

   3-1. Model Overview

   3-2. Pre-training Objectives

   3-3. Whole Word Masking

   3-4. Image Augmentation

4. Experiments

   4-1. Classification Tasks

   4-2. Retrieval Tasks

   4-3. Ablation Study

5. Conclusion

 

 

1. Introduction

 ์ง€๊ธˆ๊นŒ์ง€ VLP model๋“ค์€ vision-and-language task์—์„œ ์œ ๋งํ•œ ๊ฒฐ๊ณผ๋“ค์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์—ˆ๋‹ค. VLP model์— ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์œ„ํ•ด์„œ, ์ด๋ฏธ์ง€ ํ”ฝ์…€์€ language token๊ณผ ํ•จ๊ป˜ embedding ๋˜์—ˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ visual embedding ๋‹จ๊ณ„๋ฅผ ์œ„ํ•ด์„œ๋Š” CNN์ด ํ•„์ˆ˜์ ์ด์—ˆ๋‹ค.

 

 ์ง€๊ธˆ๊นŒ์ง€๋„ ๋Œ€๋ถ€๋ถ„์˜ VLP ์—ฐ๊ตฌ๋“ค์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด visual embedder์˜ ํž˜์„ ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค. ๋ฌด๊ฑฐ์šด visual embedder๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์˜ ๋‹จ์ ์€ ํ•™์ˆ  ์‹คํ—˜์—์„œ ์ž˜ ์ƒ๊ฐ๋˜์ง€ ์•Š์•˜๋Š”๋ฐ, ์™œ๋ƒํ•˜๋ฉด ํ•™์Šต ์‹œ๊ฐ„์— region feature๋Š” ์ €์žฅ๋˜์„œ feature ์ถ”์ถœ์˜ ๋ถ€๋‹ด์„ ์ค„์—ฌ์ค€๋‹ค. ํ•˜์ง€๋งŒ, query๊ฐ€ wild ํ™˜๊ฒฝ์—์„œ ๋А๋ฆฐ ์ถ”์ถœ ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๋ช…ํ™•ํ•œ real-world ์‘์šฉ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

 

 ์ด๋ฅผ ์œ„ํ•˜์—ฌ, ๋…ผ๋ฌธ์—์„œ๋Š” attention์„ visual input์˜ ๋น ๋ฅธ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ€๋ฒผ์šด ๋ฌด๊ฒŒ๋กœ ์ „ํ™˜ํ•˜์˜€๋‹ค. ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ๋“ค์€ transformer์— ํ”ฝ์…€์ด ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— patch์˜ ๊ฐ„๋‹จํ•œ linear projection์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์— visual feature๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‚ฌ์šฉํ•œ CNN ๋Œ€์‹ ์— text feature์„ ์‚ฌ์šฉํ•  ๋•Œ์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•œ linear projection์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ ๋Œ€์ฒดํ•˜์˜€๋‹ค.

 

 ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ๋ฐฉ์‹์—์„œ ๋‘ ๊ฐœ์˜ modality๋ฅผ ๋‹ค๋ฃจ๋Š” VIsion-and-Language Transformer(ViLT)๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์ด ๊ธฐ์กด์˜ VLP model๊ณผ ๋‹ค๋ฅธ ์ ์€ pixel-level input์˜ embedding์ด CNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ์ ์ด๋‹ค. ๋‹จ์ง€ visual input์— ๋Œ€ํ•œ deep embedder๋ฅผ ์ œ๊ฑฐํ–ˆ์„ ๋ฟ์ธ๋ฐ ๋ชจ๋ธ์˜ ํฌ๊ธฐ์™€ ๋Ÿฌ๋‹ ํƒ€์ž„์ด ์ƒ๋‹นํžˆ ์ค„์—ˆ๋‹ค. ๋‹ค์Œ์˜ ๊ทธ๋ฆผ 1์€ ViLT์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

๊ทธ๋ฆผ 1. ๊ธฐ์กด์˜ VLP architecture๊ณผ ViLT ๊ฐ„์˜ ๋น„๊ต. ViLT๊ฐ€ ํ›จ์”ฌ ์ ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ์‚ฌ์šฉํ•จ.

 

 ๋…ผ๋ฌธ์˜ key contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • ๊ฐ„๋‹จํ•œ architecutre๋ฅผ ์ œ์•ˆํ•จ. ๋ณ„๋„์˜ deep embedder๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋ณด๋‹ค๋Š” Transformer์—๊ฒŒ visual feature๋ฅผ ์ถ”์ถœ & ์ฒ˜๋ฆฌํ•˜๊ฒŒ ํ•จ. ์ด๋Š” ํ˜„์ €ํžˆ ์ ์€ ๋Ÿฐํƒ€์ž„๊ณผ ํšจ์œจ์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๋ณด์—ฌ์คฌ์Œ.
  • region feature or deep conv visual ์ž„๋ฒ ๋”๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์œ ๋งํ•œ vision-and-language task ๊ฒฐ๊ณผ๋ฅผ ์–ป์Œ.
  • word masking & image augmentation์€ downstream ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด.

 

2. Background

2-1. Taxonomy of Vision-and-Language Models

 

 ๋…ผ๋ฌธ์—์„œ๋Š” vision-and-language model์˜ ๋ถ„๋ฅ˜๋ฅผ ๋‹ค์Œ์˜ ๋‘ ๊ด€์ ์—์„œ ๊ธฐ๋ฐ˜ํ•ด์„œ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ 4๊ฐœ์˜ ๋ถ„๋ฅ˜๋Š” ๊ทธ๋ฆผ 2์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค.

 

  1. ๋‘ ๊ฐœ์˜ modality๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๊ณ„์‚ฐ๋Ÿ‰ ์ธก๋ฉด์—์„œ ์–ด๋А ์ •๋„์˜ ํ‘œํ˜„์„ ๊ฐ€์ง€๋Š”๊ฐ€
  2. ๋‘ ๊ฐœ์˜ modality๊ฐ€ deep network์—์„œ ์ƒํ˜ธ์ž‘์šฉ์„ ํ•˜๋Š”๊ฐ€

 

๊ทธ๋ฆผ 2. vision-and-language ๋ชจ๋ธ์˜ 4๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ. VE: visual embedder, TE: textual embedder, MI: modality interaction

 

 ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ViLT๋Š” ์œ„ ๊ทธ๋ฆผ 2์—์„œ d ์œ ํ˜•์— ์†ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ์—ฌ๊ธฐ์„œ raw pixel์˜ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋Š” ์–•๊ณ  text token์ฒ˜๋Ÿผ ๊ณ„์‚ฐ์ ์œผ๋กœ ๊ฐ€๋ณ๋‹ค. ์ด architecture๋Š” modality ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ๊ณ„์‚ฐ์— ์ง‘์ค‘ํ•˜์˜€๋‹ค.

 

 

3. Vision-and-Language Transformer

3-1. Model Overview

 

 ViLT๋Š” VLP ๋ชจ๋ธ์— ๋น„ํ•ด ๊ฐ„๊ฒฐํ•œ architecture์ด๋‹ค. ์ตœ์†Œ์˜ visual embedding ํŒŒ์ดํ”„๋ผ์ธ๊ณผ single-stream ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” BERT ๋Œ€์‹ ์— pre-trained ViT๋กœ๋ถ€ํ„ฐ ์ƒํ˜ธ์ž‘์šฉ transformer ๊ฐ€์ค‘์น˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค๋Š” ํ™˜๊ฒฝ์—์„œ ๋ฒ—์–ด๋‚ฌ๋‹ค. ์ด๋Ÿฌํ•œ ์ดˆ๊ธฐํ™”๋Š” ์ƒํ˜ธ ์ž‘์šฉ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ visual feature์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋™์‹œ์— ๋ณ„๋„์˜ ์‹ฌ์ธต visual embedder๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค.

 

 

 ViT๋Š” multiheaded self-attention(MSA)์™€ MLP layer๋ฅผ ํฌํ•จํ•˜๋Š” ์ ์žฌ๋œ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ViT์—์„œ layer normalization(LN)์˜ ์œ„์น˜๋Š” BERT์™€ ๋‹ค๋ฅธ ์œ ์ผํ•œ ์ ์ด๋‹ค: BERT(post-norm, MSA์™€ MLP ํ›„์— ์˜ด), ViT(pre-norm, MSA์™€ MLP ์ „์— ์˜ด). ์ž…๋ ฅ ํ…์ŠคํŠธ $t \in \mathbb{R}^{L \times |V|}$์€ word embedding ํ–‰๋ ฌ $T \in \mathbb{R}^{|V| \times H}$์™€ position embedding $T^{pos} \in \mathbb{R}^{(L+1) \times H}$์™€ ํ•จ๊ป˜ $\bar{t} \in \mathbb{R}^{L \times H}$์œผ๋กœ ์ž„๋ฒ ๋”ฉ๋œ๋‹ค.

 

 ์ž…๋ ฅ ์ด๋ฏธ์ง€ $I \in \mathbb{R}^{C \times H \times W}$๋Š” ํŒจ์น˜๋กœ ์ž˜๋ผ์ง€๊ณ  $v \in \mathbb{R}^{N \times (P^{2} \cdot C)}$๋กœ ๋‚ฉ์ž‘ํ•ด์ง€๊ณ , ์—ฌ๊ธฐ์„œ $(P, P)$๋Š” ํŒจ์น˜์˜ ํ•ด์ƒ๋„์ด๊ณ  $N = HW \setminus P^{2}$. linear projection $V \in \mathbb{R}^{(P^{2} \cdot C) \times H}$์™€ position embedding $V^{pos} \in \mathbb{R}^{(N+1) \times H}$์ด๊ณ , $v$๋Š” $\bar{v} \in \mathbb{R}^{N \times H}$์œผ๋กœ ์ž„๋ฒ ๋”ฉ๋œ๋‹ค.

 

 ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์€ ํ•ด๋‹นํ•˜๋Š” modal-type ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ $t^{type}, v^{type} \in \mathbb{R}^{H}$๊ณผ ํ•ฉํ•ด์ง„ ๋‹ค์Œ์—, ๊ฒฐํ•ฉ๋œ ์‹œํ€€์Šค $z^{0}$์œผ๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค. contextualized vector $z$๋Š” ์ตœ์ข… contextualized ์‹œํ€€์Šค $z^{D}$ ์ง์ „๊นŒ์ง€ ๊นŠ์ด $D$์˜ transformer layer์„ ํ†ตํ•ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋œ๋‹ค. $p$๋Š” ์ „์ฒด multi-modal ์ž…๋ ฅ์˜ pooled representation์ด๊ณ , linear projection $W_{poop} \in \mathbb{R}^{H \times H}$์™€ ํ•˜์ดํผ๋ณผ๋ฆญ ํƒ„์  ํŠธ๋ฅผ ์‹œํ€€์Šค $z^{D}$์˜ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค์— ์ ์šฉํ•จ์œผ๋กœ์จ ์–ป์–ด์ง€๊ฒŒ ๋œ๋‹ค.

 

 ๋ชจ๋“  ์‹คํ—˜์—์„œ, ImageNet์—์„œ pre-train ๋œ ViT-B/32๋กœ๋ถ€ํ„ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ , ๋”ฐ๋ผ์„œ ์ด๋ฆ„์„ ViLT-B/32๋ผ๊ณ  ์ง€์—ˆ๋‹ค. hidden size $H$๋Š” 768์ด๊ณ , layer ๊นŠ์ด $D$๋Š” 12, ํŒจ์น˜ ์‚ฌ์ด์ฆˆ $P$๋Š” 32, MLP ์‚ฌ์ด์ฆˆ๋Š” 3,072, attention head์˜ ์ˆ˜์˜ 12์ด๋‹ค.

 

3-2. Pre-training Objectives

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ViLT๋ฅผ ๋ณดํ†ต VLP model์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๊ฐœ์˜ objective๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œ์ผฐ๋‹ค: image text matching(ITM) & masked language modeling(MLM).

 

Image Text Matching.  0.5์˜ ํ™•๋ฅ ๋กœ ์ •๋ ฌ๋œ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ๋Œ€์ฒดํ•œ๋‹ค. single layer ITM head๋Š” ํ’€๋ง๋œ ์ถœ๋ ฅ feature $p$๋ฅผ ์ด์ง„ ํด๋ž˜์Šค์— ๋Œ€ํ•œ logit์œผ๋กœ ํˆฌ์˜ํ•˜๊ณ  negative log-liklihood๋ฅผ ITM loss๋กœ ๊ณ„์‚ฐํ•œ๋‹ค. 

 

 ์ถ”๊ฐ€์ ์œผ๋กœ ๋…ผ๋ฌธ์—์„œ word region alignment ๋ชฉํ‘œ์— ์˜๊ฐ์„ ๋ฐ›์•„์„œ, ๋‘ ๊ฐœ์˜ ์„œ๋ธŒ์…‹: $z^{D}|_{t}$(textual subset) & $z^{D}|_{v}$(visual subset) ๊ฐ„์˜ ์ •๋ ฌ ์ ์ˆ˜๋ฅผ ์ตœ์ ์˜ ์ „์†ก์„ ์œ„ํ•œ IPOT๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•˜๋Š” word patch alignment(WPA)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค.

 

Masked Language Modeling.  ์ด ๋ชฉํ‘œ๋Š” contextualized vector $z_{masked}^{D}|_{t}$๋กœ๋ถ€ํ„ฐ masked text token $t_{masked}$์˜ ์‹ค์ œ ๋ผ๋ฒจ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชฉํ‘œ์ด๋‹ค. BERT์˜ ๋งˆ์Šคํ‚น ์ „๋žต์„ ์‚ฌ์šฉํ•ด์„œ ํ™•๋ฅ  0.15๋กœ $t$๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋งˆ์Šคํ‚นํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” BERT์˜ MLM ๋ชฉํ‘œ์ฒ˜๋Ÿผ ์ž…๋ ฅ์œผ๋กœ $z_{masked}^{D}|_{t}$๊ฐ€ ๋“ค์–ด์˜ค๊ณ  vocabulary์•  ๋Œ€ํ•œ logit์„ ์ถœ๋ ฅํ•˜๋Š” two-layer MLP MLM head๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. MLM loss๋Š” masked token์„ ์œ„ํ•œ negative log-liklihood loss์ฒ˜๋Ÿผ ๊ณ„์‚ฐ๋˜์—ˆ๋‹ค.

 

3-3. Whole Word Masking

 

 whole word masking์€ ์ „์ฒด ๋‹จ์–ด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์—ฐ์†๋˜๋Š” subword๋“ค์„ ๋ชจ๋‘ maskํ•˜๋Š” masking technique์ด๋‹ค. ์ด technique์€ ๊ธฐ์กด & Chinese BERT์„ ์ ์šฉํ•  ๋•Œ downstream task์—์„œ ํšจ๊ณผ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค๋ฅธ modality๋กœ๋ถ€ํ„ฐ ์ •๋ณด์˜ ์‚ฌ์šฉ์„ full๋กœ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” VLP๋ฅผ ์œ„ํ•œ whole word masking์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์˜€๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ๋กœ WordPiece๋กœ ๋‚˜๋ˆ ์ง„ ๋ชจ๋“  ํ† ํฐ์„ masking ํ•˜์˜€๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ์ธ์ ‘ ๋‹จ์–ด๋กœ๋งŒ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

3-4. Image Augmentation

 

 image augmentation์€ vision model์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ํ•˜์ง€๋งŒ, image augmentation์˜ ๋Šฅ๋ ฅ์€ VLP model์—์„œ ์•„์ง ํƒ๊ตฌ๋˜์ง€ ์•Š์•˜๋‹ค. visual feature ์ €์žฅ์€ region feature ๊ธฐ๋ฐ˜ VLP ๋ชจ๋ธ์ด image augmentation์„ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•˜๋„๋ก ์ œํ•œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 ์ด๋ฅผ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” fine-tuning ์ค‘์— RandAugment๋ฅผ ์ ์šฉํ•˜์˜€๋‹ค. ์™ ๋งŒํ•œ policy๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ, ๋‘ ๊ฐ€์ง€๋งŒ์€ ์ œ์™ธํ•˜์˜€๋‹ค: color inversion โ†’ ํ…์ŠคํŠธ๋Š” ์ƒ‰๊น” ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ๋„ ํ•˜๊ธฐ ๋•Œ๋ฌธ, cutout โ†’ ์ด๋ฏธ์ง€์˜ ์กฐ๊ทธ๋งˆํ•œ ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•˜์ง€๋งŒ, ์ด ๋ถ€๋ถ„์ด ์ค‘์š”ํ•œ object์ผ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ.

 

 

4. Experiments

4-1. Classification Tasks

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ViLT-B/32๋ฅผ ํ‰๊ฐ€ํ•˜์˜€๋‹ค: VQAv2 & NLVR2. ๋…ผ๋ฌธ์—์„œ๋Š” fine-tuned downstream head ์ฒ˜๋Ÿผ ํžˆ๋“  ์‚ฌ์ด์ฆˆ 1,536์˜ two-layer MLP๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

Visual Question Answering.  ViLT-B/32๋ฅผ VQAv2 ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ํ‘œ 1์—์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ViLT๋Š” visual embedder๊ฐ€ ๋งŽ์€ ๋‹ค๋ฅธ VLP ๋ชจ๋ธ์— ๋น„ํ•ด VQA ์ ์ˆ˜์— ๋ฏธ์น˜์ง€๋Š” ๋ชปํ•˜์˜€๋‹ค. 

 

Natural Language for Visual Reasoning.  ViLT-B/32๋ฅผ NLVR2 ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. 

 

 ๋‹ค์Œ์˜ ํ‘œ 2๋Š” ์ด ๋‘ task์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ViLT-B/32๋Š” ๋‘ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์œ ๋งํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๊ณ , ์ข‹์€ ์ถ”๋ก  ์†๋„๋ฅผ ์–ป์—ˆ๋‹ค. 

 

ํ‘œ 1. Classification Task ๊ฒฐ๊ณผ

 

4-2. Retrieval Tasks

 

๋…ผ๋ฌธ์—์„œ๋Š” ViLT-B/32๋ฅผ MSCOCO & F30k์˜ ๋ถ„ํ• ์—์„œ fine-tune ํ•˜์˜€๋‹ค. image-to-text & text-to-image ๊ฒ€์ƒ‰์„ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” zero-shot๊ณผ fine-tuned ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ๋น„๊ตํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 15๊ฐœ์˜ ํ…์ŠคํŠธ๋ฅผ negative sample๋กœ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  model์„ positive ์Œ์˜ score๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜๋Š” cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•ด์„œ tuning ํ•˜์˜€๋‹ค. 

 

 ๋‹ค์Œ์˜ ํ‘œ 2๋Š” zero-shot ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๊ณ , ํ‘œ 3์€ fine-tuned ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. zero-shot ๊ฒ€์ƒ‰์—์„œ ViLT-B/32๋Š” ๋”์šฑ ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ pre-train ๋œ ImageBERT ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. fine-tuned ๊ฒ€์ƒ‰์—์„œ 2 ๋ฒˆ์งธ๋กœ ๋น ๋ฅธ ๋ชจ๋ธ๋ณด๋‹ค ๋†’์€ ๋งˆ์ง„์œผ๋กœ ํฐ recall์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

ํ‘œ 2. zero-shot ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

 

ํ‘œ 3. fine-tuned ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

 

4-3. Ablation Study

 

 ํ‘œ 4์—์„œ ์—ฌ๋Ÿฌ ablation์— ๋Œ€ํ•ด์„œ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค: โ†‘ training steps & whole word masking & image augmentation. ์ด ablation์˜ ์ด์ ์„ ํŒŒ์•…ํ•˜์˜€๋‹ค. ๋”์šฑ ๊ธด training step์—์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ, ์„ฑ๋Šฅ์€ ์ƒ์Šน๋˜์—ˆ๋‹ค(1์—ด~3์—ด). MLM ๋ชฉํ‘œ๋ฅผ ์œ„ํ•œ ์ „์ฒด ๋‹จ์–ด ๋งˆ์Šคํ‚น(3์—ด~4์—ด)๊ณผ augmentation์„ ์‚ฌ์šฉํ•œ fine-tuning(6์—ด)์„ ํ•œ ๊ฒฐ๊ณผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. 

 

ํ‘œ 4. abaltion study ๊ฒฐ๊ณผ

 

5. Conclusion

๋…ผ๋ฌธ์—์„œ๋Š” ์ตœ์†Œํ™”๋œ VLP architecture์ธ Vision-and-Language Transformer(ViLT)๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ViLT๋Š” visual embedding์„ ์œ„ํ•œ CNN์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ์œ ๋งํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ViLT์˜ ์š”์†Œ๋“ค์˜ ์ค‘์š”์„ฑ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜๋‹ค.

 

Scalability.  ์ ๋‹น์–‘์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด pre-trained transformer์˜ ์„ฑ๋Šฅ์€ ์ž˜ scale ๋œ๋‹ค.

 

Masked Modeling for Visual Inputs.  MRM์˜ ์„ฑ๊ณต์€ visual modality๋ฅผ ์œ„ํ•œ masked modeling objective๊ฐ€ transformer์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๊นŒ์ง€ ์ •๋ณด๋ฅผ ๋ณด์กดํ•จ์œผ๋กœ์จ ๋„์™€์ฃผ์—ˆ๋‹ค. 

 

Augmentation Strategies.  RandAugment๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ, ๋‹ค๋ฅธ ๊ฐ„๋‹จํ•œ augmentation ์ „๋žต์„ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋น„๊ตํ•ด์„œ downstream ์„ฑ๋Šฅ์„ gainํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2102.03334

 

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object dete

arxiv.org

 

'Paper Reading ๐Ÿ“œ > multimodal models' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

ALBEF: Vision and Language Representation Learning with Momentum Distillation ๋…ผ๋ฌธ  (2) 2023.04.23
ALIGN: Scaling up Visual and Vision-Language Representation with Noisy Text Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.04.20
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.04.18
VLP: Unified Vision-Language Pre-Traning for Image Captioning and VQA ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.04.17
VL-BERT: Pre-training of Generic Visual-Linguistic Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.04.16
'Paper Reading ๐Ÿ“œ/multimodal models' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • ALBEF: Vision and Language Representation Learning with Momentum Distillation ๋…ผ๋ฌธ
  • ALIGN: Scaling up Visual and Vision-Language Representation with Noisy Text Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • VLP: Unified Vision-Language Pre-Traning for Image Captioning and VQA ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • Vicuna
  • LLAMA2
  • transformer
  • GPT-4
  • Open-source
  • open-source model
  • ChatGPT
  • RLHF
  • LLM
  • LM
  • context length
  • Evaluation Metric
  • MT-Bench
  • closed-source model
  • closed-source
  • Chinchilla
  • proprietary model
  • context window
  • scaling law
  • Vicuna Evaluation
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.