Paper Reading ๐Ÿ“œ/multimodal models

Paper Reading ๐Ÿ“œ/multimodal models

LiT๐Ÿ”ฅ : Zero-Shot Transfer with Locked-image text Tuning ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋ชจ๋ธ์˜ pre-training ์žฅ์ ์€ ์œ ์ง€ํ•˜๋ฉฐ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ชจ๋ธ์„ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•œ contrastive training์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ method๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์˜ ์‹คํ—˜์— ๋”ฐ๋ฅด๋ฉด locked pre-trained image model & unlocked text model์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ contrastive-tuning์„ 'Locked-image Tuning' (LiT)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. LiT๋Š” ์ƒˆ๋กœ์šด task์— ๋Œ€ํ•ด pre-trained image model๋กœ๋ถ€ํ„ฐ ์ข‹์€ representation์„ ์ฝ์–ด๋‚ด๊ธฐ ์œ„ํ•œ text model๋งŒ์„ ๊ฐ€๋ฅด์นœ๋‹ค. LiT ๋ชจ๋ธ์€ ์ƒˆ๋กœ์šด vision task์— ๋Œ€ํ•ด์„œ zero-s..

Paper Reading ๐Ÿ“œ/multimodal models

VinVL: Revisiting Visual Representations in Vision-Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ์ด๋ฒˆ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋Š” full paper๋ฅผ ์ฝ๊ณ  ์ž‘์„ฑํ•˜๋Š” ๋ฆฌ๋ทฐ๊ฐ€ ์•„๋‹ˆ๋ผ๋Š” ์  ๊ฐ์•ˆํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค. ์›๋ž˜๋Š” full paper๋ฅผ ์ฝ์–ด๋ณด๋ ค๊ณ  ํ•˜์˜€์œผ๋‚˜, ์ด ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด ๋”ฑํžˆ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์˜ ์†Œ๊ฐœ๊ฐ€ ์•„๋‹Œ ๋” ๋‚˜์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋ƒˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ Microsoft Blog๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€๋‹ค. The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” vision language(VL) task์— ๋Œ€ํ•œ visual representation์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋””ํ…Œ์ผํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ์ด๋ฏธ์ง€์—์„œ object ์ค‘์‹ฌ์˜ representation์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•œ ๊ฐœ์„ ๋œ object detection model์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ๋…ผ๋ฌธ์—์„œ..

Paper Reading ๐Ÿ“œ/multimodal models

FLAVA: A Foundational Language And Vision Alignment Model ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๋‹ค์–‘ํ•œ SoTA vision & vision-and-language ๋ชจ๋ธ๋“ค์€ ๋‹ค์–‘ํ•œ downstream task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ์˜ vision-linguistic pre-training์— ์˜์กดํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ ์ฃผ๋กœ cross-model(contrastive) ์ด๊ฑฐ๋‚˜ multi-modal(earlier fusion)์ด๋‹ค. ๋‘˜ ๋‹ค ์•„๋‹ˆ๋ผ๋ฉด specific modality ๋˜๋Š” task๋ฅผ ํƒ€๊นƒ์œผ๋กœ ํ•œ๋‹ค. ์•ž์œผ๋กœ ๋‚˜์•„๊ฐ€์•ผ ํ•  ๋ฐฉํ–ฅ์€ ๋ชจ๋“  modality๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ํ•˜๋‚˜์˜ universal model์ธ 'ํ† ๋Œ€(foundation)'๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์ธ FLAVA๋ฅผ ์†Œ๊ฐœํ•˜๊ณ  35๊ฐœ์˜ task์—์„œ ์ด ๋ชจ..

Paper Reading ๐Ÿ“œ/multimodal models

VLMo: Unified Vision-Language Pre-training with Mixture-of-Modality-Experts ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๋…ผ๋ฌธ์—์„œ๋Š” modular Transformer๋ฅผ ์‚ฌ์šฉํ•ด์„œ dual encoder์™€ fusion encoder๋ฅผ ๊ณต๋™์œผ๋กœ ํ•™์Šตํ•˜๋Š” ํ†ตํ•ฉ Vision-Language pretrained Model(VLMO)๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ํŠนํžˆ ๋…ผ๋ฌธ์—์„œ๋Š” Mixture-of-Modality-Experts(MoME) Transformer๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋Š”๋ฐ, ์ด๊ฒƒ์˜ ๊ฐ ๋ธ”๋ก์€ modality-specific ํ•œ ์ „๋ฌธ๊ฐ€์™€ ๊ณต์œ ๋œ self-attention layer๋ฅผ ๊ฐ€์ง„๋‹ค. MoME์˜ ๋ชจ๋ธ๋ง ์œ ์—ฐ์„ฑ ๋•๋ถ„์—, pretrained VLMo๋Š” vision-language ๋ถ„๋ฅ˜ task๋ฅผ ์œ„ํ•ด fusion encoder๋กœ fine-tune ๋  ์ˆ˜๋„ ์žˆ๊ณ , ํšจ์œจ์ ์ธ image-text retr..

Paper Reading ๐Ÿ“œ/multimodal models

SimVLM: Simple Visual Language Model Pre-training with Weak Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๊ธฐ์กด์˜ Vision-Language Pre-training(VLP)๋Š” ๋งŽ์€ multi-modal downstream task์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์ง€๋งŒ, ๊ฐ’๋น„์‹ผ annotation์€ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ scalability๋ฅผ ์ œํ•œํ•˜๊ณ , ๋‹ค์–‘ํ•œ dataset-specific objective์˜ ์†Œ๊ฐœ๋กœ pre-training ํ”„๋กœ์‹œ์ €๋ฅผ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ œ์•ฝ์„ ์™„ํ™”ํ•˜๊ณ  ์ตœ์†Œํ•œ์˜ pre-training ํ”„๋ ˆ์ž„์›ŒํฌSimple Visual Language Model(SimVLM)๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. SimVLM์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ๊ฐ€์ง„๋‹ค. ๋Œ€๊ทœ๋ชจ์˜ weak supervision์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ํ•™์Šต ๋ณต์žก๋„๋ฅผ ๋‚ฎ์ถค ํ•˜๋‚˜์˜ prefic languag..

Paper Reading ๐Ÿ“œ/multimodal models

BLIP: Bootstrapping Language-Image Pre-training fro Unified Vision-Language Understanding and Generation ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๋Œ€๋ถ€๋ถ„์˜ Vision-Language Pre-training(VLP)๋Š” ๋‹ค์–‘ํ•œ vision-language task์—์„œ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ odel๋“ค์€ understanding ๊ธฐ๋ฐ˜ ์ด๊ฑฐ๋‚˜ generation ๊ธฐ๋ฐ˜ ๋‘˜ ์ค‘์— ํ•˜๋‚˜์—์„œ ์ž‘๋™ํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ๋Œ€๋ถ€๋ถ„ ์›น์—์„œ ์ˆ˜์ง‘๋œ noisy image-text ์Œ์„ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ทœ๋ชจ๋ฅผ ๋Š˜๋ฆผ์œผ๋กœ์จ ์–ป์–ด์ง€๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๊ฒƒ์€ ์ฐจ์„ ์˜ supervision์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” vision-language understanding & generation์„ ์œ ๋™์ ์œผ๋กœ ์ „๋‹ฌํ•˜๋Š” ์ƒˆ๋กœ์šด VLP ํ”„๋ ˆ์ž„์›Œํฌ์ธ BLIP์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. BLIP์€ ์บก์…˜์„ bootstrapping ํ•จ์œผ๋กœ์จ noisy web..

Paper Reading ๐Ÿ“œ/multimodal models

ALBEF: Vision and Language Representation Learning with Momentum Distillation ๋…ผ๋ฌธ

The overview of this paper ๋Œ€๋ถ€๋ถ„์˜ vision & language representation ํ•™์Šต์—๋Š” visual token๊ณผ word token์„ ๊ณต๋™์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด transformer ๊ธฐ๋ฐ˜ multi-modal encoder๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด visual ํ† ํฐ๊ณผ word ํ† ํฐ์ด ์ •๋ ฌ๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฉด, multi-modal model์ด image-text ์ƒํ˜ธ์ž‘์šฉ์„ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ALign the image & text representations BEfore Fusing(ALBEF) ํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ gorunded vision & language ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” cross-modal attention์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” contra..

Paper Reading ๐Ÿ“œ/multimodal models

ALIGN: Scaling up Visual and Vision-Language Representation with Noisy Text Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper visual & vision-language representation์€ ์ „๋ฌธ์ ์ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ์‹ฌํ•˜๊ฒŒ ์˜์กดํ•˜๊ณ  ์žˆ๋‹ค. vision ์‘์šฉ์„ ์œ„ํ•ด์„œ, representation์€ ImageNet ๋˜๋Š” OpenImages์™€ ๊ฐ™์€ ๋ถ„๋ช…ํ•œ ํด๋ž˜์Šค ๋ผ๋ฒจ์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ธฐ์กด์— ์‚ฌ์šฉํ•˜๋˜ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐฉ๋ฒ•์€ ๋งŽ์€ ๋น„์šฉ์ด ๋“ค๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ œํ•œ๋˜๊ณ , ํ•™์Šต ๋ชจ๋ธ์˜ scaling์„ ๋ฐฉํ•ดํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์•ฝ 10์–ต ๊ฐœ์˜ ์žก์Œ์ด ์„ž์—ฌ ์žˆ๋Š” image alt-text ๋ฐ์ดํ„ฐ์…‹์„ Conceptual Captions ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋น„์šฉ์ด ๋น„์‹ผ filtering ๋˜๋Š” ํ›„์ฒ˜๋ฆฌ ์ž‘์—…์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค...

Paper Reading ๐Ÿ“œ/multimodal models

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์—ฌ๋Ÿฌ vision-and-language task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ  ์žˆ๋Š” VLP๋Š” region supervision(object detection)๊ณผ convolutional architecture(ResNet)์— ์ƒ๋‹นํžˆ ์˜์กดํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ feature๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ด ํšจ์œจ์„ฑ/์†๋„์™€ ํ‘œํ˜„๋ ฅ ์ธก๋ฉด์—์„œ ๋ฌธ์ œ๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ํšจ์œจ์„ฑ/์†๋„: ์ž…๋ ฅ feature ์ถ”์ถœ์ด multi-modal ์ƒํ˜ธ์ž‘์šฉ๋ณด๋‹ค ๋” ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•จ. ํ‘œํ˜„๋ ฅ: ์‹œ๊ฐ์  ์ž„๋ฒ ๋”์˜ ํ‘œํ˜„๋ ฅ๊ณผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์‹œ๊ฐ์  vocabulary์— ๋Œ€ํ•œ ์ƒํ•œ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์€ ๊ทœ๋ชจ์˜ VLP model์ธ Vision-and-Language Transformer(ViLT)๋ฅผ ..

Paper Reading ๐Ÿ“œ/multimodal models

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper image-text ์Œ์—์„œ cross-modal representation ํ•™์Šต์˜ ๋Œ€๊ทœ๋ชจ pre-training method๋Š” vision-language task์—์„œ ์œ ๋ช…ํ•ด์กŒ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๊ทธ์ € image region feature์™€ text feature์„ ์—ฐ๊ฒฐํ•˜๊ธฐ๋งŒ ํ•  ๋ฟ, ๋‹ค๋ฅธ ์กฐ์น˜๋ฅผ ์ทจํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ง€๋œ object tag๋ฅผ anchor point๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐฉ๋ฒ•์ธ Oscar์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด๋กœ ์ธํ•ด ์ •๋ ฌ์˜ ํ•™์Šต์„ ๋”์šฑ ์‰ฝ๊ฒŒ ํ•ด ์ฃผ์—ˆ๋‹ค. ์ด method๋Š” object detector๋กœ๋ถ€ํ„ฐ ๊ฐ€์žฅ ์ค‘์š”ํ•œ object๊ฐ€ ๊ฐ์ง€๋  ํ…Œ๊ณ , paired text์—์„œ ์ด object ์ข…์ข… ์–ธ๊ธ‰๋  ๊ฒƒ์ด๋ผ๋Š”..

Cartinoe
'Paper Reading ๐Ÿ“œ/multimodal models' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ธ€ ๋ชฉ๋ก