Paper Reading ๐Ÿ“œ/multimodal models

Paper Reading ๐Ÿ“œ/multimodal models

VLP: Unified Vision-Language Pre-Traning for Image Captioning and VQA ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ†ตํ•ฉ๋œ Vision-Language Pre-training(VLP) model์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ชจ๋ธ์€ ๋‹ค์Œ์˜ ๋‘ ๊ฐ€์ง€๋ฅผ ํ†ตํ•ฉํ•˜์˜€๋‹ค. ์ด๋กœ ์ธํ•ด VLP๋Š” encoder์™€ decoder๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ๊ธฐ์˜ ๋ชจ๋ธ๋กœ ๊ตฌํ˜„ํ•œ method๋“ค๊ณผ ๋‹ค๋ฅธ ์ ์„ ๊ฐ€์ง€๊ฒŒ ๋˜์—ˆ๋‹ค. visual-language ์ดํ•ด ๋˜๋Š” ์ƒ์„ฑ์„ ์œ„ํ•ด fine-tune encoding & decoding์„ ์œ„ํ•ด ๊ณต์œ ๋œ multi-layer transformer๋ฅผ ์‚ฌ์šฉ ํ†ตํ•ฉ VLP model์€ 2๊ฐœ์˜ task์— ๋Œ€ํ•œ unsupervised learning ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ฑฐ๋Œ€ํ•œ ์–‘์˜ image-text ์ง์—์„œ pre-train ๋˜์—ˆ๋‹ค: bi-directional & sequence-to-..

Paper Reading ๐Ÿ“œ/multimodal models

VL-BERT: Pre-training of Generic Visual-Linguistic Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๋…ผ๋ฌธ์—์„œ๋Š” Visual-Linguistic BERT(VL-BERT)๋ผ ๋ถˆ๋ฆฌ๋Š” visual-linguistic task๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด pre-train ๊ฐ€๋Šฅํ•œ ํฌ๊ด„์ ์ธ representation์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. VL-BERT๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ Transformer model์„ backbone์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ๋ฐ ์–ธ์–ด์  embedded feature์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ํ™•์žฅํ•˜๋Š” ๋ชจ๋ธ์„ ์ฑ„ํƒํ•˜์˜€๋‹ค. ์ž…๋ ฅ๊ฐ’์€ segment๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ word์™€ input ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ RoI feature์ด๋‹ค. VL-BERT๋Š” ๋”์šฑ ํฌ๊ด„์ ์ธ representation์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ์˜ Conceptual Captions ๋ฐ์ดํ„ฐ์…‹๊ณผ text-only corpus..

Paper Reading ๐Ÿ“œ/multimodal models

LXMERT: Learning Cross-Modality Encoder Representations from Transformers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper vision-and-language ์ถ”๋ก ์€ ์‹œ๊ฐ์  ๊ฐœ๋…๊ณผ ์–ธ์–ด์  ์˜๋ฏธ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒƒ์€ ์ด ๋‘ modality ๊ฐ„์˜ ์ •๋ ฌ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ vision-and-language ์—ฐ๊ฒฐ์„ฑ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด LXMERT(Learning Cross-Modality Encoder Representation from Transforer)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. LXMERT๋Š” 3๊ฐœ์˜ ์ธ์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋œ ๋Œ€๊ทœ๋ชจ์˜ Transformer model์„ ์‚ฌ์šฉํ•˜๊ณ , vision๊ณผ language semantic์„ ์—ฐ๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 5๊ฐœ์˜ ๋‹ค์–‘ํ•œ representative pre-training task๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด task๋“ค์€ intr..

Paper Reading ๐Ÿ“œ/multimodal models

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visual-and-Language Tasks

The overview of this paper ๋…ผ๋ฌธ์—์„œ๋Š” image์™€ language์˜ task-agnostic joint representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ธ ViLBERT๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์€ NLP ๋ถ„์•ผ์—์„œ ์œ ๋ช…ํ•œ BERT architecture๋ฅผ multi-modal two-stream model๋กœ ํ™•์žฅํ•œ ๋ชจ๋ธ์ด๋‹ค. ์ด๋กœ ์ธํ•ด ViLBERT๋Š” co-attentional transformer๋ฅผ ํ†ตํ•ด ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๊ฐœ๋ณ„์˜ stream์—์„œ ์‹œ๊ฐ์  ๋ฐ ์–ธ์–ด์  ์ž…๋ ฅ์„ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ViLBERT๋ฅผ ๊ธฐ์กด์˜ base architecture์— ์‚ฌ์†Œํ•œ ์ถ”๊ฐ€๋งŒ์„ ํ•˜๊ณ  ๋‘ ๊ฐœ์˜ proxy task๋ฅผ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘๋œ ๊ฑฐ๋Œ€ํ•œ Conceptual Captions ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด pre-trai..

Paper Reading ๐Ÿ“œ/multimodal models

VisualBERT: A Simple and Performant Baseline for Vision and Language ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ vision-language task๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ„๋‹จํ•˜๊ณ  ์œ ์—ฐํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์ธ VisualBERT๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. VisualBERT๋Š” self-attention์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํ…์ŠคํŠธ์™€ ์˜์—ญ์˜ ์š”์†Œ๋“ค์„ ์—ฐ๊ด€๋œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋กœ ์ •๋ ฌํ•˜๋Š” Transformer layer์˜ ์Šคํƒ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” VisualBERT๋ฅผ image caption ๋ฐ์ดํ„ฐ์—์„œ pre-training ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ visually-grounded language model ๋ชฉํ‘œ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. VQA, VCR, NLVR, Flickr30K ์ด๋ ‡๊ฒŒ 4๊ฐœ์˜ vision-language task์— ์ง„ํ–‰ํ•œ ์‹คํ—˜์€ VisualBERT๊ฐ€ ๊ฐ„๋‹จํ•˜..

Paper Reading ๐Ÿ“œ/multimodal models

VLM(Vision-Language Model)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!

์š”์ฆ˜ ๋“ค์–ด์„œ๋Š” ํ•œ ๊ฐ€์ง€ ๊ธฐ์ˆ ๋กœ๋Š” ์„ฑ๊ณตํ•  ์ˆ˜ ์—†๋Š” ์‹œ๋Œ€๋ผ๊ณ  ํ•œ๋‹ค. ํ•œ ๋งˆ๋””๋กœ '์œตํ•ฉ'์ด ํ•„์š”๊ฐ€ ์•„๋‹Œ ํ•„์ˆ˜๊ฐ€ ๋˜์–ด๊ฐ€๊ณ  ์žˆ๋Š” ์„ธ์ƒ์ด๋‹ค. ์ด๋ฒˆ์— OpenAI์—์„œ ๊ณต๊ฐœํ•œ GPT-4๋„ ์ด์ „์˜ GPT ๋ชจ๋ธ๋“ค๊ณผ ๋‹ฌ๋ฆฌ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” multimodal ์„ฑ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์‹ค๋กœ ์—„์ฒญ๋‚œ ๋ฐœ์ „์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” multimodal model์˜ ํ•œ ์ข…๋ฅ˜์ธ Vision-Language Model(VLM)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค! ์ด ํฌ์ŠคํŠธ๋Š” HuggingFace์˜ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ๋‹ค. HuggingFace Blog: https://huggingface.co/blog/vision_language_pretraining#supporting-vision-language-models-..

Paper Reading ๐Ÿ“œ/multimodal models

PaLM-E: An Embodied Multimodal Language Model ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

์–ผ๋งˆ ์ „์— ๋ธ”๋กœ๊ทธ์— ๊ตฌ๊ธ€์—์„œ ์†Œ๊ฐœํ•œ PaLM์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ ํฌ์ŠคํŠธ๋ฅผ ์˜ฌ๋ ธ๋˜ ๊ธฐ์–ต์ด ๋‚œ๋‹ค. ์—„์ฒญ๋‚œ ์–‘์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ ์ธํ•ด ๊นœ์ง ๋†€๋ž๋˜ ๊ธฐ์–ต์ด ๋‚˜๋Š”๋ฐ, ์ด์ œ๋Š” ์ด PaLM์ด ๋”์šฑ multimodal ์Šค๋Ÿฌ์›Œ์กŒ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด์ œ ํ…์ŠคํŠธ๋ฅผ ๋„˜์–ด์„œ ์ด๋ฏธ์ง€๊นŒ์ง€๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ๋˜์–ด๋ฒ„๋ฆฐ PaLM-'E'์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค. ๋ณธ ํฌ์ŠคํŠธ๋Š” ๋…ผ๋ฌธ๊ณผ ๊ตฌ๊ธ€์˜ ์†Œ๊ฐœ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ๋‹ค. The overview of PaLM-E ์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ ๋จธ์‹  ๋Ÿฌ๋‹์—์„œ๋Š” ์—„์ฒญ๋‚œ ๋ฐœ์ „์„ ์ด๋ฃฉํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฐœ์ „๋œ ๋ชจ๋ธ๋“ค์€ ์กฐํฌ๋ฅผ ์„ค๋ช…ํ•˜๊ฑฐ๋‚˜ ์‹œ๊ฐ์  ์งˆ๋ฌธ์— ์‘๋‹ตํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์–ธ์–ด์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์‹ฌ์ง€์–ด๋Š” ํ…์ŠคํŠธ ์„ค๋ช…์ด ์ฃผ์–ด์ง€๋ฉด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค! ๐Ÿ˜ฒ ์ด๋Ÿฌํ•œ ํ˜์‹ ์€ ํฐ ๋ฐ์ดํ„ฐ์…‹์˜ ์‚ฌ..