์ „์ฒด ๊ธ€

Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
Paper Reading ๐Ÿ“œ/multimodal models

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์—ฌ๋Ÿฌ vision-and-language task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ  ์žˆ๋Š” VLP๋Š” region supervision(object detection)๊ณผ convolutional architecture(ResNet)์— ์ƒ๋‹นํžˆ ์˜์กดํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ feature๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ด ํšจ์œจ์„ฑ/์†๋„์™€ ํ‘œํ˜„๋ ฅ ์ธก๋ฉด์—์„œ ๋ฌธ์ œ๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ํšจ์œจ์„ฑ/์†๋„: ์ž…๋ ฅ feature ์ถ”์ถœ์ด multi-modal ์ƒํ˜ธ์ž‘์šฉ๋ณด๋‹ค ๋” ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•จ. ํ‘œํ˜„๋ ฅ: ์‹œ๊ฐ์  ์ž„๋ฒ ๋”์˜ ํ‘œํ˜„๋ ฅ๊ณผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์‹œ๊ฐ์  vocabulary์— ๋Œ€ํ•œ ์ƒํ•œ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์€ ๊ทœ๋ชจ์˜ VLP model์ธ Vision-and-Language Transformer(ViLT)๋ฅผ ..

Paper Reading ๐Ÿ“œ/multimodal models

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper image-text ์Œ์—์„œ cross-modal representation ํ•™์Šต์˜ ๋Œ€๊ทœ๋ชจ pre-training method๋Š” vision-language task์—์„œ ์œ ๋ช…ํ•ด์กŒ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๊ทธ์ € image region feature์™€ text feature์„ ์—ฐ๊ฒฐํ•˜๊ธฐ๋งŒ ํ•  ๋ฟ, ๋‹ค๋ฅธ ์กฐ์น˜๋ฅผ ์ทจํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ง€๋œ object tag๋ฅผ anchor point๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐฉ๋ฒ•์ธ Oscar์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด๋กœ ์ธํ•ด ์ •๋ ฌ์˜ ํ•™์Šต์„ ๋”์šฑ ์‰ฝ๊ฒŒ ํ•ด ์ฃผ์—ˆ๋‹ค. ์ด method๋Š” object detector๋กœ๋ถ€ํ„ฐ ๊ฐ€์žฅ ์ค‘์š”ํ•œ object๊ฐ€ ๊ฐ์ง€๋  ํ…Œ๊ณ , paired text์—์„œ ์ด object ์ข…์ข… ์–ธ๊ธ‰๋  ๊ฒƒ์ด๋ผ๋Š”..

Paper Reading ๐Ÿ“œ/multimodal models

VLP: Unified Vision-Language Pre-Traning for Image Captioning and VQA ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ†ตํ•ฉ๋œ Vision-Language Pre-training(VLP) model์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ชจ๋ธ์€ ๋‹ค์Œ์˜ ๋‘ ๊ฐ€์ง€๋ฅผ ํ†ตํ•ฉํ•˜์˜€๋‹ค. ์ด๋กœ ์ธํ•ด VLP๋Š” encoder์™€ decoder๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ๊ธฐ์˜ ๋ชจ๋ธ๋กœ ๊ตฌํ˜„ํ•œ method๋“ค๊ณผ ๋‹ค๋ฅธ ์ ์„ ๊ฐ€์ง€๊ฒŒ ๋˜์—ˆ๋‹ค. visual-language ์ดํ•ด ๋˜๋Š” ์ƒ์„ฑ์„ ์œ„ํ•ด fine-tune encoding & decoding์„ ์œ„ํ•ด ๊ณต์œ ๋œ multi-layer transformer๋ฅผ ์‚ฌ์šฉ ํ†ตํ•ฉ VLP model์€ 2๊ฐœ์˜ task์— ๋Œ€ํ•œ unsupervised learning ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ฑฐ๋Œ€ํ•œ ์–‘์˜ image-text ์ง์—์„œ pre-train ๋˜์—ˆ๋‹ค: bi-directional & sequence-to-..

Paper Reading ๐Ÿ“œ/multimodal models

VL-BERT: Pre-training of Generic Visual-Linguistic Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ๋…ผ๋ฌธ์—์„œ๋Š” Visual-Linguistic BERT$($VL-BERT$)$๋ผ ๋ถˆ๋ฆฌ๋Š” visual-linguistic task๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด pre-train ๊ฐ€๋Šฅํ•œ ํฌ๊ด„์ ์ธ representation์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. VL-BERT๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ Transformer model์„ backbone์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ๋ฐ ์–ธ์–ด์  embedded feature์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ํ™•์žฅํ•˜๋Š” ๋ชจ๋ธ์„ ์ฑ„ํƒํ•˜์˜€๋‹ค. ์ž…๋ ฅ๊ฐ’์€ segment๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ word์™€ input ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ RoI feature์ด๋‹ค. VL-BERT๋Š” ๋”์šฑ ํฌ๊ด„์ ์ธ representation์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ์˜ Conceptual Captions ๋ฐ์ดํ„ฐ์…‹๊ณผ text-only corpus..

Paper Reading ๐Ÿ“œ/multimodal models

LXMERT: Learning Cross-Modality Encoder Representations from Transformers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper vision-and-language ์ถ”๋ก ์€ ์‹œ๊ฐ์  ๊ฐœ๋…๊ณผ ์–ธ์–ด์  ์˜๋ฏธ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒƒ์€ ์ด ๋‘ modality ๊ฐ„์˜ ์ •๋ ฌ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ vision-and-language ์—ฐ๊ฒฐ์„ฑ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด LXMERT$($Learning Cross-Modality Encoder Representation from Transforer$)$๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. LXMERT๋Š” 3๊ฐœ์˜ ์ธ์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋œ ๋Œ€๊ทœ๋ชจ์˜ Transformer model์„ ์‚ฌ์šฉํ•˜๊ณ , vision๊ณผ language semantic์„ ์—ฐ๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 5๊ฐœ์˜ ๋‹ค์–‘ํ•œ representative pre-training task๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด task๋“ค์€ intr..

Paper Reading ๐Ÿ“œ/multimodal models

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visual-and-Language Tasks

The overview of this paper ๋…ผ๋ฌธ์—์„œ๋Š” image์™€ language์˜ task-agnostic joint representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ธ ViLBERT๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์€ NLP ๋ถ„์•ผ์—์„œ ์œ ๋ช…ํ•œ BERT architecture๋ฅผ multi-modal two-stream model๋กœ ํ™•์žฅํ•œ ๋ชจ๋ธ์ด๋‹ค. ์ด๋กœ ์ธํ•ด ViLBERT๋Š” co-attentional transformer๋ฅผ ํ†ตํ•ด ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๊ฐœ๋ณ„์˜ stream์—์„œ ์‹œ๊ฐ์  ๋ฐ ์–ธ์–ด์  ์ž…๋ ฅ์„ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ViLBERT๋ฅผ ๊ธฐ์กด์˜ base architecture์— ์‚ฌ์†Œํ•œ ์ถ”๊ฐ€๋งŒ์„ ํ•˜๊ณ  ๋‘ ๊ฐœ์˜ proxy task๋ฅผ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘๋œ ๊ฑฐ๋Œ€ํ•œ Conceptual Captions ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด pre-trai..

Paper Reading ๐Ÿ“œ/multimodal models

VisualBERT: A Simple and Performant Baseline for Vision and Language ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ vision-language task๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ„๋‹จํ•˜๊ณ  ์œ ์—ฐํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์ธ VisualBERT๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. VisualBERT๋Š” self-attention์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํ…์ŠคํŠธ์™€ ์˜์—ญ์˜ ์š”์†Œ๋“ค์„ ์—ฐ๊ด€๋œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋กœ ์ •๋ ฌํ•˜๋Š” Transformer layer์˜ ์Šคํƒ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” VisualBERT๋ฅผ image caption ๋ฐ์ดํ„ฐ์—์„œ pre-training ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ visually-grounded language model ๋ชฉํ‘œ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. VQA, VCR, NLVR, Flickr30K ์ด๋ ‡๊ฒŒ 4๊ฐœ์˜ vision-language task์— ์ง„ํ–‰ํ•œ ์‹คํ—˜์€ VisualBERT๊ฐ€ ๊ฐ„๋‹จํ•˜..

Paper Reading ๐Ÿ“œ/multimodal models

VLM(Vision-Language Model)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!

์š”์ฆ˜ ๋“ค์–ด์„œ๋Š” ํ•œ ๊ฐ€์ง€ ๊ธฐ์ˆ ๋กœ๋Š” ์„ฑ๊ณตํ•  ์ˆ˜ ์—†๋Š” ์‹œ๋Œ€๋ผ๊ณ  ํ•œ๋‹ค. ํ•œ ๋งˆ๋””๋กœ '์œตํ•ฉ'์ด ํ•„์š”๊ฐ€ ์•„๋‹Œ ํ•„์ˆ˜๊ฐ€ ๋˜์–ด๊ฐ€๊ณ  ์žˆ๋Š” ์„ธ์ƒ์ด๋‹ค. ์ด๋ฒˆ์— OpenAI์—์„œ ๊ณต๊ฐœํ•œ GPT-4๋„ ์ด์ „์˜ GPT ๋ชจ๋ธ๋“ค๊ณผ ๋‹ฌ๋ฆฌ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” multimodal ์„ฑ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์‹ค๋กœ ์—„์ฒญ๋‚œ ๋ฐœ์ „์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” multimodal model์˜ ํ•œ ์ข…๋ฅ˜์ธ Vision-Language Model$($VLM$)$์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค! ์ด ํฌ์ŠคํŠธ๋Š” HuggingFace์˜ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ๋‹ค. HuggingFace Blog: https://huggingface.co/blog/vision_language_pretraining#supporting-vision-language-models-..

Paper Reading ๐Ÿ“œ/Natural Language Processing

ChatGPT์˜ hallucination, ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผ ํ• ๊นŒ? - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

์ด์ œ ์ฃผ๋ณ€์—์„œ ChatGPT๋ฅผ ์จ๋ณธ ์‚ฌ๋žŒ๋ณด๋‹ค ์•ˆ ์จ๋ณธ ์‚ฌ๋žŒ์„ ์ฐพ๊ธฐ ํž˜๋“ค ์ •๋„๋กœ ์šฐ๋ฆฌ ์‚ฌํšŒ์— ๊นŠ์ด ์Šค๋ฉฐ๋“ค์—ˆ๋‹ค. ํ•„์ž๋„ ์ด ChatGPT์— ๊ด€ํ•œ ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ๋„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฆฌ๋ทฐ๋ฅผ ํ–ˆ๋‹ค. ์ด๋Ÿฐ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•  ๋•Œ๋งˆ๋‹ค ๋А๋ผ์ง€๋งŒ ChatGPT๋Š” ์ •๋ง ํ˜์‹ ์ ์ธ ๊ธฐ์ˆ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Ÿฐ ChatGPT๋„ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ์  ์ค‘ ํ•˜๋‚˜์ธ hallucination$($ํ™˜๊ฐ$)$์ด ์žˆ๋‹ค. ์ด hallucination์€ ๋ชจ๋ธ์ด ๋ชจ๋ฅด๊ณ  ์žˆ๋Š” ์ •๋ณด์— ๋Œ€ํ•œ query๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ ์ด query๋ฅผ ๋ชจ๋ฆ„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์•„๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์—†๋Š” ์‚ฌ์‹ค์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ์ฆ์ƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๋˜๋Š” ์‹ค์ œ๋กœ ์—†๋Š” ์ •๋ณด์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ฃผ์žฅํ•ด์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์ฃผ๋Š” ๋“ฑ์˜ ๋ฌธ์ œ๋“ค์„ ๋งํ•œ๋‹ค. ์ด๋Ÿฌ..

Paper Reading ๐Ÿ“œ/Natural Language Processing

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper BERT์™€ RoBERTa๋Š” semantic textual simialrity$($STS$)$ ๊ฐ™์€ ๋ฌธ์žฅ ์Œ ํšŒ๊ท€ task์— ๋Œ€ํ•ด์„œ ์ƒˆ๋กœ์šด SoTA performance๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ task๋Š” ๋‘ ๋ฌธ์žฅ์ด ๋„คํŠธ์›Œํฌ์— ์ž…๋ ฅ๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ ์ƒ๋‹นํ•œ computational overhead๋ฅผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค. BERT๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 10,000๊ฐœ ๋ฌธ์žฅ์˜ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๋น„์Šทํ•œ ์ง์„ ์ฐพ๋Š” ๊ฒƒ์€ 5,000๋งŒ ๋ฒˆ์˜ ์ถ”๋ก  ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ BERT์˜ ๊ตฌ์กฐ๋Š” semantic similarity search ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ clustering ๊ฐ™์€ unsupervised task์— ๋Œ€ํ•ด์„œ๋Š” ๋ถ€์ ํ•ฉํ•˜๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” simase & triplet network๋ฅผ ์‚ฌ์šฉํ•ด์„œ c..

Cartinoe
Cartinoe's paper review