Paper Reading ๐Ÿ“œ/Computer Vision

Paper Reading ๐Ÿ“œ/Computer Vision

CLIP: Learning Transferable Visual Models From Natural Language Supervision ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

OpenAI์—์„œ ๊ณต๊ฐœํ–ˆ๋˜ CLIP์˜ ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ๋ฆฌ๋ทฐํ•ด๋ณด์•˜๋‹ค. ์ „์ฒด ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด๊ณ  ์‹ถ์—ˆ์œผ๋‚˜ ์ „์ฒด ๋…ผ๋ฌธ์˜ ๋ถ„๋Ÿ‰์ด ๋„ˆ๋ฌด ๊ธธ์–ด์„œ ๋…ผ๋ฌธ์—์„œ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋Š” ๋ถ€๋ถ„๊ณผ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ํฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•˜์˜€๋‹ค. OpenAI์˜ CLIP ์†Œ๊ฐœ ๋ธ”๋กœ๊ทธ๋Š” ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค. ์ž ๊ทธ๋Ÿผ ์ง€๊ธˆ๋ถ€ํ„ฐ ํฌ์ŠคํŠธ๋ฅผ ์‹œ์ž‘ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค!! ๐Ÿ”ฅ CLIP: text์™€ image๋ฅผ ์—ฐ๊ฒฐํ•˜๋‹ค OpenAI์—์„œ๋Š” CLIP์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์‹ ๊ฒฝ๋ง ๋„คํŠธ์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด CLIP์€ ์ž์—ฐ์–ด supervision์œผ๋กœ๋ถ€ํ„ฐ visual concept๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜์˜€๋‹ค. CLIP์€ ๋‹จ์ˆœํžˆ ์ธ์‹ํ•  visual category์˜ ์ด๋ฆ„์„ ์ œ๊ณตํ•˜์—ฌ GPT-2์™€ GPT-3์ฒ˜๋Ÿผ 'zero-shot'์œผ๋กœ ๋ชจ๋“  visual classification ๋ฒค์น˜๋งˆํฌ์— ์ ..

Paper Reading ๐Ÿ“œ/Computer Vision

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

The overview of this paper ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ƒˆ๋กœ์šด vision Transformer์ธ Swin Transformer์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด Swin Transformer๋Š” computer vision์— ๋Œ€ํ•ด general-purpose ์ฒ™์ถ”๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค. ์‹œ๊ฐ์  ํŠน์„ฑ์˜ ๋‹ค์–‘ํ•œ scale๊ณผ text์— ๋น„ํ•ด ๊ณ ํ•ด์ƒ๋„์ธ ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ computer vision๊ณผ NLP ๋‘ ์˜์—ญ์˜ ์ฐจ์ด ๋•Œ๋ฌธ์—, Transformer์„ computer vision์— ์ ์šฉ์‹œํ‚ค๋Š”๋ฐ ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ฐจ์ด์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” representation์ด Shifted Windows์™€ ํ•จ๊ป˜ ๊ณ„์‚ฐ๋˜๋Š” hierarchical Transformer์„ ์ œ์•ˆํ•˜์˜€๋‹ค. shifted windowing ๊ธฐ๋ฒ•์€..

Paper Reading ๐Ÿ“œ/Computer Vision

ViT - An Image Worth 16 x 16 Words: Transformers for Image Recognition at Scale

The overview of this paper Transformer architecture๋Š” NLP ๋ถ„์•ผ์—์„œ ๋งค์šฐ ๊ถŒ์œ„์ ์ด๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ฅผ computer vision์— ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ๋Š” ๊ทนํžˆ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. convolutional network์˜ ์‚ฌ์ด์— attention์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, convolutional network์˜ ์ „๋ฐ˜์ ์ธ ๊ตฌ์„ฑ์„ ๋ฐ”๊พธ๊ธด ํ•˜์ง€๋งŒ, ์ ˆ๋Œ€ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ์ง€๋Š” ์•Š๋Š”๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ CNN์— ์˜์กดํ•  ํ•„์š” ์—†์ด image์˜ patch์— ์ง์ ‘์ ์œผ๋กœ Transformer๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ฑฐ๋Œ€ํ•œ ์–‘์˜ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์—์„œ pre-train์„ ํ•˜๊ณ , ์ด๋ฏธ์ง€ ๋ฒค์น˜๋งˆํฌ์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ, Vision Transformer(ViT)๋Š” ๋”์šฑ ์ ์€ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ..

Paper Reading ๐Ÿ“œ/Computer Vision

Grad-CAM: Visual Explanation from Deep Networks via Gradient-based Localization ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

Table of Contents 1. Introduction 2. Grad-CAM 1. Introduction Grad-CAM์€ Gradient-weighted Class Activation Mapping์˜ ์•ฝ์ž๋กœ, CNN์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋ฅผ ๋ถ„์„ํ•  ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ๋žŒ๋“ค์€ ๊ทธ ๊ณผ์ •์„ ๋ณผ ์ˆ˜ ์—†์ง€๋งŒ, Grad-CAM์„ ํ™œ์šฉํ•˜๋ฉด CNN์˜ ํ™œ๋™ ๊ณผ์ •์„ ๋”์šฑ ๋ช…๋ฐฑํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Grad-CAM์€ ์ด์ „์˜ ๋ชจ๋ธ๋“ค๊ณผ ๋‹ฌ๋ฆฌ ์•„๋ฌด๋Ÿฐ ๊ตฌ์กฐ์  ๋ณ€ํ™”์™€ ์žฌํ•™์Šต ์—†์ด CNN์˜ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋“ค์— ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค!! ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Grad-CAM๊ณผ fine-grained visualization์„ ๊ฒฐํ•ฉํ•˜์—ฌ high-resolution class-discriminative visulaization์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ..