Paper Reading ๐Ÿ“œ/multimodal models

PaLM-E: An Embodied Multimodal Language Model ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 3. 21. 22:26

 ์–ผ๋งˆ ์ „์— ๋ธ”๋กœ๊ทธ์— ๊ตฌ๊ธ€์—์„œ ์†Œ๊ฐœํ•œ PaLM์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ ํฌ์ŠคํŠธ๋ฅผ ์˜ฌ๋ ธ๋˜ ๊ธฐ์–ต์ด ๋‚œ๋‹ค. ์—„์ฒญ๋‚œ ์–‘์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ ์ธํ•ด ๊นœ์ง ๋†€๋ž๋˜ ๊ธฐ์–ต์ด ๋‚˜๋Š”๋ฐ, ์ด์ œ๋Š” ์ด PaLM์ด ๋”์šฑ multimodal ์Šค๋Ÿฌ์›Œ์กŒ๋‹ค.  ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด์ œ ํ…์ŠคํŠธ๋ฅผ ๋„˜์–ด์„œ ์ด๋ฏธ์ง€๊นŒ์ง€๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ๋˜์–ด๋ฒ„๋ฆฐ PaLM-'E'์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค. ๋ณธ ํฌ์ŠคํŠธ๋Š” ๋…ผ๋ฌธ๊ณผ ๊ตฌ๊ธ€์˜ ์†Œ๊ฐœ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ๋‹ค.

 

 

The overview of PaLM-E

์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ ๋จธ์‹  ๋Ÿฌ๋‹์—์„œ๋Š” ์—„์ฒญ๋‚œ ๋ฐœ์ „์„ ์ด๋ฃฉํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฐœ์ „๋œ ๋ชจ๋ธ๋“ค์€ ์กฐํฌ๋ฅผ ์„ค๋ช…ํ•˜๊ฑฐ๋‚˜ ์‹œ๊ฐ์  ์งˆ๋ฌธ์— ์‘๋‹ตํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์–ธ์–ด์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์‹ฌ์ง€์–ด๋Š” ํ…์ŠคํŠธ ์„ค๋ช…์ด ์ฃผ์–ด์ง€๋ฉด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค! ๐Ÿ˜ฒ ์ด๋Ÿฌํ•œ ํ˜์‹ ์€ ํฐ ๋ฐ์ดํ„ฐ์…‹์˜ ์‚ฌ์šฉ์„ฑ์ด ๋Š˜์–ด๋‚˜๊ณ , ์ƒˆ๋กœ์šด ๋ฐœ์ „๋“ค์ด ๋ชจ๋ธ์ด ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คฌ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ robotic model๋“ค์€ ๋ช‡๋ช‡ ์„ฑ๊ณต ์‚ฌ๋ก€๋“ค์„ ๋ณด์ด๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด์—, ํฐ ํ…์ŠคํŠธ corpora ๋˜๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ€์กฑ์œผ๋กœ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์— ๋น„ํ•ด ์•ž์งˆ๋Ÿฌ ๋‚˜๊ฐ€๊ณ  ์žˆ์–ด ๋ณด์ด๊ณ  ์žˆ๋‹ค.

 

 ๊ทธ๋ž˜์„œ ๊ตฌ๊ธ€์—์„œ ์†Œ๊ฐœํ•œ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์  ๋ฐ ์–ธ์–ด์  ์˜์—ญ์œผ๋กœ ์–ป์€ ์ง€์‹์„ robotic ์‹œ์Šคํ…œ์— ์ „๋‹ฌํ•จ์œผ๋กœ์จ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ๋‹ค๋ฐฉ๋ฉด์œผ๋กœ ๋›ฐ์–ด๋‚œ robotic model์ด ๋ฐ”๋กœ PaLM-E์ด๋‹ค. PaLM-E๋Š” ๊ฐ•๋ ฅํ•œ LLM์ธ PaLM์„ robotic ์—์ด์ „ํŠธ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์„ผ์„œ ๋ฐ์ดํ„ฐ๋กœ ๋ณด์™„ํ•จ์œผ๋กœ์จ "๊ตฌ์ฒดํ™”$($embodied$)$"ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์ด ๊ธฐ์กด์— LLM์„ robotic์œผ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋Š” ์‹œ๋„๋“ค๊ณผ์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์ด๋‹ค. ์˜ค์ง text input์—๋งŒ ์˜์กดํ•˜๊ธฐ ๋ณด๋‹ค๋Š”, PaLM-E์„ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋ด‡ ์„ผ์„œ ๋ฐ์ดํ„ฐ์˜ raw stream์„ ์ง์ ‘ ์ˆ˜์ง‘ํ•˜๋„๋ก LM์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๋ชจ๋ธ์€ ๋กœ๋ด‡ ํ•™์Šต์—์„œ ๋›ฐ์–ด๋‚  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ผ๋ฐ˜์ ์ธ ๋ชฉ์ ์˜ visual-language-model์—์„œ ํ›Œ๋ฅญํ•œ language-only task ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

 

An Embodied language model, and also a visual-language generalist

 ํ•œํŽธ์œผ๋กœ PaLM-E๋Š”, robotic์„ ์œ„ํ•œ ๋ชจ๋ธ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ๊ณ , ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ๋กœ๋ด‡๊ณผ ์—ฌ๋Ÿฌ ์–‘์‹์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ•ด๊ฒฐํ•œ๋‹ค. ๋™์‹œ์— PaLM-E๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€๋Šฅํ•œ vision-language-model์ด๋‹ค. PaLM-E๋Š” ์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•˜๊ฑฐ๋‚˜ ์‚ฌ๋ฌผ์„ ์ธ์‹ํ•˜๊ณ , ์žฅ๋ฉด์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋“ฑ์˜ visual task๋„ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅํ•˜๊ณ , ์‹œ ์ธ์šฉ ๋˜๋Š” ์ˆ˜ํ•™ ๋ฐฉ์ •์‹์„ ํ’€๊ฑฐ๋‚˜ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋“ฑ์˜ language task๋„ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

 PaLM-E๋Š” ๊ตฌ๊ธ€์—์„œ ๊ฐ€์žฅ ์ตœ๊ทผ์— ์†Œ๊ฐœํ•œ LLM์ธ PaLM๊ณผ ๊ฐ€์žฅ ๋ฐœ์ „๋œ vision model์ธ ViT-22B๋ฅผ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค. ์ด ๋ฐฉ์‹์˜ ๊ฐ€์žฅ ํฐ ๋ชจ๋ธ์€ PaLM-540B๋กœ ๋งŒ๋“ค์–ด์ง„ PaLM-E-562B์ด๊ณ , ์–ด๋– ํ•œ task-specific fine-tuning ์—†์ด visual-language OK-VQA  ๋ฒค์น˜๋งˆํฌ์—์„œ ์ƒˆ๋กœ์šด SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ๋ณธ์ ์œผ๋กœ PaLM-540B์™€ ๋™์ผํ•œ ์–ธ์–ด ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. 

 

 PaLM-E์˜ ์ฃผ๋œ contributuon์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  1. embodied ๋ฐ์ดํ„ฐ๋ฅผ multimodal LLM์˜ ํ•™์Šต์— ํ˜ผํ•ฉํ•ด ๋ฒ”์šฉ์  ๋ชจ๋ธ, ์ „์ด ํ•™์Šต, ๋‹ค์ค‘ ๊ตฌํ˜„ ์˜์‚ฌ ๊ฒฐ์ • ์—์ด์ „ํŠธ๋ฅผ ๊ต์œกํ•  ์ˆ˜ ์žˆ์Œ.
  2. ํ˜„์žฌ SOTA visual-language model์€ zero-shot ์ถ”๋ก  ๋ฌธ์ œ๋ฅผ ์ž˜ ๋‹ค๋ฃจ์ง€ ๋ชป ํ•จ. ํ•˜์ง€๋งŒ, ์œ ๋Šฅํ•œ ๋ฒ”์šฉ visual-language model์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•จ.
  3. neural scene representation๊ณผ entity-labeling multimodal token ๊ฐ™์€ ์ƒˆ๋กœ์šด architecture์„ ์ œ์•ˆํ•˜์˜€์Œ. 
  4. PaLM-E๋Š” visual๊ณผ language๊ณผ ๊ฐ™์ด ๋‹ค๋ฐฉ๋ฉด์— ๋Œ€ํ•ด์„œ ์งˆ์ ์œผ๋กœ ์œ ๋งํ•œ ๋ชจ์Šต์„ ๋ณด์—ฌ์คŒ.
  5. ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์ ์€ catastophic fogetting๊ณผ ํ•จ๊ป˜ multimodal fine-tuning์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.

 

How does PaLM-E work?

 ๊ธฐ์ˆ ์ ์œผ๋กœ PaLM-E๋Š” observation์„ pre-trained LM์— ์ฃผ์ž…ํ•จ์œผ๋กœ์จ ์ž‘๋™ํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ LM์— ์˜ํ•ด ์ž์—ฐ์–ด์˜ ๋‹จ์–ด๊ฐ€ ์ฒ˜๋ฆฌ๋˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ํ”„๋กœ์‹œ์ €๋กœ ๋ณ€ํ™˜ํ•จ์œผ๋กœ์จ ์‹คํ˜„์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 LM์€ text๋ฅผ ์‹ ๊ฒฝ๋ง์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜ํ•™์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ์˜์กดํ•œ๋‹ค. ์ด๊ฒƒ์€ ๋จผ์ € text๋ฅผ subword๋ฅผ ์ธ์ฝ”๋“œํ•˜๋Š” so-called token์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋‹ฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ํ† ํฐ์€ ์ˆซ์ž์˜ ๊ณ ์ฐจ์› ๋ฒกํ„ฐ, ํ† ํฐ ์ž„๋ฒ ๋”ฉ๊ณผ ์—ฐ๊ด€๋˜์–ด ์žˆ๋‹ค. LM์€ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ$($ํ–‰๋ ฌ๊ณฑ ๊ฐ™์€$)$์„ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๋ฒกํ„ฐ์˜ ์‹œํ€€์Šค์— ์ ์šฉํ•ด์„œ ๋‹ค์Œ์— ์˜ฌ ๊ฒƒ ๊ฐ™์€ word token์„ ์˜ˆ์ธกํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ƒˆ๋กญ๊ฒŒ ์˜ˆ์ธก๋œ word๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์„œ, LM์€ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋”์šฑ ๋” ๊ธด text๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 PaLM-E์˜ ์ž…๋ ฅ์€ ์ž„์˜์˜ ์ˆœ์„œ๋กœ ๋œ text ๋ฐ ๊ธฐํƒ€ ์–‘์‹$($์ด๋ฏธ์ง€, robot states, scene embedding ๋“ฑ$)$์ด๋ฉฐ, ์ด๋ฅผ "multimodal sentences"๋ผ๊ณ  ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ž…๋ ฅ์€ "<img_1>๊ณผ <img_2> ์‚ฌ์ด์— ๋ฌด์Šจ ์ผ์ด ๋ฐœ์ƒํ–ˆ๋‚˜์š”?"์˜ ํ˜•ํƒœ๋ฅผ ๋ˆ๋‹ค. ์—ฌ๊ธฐ์„œ <img_1>๊ณผ <img_2>๋Š” ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€์ด๋‹ค. ์ถœ๋ ฅ์€ PaLM-E์— ์˜ํ•ด auto-regressivelyํ•˜๊ฒŒ ์ƒ์„ฑ๋œ text์ด๋‹ค. ์ด๊ฒƒ์€ ์งˆ๋ฌธ์˜ ๋Œ€๋‹ต์ผ ์ˆ˜๋„ ์žˆ๊ณ , text form์—์„œ ๊ฒฐ์ •์˜ ์‹œํ€€์Šค์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

PaLM-E ๋ชจ๋ธ์˜ architecture. ์–ด๋–ป๊ฒŒ PaLM-E๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์–‘์‹์„ ์ดํ•ดํ•˜๊ณ  multimodal language modeling์„ ํ†ตํ•ด task๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Œ.

 

 PaLM-E์˜ ์•„์ด๋””์–ด๋Š” ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ์ž์—ฐ์–ด ๋‹จ์–ด ํ† ํฐ ์ž„๋ฒ ๋”ฉ์ฒ˜๋Ÿผ ๋˜‘๊ฐ™์€ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” encoder๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์† ์ž…๋ ฅ์€ "words"์™€ ์œ ์‚ฌํ•œ ๊ฒƒ์œผ๋กœ ๋งคํ•‘๋œ๋‹ค. $($๋น„๋ก ์ด๋“ค์€ ๋ณ„๊ฐœ์˜ ์„ธํŠธ๋ฅผ ํ˜•์„ฑํ•  ํ•„์š”๊ฐ€ ์—†์Œ$)$ word์™€ image embedding ๋‘˜์€ ๋˜‘๊ฐ™์€ ์ฐจ์›์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋“ค์€ LM์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” language(PaLM) ๋ฐ vision(ViT) ๋ชจ๋‘์— ๋Œ€ํ•ด pre-trained model์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์œ„ํ•ด PaLM-E๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜์˜€๋‹ค. ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ํ•™์Šต ์ค‘์— ์—…๋ฐ์ดํŠธ๋  ์ˆ˜ ์žˆ๋‹ค.

 

 

Transferring knowledge from large-scale training to robots

 PaLM-E๋Š” ๋‹ค๋ฐฉ๋ฉด์œผ๋กœ ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ๊ณตํ•˜์˜€๋‹ค. ์ด ํŒจ๋Ÿฌ๋‹ค์ž„์€ robot task์™€ vision-language task๋ฅผ ํ•จ๊ป˜ ํ•˜๋‚˜์˜ ์ผ๋ฐ˜์ ์ธ representation์˜ ํ‹€๋กœ ๋„ฃ์Œ์œผ๋กœ์จ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค: text์™€ image๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ, text๋ฅผ ์ถœ๋ ฅ. ์ค‘์š”ํ•œ ๊ฒฐ๊ณผ๋Š” PaLM-E๊ฐ€ ์ƒ๋‹นํ•œ ๊ธ์ •์ ์ธ ์ง€์‹ ์ „๋‹ฌ์„ vision๊ณผ language ์˜์—ญ์œผ๋กœ๋ถ€ํ„ฐ ์–ป์—ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ํ•™์Šต์˜ ํšจ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

 

์ผ๋ฐ˜์ ์ธ vision-language task๋กœ๋ถ€ํ„ฐ ๊ธ์ •์ ์ธ ์ง€์‹์˜ ์ „๋‹ฌ์€ ๋”์šฑ ํšจ๊ณผ์ ์ธ ๋กœ๋ด‡ ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์Œ. ์ด ๊ทธ๋ฆผ์—์„œ๋Š” 3๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ๋ด‡ ์ฃผ์ฒดํ™”์™€ ์˜์—ญ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Œ.

 

  ๊ฒฐ๊ณผ๋“ค์€ PaLM-E๊ฐ€ robotics, vision, language task์˜ ๊ฑฐ๋Œ€ํ•œ ์„ธํŠธ๋ฅผ ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์„ ๊ฐ๊ฐ์˜ task์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•˜์—ฌ ์„ฑ๋Šฅ์˜ ์ €ํ•˜ ์—†์ด ๋™์‹œ์— ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, visual-language ๋ฐ์ดํ„ฐ๋Š” robot task์˜ ์„ฑ๋Šฅ์„ ์ƒ๋‹นํžˆ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ์ „๋‹ฌ์„ ํ†ตํ•ด PaLM-E๋Š” task๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ example ์ˆ˜ ์ธก๋ฉด์—์„œ ๋กœ๋ด‡ ์ž‘์—…์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

Results

 ๋…ผ๋ฌธ์—์„œ๋Š” PaLM-E๋ฅผ 3๊ฐœ์˜ robotic ํ™˜๊ฒฝ์—์„œ ํ‰๊ฐ€ํ•˜์˜€๊ณ , ๊ทธ ์ค‘์— ๋‘ ๊ฐœ๋Š” ์‹ค์ œ ๋กœ๋ด‡์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ visual question answering(VQA), ์ด๋ฏธ์ง€ ์บก์…”๋‹, ์ผ๋ฐ˜์ ์ธ language task ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ vision-language task์— ๋Œ€ํ•ด์„œ๋„ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. PaLM-E๊ฐ€ ๋กœ๋ด‡์— ๋Œ€ํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” task๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ ํ…์ŠคํŠธ๋ฅผ ํ•˜์œ„ ์ˆ˜์ค€์˜ ๋กœ๋ด‡ ๋™์ž‘์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ํ•˜์œ„ ์ˆ˜์ค€ language-visual ์ •์ฑ…๊ณผ ์Œ์„ ์ด๋ฃจ๊ฒŒ ํ•˜์˜€๋‹ค.

 

 ์•„๋ž˜์˜ ์ฒซ ๋ฒˆ์งธ ์˜ˆ์‹œ๋Š” ์‚ฌ๋žŒ์ด ๋ชจ๋ฐ”์ผ ๋กœ๋ด‡์—๊ฒŒ ์นฉ ํ•œ ๋ด‰์ง€๋ฅผ ๊ฐ€์ ธ์˜ค๊ฒŒ ์‹œํ‚ค๋Š” ์˜ˆ์‹œ์ด๋‹ค. task๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•˜๊ธฐ ์œ„ํ•ด, PaLM-E๋Š” ์„œ๋ž์„ ์ฐพ๊ณ  ์—ฌ๋Š” ๊ณ„ํš์„ ์„ธ์šฐ๊ณ  task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๊ณ„ํš์„ ์—…๋ฐ์ดํŠธํ•จ์œผ๋กœ์จ ์„ธ์ƒ์˜ ๋ณ€ํ™”์— ๋Œ€์‘ํ•˜๊ฒŒ ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์˜ˆ์‹œ์—์„œ๋Š”, ๋กœ๋ด‡์—๊ฒŒ ์ดˆ๋ก์ƒ‰ ๋ธ”๋ก์„ ์ง‘๊ฒŒ ํ•˜์˜€๋‹ค. ๋งŒ์•ฝ ๋กœ๋ด‡์—๊ฒŒ ๋ธ”๋ก์ด ๋ณด์ด์ง€ ์•Š๋”๋ผ๋„, PaLM-E๋Š” ๊ณ„์† ๋กœ๋ด‡์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” step-by-step ๊ณ„ํš์„ ์ƒ์„ฑํ•œ๋‹ค.

 

PaLM-E๋Š” ๋ชจ๋ฐ”์ผ ๋กœ๋ด‡์ด ์ฃผ๋ฐฉ ํ™˜๊ฒฝ์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ์กฐ์ข…ํ•จ. ์œ„ ์งค๋“ค์—์„œ ๋ณด์—ฌ์ง€๋Š” ๋Šฅ๋ ฅ๋“ค์€ vision๊ณผ language model๋กœ๋ถ€ํ„ฐ ์ „์ด ํ•™์Šต์„ ํ†ตํ•ด ์ด‰์ง„๋Œ.

 

 ์•„๋ž˜ ๊ทธ๋ฆผ์˜ ๋‘ ๋ฒˆ์งธ ํ™˜๊ฒฝ์—์„œ ๋™์ผํ•œ PaLM-E ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋กœ๋ด‡์—์„œ "์ƒ‰์ƒ๋ณ„๋กœ ๋ธ”๋ก์„ ๋ชจ์„œ๋ฆฌ๋กœ ์ •๋ ฌ"๊ณผ ๊ฐ™์€ ๋งค์šฐ ๊ธธ๊ณ  ์ •ํ™•ํ•œ ์ž‘์—…์„ ํ•ด๊ฒฐํ•œ๋‹ค. ์ด๊ฒƒ์€ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ๋” ์งง์€ textually-represented action์˜ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "ํŒŒ๋ž€์ƒ‰ ํ๋ธŒ๋ฅผ ์•„๋ž˜ ์˜ค๋ฅธ์ชฝ ์ฝ”๋„ˆ์— ๋ฐ€์–ด๋‘ฌ๋ผ.", "ํŒŒ๋ž€์ƒ‰ ์‚ผ๊ฐํ˜•๋„ ๊ฐ™์ด ๋ฐ€์–ด๋‘ฌ๋ผ." ์™€ ๊ฐ™์ด ๋ง์ด๋‹ค. ์žฅ๊ธฐ์ ์ธ ๊ณผ์ œ๋Š” ์ž์œจ์ ์ธ ์™„์„ฑ์˜ ๋ฒ”์œ„์—์„œ ๋ฒ—์–ด๋‚œ๋‹ค. ๋˜ํ•œ ๋นจ๊ฐ„์ƒ‰ ๋ธ”๋ก์„ ์ปคํ”ผ ์ปต์— ๋ฐ€์–ด๋„ฃ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ํ•™์Šต ์‹œ๊ฐ„ ๋™์•ˆ ๋ณผ ์ˆ˜ ์—†์—ˆ๋˜ ์ƒˆ๋กœ์šด task$($์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”$)$๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹ค. 

 

PaLM-E๊ฐ€ tabeltop robot์„ ์กฐ์ข…ํ•ด์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์žฅ๊ธฐ์  ๊ณผ์ œ๋ฅผ ์™„๋ฃŒ์‹œํ‚ด

 

 ์„ธ ๋ฒˆ์งธ ๋กœ๋ด‡ ํ™˜๊ฒฝ์€ ๋งค์šฐ ๋งŽ์€ ์ˆ˜์˜ ๊ฐ€๋Šฅํ•œ ํ–‰๋™ ์ˆœ์„œ๋กœ ๋กœ๋ด‡์— ์ง๋ฉดํ•˜๋Š” ์กฐํ•ฉ์ ์œผ๋กœ ๋„์ „์ ์ธ ๊ณ„ํš task๋ฅผ ์—ฐ๊ตฌํ•˜๋Š” task and motion planning$($TAMP$)$์˜ ํ•„๋“œ๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•˜๋‹ค. ์ „๋ฌธ๊ฐ€ TAMP planner๋กœ๋ถ€ํ„ฐ ์–ป์€ ๋ณดํ†ต์˜ ์–‘์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PaLM-E๋Š” ์ด task๋ฅผ ํ•ด๊ฒฐํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋˜ํ•œ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด visual ๋ฐ language ์ง€์‹ ์ „๋‹ฌ์„ ํ™œ์šฉํ•œ๋‹ค. 

 

PaLM-E๋Š” task์™€ motion planning ํ™˜๊ฒฝ์„ ์œ„ํ•œ ํ”Œ๋žœ์„ ์ƒ์„ฑํ•จ

 

 visual-language ๋‹ค๋ฐฉ๋ฉด์— ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ๋กœ์จ PaLM-E๋Š” ์ตœ๊ณ ์˜ vision-language-only ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ๋„ ๊ฒฝ์Ÿ๋ ฅ์žˆ๋Š” ๋ชจ๋ธ์ด๋‹ค. ํŠนํžˆ, PaLME-E-562B๋Š” ์–ด๋ ค์šด OK-VQA ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜์˜€๋‹ค. ์ด task๋Š” ์‹œ๊ฐ์  ์ดํ•ด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„ธ์ƒ์˜ ์™ธ๋ถ€์ ์ธ ์ง€์‹๋„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋˜ํ•œ ์ด ๊ฒฐ๊ณผ๋Š” ํŠน์ • task์— ๋Œ€ํ•ด์„œ๋งŒ fine-tuningํ•˜์ง€ ์•Š๊ณ  ์ผ๋ฐ˜ ๋ชจ๋ธ๋กœ ๋„๋‹ฌํ•˜์˜€๋‹ค. 

 

 

Conclusion

 PaLM-E๋Š” ์–ด๋–ป๊ฒŒ generally-capable model์ด vision๊ณผ language, robotic์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„“ํžˆ๋Š” ๋™์‹œ์— ์‹œ๊ฐ ๋ฐ ์–ธ์–ด์—์„œ robotic ์˜์—ญ์œผ๋กœ ์ง€์‹์„ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—๋Š” ์ถ”๊ฐ€์ ์ธ ์ฃผ์ œ๋“ค์˜ ๋””ํ…Œ์ผ์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฃจ๊ณ  ์žˆ์œผ๋‹ˆ ํ•œ ๋ฒˆ ํ™•์ธํ•ด๋ณด๊ธธ ๋ฐ”๋ž€๋‹ค.

 

 PaLM-E๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋” ์œ ๋Šฅํ•œ ๋กœ๋ด‡์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ๋กœ๋ฅผ ์ œ๊ณตํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ง€๊ธˆ๊นŒ์ง€ ๋ถ„๋ฆฌ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์˜€๋˜ ์ž‘์—…์„ ํ†ตํ•ฉํ•˜๋Š” ๊ธฐ๋Šฅ์„ ํฌํ•จํ•˜์—ฌ multimodal ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๊ด‘๋ฒ”์œ„ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ๋Œ€ํ•œ ํ•ต์‹ฌ enabler๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2303.03378

 

PaLM-E: An Embodied Multimodal Language Model

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world cont

arxiv.org

https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html

 

PaLM-E: An embodied multimodal language model

Posted by Danny Driess, Student Researcher, and Pete Florence, Research Scientist, Robotics at Google Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of

ai.googleblog.com

 

'Paper Reading ๐Ÿ“œ > multimodal models' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

VL-BERT: Pre-training of Generic Visual-Linguistic Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.04.16
LXMERT: Learning Cross-Modality Encoder Representations from Transformers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.04.13
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visual-and-Language Tasks  (0) 2023.04.12
VisualBERT: A Simple and Performant Baseline for Vision and Language ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.04.11
VLM(Vision-Language Model)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!  (0) 2023.04.10
'Paper Reading ๐Ÿ“œ/multimodal models' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visual-and-Language Tasks
  • VisualBERT: A Simple and Performant Baseline for Vision and Language ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • VLM(Vision-Language Model)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • Open-source
  • Vicuna
  • Evaluation Metric
  • LLAMA2
  • GPT-4
  • scaling law
  • LLM
  • MT-Bench
  • context window
  • ChatGPT
  • Vicuna Evaluation
  • closed-source model
  • proprietary model
  • LM
  • RLHF
  • closed-source
  • Chinchilla
  • context length
  • open-source model
  • transformer
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
PaLM-E: An Embodied Multimodal Language Model ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.