Paper Reading ๐Ÿ“œ/Alignment Problem of LLM

LIMA: Less Is More for Alignment ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 5. 25. 19:06

The overview of this paper

 LLM์€ ๋‘ ๊ฐ€์ง€์˜ ๋‹จ๊ณ„๋กœ ํ•™์Šต๋œ๋‹ค.

 

  1. general-purpose representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด, raw text๋กœ๋ถ€ํ„ฐ unsupervised pre-training์„ ์‚ฌ์šฉ
  2. end task์™€ ์‚ฌ์šฉ์ž ์„ ํ˜ธ๋ฅผ align ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ instruction tuning & RL์„ ์‚ฌ์šฉ

 

 ์ด ๋‘ ๊ฐ€์ง€ stage์˜ ์ค‘์š”์„ฑ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋– ํ•œ RL ๋˜๋Š” human preference modeling ์—†์ด ์˜ค์ง 1000๊ฐœ์˜ ์‹ ์ค‘ํ•˜๊ฒŒ ์„ ์ •๋œ prompt & response์—์„œ ๊ธฐ์กด supervised loss๋ฅผ ์‚ฌ์šฉํ•ด์„œ fine-tune ๋œ LLaMA-65B์ธ LIMA๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค. LIMA๋Š” ๋ณต์žกํ•œ ์ฟผ๋ฆฌ๋ฅผ ํฌํ•จํ•˜๋Š” training ๋ฐ์ดํ„ฐ์˜ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์—์„œ๋งŒ ํŠน์ • ์‘๋‹ต ํ˜•์‹์„ ๋”ฐ๋ฅด๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜์—ฌ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๋ชจ๋ธ์€ unseen task์— ๋Œ€ํ•ด ๋”์šฑ ์ž˜ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด ๋ชจ๋“  ๊ฒƒ์„ ์ข…ํ•ฉํ•˜์—ฌ, ๋…ผ๋ฌธ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” LLM์˜ ๊ฑฐ์˜ ๋ชจ๋“  ์ง€์‹์€ pre-training ์ค‘์— ํ•™์Šต๋œ๋‹ค๋Š” ๊ฒƒ์„ ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ œ์•ˆํ•˜๊ณ  ์ œํ•œ๋œ instruction training ๋ฐ์ดํ„ฐ๋Š” high-quality output์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ๊ฐ€๋ฅด์น˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค.

 

 

Table of Contents

1. Introduction

2. Alignment Data

3. Training LIMA

4. Human Evaluation

5. Why is Less More? Ablations

6. Multi-Turn Dialogue

7. Discussion

 

 

 

1. Introduction

 LM์—๊ฒŒ general-purpose representation์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๊ฒƒ์€ ์–ด๋– ํ•œ language understanding & generation task๋กœ๋„ transfer๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด instruction tuning, multi-million-example ๋ฐ์ดํ„ฐ์…‹, RLHF๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ˜„์กดํ•˜๋Š” alignment method๋Š” ChatGPT ๋ ˆ๋ฒจ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ƒ๋‹นํ•œ ์–‘์˜ ๊ณ„์‚ฐ๋Ÿ‰ & ๊ตฌ์ฒด์  ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” 1,000๊ฐœ์˜ ์—„์„ ๋œ training example์—์„œ fine-tune ๋จ์œผ๋กœ์จ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” alignment๊ฐ€ ๋ชจ๋ธ์ด ์ด๋ฏธ pre-training ์ค‘์— ์–ป์€ ์ง€์‹๊ณผ ๋Šฅ๋ ฅ์„ ๋“œ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์ด ์‚ฌ์šฉ์ž๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ์œ„ํ•œ ์Šคํƒ€์ผ ๋˜๋Š” ํ˜•์‹์„ ํ•™์Šตํ•˜๋Š” ๊ฐ„๋‹จํ•œ ํ”„๋กœ์„ธ์Šค์ผ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์˜€๋‹ค. ์ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‹ค์ œ ์‚ฌ์šฉ์ž prompt์™€ high-quality ์‘๋‹ต์— ๊ฐ€๊นŒ์šด 1,000๊ฐœ์˜ example์„ ์—„์„ ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ํ€„๋ฆฌํ‹ฐ์™€ ๋‹ค์–‘์„ฑ์„ ์œ„ํ•ด Stack Exchange๋‚˜ wikiHow ๊ฐ™์€ ์ปค๋ฎคํ‹ฐ๋‹ˆ ํฌ๋Ÿผ์œผ๋กœ๋ถ€ํ„ฐ 750๊ฐœ์˜ top question์„ ๊ณจ๋ผ์™”๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์ˆ˜๊ธฐ๋กœ ์ž‘์„ฑ๋œ 250๊ฐœ์˜ prompt์™€ response example์„ ์ž‘์„ฑํ•˜์˜€๋‹ค. ์ตœ์ข…์ ์œผ๋กœ LLaMA-65B๋ฅผ ์ด 1,000๊ฐœ์˜ demonstration์—์„œ fine-tune ํ•œ ๋ชจ๋ธ์ธ LIMA๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค. 

 

 300๊ฐœ์˜ ๊นŒ๋‹ค๋กœ์šด test prompt์—์„œ LIMA์™€ ๋‹ค๋ฅธ SoTA ๋ชจ๋ธ๋“ค์„ ๋น„๊ตํ•˜์˜€๋‹ค. human preference ์—ฐ๊ตฌ์—์„œ, LIMA๋Š” RLHF-trained DaVinci003์„ ๋Šฅ๊ฐ€ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 52,000๊ฐœ์˜ example์—์„œ ํ•™์Šต๋œ 65B Alpaca๋„ ๋Šฅ๊ฐ€ํ•˜์˜€๋‹ค. ๋น„๋ก ์‚ฌ๋žŒ๋“ค์€ GPT-4, Claude, Bard์˜ ์‘๋‹ต์„ LIMA์˜ ์‘๋‹ต๋ณด๋‹ค ๋” ์„ ํ˜ธํ•˜๊ธด ํ•˜์˜€์ง€๋งŒ, ๊ทธ๋ž˜๋„ ๊ฑฐ์˜ ๋น„์Šทํ•œ ๋น„์œจ๋กœ LIMA์˜ ์‘๋‹ต๋„ ์„ ํ˜ธ๋˜์—ˆ๋‹ค. LIMA์˜ ์‘๋‹ต์„ absolute scale์—์„œ ๋ถ„์„ํ•ด๋ณธ ๊ฒฐ๊ณผ 88%์˜ ์‘๋‹ต์ด prompt๋ฅผ ๋งŒ์กฑํ•˜์˜€๊ณ , 50%์˜ ์‘๋‹ต์ด ํ›Œ๋ฅญํ•˜์˜€๋‹ค.

 

 Ablation ์‹คํ—˜์€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ์ตœ์ ํ™”ํ•  ๋•Œ ํฐ ์ด๋“๊ณผ ํ•จ๊ป˜ prompt ๋‹ค์–‘์„ฑ์„ ํ™•์žฅํ•˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ ์ˆ˜๋Ÿ‰์„ ํ™•์žฅํ•  ๋•Œ ์ด๋“์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๋น„๋ก 0๊ฐœ์˜ dialogue example์„ ๊ฐ€์ง์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  LIMA๋Š” ๋…ผ๋ฆฌ ์ •์—ฐํ•œ multi-tuen dialogue๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์•Œ์•˜๋‹ค. ์ด ๋Šฅ๋ ฅ์€ ์˜ค์ง 30๊ฐœ์˜ hand-crafted dialogue chain์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๊ทน์ ์œผ๋กœ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ „๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” pre-training์˜ ํž˜๊ณผ ๋Œ€๊ทœ๋ชจ instruction-tuning๊ณผ RL ๋ฐฉ๋ฒ•์˜ ์ค‘์š”์„ฑ์„ ์„ค๋ช…ํ•œ๋‹ค.

 

2. Alignment Data

๋…ผ๋ฌธ์—์„œ๋Š” Superficial Alignment Hypothesis๋ฅผ ์ •์˜ํ•˜์˜€๋‹ค.

 

๋ชจ๋ธ์˜ ์ง€์‹๊ณผ ๋Šฅ๋ ฅ์€ ๊ฑฐ์˜ pre-training ์ค‘์— ํ•™์Šต๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  alignment๋Š” ์‚ฌ์šฉ์ž์™€ ์ƒํ˜ธ์ž‘์šฉ์„

ํ•  ๋•Œ ์‚ฌ์šฉ๋˜์–ด์•ผ ํ•˜๋Š” ํฌ๋งท์˜ ํ•˜์œ„ ๋ถ„ํฌ๋ฅผ ๊ฐ€๋ฅด์นœ๋‹ค.

 

 ๋งŒ์•ฝ ์ด ๊ฐ€์„ค์ด ์‚ฌ์‹ค์ด๋ผ๋ฉด ์‚ฌ๋žŒ๋“ค์€ ๋ณด๋‹ค ์ž‘์€ example ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PLM์„ ์ถฉ๋ถ„ํžˆ tune ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” 1,000๊ฐœ์˜ prompt & response ๋ฐ์ดํ„ฐ์…‹์„ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์„œ output์€ ์„œ๋กœ๋ผ๋ฆฌ ๋ฌธ์ฒด์ ์œผ๋กœ alignํ•˜์ง€๋งŒ, input์€ ๋‹ค์–‘ํ•˜๋‹ค. ํ‘œ 1์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์†Œ์Šค์— ๋Œ€ํ•œ ๊ฐœ์š”๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ๋ช‡ ๊ฐ€์ง€ ํ†ต๊ณ„๋ฅผ ์ œ๊ณตํ•ด ์ค€๋‹ค.

 

ํ‘œ 1. training prompts(inputs), responses(outputs), test prompts์˜ ์†Œ์Šค. ์ดํ•ฉ training data์˜ ๋Œ€๋žต 750,000๊ฐœ์˜ ํ† ํฐ์ด๊ณ , ์ •ํ™•ํžˆ 1,000๊ฐœ์˜ ์‹œํ€€์Šค๋กœ ๋‚˜๋ˆ ์ง

 

2-1. Community Questions & Answers

 

 ๋…ผ๋ฌธ์—์„œ๋Š” 3๊ฐœ์˜ community Q&A ์›น์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค: Stack Exchange, wikiHow, PushShift Reddit Dataset. Stack Exchange & wikiHow๋Š” well-aligned ๋˜์–ด ์žˆ์ง€๋งŒ, Reddit updated answer๋Š” ์›ƒ๊ธฐ๊ฑฐ๋‚˜ ๋‚š์‹œ์„ฑ ๊ธ€์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์ ์ ˆํ•œ ์Šคํƒ€์ผ์„ ๋”ฐ๋ฅด๋Š” ์‘๋‹ต์„ ์—„์„ ํ•˜๊ธฐ ์œ„ํ•˜ ๋”์šฑ ์ˆ˜๋™์ ์ธ ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค.

 

2-2. Manually Authored Examples

 

 ์˜จ๋ผ์ธ ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ์˜ํ•ด ๋ฌผ์–ด๋ด์ง€๋Š” question์— ๋Œ€ํ•ด ๋…ผ๋ฌธ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๋‹ค์–‘ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด author๋กœ๋ถ€ํ„ฐ prompt๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ฐ ๊ทธ๋ฃน๋‹น 250๊ฐœ์˜ prompt๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ด 2๊ฐ€์ง€ ๊ทธ๋ฃน์˜ author์„ ๋””์ž์ธํ–ˆ๋‹ค.

 

  • Group A: 200๊ฐœ์˜ training์„ ์œ„ํ•œ prompt + held-out dev set๋ฅผ ์œ„ํ•œ 50๊ฐœ์˜ prompt
  • Group B: ํ•„ํ„ฐ๋ง ํ›„ ๋‚จ์€ 230๊ฐœ์˜ prompt๋ฅผ test๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ

 

 ๋…ผ๋ฌธ์—์„œ๋Š” author๋“ค์ด ์ง์ ‘ ์ž‘์„ฑํ•œ high-quality ์‘๋‹ต๊ณผ ํ•จ๊ป˜ 200๊ฐœ์˜ training prompt๋ฅผ ๋ณด์ถฉํ•˜์˜€๋‹ค. answer๋ฅผ author๋“ค์ด ์ž‘์„ฑํ•˜๋Š” ์ค‘์— helpful AI assistant๋ฅผ ์œ„ํ•œ ์ ์ ˆํ•œ ๊ท ์ผํ•œ ํผ์„ ์„ธํŒ…ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, question์— ๋Œ€ํ•œ ์ธ์ •๊ณผ answer ์ž์ฒด๋กœ ๋งŽ์€ prompt์— ๋‹ต๋ณ€์ด ์ œ๊ณต๋œ๋‹ค. ์‚ฌ์ „ ์‹คํ—˜๋“ค์€ ์ด๋Ÿฌํ•œ ํ•œ๊ฒฐ๊ฐ™์€ ํ˜•์‹์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๊ฒƒ์ด ๋ชจ๋ธ์ด CoT์˜ "let's think step-by-step"์™€ ์œ ์‚ฌํ•œ ๊ฒƒ์„ ํ˜•์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” 13๊ฐœ์˜ ์–ด๋А ์ •๋„ toxicity ๋˜๋Š” ์•…์˜์— ์ฐฌ training prompt๋„ ํฌํ•จํ•˜์˜€๋‹ค. ์‘๋‹ต์€ ๋ถ€๋ถ„์  ๋˜๋Š” ์™„์ „ํžˆ ๋ช…๋ น์„ ๊ฑฐ์ ˆํ•˜๋„๋ก ์‹ ์ค‘ํ•˜๊ฒŒ ์ž‘์„ฑํ•˜์˜€๊ณ , ์™œ assistant๊ฐ€ ์‘๋‹ต์„ ํ•  ์ˆ˜ ์—†๋Š”์ง€ ๋˜ํ•œ ์„ค๋ช…ํ•˜์˜€๋‹ค. ํ…Œ์ŠคํŠธ ์„ธํŠธ์—๋„ ์ด์™€ ์œ ์‚ฌํ•œ 30๊ฐœ์˜ prompt๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

 ๊ฒŒ๋‹ค๊ฐ€ author๋“ค์ด ์ž‘์„ฑํ•œ example์— SuperNI๋กœ๋ถ€ํ„ฐ 50๊ฐœ์˜ training example์„ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. ์ž ์žฌ์ ์ธ ์‚ฌ์šฉ์ž prompt์˜ ๋ถ„ํฌ๋Š” Super-Natural Instructions์˜ task ๋ถ„ํฌ์™€ ํ‹€๋ฆผ์—†์ด ๋‹ค๋ฅด์ง€๋งŒ, ์ง๊ฐ์€ ์ด ์ž‘์€ ์ƒ˜ํ”Œ์ด training example์˜ ์ „์ฒด ํ˜ผํ•ฉ์— ๋‹ค์–‘์„ฑ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ์ž ์žฌ์ ์œผ๋กœ ๋ชจ๋ธ robust๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

 ์†์ˆ˜ ๋‹ค์–‘ํ•œ prompt์™€ ์–‘์งˆ์˜ ์‘๋‹ต์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์€ ํž˜๋“ค๋‹ค. ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ๋“ค์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ˆ˜๋™์ ์ธ ๋…ธ๋™์„ ํ”ผํ•˜๊ณ , ํ€„๋ฆฌํ‹ฐ๋ณด๋‹ค ์–‘์„ ์ตœ์ ํ™”ํ•˜๋Š”๋ฐ ์ง‘์ค‘ํ•˜์˜€์œผ๋‚˜, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ๋ณด๋‹ค ๋‹ค์–‘์„ฑ๊ณผ ํ€„๋ฆฌํ‹ฐ์˜ ํšจ๊ณผ์— ๋Œ€ํ•ด ์กฐ์‚ฌํ•˜์˜€๋‹ค.

 

3. Training LIMA

 ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ์˜ ํ”„๋กœํ† ์ฝœ์„ ์‚ฌ์šฉํ•ด์„œ LIMA๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค. LLaMA-65B์—์„œ ์‹œ์ž‘ํ•ด์„œ 1,000๊ฐœ์˜ example alignment training set์—์„œ fine-tune ํ•˜์˜€๋‹ค. ๊ฐ ํ™”์ž๋ฅผ ๋‹ฌ๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ํ‘œํ˜„์˜ ๋งˆ์ง€๋ง‰์— ์ŠคํŽ˜์…œ ํ† ํฐ EOT๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ† ํฐ์€ ์ƒ์„ฑ์„ ๋ฉˆ์ถ”๋Š” EOS ํ† ํฐ๊ณผ ๋˜‘๊ฐ™์€ ์—ญํ• ์„ ํ•˜์ง€๋งŒ, pre-trained ๋ชจ๋ธ์ด ๊ธฐ์กด EOS ํ† ํฐ์— ์ฃผ์ž…ํ–ˆ์„ ์ˆ˜๋„ ์žˆ๋Š” ๋‹ค๋ฅธ ์˜๋ฏธ์™€์˜ ์œตํ•ฉ์„ ํ”ผํ•œ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ fine-tuning ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋”ฐ๋ผ์„œ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ํ™•์ธํ•ด ์ฃผ๊ธธ ๋ฐ”๋ž€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋…ผ๋ฌธ์—์„œ๋Š” perplexity๊ฐ€ ์ƒ์„ฑ ํ€„๋ฆฌํ‹ฐ์™€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , held-out 50-example dev set์„ ์‚ฌ์šฉํ•ด์„œ 5๋ฒˆ์งธ์—์„œ 10๋ฒˆ์งธ epoch ๊ฐ„์— checkpoint๋ฅผ ์ˆ˜๋™์œผ๋กœ ์„ ํƒํ•˜์˜€๋‹ค. 

 

4. Human Evaluation

 ๋…ผ๋ฌธ์—์„œ๋Š” LIMA๋ฅผ SoTA ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•จ์œผ๋กœ์จ ํ‰๊ฐ€ํ•˜์˜€๊ณ , LIMA๊ฐ€ RLHF ๊ธฐ๋ฐ˜ DaVinci003๊ณผ 52,000๊ฐœ์˜ example์—์„œ ํ•™์Šต๋œ 65B Alpaca๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ๋ฆฌ๊ณ  GPT-4์™€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์‘๋‹ต์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•˜์˜€๋‹ค. LIMA์˜ ์ƒ์„ฑ์„ ๋ถ„์„ํ•ด ๋ณด๋ฉด 50%์˜ output์€ ํ›Œ๋ฅญ(excellent)ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋ช‡ ๊ฐœ์˜ example์—์„œ์˜ ๊ฐ„๋‹จํ•œ fine-tuning์€ SoTA์™€ ๊ฒฝ์Ÿํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์€ ์•ž์„œ ์„ธ์šด Superficial Alignment Hypothesis๋ฅผ ์ง€์ง€ํ•œ๋‹ค.

 

4-1. Experimental Setup

 

 LIMA๋ฅผ ๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ test prompt์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ ์‘๋‹ต์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋‹ค์Œ์— crowdworker์—๊ฒŒ LIMA์˜ output๊ณผ ๋‹ค๋ฅธ baseline์˜ output ์ค‘ ๋ฌด์—‡์„ ์„ ํ˜ธํ•˜๋Š”์ง€ ๋ฌผ์–ด๋ณด๊ณ , ์ด ์‹คํ—˜์„ ๋ฐ˜๋ณตํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  crowdworker๋ฅผ GPT-4๋กœ ๋Œ€์ฒดํ–ˆ์„ ๋•Œ๋„ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ๋™์˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

Baselines.  LIMA๋ฅผ 5๊ฐœ์˜ baseline๊ณผ ๋น„๊ตํ•˜์˜€๋‹ค: Alpaca-65B, DaVinci003, Bard, Claude, GPT-4.

 

Generation.  ๊ฐ prompt์— ๋Œ€ํ•ด nucleus sampling์„ ์‚ฌ์šฉํ•ด์„œ ๊ฐ baseline ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ํ•˜๋‚˜์˜ ์‘๋‹ต์„ ์ƒ์„ฑ ํ•ด๋‚œ๋‹ค. repitition penalty๋ฅผ ์ ์šฉํ•˜๊ณ  maximum token length๋Š” 2,048๋กœ ์ œํ•œํ•˜์˜€๋‹ค.

 

Methodology.  ๊ฐ ์Šคํ…์—์„œ annotator์—๊ฒŒ ํ•˜๋‚˜์˜ prompt์™€ ๊ฐ€๋Šฅํ•œ 2๊ฐœ์˜ ์‘๋‹ต(์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ๋œ)์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋‹ค์Œ์— annotator๋Š” ๋‘ ์‘๋‹ต ์ค‘ ๋ฌด์—‡์ด ๋” ๋‚˜์€์ง€ ํ‰๊ฐ€ํ•œ๋‹ค.

 

4-2. Results

 

 ๊ทธ๋ฆผ 1(์™ผ์ชฝ)์€ human preference ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด์—, ๊ทธ๋ฆผ 1(์˜ค๋ฅธ์ชฝ)์€ GPT-4 preference์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ฒฐ๊ตญ์— ์‚ฌ๋žŒ๊ณผ GPT-4 ๋‘˜ ๋‹ค ๋˜‘๊ฐ™์€ ์ถ”์„ธ๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ๊ฐ ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • Alpaca-65B๋Š” 52๋ฐฐ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šตํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  LIMA๊ฐ€ ๋” ์„ ํ˜ธ๋˜์—ˆ์Œ ๐Ÿ‘
  • DaVinci003์€ ๊ฐ€์žฅ ์šฐ์›”ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์ง€๋Š” alignment method์ธ RLHF๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ์ง€๋งŒ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ ๐Ÿ˜ฒ
  • Bard๋Š” 42%๋กœ ์ข€ ๋” ์„ ํ˜ธ๋˜์—ˆ์ง€๋งŒ, ๋ฐ”๊ฟ” ๋งํ•˜๋ฉด 58%๋กœ LIMA์˜ ์‘๋‹ต์€ Bard๋งŒํผ ์ข‹๋‹ค๋Š” ์˜๋ฏธ ๐Ÿ™‚
  • Claude & GPT-4๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ LIMA๋ณด๋‹ค ๋‚˜์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ์Œ ๐Ÿ“ˆ

 

๊ทธ๋ฆผ 1. ์™ผ์ชฝ์€ human preference evaluation, ์˜ค๋ฅธ์ชฝ์€ GPT-4๋ฅผ ์‚ฌ์šฉํ•œ preference evaluation

 

4-3. Analysis

 

 LIMA์— ๋Œ€ํ•œ ํ‰๊ฐ€๋Š” SoTA baseline์— ๊ด€ํ•ด ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์ด baseline๋“ค์€ training ์ค‘์— ์ˆ˜๋งŒ ๋ช…์˜ ์‹ค์ œ ์‚ฌ์šฉ์ž๊ฐ€ prompt์— ๋…ธ์ถœ๋ผ์„œ highly-tune ๋œ product์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 50๊ฐœ์˜ ๋žœ๋ค ํ•œ example์„ ์ˆ˜๋™์œผ๋กœ ๋ถ„์„ํ•จ์œผ๋กœ์จ absolute ํ‰๊ฐ€ ์ฒด๊ณ„๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.

 

  • Fail: ์‘๋‹ต์ด prompt์˜ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋งŒ์กฑํ•˜์ง€ ๋ชปํ•จ
  • Pass: ์‘๋‹ต์ด prompt์˜ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋งŒ์กฑํ•จ
  • Excellent: ๋ชจ๋ธ์ด prompt์— ๋Œ€ํ•ด ํ›Œ๋ฅญํ•œ ์‘๋‹ต ์ œ๊ณต

 

Results.  ๊ทธ๋ฆผ 3์€ LIMA ์‘๋‹ต์˜ 50% ์ •๋„๊ฐ€ ํ›Œ๋ฅญํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๊ณ , failure ์ผ€์ด์Šค์— ๋Œ€ํ•ด์„œ๋Š” ์–ด๋– ํ•œ ํŠธ๋ Œ๋“œ๋ฅผ ํ™•์ธํ•˜์ง€ ๋ชปํ–ˆ๋‹ค. ๊ทธ๋ฆผ 3์€ ์กฐ์–ธ์„ ํ•˜๊ฑฐ๋‚˜ ๋ ˆ์‹œํ”ผ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์— ๋Œ€ํ•œ LIMA์˜ example output์„ ๋ณด์—ฌ์ค€๋‹ค.

 

Out of Distribution.  LIMA๋Š” 50๊ฐœ์˜ ๋ถ„์„๋œ example์˜ example์—์„œ ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰ํ• ๊นŒ? 20๊ฐœ์˜ Out-Of-Distribution example์—์„œ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ 20%์˜ ์‘๋‹ต์€ Fail, 35%์˜ ์‘๋‹ต์€ Pass, 45%์˜ ์‘๋‹ต์€ Excellent๋ผ๋Š” ๊ฒƒ์„ ์•Œ์•„๋‚ด์—ˆ๋‹ค. ์ด ์‹คํ—˜์€ ๋งค์šฐ ์ž‘์€ ์ƒ˜ํ”Œ์—์„œ ์ง„ํ–‰๋˜์—ˆ์ง€๋งŒ, LIMA๋Š” training ๋ถ„ํฌ ์™ธ์—์„œ๋„ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋Š” ์ž˜ ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. ๊ทธ๋ฆผ 3์€ standup์„ ์ž‘์„ฑํ•˜๊ฑฐ๋‚˜ ํ”ผ์ž๋ฅผ ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ฌผ์–ด๋ณผ ๋•Œ, LIMA์˜ ๋ฆฌ์•ก์…˜์„ ๋ณด์—ฌ์ค€๋‹ค.

 

Safety.  ์ตœ์ข…์ ์œผ๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” safety ๊ด€๋ จ example์„ ์กฐ๊ธˆ๋งŒ ์คฌ๋Š”๋ฐ๋„ ๊ดœ์ฐฎ์€์ง€์— ๋Œ€ํ•ด์„œ ๋ถ„์„ํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, 30๊ฐœ์˜ ๋ฏผ๊ฐํ•œ prompt์— ๋Œ€ํ•ด์„œ LIMA๋Š” 80%์˜ prompt์—์„œ ์•ˆ์ „ํ•˜๊ฒŒ ์‘๋‹ตํ•˜์˜€๋‹ค. ๊ฒฝ์šฐ์— ๋”ฐ๋ผ LIMA๋Š” ์ž‘์—… ์ˆ˜ํ–‰์„ ์™„์ „ํžˆ ๊ฑฐ๋ถ€ํ•˜์ง€๋งŒ ์•…์˜์ ์ธ ์˜๋„๊ฐ€ ๋‚ดํฌ๋œ ๊ฒฝ์šฐ LIMA๋Š” ๊ทธ๋ฆผ 3์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ์‘๋‹ต์„ ์ œ๊ณตํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ํฌ๋‹ค.

 

๊ทธ๋ฆผ 3. test prompt๋กœ๋ถ€ํ„ฐ์˜ ๋ชจ๋ธ output

 

5. Why is Less More? Ablations

 ๋…ผ๋ฌธ์—์„œ๋Š” training data์˜ ๋‹ค์–‘์„ฑ, ํ€„๋ฆฌํ‹ฐ, ์–‘์˜ ํšจ๊ณผ๋ฅผ ablation์„ ํ†ตํ•ด ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ alignment์˜ ๋ชฉ์ ์— ๋Œ€ํ•ด input ๋‹ค์–‘์„ฑ๊ณผ output ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธ์ •์ ์ธ ํšจ๊ณผ๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•˜์˜€๋‹ค. ๋ฐ˜๋ฉด์— ์–‘๋งŒ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋ณ„ ํšจ๊ณผ๊ฐ€ ์—†๋‹ค.

 

Diversity.  prompt ๋‹ค์–‘์„ฑ์˜ ํšจ๊ณผ๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด quality-filtered Stack Exchange ๋ฐ์ดํ„ฐ์™€ wikiHow ๋ฐ์ดํ„ฐ์—์„œ์˜ ํ•™์Šต์˜ ํšจ๊ณผ๋ฅผ ๋น„๊ตํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 4๋Š” ๋”์šฑ ๋‹ค์–‘ํ•œ Stack Exchange ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒ๋‹นํžˆ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.

 

Quality.  ์‘๋‹ต ํ€„๋ฆฌํ‹ฐ์˜ ํšจ๊ณผ๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋– ํ•œ ํ•„ํ„ฐ๋ง๋„ ์—†๋Š” Stack Exchange๋กœ๋ถ€ํ„ฐ 2,000๊ฐœ์˜ example์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์ด ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ•„ํ„ฐ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ทธ๋ฆผ 4์—์„œ๋Š” 0.5% ์ •๋„์˜ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. 

 

๊ทธ๋ฆผ 4. ์„œ๋กœ ๋‹ค๋ฅธ ์†Œ์Šค์˜ 2000๊ฐœ์˜ example์—์„œ ํ•™์Šต๋œ 7B ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ

 

Qunatity.  example์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋จธ์‹ ๋Ÿฌ๋‹ ์„ธํŒ…์—์„œ ์ž˜ ์•Œ๋ ค์ง„ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด, Stack Exchange๋กœ๋ถ€ํ„ฐ ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” training set๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 5๋Š” training set๋ฅผ ๋”๋ธ”๋ง ํ•˜๋ฉด์„œ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์‘๋‹ต์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” alignment์˜ scaling law๋Š” ์–‘ ํ•˜๋‚˜์—๋งŒ ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ prompt์˜ ๋‹ค์–‘์„ฑ์—๋„ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. (high-quality ์‘๋‹ต์„ ์œ ์ง€ํ•˜๋Š” ํ•œ)

 

๊ทธ๋ฆผ 5. ์ง€์ˆ˜ํ•จ์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ์–‘์˜ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต๋œ 7B ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ

 

6. Multi-Turn Dialogue

 ์˜ค์ง 1,000๊ฐœ์˜ single-turn ์ƒํ˜ธ์ž‘์šฉ์—์„œ fine-tune ๋œ ๋ชจ๋ธ์ด multi-turn dialogue์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์„๊นŒ? ๋…ผ๋ฌธ์—์„œ๋Š” LIMA๋ฅผ 10๊ฐœ์˜ ๋ผ์ด๋ธŒ ๋Œ€ํ™”์—์„œ ํ…Œ์ŠคํŠธํ•˜์˜€๋‹ค. ์ด๋•Œ ๊ฐ ์‘๋‹ต์€ Fail, Pass, Excellent๋กœ ๋ผ๋ฒจ๋งํ•˜์˜€๋‹ค. LIMA์˜ ์‘๋‹ต์€ ๋Œ€ํ™”์˜ ์ด์ „ ๋‹จ๊ณ„์—์„œ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜๋Š” zero-shot ์ฑ—๋ด‡์— ๋Œ€ํ•ด ๋†€๋ผ์šธ ์ •๋„๋กœ ์ผ๊ด€์„ฑ์„ ๊ฐ€์ง„ ๋ชจ๋ธ์€ Out-Of-Distribution ์—์„œ๋„ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์ด ๋ถ„๋ช…ํ•˜๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ 10๊ฐœ์˜ ๋Œ€ํ™” ์ค‘ 6๊ฐœ์—์„œ LIMA๋Š” 3๊ฐœ์˜ ์ƒํ˜ธ์ž‘์šฉ ๋‚ด์—์„œ prompt๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š์•˜๋‹ค.

 

 ๋Œ€ํ™” ๋Šฅ๋ ฅ ๊ฐœ์„ ์„ ์œ„ํ•ด 30๊ฐœ์˜ multi-turn dialogue chain์„ ๋ชจ์•˜๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ด 1,030๊ฐœ์˜ example์—์„œ fine-tune๋œ ์ƒˆ๋กœ์šด ๋ฒ„์ „์˜ LIMA๋Š” zero-shot ๋ชจ๋ธ์—์„œ ๋˜‘๊ฐ™์ด ์‚ฌ์šฉ๋œ prompt์— ๊ธฐ๋ฐ˜ํ•ด์„œ 10๊ฐœ์˜ ๋ผ์ด๋ธŒ ๋Œ€ํ™”๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 7์€ ์ด๋Ÿฌํ•œ dialogue์— ๋Œ€ํ•œ ์˜ˆ์™ธ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

 ๊ทธ๋ฆผ 6์€ ์‘๋‹ต ํ€„๋ฆฌํ‹ฐ์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ƒ์„ฑ ํ€„๋ฆฌํ‹ฐ๋ฅผ ์ƒ๋‹นํžˆ ๊ฐœ์„ ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ failure rate๋„ zero-shot์˜ ๊ฒฝ์šฐ์—๋Š” 42 ํ„ด ๋‹น 15๋ฒˆ์˜ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์˜€์ง€๋งŒ, fine-tuned์˜ ๊ฒฝ์šฐ์—๋Š” 46 ํ„ด ๋‹น 1๋ฒˆ ์ •๋„์˜ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์˜€๋‹ค. fine-tuned ๋ชจ๋ธ์€ 10๊ฐœ ์ค‘ 7๊ฐœ์˜ ๋Œ€ํ™”์—์„œ ์ƒ๋‹นํžˆ ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๊ณ , 3๊ฐœ์—์„œ๋Š” zero-shot๊ณผ ํƒ€์ด๋ฅผ ์ด๋ค˜๋‹ค. ๋‹จ 30๊ฐœ์˜ example์—์„œ ์ด๋Ÿฌํ•œ ๋Šฅ๋ ฅ์˜ ๋„์•ฝ๊ณผ zero-shot ๋ชจ๋ธ์ด ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์€ ์ด๋Ÿฐ ๋Šฅ๋ ฅ์ด pre-training ์ค‘์— ํ•™์Šต๋˜๊ณ  limited supervision์„ ํ†ตํ•ด ํ˜ธ์ถœ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์„ค์„ ๊ฐ•ํ™”ํ•œ๋‹ค.

 

๊ทธ๋ฆผ 6. dialogue turn์˜ ๋ถ„์„

 

๊ทธ๋ฆผ 7. 30๊ฐœ์˜ dialogue example์„ ์‚ฌ์šฉํ•  ๋•Œ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๋•Œ์˜ LIMA์˜ example dialogue

 

7. Discussion

 ๋…ผ๋ฌธ์—์„œ๋Š” 1,000๊ฐœ์˜ ์‹ ์ค‘ํ•˜๊ฒŒ ์—„์„ ๋œ example์—์„œ ๊ฐ•๋ ฅํ•œ PLM์„ fine-tune ํ•˜๋ฉด ๊ด‘๋ฒ”์œ„ํ•œ prompt์—์„œ ๋ˆˆ์— ๋„๊ณ , ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ example์„ ์ž‘์„ฑํ•˜๋Š” ๋ฐ๋Š” ์ƒ๋‹นํ•œ mental effort๊ฐ€ ํ•„์š”ํ•ด์„œ scale up์ด ํž˜๋“ค๋‹ค. ๊ทธ๋ฆฌ๊ณ  LIMA๋Š” product-grade ๋ชจ๋ธ๋งŒํผ robust ํ•˜์ง€ ์•Š๋‹ค. ๋ฐ˜๋ฉด์— LIMA๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์‘๋‹ต์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ๋””์ฝ”๋”ฉ ๋˜๋Š” ๊ณต๊ฒฉ์  prompt ์ค‘์— unlucky sample๋Š” ์•ฝํ•œ ์‘๋‹ต์„ ์ด๋Œ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ์ฆ๊ฑฐ๋Š” ๊ฐ„๋‹จํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ alignment์˜ ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2305.11206

 

LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We

arxiv.org

 

'Paper Reading ๐Ÿ“œ > Alignment Problem of LLM' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Aligning Large Language Models through Synthetic Feedback ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.05.30
ICIL: In-Context Instruction Learning ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.05.28
Red Teaming Language Models with Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.05.23
Training a helpful and harmless assistant with reinforcement learning from human feedback ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.05.18
Exploring the Benefits of Training Expert Language Models over Instruction Tuning ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (3) 2023.05.15
'Paper Reading ๐Ÿ“œ/Alignment Problem of LLM' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • Aligning Large Language Models through Synthetic Feedback ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ICIL: In-Context Instruction Learning ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Red Teaming Language Models with Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Training a helpful and harmless assistant with reinforcement learning from human feedback ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • context window
  • LLM
  • Evaluation Metric
  • closed-source model
  • closed-source
  • MT-Bench
  • GPT-4
  • LLAMA2
  • Vicuna
  • Open-source
  • LM
  • context length
  • proprietary model
  • scaling law
  • open-source model
  • RLHF
  • ChatGPT
  • Vicuna Evaluation
  • Chinchilla
  • transformer
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
LIMA: Less Is More for Alignment ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.