Paper Reading ๐Ÿ“œ/Natural Language Processing

๐ŸฌOrca: Progressive Learning from Complex Explanation Traces of GPT-4 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 6. 23. 13:53

The overview of this paper

 ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ๋“ค์€ smaller model์˜ ์—ญ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด imitation learning์„ ํ†ตํ•ด large foundation models(LFM)์— ์˜ํ•ด ์ƒ์„ฑ๋œ output๊ณผ ํ•จ๊ป˜ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์—๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฌธ์ œ์ ๋“ค์ด ์กด์žฌํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Orca๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. Orca๋Š” LFM์˜ ์ถ”๋ก  ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ชจ๋ฐฉํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šตํ•˜๋Š” 13B ๋ชจ๋ธ์ด๋‹ค.

 

Orca๋Š” explanation trace(step-by-step process)๋ฅผ ํฌํ•จํ•˜๋Š” GPT-4 ๋กœ๋ถ€ํ„ฐ ํ’๋ถ€ํ•œ ์‹œ๊ทธ๋„์„ ํ•™์Šตํ•˜๊ณ , ChatGPT teacher assistant์— ์˜ํ•ด ์ง€๋„๋˜๋Š” ๋‹ค๋ฅธ ๋ณต์žกํ•œ instruction์—์„œ ํ•™์Šต๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ progressive learning process๋ฅผ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด ์‹ ์ค‘ํ•œ ์ƒ˜ํ”Œ๋ง๊ณผ ์„ ํƒ๊ณผ ํ•จ๊ป˜ ๋Œ€๊ทœ๋ชจ & ๋‹ค์–‘ํ•œ imitation data๋ฅผ ํ™œ์šฉํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ Orca๋Š” ๊ธฐ์กด์˜ SoTA ๋ชจ๋ธ์„ ์ฐ์–ด ๋ˆ„๋ฅด๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค! ๐Ÿซข

 

 

Table of Contents

1. Introduction

2. Explanation Tuning

3. Experiment Setup

4. Evaluation for Open-ended Generation

5. Evaluation for Reasoning

6. Limitations

 

 

1. Introduction

 ๋ชจ๋ธ ์ž์‹ ์„ ์‚ฌ์šฉํ•ด์„œ ๋‹ค๋ฅธ AI ๋ชจ๋ธ์˜ ํŠน์„ฑ์„ supervise ํ•  ์ˆ˜ ์žˆ์„๊นŒ? ์ด์ „ ์—ฐ๊ตฌ์—์„œ๋Š” ์ดˆ๊ธฐ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์ˆ˜์ •๋ณธ์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ ์ด๋Ÿฌํ•œ ์ˆ˜์ •๋œ ์‘๋‹ต์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›๋ž˜ ๋ชจ๋ธ์„ fine-tune ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ ๋™์ž‘์„ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๊ณ  ์‚ฌ๋žŒ์˜ ๋ผ๋ฒจ์„ ํ›จ์”ฌ ์ ๊ฒŒ ์‚ฌ์šฉํ•˜์—ฌ ๋ณด๋‹ค harmless ํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

 

 ์ตœ๊ทผ์— ChatGPT์™€ GPT-4 ๊ฐ™์€ LFM์„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ teacher๋กœ ์‚ฌ์šฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ์— ๋Œ€ํ•œ ์œ ์ž…์ด ๋Š˜์—ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ๋“ค์€ teacher ๋ชจ๋ธ์˜ ์Šคํƒ€์ผ์€ ๋”ฐ๋ผ๊ฐˆ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ถ”๋ก ๊ณผ ์š”์•ฝ์—์„œ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ์ด๋ ‡๊ฒŒ LFM์„ teacher model๋กœ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šต๋œ 13B instruction-tuned model์—๋Š” Vicuna๊ฐ€ ์žˆ๋Š”๋ฐ, Vicuna๋Š” OpenLLM & ChatArena ๋ฆฌ๋”๋ณด๋“œ์—์„œ์˜ ์„ฑ๋Šฅ์— ์˜ํ•ด ์ตœ๊ณ ์˜ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋กœ ๊ฐ„์ฃผ๋˜๊ณ  ์žˆ๋‹ค.

 

ํ‘œ 1. Orca๋Š” Vicuna ํ‰๊ฐ€ ์„ธํŠธ์—์„œ GPT-4์— ์˜ํ•ด ํ‰๊ฐ€๋˜์—ˆ์„ ๋•Œ ChatGPT๋ฅผ ํฌํ•จํ•˜๋Š” ๋‹ค์–‘ํ•œ foundation model์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์Œ

 

 ๊ทธ๋ฆผ 1์—์„œ ๋ฌ˜์‚ฌ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, GPT-4๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ‰๊ฐ€ method์—์„œ Vicuna๋Š” ChatGPT์˜ 92% ์ •๋„์— ํ•ด๋‹นํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ๐Ÿ˜ฒ ๊ทธ๋Ÿฌ๋‚˜ human label์— ๋Œ€ํ•œ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์— ๋Œ€ํ•œ ๋ณด๋‹ค ์„ธ์‹ฌํ•œ ํ‰๊ฐ€๋Š” Vicuna๊ฐ€ ์ „๋ฌธ ๋ฐ ํ•™์—… ์‹œํ—˜์—์„œ ChatGPT ํ’ˆ์งˆ์˜ 64%๋งŒ ์œ ์ง€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค(๊ทธ๋ฆผ 2). ๐Ÿ˜… ๊ทธ๋ฆฌ๊ณ  BigBench-Hard ๊ฐ™์€ ๋ณต์žกํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์˜ค์ง ChatGPT ํ’ˆ์งˆ์˜ 48%๋งŒ ์œ ์ง€ํ•˜์˜€๋‹ค(๊ทธ๋ฆผ 3). ๐Ÿ˜“  ์ด๋Ÿฌํ•œ ํ‰๊ฐ€์˜ ๋ถˆ์ผ์น˜๋Š” smaller model์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•˜๋Š” ๊ธฐ์กด ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ์˜ ํ•œ๊ณ„์ ์„ ๋ณด์—ฌ์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด๋“ค์˜ ์ถ”๋ก  ๋ฐ ์ดํ•ด ๋Šฅ๋ ฅ์—์„œ ์ƒ๋‹นํ•œ ๋ถ€์กฑ์„ ๋ฐํ˜”๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ , ์ด ๊ฐญ์„ ์ค„์ด๊ณ ์ž ํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•  ์ „๋žต๋„ ์ œ์‹œํ•˜์˜€๋‹ค.

 

๊ทธ๋ฆผ 2. Explanation tuning์„ ์‚ฌ์šฉํ•œ Orca๋Š” ๋‹ค์–‘ํ•œ ์ „๋ฌธ์  ๋ฐ ํ•™๋ฌธ์  ์‹œํ—˜์—์„œ LFM๊ณผ์˜ ๊ฐญ์„ ์ค„์˜€์Œ

 

๊ทธ๋ฆผ 3. BigBench-Hard์—์„œ ๋ณต์žกํ•œ zero-shot ์ถ”๋ก  task์— ๋Œ€ํ•ด Orca๋Š” ChatGPT์™€ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•จ

 

1-1. Challenges with Existing Methods

 

 ๊ธฐ์กด์˜ LFM์˜ output์„ ํ‰๋‚ด๋‚ด๊ธฐ ์œ„ํ•œ instruction-tuning์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” task ๋‹ค์–‘์„ฑ, ์ฟผ๋ฆฌ ๋ณต์žก์„ฑ, ๋ฐ์ดํ„ฐ scaling์—์„œ ํ•œ๊ณ„์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

 

  • Simple instructions with limited diversity: Self-Instruction์œผ๋กœ ์ƒ์„ฑ๋œ ์ฟผ๋ฆฌ๋Š” ๋‹ค์–‘์„ฑ๊ณผ ๋ณต์žก๋„์—์„œ ํ•œ๊ณ„์ ์„ ๊ฐ€์ง„๋‹ค. ๊ทธ๋ž˜์„œ Self-Instruction์„ ๊ฐœ์„ ์‹œํ‚ค๊ณ ์ž ํ•œ WizardLM์˜ Evol-Instructi์™€ Vicuna & Koala์˜ ๋”์šฑ ์‚ฌ๋žŒ ๊ฐ™์€ ๋Œ€ํ™”์™€ ShareGPT์ด ์ž์—ฐ์Šค๋Ÿฌ์šด instruction๋„ ์žˆ๋‹ค.
  • Task diversity and data scaling: ShareGPT์˜ ๋ฐ์ดํ„ฐ๋Š” ์ฐฝ์˜์  ์ปจํ…์ธ  ์ƒ์„ฑ๊ณผ ์ •๋ณด ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ์™€ ๊ฐ™์€ task๊ฐ€ ๋‹ค๋ฅธ task๋ณด๋‹ค ๋งŽ๋‹ค. ์ด๋Ÿฌํ•œ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋Œ€ํ™” ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์€ LFM์˜ ์Šคํƒ€์ผ์€ ์บก์ฒ˜ํ•˜์ง€๋งŒ, ์ถ”๋ก  ํ”„๋กœ์„ธ์Šค๋Š” ์บก์ฒ˜ํ•˜์ง€ ๋ชปํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 2์™€ 3์—์„œ Vicuna์˜ ์„ฑ๋Šฅ์ด ์ด๋ฅผ ์„ค๋ช…ํ•ด ์ค€๋‹ค. ํ‘œ 1์€ ๋ฐ์ดํ„ฐ์˜ ์‚ฌ์ด์ฆˆ์™€ tuning method์˜ ๊ฐœ์š”๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
  • Limited imitation signals: ๊ธฐ์กด method๋Š” teacher ๋ชจ๋ธ์— ์˜ํ•ด ์ƒ์„ฑ๋œ <query, response> ์Œ์œผ๋กœ๋ถ€ํ„ฐ์˜ imitation learning์— ์˜์กดํ•œ๋‹ค. ์ด๊ฒƒ์€ teacher์˜ ์ถ”๋ก  ํ”„๋กœ์„ธ์Šค ์ถ”์ ์— ๋„ˆ๋ฌด ์ œํ•œ๋œ ์˜ํ–ฅ์„ ์ฃผ๊ฒŒ ๋œ๋‹ค.
  • Evaluation: ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ์ด ๋ถ€์กฑํ•˜๋‹ค. ๊ทธ๋‚˜๋งˆ ๋งŽ์ด ์‚ฌ์šฉ๋˜๊ณ  ์ธ์ •๋˜๋Š” metric์ธ Vicuna Evaluation๋„ ๋ฌธ์ œ๊ฐ€ ๋งŽ๋‹ค. auto-evaluation์€ LFM๊ณผ ๋น„๊ตํ•ด์„œ smaller model์˜ ๋Šฅ๋ ฅ์„ ๊ณผ๋Œ€ํ‰๊ฐ€ํ•˜๊ณ , ์ด์ „์˜ metric์€ ์š”์•ฝ ๋ฐ ์ถ”๋ก  ์Šคํ‚ฌ์ด ์•ฝํ•˜๋‹ค๋Š” ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

 

ํ‘œ 1.&nbsp;LFM๊ณผ ํ•จ๊ป˜ instruction tune๋œ ์œ ๋ช…ํ•œ ๋ชจ๋ธ์˜ ๊ฐœ์š”

 

 

1-2. Key Contributions

 

 ์ด ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ๋Š” ์•ž์„œ ์–ธ๊ธ‰ํ•œ challenge๋“ค์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค:

 

  • Explanation Tuning: <query, response> ์Œ์— teacher์˜ ์ถ”๋ก  ํ”„๋กœ์„ธ์Šค๋ฅผ ์„ค๋ช…ํ•˜๋Š” GPT-4๋กœ๋ถ€ํ„ฐ ๋””ํ…Œ์ผํ•œ response์™€ ํ•จ๊ป˜ augment ํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ LFM์˜ ์ƒ๊ฐ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ชจ๋ฐฉํ•  ๊ธฐํšŒ๋ฅผ ์คฌ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Scaling tasks and instructions: Flan 2022 ๋ชจ์Œ๊ณผ FLAN-v2๋ฅผ ํ™œ์šฉํ•ด์„œ ๋”์šฑ ํ’๋ถ€ํ•˜๊ณ  ๋‹ค์–‘ํ•œ training set๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์˜ instruction ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ณต์žกํ•œ prompt๋ฅผ ๋งŒ๋“ค๊ณ , ์ด prompt๋Š” LFM์—๊ฒŒ ์ฟผ๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ ๋”์šฑ ํ’๋ถ€ํ•˜๊ณ  ๋‹ค์–‘ํ•œ training set๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค.
  • Evaluation: ๋…ผ๋ฌธ์—์„œ๋Š” Orca์˜ ์ƒ์„ฑ, ์ถ”๋ก , ์š”์•ฝ ๋Šฅ๋ ฅ์„ ์—ฌ๋Ÿฌ ์„ธํŒ…์—์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ChatGPT, GPT-4, Vicuna ๊ฐ™์€ LFM๊ณผ Orca์˜ ์ƒ์„ฑ ๋ฐ ์ถ”๋ก  ๋Šฅ๋ ฅ ๋น„๊ต๋ฅผ ์œ„ํ•ด case-study๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.
    1. AutoEvaluation w/ GPT-4 - Vicuna, Awesome, WizardLM์˜ prompt ๋ชจ์Œ์˜ ํ‰๊ฐ€ ์„ธํŠธ์—์„œ
    2. Academic benchmark: Big-Bench Hard, TruthfulQA
    3. Professional & Academin benchmark: SAT, LAST, GRE, GMAT from AGIEval
    4. Safety Evaluation

 

 

2. Explanation Tuning

 ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๋ฌธ์ œ์  ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋ณต์žกํ•œ instruction & ํ’๋ถ€ํ•œ signal๊ณผ ํ•จ๊ป˜ augment๋œ ๋‹ค์–‘ํ•œ task๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ training data๋ฅผ ํ™œ์šฉํ•˜์˜€๋‹ค.

 

2-1. Dataset Construction

 

 ๋…ผ๋ฌธ์˜ training data๋Š” ๋‹ค์Œ์˜ 3๊ฐœ์˜ instance๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค: <System message, User query, LFM response>. ๊ฐ๊ฐ์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • System Meassage: ํ•„์ˆ˜์ ์ธ context, ๊ฐ€์ด๋“œ๋ผ์ธ, ๋‹ค๋ฅธ ์ ์ ˆํ•œ ๋””ํ…Œ์ผ์„ ์ œ๊ณตํ•ด์คŒ
  • User Query: LFM์ด ์ˆ˜ํ–‰ํ•˜๊ธฐ ์›ํ•˜๋Š” ์‹ค์ œ task๋ฅผ ์ •์˜ํ•จ
  • LFM Response: FLAN-v2์˜ 5M ๊ฐœ์˜ user query๋Š” ChatGPT์˜ ์‘๋‹ต์„ ํ™œ์šฉํ•˜๊ณ , ๊ทธ์ค‘ 1M ๊ฐœ๋Š” GPT-4์˜ ์‘๋‹ต์„ ์ˆ˜์ง‘ํ•˜์˜€์Œ

 

System Messages.  16๊ฐœ์˜ system message๋ฅผ ๋งŒ๋“ค์–ด์„œ LFM์œผ๋กœ๋ถ€ํ„ฐ ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์‘๋‹ต์„ ๋ถˆ๋Ÿฌ์ผ์œผํ‚ค๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ system meessage๋Š” Orca๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ output์„ ๋‚ด๋†“์„ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์คฌ๋‹ค:

 

  • long & short answers
  • guideline, instruction, fomat ์ค€์ˆ˜
  • ์ •๋ณด ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐฝ์˜์  ์ฝ˜ํ…์ธ ๋„ ์ƒ์„ฑ
  • ์„ค๋ช… ์ƒ์„ฑ & step-by-step ์ถ”๋ก  ์ƒ์„ฑ

 

 ๋…ผ๋ฌธ์—์„œ๋Š” FLAN-v2 ๋ชจ์Œ์˜ ์„œ๋กœ ๋‹ค๋ฅธ subcollection์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ system message๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ํ‘œ 2์— ์ด system message์˜ ์˜ˆ๊ฐ€ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆผ 6์€ system message์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. 

 

ํ‘œ 2.&nbsp;System Instructin ๋ชจ์Œ

 

๊ทธ๋ฆผ 4. training data์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ์Œ์—์„œ system message์˜ ์ƒ๋Œ€์  ๋นˆ๋„

 

Dataset Description and Sampling from the FLAN-v2 Collection.  FLAN-v2๋Š” ์ด 5๊ฐœ์˜ sub-collection์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค: CoT, NiV2, T0, Flan 2021, Dialogue. ๋…ผ๋ฌธ์—์„œ๋Š” Orca๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์˜ค์ง zero-shot query๋งŒ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. 

 

ํ‘œ 3.&nbsp;training data์˜ ๊ตฌ์„ฑ

 

ChatGPT as Teaching Assistant.  ๋…ผ๋ฌธ์—์„œ๋Š” FLAN-5M์œผ๋กœ ์–ธ๊ธ‰๋˜๋Š” 5๋ฐฑ๋งŒ ๊ฐœ์˜ instruction์„ ์ƒ์„ฑํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์ด FLAN-5M์—์„œ 1๋ฐฑ๋งŒ ๊ฐœ์˜ ์ฟผ๋ฆฌ๋ฅผ ๋žœ๋ค ํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•ด์„œ FLAN-1M์„ ๋งŒ๋“ค์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  FLAN-5M์— ๋Œ€ํ•ด ChatGPT๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‘๋‹ต์„ ์ˆ˜์ง‘ํ•˜๊ณ , FLAN-1M์— ๋Œ€ํ•œ ์‘๋‹ต์œผ๋กœ GPT-4๋ฅผ ํ™œ์šฉํ•˜์˜€๋‹ค. 

 

 ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด Orca๋ฅผ ๋จผ์ € FLAN-5M์—์„œ ํ•™์Šต์‹œํ‚ค๊ณ , ๋’ค๋”ฐ๋ผ์„œ FLAN-1M์—์„œ ํ•™์Šต์‹œ์ผฐ๋‹ค. ChatGPT๋ฅผ ์ค‘๊ฐ„ teacher๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

 

  • Capacity Gap: ๋” ์ ์€ ๋Šฅ๋ ฅ ๊ฐญ์„ ๊ฐ€์ง€๋Š” ์ค‘๊ฐ„ teacher๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ KD์—์„œ smaller student์— ๋Œ€ํ•œ imitation learning ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ํ•œ ๋งˆ๋””๋กœ progressive learning์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • Cost & Time: OpenAI API๋ฅผ ์‚ฌ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜์ง‘์€ ์ œํ•œ๋œ๋‹ค. ํŠนํžˆ, GPT-4๋Š” ChatGPT๋ณด๋‹ค ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ChatGPT์—์„œ 5๋ฐฐ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค.

 

 ๊ทธ๋ฆผ 5๋Š” ์„œ๋กœ ๋‹ค๋ฅธ system message์— ๋Œ€ํ•œ ChatGPT์™€ GPT-4์˜ ์‘๋‹ต ๊ธธ์ด ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด GPT-4๋Š” ChatGPT๋ณด๋‹ค 1.5๋ฐฐ ๋” ๊ธด ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์€ progressive learning์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด ์ค€๋‹ค!

 

๊ทธ๋ฆผ 5.&nbsp;์„œ๋กœ ๋‹ค๋ฅธ system message์— ๋Œ€ํ•œ GPT-4์™€ ChatGPT์˜ ์‘๋‹ต ๊ธธ์ด ๋ถ„ํฌ ๋น„๊ต

 

2-2. Training

 

 ์ด ์„น์…˜์—์„œ๋Š” Orac์— ๋Œ€ํ•œ ํ•™์Šต ํ”„๋กœ์„ธ์Šค์˜ ๊ฐœ์š”๋ฅผ ์ œ๊ณตํ•ด ์ค€๋‹ค.

 

  • Tokenization: LLaMA์˜ Byte Pari Encoding(BPE)๋ฅผ ํ™œ์šฉํ•จ
  • Packing: training process ์ตœ์ ํ™” & ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ปดํ“จํŒ… ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด packing ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•จ. packing์€ ์—ฌ๋Ÿฌ input example์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒƒ
  • Loss: teacher์— ์˜ํ•ด ์ƒ์„ฑ๋œ ํ† ํฐ์—์„œ๋งŒ loss๋ฅผ ๊ณ„์‚ฐํ•จ. ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์—ฐ๊ด€๋˜๊ณ  ์ •๋ณด๊ฐ€ ๋งŽ์€ ํ† ํฐ์— ์ง‘์ค‘ํ•˜๋„๋ก ๋ณด์žฅํ•ด ์คŒ

 

3. Experiment setup

3-1. Baselines

 

 Orca๋ฅผ text-davinci-003, ChatGPT, GPT-4, VIcuna์™€ ๋น„๊ตํ•˜์˜€๋‹ค.

 

3-2. Tasks

 

 ๋…ผ๋ฌธ์—์„œ๋Š” open-ended generation๊ณผ ๋ณต์žกํ•œ ์ถ”๋ก  task๋ฅผ ์ถ”๋ก ํ•˜๊ณ  ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๋Šฅ๋ ฅ์  ์ธก๋ฉด์—์„œ Orca์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ‘œ 4๋Š” ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์˜ ํ†ต๊ณ„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

ํ‘œ 4. Orca ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ

 

Open-ended Generation Capabilities.  Vicuna Evaluation๊ณผ ๋˜‘๊ฐ™์€ ์…‹์—…์—์„œ 3๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ prompt ๋ชจ์Œ(Vicuna, Awesome, WizardLM)์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. 

 

Reasoning Capabilities.  ์ถ”๋ก  ๋Šฅ๋ ฅ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๋‹ค์Œ์˜ 2๊ฐ€์ง€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

  • AGIEval: LM์„ human ์ค‘์‹ฌ ์‹œํ—˜์—์„œ ํ‰๊ฐ€ํ•จ(eg. GRE, GMAT, SAT etc.)
  • BIg-Bench Hard(BBH): 23๊ฐœ์˜ ์–ด๋ ค์šด BIG-Bench tasks. LLM์˜ ๋Šฅ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋˜์—ˆ์Œ. ์ด task๋“ค์€ ๋ชจ๋‘ ์ด์ „ LM๋“ค์ด ํ‰๊ท  human-rater๋ฅผ ๋Šฅ๊ฐ€ํ•˜์ง€ ๋ชปํ•˜๋Š” task๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ.

 

 ๋˜ํ•œ Orca๋Š” ์–ด๋– ํ•œ ์˜ˆ์‹œ ๋˜๋Š” CoT ์—†์ด zero-shot ์„ธํŒ…์—์„œ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ํ‰๊ฐ€๋˜์—ˆ๋‹ค.

 

4. Evaluation for Open-ended Generation

 ํ‘œ 5๋Š” ChatGPT์™€ GPT-4๋ฅผ ์ฐธ์กฐ ๋ชจ๋ธ๋กœ ํ•ด์„œ ํ›„๋ณด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ์—ฌ๊ธฐ์„œ GPT-4๋Š” ํ‰๊ฐ€์ž๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • Orca๋Š” ChatGPT ํ’ˆ์งˆ์˜ 95% ์ •๋„, GPT-4 ํ’ˆ์งˆ์˜ 85% ์ •๋„ ๋‹ฌ์„ฑํ•จ. ๊ทธ๋ฆฌ๊ณ  Vicuna์— ๋น„ํ•ด 10% ์ •๋„ ๊ฐœ์„ ๋จ.
  • Vicuna Evaluation์—์„œ ChatGPT์™€ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง.
  • ๊ด‘๋ฒ”์œ„ํ•œ ์ƒ์„ฑ ์—ญํ• ์— ๊ฑธ์ณ์„œ prompt์— ๋Œ€ํ•ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ. ํŠนํžˆ Awesome prompt ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด

 

ํ‘œ 5.&nbsp;์—ฌ๋Ÿฌ prompt๋ฅผ ์‚ฌ์šฉํ•œ Vicuna Evaluation์˜ ๊ฒฐ๊ณผ

 

Replication Note: GPT-4 ํ‰๊ฐ€๋Š” ์ฒซ ๋ฒˆ์งธ ์‘๋‹ต์— ๋” ๊ธ์ •์ ์ธ bias๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•˜์˜€๋‹ค.

 

 

5. Evaluation Reasoning

5-1. AGIEval Results

 

 ํ‘œ 6์€ Orca์™€ ๋‹ค๋ฅธ baseline ๋ชจ๋ธ ๊ฐ„์˜ AGIEval ๋ฒค์น˜๋งˆํฌ์—์„œ zero-shot ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ํ‰๊ฐ€ ์…‹์—…์€ AGIEval๊ณผ ๋˜‘๊ฐ™์•˜๊ณ , accuracy metric์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธก์ •๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•ด ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • Orca๋Š” Text-dacinci-003๊ณผ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ. ChatGPT์˜ 86%๋ฅผ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, GPT-4์— ๋น„ํ•ด ์ƒ๋‹นํžˆ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ.
  • ์ˆ˜ํ•™ ๊ด€๋ จ task์—์„œ Orca๋Š” text-davinci-003๊ณผ 5% ์ •๋„์˜ ์ฐจ์ด๋ฅผ ๋ณด์ด๊ณ , ChatGPT์™€ ํฐ ๊ฐญ์„ ๋ณด์—ฌ์คฌ๋‹ค.
  • Vicuna์™€ ๋น„๊ตํ•ด์„œ ๋”์šฑ ๊ฐ•๋ ฅํ•œ ๋ชจ์Šต์„ ๋ณด์—ฌ์คŒ.
  • GPT-4๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์ง€๋งŒ, ๋ชจ๋‘ ์‚ฌ๋žŒ๋ณด๋‹ค๋Š” ํ˜„์ €ํžˆ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์Œ.
  • system message์˜ ์ข…๋ฅ˜์— ๊ธฐ๋ฐ˜ํ•ด Orca์˜ ์„ฑ๋Šฅ์ด ์ƒ๋‹นํžˆ ๋‹ค์–‘ํ•ด์ง(ํ‘œ 7)

 

ํ‘œ 6. text-davinci-003, ChatGPT, GPT-4, Vicuna, Orca์˜ AGIEval benchmark์˜ zero-shot ์„ฑ๋Šฅ ๋น„๊ต

 

ํ‘œ 7. AGIEval ๋ฒค์น˜๋งˆํฌ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ system message๋ฅผ ์‚ฌ์šฉํ•œ Orca์˜ zero-shot ์„ฑ๋Šฅ ๋น„๊ต

 

Scaling and Teacher Assitance.  FLAN-1M์—์„œ๋งŒ ํ•™์Šต๋œ Orca์™€ FLAN-5M & FLAN-1M ์—์„œ ํ•™์Šต๋œ Orca๋ฅผ ๋น„๊ตํ•จ์œผ๋กœ์จ progressive learning์˜ ํšจ๊ณผ๋ฅผ ๋ถ„์„ํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์„ฑ๋Šฅ์ด 4.5 ํฌ์ธํŠธ ๊ฐœ์„ ๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๊ฐ€ ํ‘œ 8์— ๋‚˜ํƒ€๋‚˜์žˆ๋‹ค.

 

ํ‘œ 8. intermediate teacher์˜ ํšจ๊ณผ. FLAN-5M๊ณผ FLAN-1M์„ ๊ฐ™์ด ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ๋”์šฑ ์ข‹์•˜์Œ.

 

100๊ฐœ์˜ ๋žœ๋ค ChatGPT-beats-Orca & Orac-beats-ChatGPT ์ƒ˜ํ”Œ์˜ ๋ถ„์„

 

  • Domain Knowledge: Tesla์˜ ๋ฐฐํ„ฐ๋ฆฌ ๋ฌธ์ œ ๊ฐ™์€ ์ „๋ฌธ์ ์ธ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์€ ์ „๋ฌธ์ ์ธ ์ง€์‹์„ ํ•„์š”๋กœ ํ•˜๊ธฐ์— ๋‘˜ ๋‹ค ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.
  • Complex Reasoning: ๋ณต์žกํ•œ ์ถ”๋ก ์— ๋Œ€ํ•ด์„œ ๋‘˜ ๋‹ค ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.
  • Long Context: ChatGPT๋Š” Orca๋ณด๋‹ค long context๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋” ์šฐ์ˆ˜ํ•˜๋‹ค.
  • Geometric Reasoning: ๊ธฐํ•˜ํ•™์  ์ถ”๋ก ์— ๋Œ€ํ•ด ๊ฐ๊ฐ์€ ์‚ด์ง์”ฉ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๊ฒƒ์€ ๋‘ ๋ชจ๋ธ ๊ฐ„์˜ ๊ธฐํ•˜ํ•™์  ์ถ”๋ก  ์„ฑ๋Šฅ ๊ฐญ์„ ๊ฐ€๋ฆฌํ‚จ๋‹ค.
  • LaTeX Reasoning: LaTeX ์œ ํ˜•์˜ ์ถ”๋ก ์—์„œ ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

๊ทธ๋ฆผ 6. AGIEval ๋ฒค์น˜๋งˆํฌ์—์„œ Orca, ChatGPT, GPT-4์˜ ์„ฑ๋Šฅ

 

 

5-2. Big-Bench Hard Results

 

 ํ‘œ 9๋Š” ํ‘œ์ค€ zero-shot prompting๊ณผ ํ•จ๊ป˜ Big-Bench Hard์—์„œ Orca์™€ baseline ๋ชจ๋ธ์˜ zero-shot ์„ฑ๋Šฅ ๋น„๊ต ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. Orca๋Š” ๋ชจ๋“  task์˜ ์ข…ํ•ฉ์—์„œ ChatGPT๋ณด๋‹ค ๋ฏธ๋ฏธํ•˜๊ฒŒ ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ , GPT-4 ๋ณด๋‹ค๋Š” ์ƒ๋‹นํžˆ ๋–จ์–ด์ง€๊ณ , Vicuna๋ฅผ ์ƒ๋‹นํžˆ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. 

 

 ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ๊ณผ ๋‹ฌ๋ฆฌ Orca๋Š” GPT-4๋ณด๋‹ค ์ƒ๋‹นํžˆ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ์ด๋Š” ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ ๋ฐํ˜€์กŒ๋˜ ๋“ฏ์ด GPT-4๊ฐ€ Big-Bench์—์„œ data contamination ๋ฌธ์ œ๋ฅผ ๊ฒช๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

 

ํ‘œ 9. Big-Bench Hard์—์„œ Orca, Vicuna, ChatGPT, GPT-4์˜ zero-shot ์„ฑ๋Šฅ ๋น„๊ต

 

 Orca์™€ ChatGPT๋Š” ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, Orca์™€ ChatGPT ๊ฐ„์˜ ์„ฑ๋Šฅ ์ฐจ์ด์— ๋Œ€ํ•ด์„œ ๋” ์ž์„ธํžˆ ๋“ค์—ฌ๋‹ค๋ดค๋‹ค:

 

  • Entailment & Semantic Understanding: Orca๋Š” entailment(Formal Fallacies)์™€ ๊ตฌ๋ฌธ ์ดํ•ด(Disambiguation QA & Snarks)๋ฅผ ๋” ์ž˜ํ•œ๋‹ค.
  • Temporal & Spatial Reasoning: Orca๋Š” ์‹œ๊ฐ„(Temporal rSequences), ๊ณต๊ฐ„์ (Navigate), ์ƒ‰๊น” ๊ธฐ๋ฐ˜(Colored Objects) ์ถ”๋ก ์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.
  • Casual Judgement: casual judgement task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. Orca๋Š” ChatGPT๋ณด๋‹ค 4.7% ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๊ณ , ์ด๋Š” GPT-4์™€ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ์ด๋‹ค!
  • Multilingual Understanding: Salient Translation Error Detection์—์„œ Orca์™€ ChatGPT๋Š” ์Œ์„ ์ด๋ฃจ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.
  • World Knowledge: world knowledge๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” task(Sports Understanding, Ruin Names)์—์„œ Orca๋Š” ChatGPT ๋ณด๋‹ค ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์œผ๋‚˜, ์˜ํ™” ์ถ”์ฒœ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋Š” Orca๊ฐ€ ChatGPT์— ๋น„ํ•ด ์ถฉ๋ถ„ํ•œ ์ง€์‹์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณธ๋‹ค.
  • Logical & Geometric Reasoning: ChatGPT๋Š” Orca์™€ ๋น„๊ตํ•ด์„œ ์šฐ์›”ํ•œ ๋…ผ๋ฆฌ์  ์ถ”๋ก ์„ ๋ณด์—ฌ์ค€๋‹ค. 
  • Table Understanding: ChatGPT๋Š” Orca์™€ ๋น„๊ตํ•ด์„œ ๋” ๋‚˜์€ ํ‘œ ์ดํ•ด & ์ถ”๋ก  ๋Šฅ๋ ฅ(Penguins in a Table)์„ ๊ฐ€์ง„๋‹ค. 

 

๊ทธ๋ฆผ 7. Big-Bench์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ task์— ๋Œ€ํ•œ GPT-4, ChatGPT, Orca์˜ ์„ฑ๋Šฅ

 

6. Limitations

 Orca๋Š” LLaMA model์— ๊ธฐ๋ฐ˜ํ•ด์„œ ๋งŒ๋“ค์–ด์กŒ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋“ค์˜ ๋งŽ์€ ์ œ์•ฝ์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ LLM์˜ ์ผ๋ฐ˜์ ์ธ ์ œ์•ฝ๋„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.

 

  • Data Biases: ์†Œ์Šค ๋ฐ์ดํ„ฐ์˜ bias๋ฅผ ๋ฌด์‹ฌ์ฝ” ๊ฐ€์ง€๊ณ  ์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ์€ ์ž ์žฌ์ ์œผ๋กœ ํŽธํ–ฅ๋˜๊ฑฐ๋‚˜ ๊ณตํ‰ํ•˜์ง€ ์•Š์€ output์„ ์ƒ์„ฑํ•ด ๋‚ผ ์ˆ˜๋„ ์žˆ๋‹ค.
  • Lack of Contextual Understanding: ํ•œ์ •๋œ real-world understanding์„ ๋ณด์—ฌ์ฃผ๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ถ€์ •ํ™•ํ•˜๊ฑฐ๋‚˜ ํ„ฐ๋ฌด๋‹ˆ์—†๋Š” ์‘๋‹ต์„ ๋‚ด๋†“๊ธฐ๋„ ํ•œ๋‹ค.
  • Lack of Transparency: ์š”์ฆ˜ ๋ชจ๋ธ๋“ค์€ black box์  ์„ฑํ–ฅ์„ ๋งŽ์ด ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋ธ์˜ ์ž์„ธํ•œ ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค.
  • Content Harms: LLM์€ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ content harm์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด์— ๋Œ€์ฒ˜ํ•˜๋Š” ๋ฐฉ์‹์ด ์ค‘์š”ํ•˜๋‹ค.
  • Hallucination: smaller model์€ ์ž‘์€ ์‚ฌ์ด์ฆˆ์™€ ๊ฐ์†Œ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ˆ˜์šฉ๋Ÿ‰ ๋•Œ๋ฌธ์— hallucination์„ ์ผ์œผํ‚ค๊ธฐ ๋” ์‰ฌ์›Œ์กŒ๋‹ค.
  • Potential for Misuse: ์ ํ•ฉํ•œ safeguard ์—†์ด ๋ฌด๋ถ„๋ณ„ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋ฉด ์ž˜๋ชป ์‚ฌ์šฉ๋  ๋ฆฌ์Šคํฌ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

 ์ถ”๊ฐ€์ ์œผ๋กœ Orca์˜ ์„ฑ๋Šฅ์€ explanation tuning์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์— ์˜ํ•ด ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค:

 

  • Zero-Shot Settings: Orca๋Š” ํ‘œ์ค€ prompt์™€ ํ•จ๊ป˜ zero-shot ์„ธํŒ…์„ ๊ตฌ๋™ํ•˜๋Š” ๋ฐ์ดํ„ฐ์—์„œ๋งŒ ํ•™์Šต๋˜์—ˆ๋‹ค. ์•„์ง multi-turn ๋Œ€ํ™”, ICL, few-shot learning, CoT ๋“ฑ์—์„œ๋Š” ํ…Œ์ŠคํŠธ๋˜์ง€ ์•Š์•˜๋‹ค.
  • Data Distribution: Orca์˜ ์„ฑ๋Šฅ์€ tuning data์˜ ๋ถ„ํฌ์™€ ๊ฐ•ํ•˜๊ฒŒ ์ƒ๊ด€๋˜์–ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ training data์—์„œ ๋ณ„๋กœ ํ‘œํ˜„๋˜์–ด ์žˆ์ง€ ์•Š์€ math, coding, reasoning ๊ฐ™์€ ๋ถ€๋ถ„์—์„œ ์•ฝํ•œ ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค.
  • System messages: Orca๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์‘๋‹ต์„ ์ด๋Œ์–ด๋‚ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ system instruction์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ๋‹ค. 
  • GPT-4 Behavior: Orca๊ฐ€ GPT-4๋ฅผ ๋ชจ๋ฐฉํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋œ ๊ฒƒ์ฒ˜๋Ÿผ teacher model์˜ ์žฅ์ ๊ณผ ๋‹จ์ ์„ ๋ชจ๋‘ ์ƒ์†๋ฐ›์•˜์„ ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” Orca๊ฐ€ GPT-4 training ์ค‘์— ์‚ฌ์šฉ๋˜๋Š” safety ์ธก์ •๊ณผ safety guardrail๋กœ๋ถ€ํ„ฐ ์ด์ ์„ ์–ป๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๊ธฐ์ • ์‚ฌ์‹คํ™” ํ•˜์˜€๋‹ค. 

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2306.02707

 

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation

arxiv.org

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

ํ•„์š”ํ•œ ๊ฑด ์˜ค์ง ๊ต๊ณผ์„œ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ๋ฟ!! ๐Ÿ“– - phi-1: Textbooks Are All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.25
LM์ด ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด? ๐Ÿ”ฌ: Large Language Models as Tool Makers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.24
KD์— ์‚ด์ง์˜ ๋ณ€ํ™”๋ฅผ ์ค˜๋ณด์ž!! ๐Ÿ˜œ - Knowledge Distillation of Large Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.22
Let's verify step-by-step ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (1) 2023.06.20
์ค‘์š”ํ•œ ๊ฑด ๊บพ์ด์ง€ ์•Š๋Š” high-quality data!! - Koala๐Ÿจ: A Dialogue Model for Academic Research ๋ฆฌ๋ทฐ  (0) 2023.06.19
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • ํ•„์š”ํ•œ ๊ฑด ์˜ค์ง ๊ต๊ณผ์„œ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ๋ฟ!! ๐Ÿ“– - phi-1: Textbooks Are All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • LM์ด ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด? ๐Ÿ”ฌ: Large Language Models as Tool Makers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • KD์— ์‚ด์ง์˜ ๋ณ€ํ™”๋ฅผ ์ค˜๋ณด์ž!! ๐Ÿ˜œ - Knowledge Distillation of Large Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Let's verify step-by-step ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • scaling law
  • closed-source
  • Chinchilla
  • LLM
  • proprietary model
  • context window
  • GPT-4
  • Vicuna Evaluation
  • Vicuna
  • context length
  • LLAMA2
  • ChatGPT
  • open-source model
  • Evaluation Metric
  • transformer
  • LM
  • Open-source
  • MT-Bench
  • RLHF
  • closed-source model
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
๐ŸฌOrca: Progressive Learning from Complex Explanation Traces of GPT-4 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.