Paper Reading ๐Ÿ“œ/Natural Language Processing

ํ•„์š”ํ•œ ๊ฑด ์˜ค์ง ๊ต๊ณผ์„œ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ๋ฟ!! ๐Ÿ“– - phi-1: Textbooks Are All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 6. 25. 16:44

The overview of this paper

 ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ์ž‘๊ณ  code๋ฅผ ์œ„ํ•œ LLM์ธ phi-1์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. phi-1์€ 1.3B Transformer model์ด๊ณ , ์›น์œผ๋กœ๋ถ€ํ„ฐ textbook ํ€„๋ฆฌํ‹ฐ ๋ฐ์ดํ„ฐ์˜ ์„ ํƒ์  ๋ชจ์Œ๊ณผ ์ข…ํ•ฉ์ ์œผ๋กœ ์ƒ์„ฑ๋œ textbook์„ ์‚ฌ์šฉํ•˜๊ณ , GPT-3.5๋กœ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค. phi-1์€ ์ž‘์€ ๊ทœ๋ชจ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋†’์€ pass@1 accuracy๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

 

Table of Contents

1. Introduction

2. Training details and the importance of high-quality data

3. Spikes of model capability after finetuning on CodeExercises

4. Evaluation on unconventional problems with LLM grading

5. Data pruning for unbiased performance evaluation

 

 

1. Introduction

 ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ „์˜ ์—ฐ๊ตฌ๋ฅผ ๋”ฐ๋ผ์„œ ๋‹ค๋ฅธ ์ถ•(๋ฐ์ดํ„ฐ ํ€„๋ฆฌํ‹ฐ)๊ณผ ํ•จ๊ป˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ์–ป์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํƒ๊ตฌํ•˜์˜€๋‹ค. ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋ฐ์ดํ„ฐ๋Š” ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์ด๋ˆ๋‹ค๋Š” ๊ฒƒ์€ ์˜ค๋žซ๋™์•ˆ ์ž˜ ์•Œ๋ ค์ง„ ์‚ฌ์‹ค์ด๊ณ , ์ด๋Š” ์–ด๋А ์ •๋„ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ด์ ์„ ์–ป๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ์—์„œ ๋” ๋งŽ์€ ํŒจ์Šค๋ฅผ ํ—ˆ๋ฝํ•ด์ค€๋‹ค. ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์€ scaling law์˜ ํ˜•ํƒœ๋„ ๋ฐ”๊ฟ€ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ž ์žฌ์ ์œผ๋กœ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ ๋งž๋จน๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ฐํ˜”๋‹ค. 

 

 ์ด ๋…ผ๋ฌธ์€ high-quality ๋ฐ์ดํ„ฐ๊ฐ€ LLM์˜ SoTA๋ฅผ ๊ฐœ์„ ์‹œํ‚ฌ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์ด์ฆˆ์™€ ํ•™์Šต ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ฃผ์žฅ๊ณผ ํ•จ๊ผ ์ง„ํ–‰๋œ๋‹ค. ์ค‘์š”ํ•œ ์ ์€ smaller model์€ ์ ์€ ํ•™์Šต์„ ํ•„์š”๋กœ ํ•ด์„œ LLM์˜ ํ™˜๊ฒฝ์  ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” LLM์ด code์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ์ดˆ์ ์„ ๋‘์—ˆ๊ณ , ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ๋กœ๋Š” ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” HumanEval์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” high-quality ๋ฐ์ดํ„ฐ์˜ ํšจ๊ณผ๋ฅผ phi-1์ด๋ผ ๋ถ€๋ฅด๋Š” 1.3B ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ์„ค๋ช…ํ•˜์˜€๋‹ค. phi-1์€ ์›น ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘๋˜๊ณ  ํ•„ํ„ฐ๋ง๋œ 'textbook quality' ๋ฐ์ดํ„ฐ์—์„œ pre-train ์‹œํ‚ค๊ณ , 'textbook-exercise-like' ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต์‹œ์ผฐ๋‹ค. phi-1์€ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด ์ƒ๋‹นํžˆ ์ž‘์€ ์‚ฌ์ด์ฆˆ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , HumanEval, MBAPP์—์„œ ์ตœ๊ณ ์˜ pass@1 accuracy๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด ๋”์šฑ ์ ์€ ํ† ํฐ์—์„œ ํ•™์Šต๋˜์—ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , phi-1์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ๋ฆฌ๊ณ  phi-1์™€ phi-1-small์„ ๋น„๊ตํ•จ์œผ๋กœ์จ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๊ฐ€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฐ€์„ค์„ ์ž…์ฆํ•˜์˜€๋‹ค.

 

ํ‘œ 1. ๊ฐ€๋Šฅํ•œ ๋งŽ์€ ๋ชจ๋ธ ๊ฐ„์˜ ๋น„๊ต. phi-1์€ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด ๊ต‰์žฅํžˆ ์ž‘์€ ๊ทœ๋ชจ์—์„œ ํ•™์Šตํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์œ ๋งํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ.

 

2. Training details and the importance of high-quality data

 ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์—์„œ ์–ธ๊ธ‰๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, phi-1์˜ ์ฃผ์„ฑ๋ถ„์€ textbook quality์˜ training data์— ์˜์กดํ•œ๋‹ค. ์ด์ „์—๋Š” TheStack ๊ฐ™์€ text data๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์›น ์‹œ๋ฐ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ, ์ด๋Ÿฌํ•œ ์†Œ์Šค๋Š” ๋ชจ๋ธ์—๊ฒŒ ์–ด๋–ป๊ฒŒ ๊ณ„ํš์„ ์„ธ์šฐ๊ณ  ์ถ”๋ก ์„ ํ•˜๊ฒŒ ํ•  ์ง€๋ฅผ ๊ฐ€๋ฅด์น˜๋Š” ๋ฐ ์ตœ์ ์ด ์•„๋‹ˆ๋‹ค. phi-1์˜ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์™€ training method๋Š” ๋˜‘๊ฐ™์œผ๋‚˜, ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ curateํ•˜๋Š”์ง€๋งŒ ๋‹ค๋ฅด๋‹ค.

 

 ๊ธฐ์กด์˜ code dataset์€ ๊ด‘๋ฒ”์œ„ํ•œ ํ† ํ”ฝ๊ณผ ์‚ฌ์šฉ ์ผ€์ด์Šค๋ฅผ ์ปค๋ฒ„ํ•˜๋Š” ํฌ๊ณ  ๋‹ค์–‘ํ•œ corpus๋ฅผ ํ˜•์„ฑํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ์ด ๋ฐ์ดํ„ฐ๋“ค์€ ์ฝ”๋”ฉ์˜ ๊ธฐ๋ณธ์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ not instructive ํ•˜๊ณ , ๋‹ค์Œ์˜ ์—ฌ๋Ÿฌ ๊ฒฐ์ ์„ ๊ฒช๋Š”๋‹ค:

 

  • ๋งŽ์€ ์ƒ˜ํ”Œ๋“ค์ด ๋…๋ฆฝ์ ์ด์ง€ ์•Š์Œ โ†’ ๋ฐ์ดํ„ฐ์˜ ์™ธ๋ถ€์— ์žˆ๋Š” ๋‹ค๋ฅธ ๋ชจ๋“ˆ ๋˜๋Š” ํŒŒ์ผ์— ์˜์กดํ•จ
  • ์ „ํ˜•์ ์ธ example์€ ์˜๋ฏธ์žˆ๋Š” computation์„ ํฌํ•จํ•˜์ง€ ์•Š๊ณ  ์‚ฌ์†Œํ•œ code๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋…ผ๋ฆฌ๋ฅผ ํฌํ•จํ•˜๋Š” ์ƒ˜ํ”Œ์€ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ์ข‹์ง€ ์•Š๋Š” ๋ฌธ์„œํ™”๋œํ•จ์ˆ˜ ์•ˆ์— ์ˆจ๊ฒจ์ ธ ์žˆ์Œ โ†’ ์ด๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ์˜ ํ•™์Šต์„ ์–ด๋ ต๊ฒŒ ํ•จ
  • example ํŠน์ • ํ† ํ”ฝ ๋˜๋Š” ์‚ฌ์šฉ ์ผ€์ด์Šค์— ํŽธํ–ฅ๋ผ์„œ ์ฝ”๋”ฉ ๊ฐœ๋…๊ณผ ์Šคํ‚ฌ์˜ unbalanceํ•œ ๋ถ„ํฌ๋ฅผ ๋‚ด๋†“๊ฒŒ ๋Œ

 

 ๋…ผ๋ฌธ์—์„œ๋Š” LM๋„ ์‚ฌ๋žŒ์ด ์ข‹์€ textbook์ด๋ผ ์—ฌ๊ธธ ์ •๋„์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๊ฐ€์ง€๋Š” training set๋กœ๋ถ€ํ„ฐ ์ด์ ์„ ์–ป์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ถ”์ธกํ•˜์˜€๋‹ค: ํˆฌ๋ช…ํ•˜๊ณ , ๋…๋ฆฝ์ ์ด๊ณ , instructiveํ•˜๊ณ  ๋ฐธ๋Ÿฐ์Šค ์žกํžŒ ๋ฐ์ดํ„ฐ. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ์˜๋„์ ์œผ๋กœ high-quality ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์ƒ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์„œ ๋”์šฑ ์ž‘์€ ๋ชจ๋ธ๊ณผ ์ ์€ compute๋กœ๋„ code-generation task์—์„œ SoTA๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. phi-1์˜ training์€ ๋‹ค์Œ์˜ 3๊ฐœ์˜ ์ฃผ๋œ ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ•œ๋‹ค:

 

  • LM ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์–ป์–ด์ง„ filtered code-language dataset(TheStack & StackOverflow) - 6B tokens
  • GPT-3.5๊ฐ€ ์ƒ์„ฑํ•œ Python ๊ต๊ณผ์„œ์˜ <1B token์œผ๋กœ ๊ตฌ์„ฑ๋œ synthetic textbook dataset
  • Python exercise์™€ ์†”๋ฃจ์…˜์˜ ~180M token์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ ์€ synthetic exercise dataset

 

 ์œ„์˜ ๋ฐ์ดํ„ฐ์…‹์€ 7B๋ณด๋‹ค ์ ์€ ํ† ํฐ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” filtered code-language & synthetic textbook dataset์˜ ์กฐํ•ฉ์„ 'CodeTextbook'์œผ๋กœ ๋ถ€๋ฅด๊ณ , ์ด๊ฒƒ์„ pre-training ํŽ˜์ด์ฆˆ์— ์‚ฌ์šฉํ•ด์„œ base model phi-1-base์„ ์–ป์—ˆ๋‹ค. ๊ทธ ๋‹ค์Œ์— 'CodeExercise' ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” 180M token์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” synthetic exercise ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด์„œ phi-1-base๋ฅผ fine-tune ํ•ด์„œ phi-1์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 'CodeExercise'์˜ ์ž‘์€ ์‚ฌ์ด์ฆˆ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ด ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ fine-tuning์€ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งŽ์€ ํฅ๋ฏธ๋กœ์šด ๋Šฅ๋ ฅ์„ unlock ํ•˜์˜€๋‹ค.

 

๊ทธ๋ฆผ 1. HumanEval์—์„œ pass@1 accuracy

 

2-1. FIltering of existing code datasets using a transformer-based clssifier

 

 ๋…ผ๋ฌธ์—์„œ๋Š” publicly availableํ•œ TheStack๊ณผ StackOverflow์˜ ์„œ๋ธŒ์…‹์ธ Python code dataset์„ ์‚ฌ์šฉํ•ด์„œ ์‹คํ—˜์„ ์‹œ์ž‘ํ•˜์˜€๋‹ค. TheStack๊ณผ StackOverflow์˜ ํ€„๋ฆฌํ‹ฐ๋Š” GPT-4๋ฅผ ์‚ฌ์šฉํ•ด์„œ annotate ํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” output embedding์„ ์‚ฌ์šฉํ•ด์„œ file/sample์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Random Forest Classifier๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด annotated dataset๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  GPT-4๋ฅผ TheStack & StackOverflow์˜ ์ž‘์€ ์„œ๋ธŒ์…‹์˜ ํ€„๋ฆฌํ‹ฐ์—์„œ annotation์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์†Œํ•œ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด๋Š” human effort๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ๋งŒ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. 

 

2-2. Creation of synthetic textbook-quality datasets

 

 code generation์„ ์œ„ํ•œ high-quality ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๋Š”๋ฐ ์ฃผ๋œ ์–ด๋ ค์šด ์ ์€ example์ด ๋‹ค์–‘ํ•˜๊ณ  ๋น„๋ฐ˜๋ณต์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ณด์žฅํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋‹ค์–‘์„ฑ์€ ๋‹ค์œผ๋ฏ€์ด ๋ช‡ ๊ฐ€์ง€ ์ด์œ ๋กœ ์ธํ•ด์„œ ์ค‘์š”ํ•˜๋‹ค: LM์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์— ๋…ธ์ถœ๋˜๊ฒŒ ํ•ด์ฃผ๊ณ  overfitting์˜ ์œ„ํ—˜๊ณผ ํŠน์ • ํŒจํ„ด ๋˜๋Š” ์†”๋ฃจ์…˜์„ ๊ธฐ์–ตํ•˜๋Š” ๊ฒƒ์„ ์ค„์—ฌ์ฃผ๊ณ , ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™”์™€ robustness๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์ค€๋‹ค. ๊ทธ๋ž˜์„œ LM์ด ๋”์šฑ ์ฐฝ์˜์ ์ด๊ณ  ๋‹ค์–‘ํ•ด์ง€๋„๋ก ์œ ๋„ํ•˜๊ณ , example์˜ ํ€„๋ฆฌํ‹ฐ์™€ ์ผ๊ด€์„ฑ์€ ์œ ์ง€ํ•˜๋Š” ์˜ฌ๋ฐ”๋ฅธ ํŠธ๋ฆญ์„ ์ฐพ์„ ํ•„์š”๊ฐ€ ์žˆ์–ด์กŒ๋‹ค. ์ด์ „ ์—ฐ๊ตฌ์— ์˜๊ฐ์„ ๋ฐ›์•„์„œ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ์„ ์ผ์œผํ‚ค๋Š” prompt์— ๋ฌด์ž‘์œ„์„ฑ์„ ์ฃผ์ž…ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ฐพ๊ณ ์ž ํ•˜์˜€๋‹ค. 

 

The synthetic textbook dataset.  ์ด ๋ฐ์ดํ„ฐ์…‹์€ ๊ด€๋ จ ์ฝ”๋“œ snippet์ด ์‚ฝ์ž…๋œ high-quality์˜ ์ž์—ฐ์–ด ํ…์ŠคํŠธ ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•ด์ค€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ์ด๋Ÿฌํ•œ textbook์˜ ์ปจํ…์ธ ๋ฅผ ์ถ”๋ก ๊ณผ ๊ธฐ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์Šคํ‚ฌ์„ ์ด‰์ง„ํ•˜๋Š” ํ† ํ”ฝ์„ ์ปค๋ฒ„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ชฉํ‘œ๋ฅผ ๋‘์—ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ํ† ํ”ฝ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์Šคํ‚ฌ์˜ ํƒ€๊นƒ ์ฒญ์ค‘์—๊ฒŒ ์ œ์•ฝ์„ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋‹ค์–‘์„ฑ์„ ์ œ๊ณตํ•ด์ค€๋‹ค. ๋‹ค์Œ์˜ ์˜ˆ์‹œ๋Š” ์ข…ํ•ฉ์ ์œผ๋กœ ์ƒ์„ฑ๋œ textbook text๋ฅผ ์„ค๋ช…ํ•œ๋‹ค:

 

synthetic textbook dataset

 

The CodeExercise dataset.  ๊ฐ Exercise๋Š” ์™„์„ฑ๋˜์–ด์•ผ ํ•˜๋Š” ํ•จ์ˆ˜์˜ docstring์ด๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชฉํ‘œ๋Š” ์ž์—ฐ์–ด instruction์— ๊ธฐ๋ฐ˜ํ•ด์„œ ํ•จ์ˆ˜์˜ค๋‚˜์„ฑ task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ alignํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ GPT-3.5์— ์˜ํ•ด ์ƒ์„ฑ๋˜์—ˆ๊ณ , ๋‹ค์–‘์„ฑ์„ ๋Œ์–ด๋‚ด๊ธฐ ์œ„ํ•ด ํ•จ์ˆ˜ ์ด๋ฆ„์— ์ œ์•ฝ์„ ๊ฑธ์—ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ํŠน๋ณ„ํ•˜๊ฒŒ decontamination๊ณผ alternative ๋ฒค์น˜๋งˆํฌ๋„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋‹ค์Œ์˜ snippet์€ ์ข…ํ•ฉ์ ์œผ๋กœ ์ƒ์„ฑ๋œ exercise๋ฅผ ๋ฌ˜์‚ฌํ•œ๋‹ค.

 

CodeExercise dataset

 

2-3. Model architecture & training

 

 ๋…ผ๋ฌธ์—์„œ๋Š” MHA์˜ FlashAttention์„ ์‚ฌ์šฉํ•œ decoder-only Transformer๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ CodeGen ๋ชจ๋ธ๊ณผ ๊ฐ™์ด MHA & MLP์˜ ๋ณ‘๋ ฌ์  ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋˜ํ•œ rotary position embedding(RoPE)๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. tokenizer๋Š” codegen-350M-mono์™€ ๋˜‘๊ฐ™์€ tokenizer๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. 

 

 pre-training๊ณผ fine-tuning์— ๋Œ€ํ•ด ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ์…‹์„ <|endofext|>๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์—ฐ๊ฒฐํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  2,048์˜ sequence length์—์„œ next-token prediction loss๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•˜์˜€๋‹ค. phi-1-base๋Š” 8๊ฐœ์˜ A100 GPU์—์„œ 4์ผ ๋™์•ˆ ํ•™์Šต๋˜์—ˆ๊ณ , phi-1์€ ๋˜‘๊ฐ™์€ ์„ธํŒ…์—์„œ 7์‹œ๊ฐ„ ๋™์•ˆ fine-tuning ๋˜์—ˆ๋‹ค.

 

Pretraining.  phi-1-base๋Š” CodeTextbook ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋˜์—ˆ๋‹ค. ์ž‘์€ ์‚ฌ์ด์ฆˆ์™€ computation์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  HumanEval์—์„œ 29%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

Finetuning.  phi-1์€ phi-1-base๋ฅผ CodeExercise dataset์— fine-tune ํ•จ์œผ๋กœ์จ ์–ป์–ด์กŒ๋‹ค. fine-tuning๊ณผ pre-training์€ ๋˜‘๊ฐ™์€ ์…‹์—…์—์„œ ์ง„ํ–‰๋˜์—ˆ๋‹ค.

 

 

3. Spikes of model capability after finetuning on CodeExercises

 ์ž‘์€ CodeExercise ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ fine-tuning์œผ๋กœ๋ถ€ํ„ฐ HumanEval์—์„œ ํฐ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋‚ด๋†“์•˜๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” fine-tuning์„ ๊ฑฐ์นœ ๋ชจ๋ธ์€ fine-tuning ๋ฐ์ดํ„ฐ์…‹์—์„œ feature ๋˜์ง€ ์•Š์€ task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ์„ค๋ช…ํ•œ๋‹ค. ์ด๊ฒƒ์€ phi-1์˜ fine-tuning ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ชจ๋ธ์ด pre-training ์ค‘์— ์–ป์€ ์ง€์‹์„ ์žฌ์กฐ์งํ•˜๊ณ  ๊ฐ•ํ™”ํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค.

 

3-1. Fine-tuning improves the model's understanding

 

 ๋…ผ๋ฌธ์—์„œ ๋งŒ๋“ค์–ด๋‚ธ ๊ฐ„๋‹จํ•œ Python ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์ด fine-tuning์„ ๊ฑฐ์นœ instruction๊ณผ ํ•จ๊ป˜ ๋” ๋†’์€ ๋ ˆ๋ฒจ์˜ ์ดํ•ด์™€ ์ค€์ˆ˜๋ฅผ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•˜์˜€๋‹ค. phi-1-base๋Š” prompt์—์„œ ๋…ผ๋ฆฌ์  ๊ด€๊ณ„์— ๋Œ€ํ•ด ์–ด๋ ค์›€์„ ๊ฒช์—ˆ๋Š”๋ฐ, phi-1์€ question์„ ํ•ด์„ํ•˜๊ณ , answer๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ƒ์„ฑํ•˜์˜€๋‹ค. ์•„๋ž˜์˜ ์˜ˆ์‹œ์—์„œ 350M phi-1-small๋„ ์†”๋ฃจ์…˜์ด ํ‹€๋ฆฌ๊ธด ํ–ˆ์ง€๋งŒ, ์–ด๋А ์ •๋„์˜ ๋ฌธ์ œ ์ดํ•ด๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. 

 

 

3-2. Finetuning improves the model's ability to use external libraries

 

๋…ผ๋ฌธ์—์„œ๋Š” CodeExercise์—์„œ์˜ fine-tuning์€ ์˜ˆ์ธก์น˜ ๋ชปํ•˜๊ฒŒ ๋ชจ๋ธ์˜ ์™ธ๋ถ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ ๋Šฅ๋ ฅ์„ ๊ฐœ์„ ์‹œ์ผฐ๋‹ค. exercise์— ์ด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œํ‚ค์ง€ ์•Š์•˜์Œ์—๋„ ๋ง์ด๋‹ค! ์ด๊ฒƒ์€ phi-1์˜ fine-tuning์ด ํƒ€๊นƒ์œผ๋กœ ์‚ผ๋Š” task๋ฅผ ๊ฐœ์„ ์‹œํ‚ฌ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ pre-training์œผ๋กœ๋ถ€ํ„ฐ distillํ•˜๊ธฐ ์œ„ํ•œ ๋น„๊ด€๋ จ task๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. 

 

PyGame Examples.  ๋…ผ๋ฌธ์—์„œ๋Š” PyGame์œผ๋กœ ๊ณต์„ ์›€์ง์ด๋Š” ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ๋ชจ๋ธ์—๊ฒŒ ๋ฌผ์–ด๋ดค๋‹ค. ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด๋ฉด phi-1์€ PyGame ํ•จ์ˆ˜๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ ์šฉํ•˜์˜€๋‹ค. phi-1-base & phi-1-small์€ ๊ตฌ๋ฌธ์ ์œผ๋กœ๋Š” ๋งž์œผ๋‚˜, ์˜๋ฏธ์ƒ์œผ๋กœ๋Š” ๊ด€๋ จ์ด ์—†์—ˆ๋‹ค.

 

PyGame example

 

TKinter Example.  ๋‘ ๋ฒˆ์งธ ์˜ˆ์‹œ๋กœ๋Š” TKinter์ด๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ๋ฒ„ํŠผ์„ ํด๋ฆญํ•จ์— ๋”ฐ๋ผ textfield๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋„๋ก ๋ฌผ์–ด๋ดค๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด 3๊ฐœ์˜ ๋ชจ๋ธ์˜ ์ฝ”๋“œ๋Š” prompt ์ดํ•ด์˜ ํฐ ๊ฐญ์„ ๋ณด์—ฌ์ค€๋‹ค. phi-1-base์™€ phi-1-small์€ ์•Œ๋งž์€ TKinter API๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์‹คํŒจํ•˜๊ณ , ์˜๋ฏธ์—†์€ ํ•จ์ˆ˜ ํ˜ธ์ถœ์„ ๋งŒ๋“ค์–ด๋ƒˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ phi-1์€ GUI์™€ ๋ชจ๋“  ํ•จ์ˆ˜๋ฅผ ์•Œ๋งž๊ฒŒ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. 

 

TKinter example

 

Chat model Example.  phi-1์€ phi-1-base๋ณด๋‹ค ๋” ๋‚˜์€ chat ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. chat data๋Š” ์ „์ ์œผ๋กœ pre-training์— ์žˆ๊ณ , fine-tuning์—๋Š” ์—†์—ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋ง์ด๋‹ค. 

 

chat mode example

 

4. Evaluation on unconventional problems with LLM grading

 HumanEval์—์„œ phi-1์˜ ๋†€๋ผ์šธ ์ •๋„๋กœ ์ข‹์€ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์ž ์žฌ์ ์ธ ๊ฑฑ์ •์€ CodeExercise dataset์˜ contamination์— ๊ธฐ์ธํ•˜๋Š” memorization์ด ์žˆ์„ ์ˆ˜๋„ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ „ํ†ต์ ์ด์ง€ ์•Š์€ ๋ฐฉ์‹์œผ๋กœ ๊ณ ์•ˆ๋œ ์ƒˆ๋กœ์šด ํ‰๊ฐ€์™€ ํ•จ๊ป˜ ์ด๋Ÿฌํ•œ ๊ฑฑ์ •์„ ํ•ด๊ฒฐํ•˜์˜€๋‹ค.

 

 real-world code base ๋˜๋Š” coding exercise์— ์ž˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๋ฅผ ๋””์ž์ธํ•˜๊ธฐ ์œ„ํ•œ instruction๊ณผ ํ•จ๊ป˜ HumanEval๊ณผ ๋˜‘๊ฐ™์€ ํฌ๋งท์˜ 50๊ฐœ์˜ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค.

 

 LM์„ coding task์—์„œ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ ํ•œ ๊ฐ€์ง€ ์–ด๋ ค์šด ์ ์€ ๋ชจ๋ธ์˜ output์ด ์ข…์ข… binary ํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ฝ”๋“œ๊ฐ€ test๋ฅผ ํ†ต๊ณผํ•˜๋Š”์ง€ ๊ทธ๋ ‡์ง€ ์•Š์€์ง€๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ๋‰˜์•™์Šค๋ฅผ ์บก์ฒ˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค. ๊ฑฐ์˜ ์•Œ๋งž์€ ์ฝ”๋“œ์ด์ง€๋งŒ, ์‚ฌ์†Œํ•œ ์—๋Ÿฌ๋ฅผ ๊ฐ€์ง€๋Š” ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜, ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ํ‹€๋ ธ์ง€๋งŒ, ์šฐ์—ฐํžˆ๋„ ๋ช‡ ๊ฐœ์˜ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๊ธฐ๋„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋ธ์˜ ์ฝ”๋”ฉ ์Šคํ‚ฌ์„ ๋”์šฑ ์ •๋ณด์  ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ coding ์ธํ„ฐ๋ทฐ์—์„œ output๊ณผ ์•Œ๋งž์€ ์†”๋ฃจ์…˜์„ ๋น„๊ตํ•˜๊ณ  ์˜ˆ์ธก ๋…ผ๋ฆฌ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งค์น˜ํ•˜๋Š”์ง€์— ๊ธฐ๋ฐ˜ํ•ด์„œ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ํ›„๋ณด ์†”๋ฃจ์…˜์˜ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด GPT-4๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์†”๋ฃจ์…˜์— ๋“ฑ๊ธ‰์„ ๋งค๊ธฐ๋Š” ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์—๋Š” ๋‹ค์Œ์˜ 2๊ฐ€์ง€ ์žฅ์ ์ด ์žˆ๋‹ค.

 

  1. GPT-4๋ฅผ grader๋กœ ์‚ฌ์šฉํ•ด์„œ student model์˜ ์ฝ”๋”ฉ ๋Šฅ๋ ฅ์˜ ๋”์šฑ fine-grained & meaningul signal์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Œ
  2. test์— ๋Œ€ํ•œ ํ•„์š”๋ฅผ ์ œ๊ฑฐ

 

 prompt๋Š” LLM์ด student์˜ ์†”๋ฃจ์…˜์„ short verbal evaluation์—์„œ ํ‰๊ฐ€ํ•˜๋„๋ก instruct ํ•˜์˜€๋‹ค. 

 

 ํ‘œ 2๋Š” phi-1๊ณผ ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋…ผ๋ฌธ์˜ ์ƒˆ๋กœ์šด grading method๋„ HumanEval๊ณผ ๋˜‘๊ฐ™์€ ๋žญํ‚น์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค(ํ‘œ 1 ์ฐธ๊ณ ). ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” phi-1 ์„ฑ๋Šฅ์˜ ์œ ํšจ์„ฑ์˜ ์‹ ๋ขฐ๋ฅผ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.

 

ํ‘œ 2. 50๊ฐœ์˜ ์ƒˆ๋กœ์šด ๋น„์ „ํ†ต์  ์ฝ”๋”ฉ ๋ฌธ์ œ์—์„œ LLM์— ์˜ํ•ด ๋งค๊ฒจ์ง„ understanding score

 

5. Data pruning for unbiased performance evaluation

 CodeExercise์—์„œ์˜ ํ•™์Šต์€ HumanEval ๋ฒค์น˜๋งˆํฌ์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์ƒ๋‹นํ•œ ํ–ฅ์ƒ์„ ์ด๋ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด HumanEval์˜ ํŒŒ์ผ๊ณผ ์œ ์‚ฌํ•œ ํŒŒ์ผ์„ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ CodeExercise๋ฅผ pruneํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ทธ ๋‹ค์Œ์— pruned data์—์„œ ๋ชจ๋ธ์„ ์žฌํ•™์Šต์‹œ์ผฐ์Œ์—๋„ HumanEval์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ data pruning ์‹คํ—˜์€ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฏฟ๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด contamination ์‹คํ—˜์„ ํ†ตํ•ด CodeExercise๊ฐ€ HumanEval์— ์˜ํ•ด contaminate๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

5-1. N-gram overlap

 

 N-gram์€ ๊ณต์œ ๋œ n-word sequence์— ๊ธฐ๋ฐ˜ํ•ด์„œ text segment์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ humaneval question๊ณผ ๊ฐ exercise์˜ docstring ๊ฐ„์— n-gram overlap์„ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ตœ์†Œ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹ entry์—์„œ 4๊ฐœ์˜ humaneval question์—์„œ 13-gram overlap์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์˜ n-gram ovelap ๋ถ„์„์€ phi-1 ๋ฐ์ดํ„ฐ์…‹์ด HumanEval๊ณผ ์ตœ์†Œํ•œ์˜ letter-by-letter overlap์„ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

n-gram overlap analysis

 

5-2. Embedding and syntax-based similarity analysis

 

 ์•ž์„œ ๋ดค๋˜ ๊ฒƒ์ฒ˜๋Ÿผ n-gram ๋ถ„์„์€ HumanEval๊ณผ CodeExercise ๊ฐ„์˜ ์œ ์‚ฌ code snipper์„ ์ฐพ๋Š”๋ฐ ์ถฉ๋ถ„ํžˆ ๊ฐœ์„ ๋˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ ๋Œ€์‹ ์— ์ž„๋ฒ ๋”ฉ๊ณผ syntax-based distance์˜ ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. embedding distance ๊ณ„์‚ฐ์„ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” code snippet ๊ฐ„์— L2 distance๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. embedding distance๋Š” code ์Œ์„ ๊ฐญ์ฒ˜ํ•˜๋Š”๋ฐ ์„ฑ๊ณต์ ์ด์—ˆ๋‹ค. syntac-based distance๋ฅผ ์œ„ํ•ด์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฃผ์–ด์ง„ ๋‘ code snippet์˜ Abstract syntax trees(AST) ๊ฐ„์˜ edit distance๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. AST distance๋Š” ์ฝ”๋“œ ์Œ ๊ฐ„์˜ ์˜ค๋ฒ„๋žฉ์„ ์„ฑ๊ณต์ ์œผ๋กœ ํŒ๋ณ„ํ•ด๋ƒˆ๋‹ค. CodeExercise์˜ pruning์„ ์œ„ํ•ด embedding distance๋ฅผ ์œ„ํ•œ ๊ธฐ์ค€์ ์„ ๊ณ ์ •ํ•˜๊ณ , AST distance์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ match rate๋ฅผ ํ…Œ์ŠคํŠธ ํ•˜์˜€๋‹ค.

 

 ํ‘œ 3์€ pruned dataset์—์„œ ์žฌํ•™์Šต๋œ phi-1์˜ ์„ฑ๋Šฅ๊ณผ full CodeExercise์—์„œ ํ•™์Šต๋œ ๊ธฐ์กด์˜ phi-1, StarCoder-prompted๋ฅผ ๋น„๊ตํ•ด์„œ ์š”์•ฝํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” HumanEval problem์„ ๊ธฐ์กด CodeExercise ๋ฐ์ดํ„ฐ์…‹ ๋‚ด๋ถ€์˜ ์ตœ์†Œ ํ•˜๋‚˜์˜ close match๋ฅผ ๊ฐ€์ง€๋Š”์ง€ ๊ทธ๋ ‡์ง€ ์•Š์€์ง€์— ๊ธฐ๋ฐ˜ํ•ด์„œ 2๊ฐœ์˜ ์„œ๋ธŒ์…‹(similar & non-similar)์œผ๋กœ ๋‚˜๋ˆด๋‹ค. ๊ทธ ๋‹ค์Œ์— HumanEval์˜ ๊ฐ ์„œ๋ธŒ์…‹์—์„œ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜์˜€๋‹ค. ๋ฐ์ดํ„ฐ์…‹์„ ํฌ๊ฒŒ prune ํ•œ ํ›„์—๋„, phi-1์€ ์•„์ง StarCoder-Prompted๋ฅผ ํฐ ๋งˆ์ง„์œผ๋กœ ๋Šฅ๊ฐ€ํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ phi-1์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด data contamination ๋•Œ๋ฌธ์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์ž…์ฆํ•œ๋‹ค.

 

ํ‘œ 3. ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ์˜ํ•ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ•ด๊ฒฐ๋œ similar vs non-similar HumanEval problems์˜ ํผ์„ผํ…Œ์ด์ง€

 

 

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2306.11644

 

Textbooks Are All You Need

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the w

arxiv.org

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

GPT-4๋„ ์ž˜ ๋ชปํ•œ API ํ˜ธ์ถœ์„ ํ•œ๋‹ค๊ณ ?!? - Gorilla๐Ÿฆ: Large Language Model Connected with Massive APIs ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.27
Open-domain instruction์˜ ํšจ๊ณผ ๐Ÿช„ - WizardLM: Empowering Large Language Models to Follow Complex Instructions ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.06.26
LM์ด ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด? ๐Ÿ”ฌ: Large Language Models as Tool Makers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.24
๐ŸฌOrca: Progressive Learning from Complex Explanation Traces of GPT-4 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.23
KD์— ์‚ด์ง์˜ ๋ณ€ํ™”๋ฅผ ์ค˜๋ณด์ž!! ๐Ÿ˜œ - Knowledge Distillation of Large Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.06.22
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • GPT-4๋„ ์ž˜ ๋ชปํ•œ API ํ˜ธ์ถœ์„ ํ•œ๋‹ค๊ณ ?!? - Gorilla๐Ÿฆ: Large Language Model Connected with Massive APIs ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Open-domain instruction์˜ ํšจ๊ณผ ๐Ÿช„ - WizardLM: Empowering Large Language Models to Follow Complex Instructions ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • LM์ด ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด? ๐Ÿ”ฌ: Large Language Models as Tool Makers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ๐ŸฌOrca: Progressive Learning from Complex Explanation Traces of GPT-4 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • GPT-4
  • closed-source
  • Evaluation Metric
  • Open-source
  • MT-Bench
  • ChatGPT
  • closed-source model
  • Vicuna
  • proprietary model
  • LM
  • context window
  • Chinchilla
  • open-source model
  • Vicuna Evaluation
  • scaling law
  • context length
  • LLM
  • LLAMA2
  • RLHF
  • transformer
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
ํ•„์š”ํ•œ ๊ฑด ์˜ค์ง ๊ต๊ณผ์„œ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ๋ฟ!! ๐Ÿ“– - phi-1: Textbooks Are All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.