Insight ๐Ÿ˜Ž

How has scaling law developed in NLP? ๐Ÿค” - NLP์—์„œ scaling law๋Š” ์–ด๋–ป๊ฒŒ ๋ฐœ์ „๋˜์—ˆ์„๊นŒ?

Cartinoe 2023. 7. 24. 16:54

Before Starting..

 2017๋…„ NLP๋ฅผ ํฌํ•จํ•œ ์ง€๊ธˆ๊นŒ์ง€์˜ ๋”ฅ๋Ÿฌ๋‹์˜ ํŒ๋„๋ฅผ ๋’ค์ง‘์–ด์—Ž๋Š” ํ˜์‹ ์ ์ธ ๋ชจ๋ธ์ธ 'Transformer'๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ค„๋ณผ ๋‚ด์šฉ์€ Transformer์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์ด ์•„๋‹ˆ๊ธฐ์— ๋”ฐ๋กœ ๊นŠ์ด ์•Œ์•„๋ณด์ง€๋Š” ์•Š๊ฒ ์ง€๋งŒ, ์ด๋ฒˆ ํฌ์ŠคํŒ…์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด ๋ชจ๋ธ์˜ ์‚ฌ์ด์ฆˆ์— ๋Œ€ํ•ด์„œ๋Š” ์•Œ์•„๋‘˜ ํ•„์š”๊ฐ€ ์žˆ๋‹ค. Transformer์˜ ์‚ฌ์ด์ฆˆ๋Š” 465M ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ๋ชจ๋ธ์ด์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ถˆ๊ณผ 3๋…„ ๋งŒ์— ์ด ์‚ฌ์ด์ฆˆ๊ฐ€ ์ •๋ง ์ž‘๊ฒŒ ๋А๊ปด์ง€๊ฒŒ ํ•  ๋งŒํผ ํฐ ์‚ฌ์ด์ฆˆ์˜ ๋ชจ๋ธ์ธ GPT-3(175B)๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ˜„์žฌ๊นŒ์ง€๋„ ์ด๋ณด๋‹ค ๋” ํฐ ๋ชจ๋ธ๋“ค์€ ๊ณ„์† ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค. LM์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ด๋ ‡๊ฒŒ ์ ์  ์ปค์ง€๊ฒŒ ๋œ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ? ๊ทธ ์ด์œ ๋Š” Kaplan et al. 2020์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด๋ ‡๊ฒŒ ๋ชจ๋ธ์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณ„์† ๋Š˜๋ ค๊ฐ€๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๊ถ๊ทน์ ์ธ ๋ฐฉ๋ฒ•์ธ ๊ฒƒ์ผ๊นŒ? ํ›„์† ์—ฐ๊ตฌ๋“ค์— ์˜ํ•˜๋ฉด ๋˜ ๊ทธ๋ ‡์ง€๋งŒ์€ ์•Š๋‹ค๊ณ  ํ•œ๋‹ค(Hoffman et al. 2022, Zhou et al. 2023). ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” LM์˜ scaling law์˜ ๋ณ€์ฒœ์‚ฌ์— ๋Œ€ํ•ด์„œ ํ•œ ๋ฒˆ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค!

 

What is the scaling law? ๐Ÿค”๐Ÿ“ˆ

 ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ์ž์„ธํ•˜๊ฒŒ ๋‹ค๋ค„๋ณผ ๋‚ด์šฉ์€ scaling law์ธ๋ฐ, ์ด scaling law์— ๋Œ€ํ•ด์„œ ์ž˜ ๋ชจ๋ฅด๊ณ  ์žˆ๋‹ค๋ฉด ํฌ๋‚˜ํฐ ๋‚ญํŒจ์ด๋‹ˆ, ๊ฐ„๋‹จํ•˜๊ฒŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€ ๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค. 

 

 Scaling law๋Š” ์ง์—ญํ•ด๋ณด๋ฉด, '๊ทœ๋ชจ ์ฆ๊ฐ€์˜ ๋ฒ•์น™'์ด๋ผ๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹ค์ œ ์˜๋ฏธ๋„ ์ด๋ฆ„๊ณผ ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š์€๋ฐ, ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด ์–ด๋–ค ์š”์†Œ์˜ ์ˆ˜์— ๋ณ€ํ™”๋ฅผ ๊ฐ€ํ–ˆ์„ ๋•Œ ๋‹ค๋ฅธ ์š”์†Œ๊ฐ€ ๋ณ€ํ™”ํ•˜๋Š” ๊ด€๊ณ„๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ์‹ค์ œ๋กœ scaling law๋Š” ๋‹ค์–‘ํ•œ ๊ณผํ•™ ๋ถ„์•ผ์— ์‚ฌ์šฉ๋˜๋Š” ์šฉ์–ด์ธ๋ฐ, ์ด๊ฒƒ์„ ์ปดํ“จํ„ฐ ๊ณผํ•™์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ฃจ๊ณ ์ž ํ•˜๋Š” scaling law๋Š” LM์˜ scaling law๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋•Œ๋Š” 'LM์˜ ์š”์†Œ๋“ค์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™”์˜ ๋ฒ•์น™'์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 

LM์˜ scaling law ์˜ˆ์‹œ(์ถœ์ฒ˜: Kaplan, Jared, et al. "Scaling laws for neural language models.")

 

 ์œ„์˜ ๊ทธ๋ฆผ์€ Kaplan et al. 2020์—์„œ ๋ณด์—ฌ์ฃผ๋Š” scaling law์˜ ์˜ˆ์‹œ์ด๋‹ค. ์—ฌ๊ธฐ์—์„œ๋Š” scaling law๋ฅผ compute budget, dataset size, parameters์— ๋ณ€ํ™”๋ฅผ ์คฌ์„ ๋•Œ test loss์˜ ๋ณ€ํ™”๋กœ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ์ด๋ฅผ ์ •๋ฆฌํ•ด ๋ณด๋ฉด, LM์˜ scaling law๋ผ๋Š” ๊ฒƒ์€ 'LM์˜ dataset size์™€ parameter ๊ฐ™์€ ์š”์†Œ์— ๋ณ€ํ™”๋ฅผ ์คฌ์„ ๋•Œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”๊ฐ€' ๋ผ๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค!

 

Parameters matter most! (2020) ๐Ÿ’ป

 LM์˜ scaling law๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์†Œ๊ฐœํ•˜๊ณ  ์ œ์•ˆํ•œ ๊ฒƒ์€ 2020๋…„ OpenAI์—์„œ ๋‚ด๋†“์€ ๋…ผ๋ฌธ์ธ Kaplan, Jared, et al. 'Scaling laws for neural language models.' (2020)์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” LM์˜ ์„ฑ๋Šฅ์ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜, ๋ฐ์ดํ„ฐ ํฌ๊ธฐ, ์—ฐ์‚ฐ ๋Šฅ๋ ฅ๊ณผ ๊ด€๋ จ์ด ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๊ณผ์—ฐ ์–ด๋–ค ์š”์†Œ๋“ค์ด ๋” ์ค‘์š”ํ•˜๊ณ  ๋œ ์ค‘์š”ํ•œ์ง€ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์‹คํ—˜์„ ํ†ตํ•ด LM์˜ scaling law๋ฅผ ๋ฐํ˜€๋‚ธ๋‹ค.

 

 ์ด ๋…ผ๋ฌธ์—์„œ ๋ฐํ˜€๋‚ธ scaling law๋Š” ๋‹ค์Œ์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

 

๋” ๋งŽ์€ compute budget์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ, ์–ผ๋งˆ๋‚˜ ๋” ํฐ models, ๋” ๋งŽ์€ ์–‘์˜ batch, ๋” ๋งŽ์€ training step์„ ์‚ฌ์šฉํ•ด์•ผ ํ• ๊นŒ? (์ถœ์ฒ˜: Kaplan, Jared, et al. "Scaling laws for neural language models.")

 

 ์œ„์˜ ๊ทธ๋ฆผ์€ ๋ญ”๊ฐ€ ๋ณต์žกํ•˜๊ฒŒ ์ƒ๊ฒผ์ง€๋งŒ, ์ง€๋ ˆ ๊ฒ๋จน์„ ํ•„์š”๋Š” ์—†๋‹ค. ์‚ฌ์‹ค ๊ทธ ์† ๋œป์€ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ณ  ๋ช…๋ฃŒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์šฐ์„  ์œ„ ๊ทธ๋ฆผ์ด ๋‚˜์˜ค๊ฒŒ ๋œ ๊ฒฝ์œ„๋Š” ๋” ๋งŽ์€ compute budget์„ ์‚ฌ์šฉํ•  ๋•Œ model size, batch size, training step์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์–ด๋–ป๊ฒŒ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ผ๊นŒ๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด ๊ทธ๋ ค์ง„ ๊ฒƒ์ด๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด model size๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ณ , ๊ทธ ๋‹ค์Œ์— batch size, ๋งˆ์ง€๋ง‰์œผ๋กœ training step ์ˆœ์œผ๋กœ loss ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 ์ด๋ ‡๊ฒŒ๋งŒ ๋งํ•˜๋ฉด ์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•ด์„œ ์ž˜ ์ดํ•ด๊ฐ€ ๊ฐ€์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค. ๊ฒฐ๊ตญ์— ์ด ๋…ผ๋ฌธ์—์„œ ํ•˜๊ณ ์ž ํ–ˆ๋˜ ๋ง์„ ๋…ผ๋ฌธ์˜ Discussion์— ์žˆ๋Š” ํ•œ ๋ฌธ์žฅ์„ ์ธ์šฉํ•ด์„œ ๋งํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

'Big models may be more important than big data.'

 

 ์œ„์˜ ๊ทธ๋ฆผ์—์„œ๋„ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ๋” ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ, ์ด ๋…ผ๋ฌธ์—์„œ ์‹คํ—˜์„ ํ†ตํ•ด ๋ฐํ˜€๋‚ธ ์‚ฌ์‹ค์€ 'ํฐ ๋ชจ๋ธ์ด ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ์ค‘์š”ํ•˜๋‹ค' ๋ผ๋Š” ์‚ฌ์‹ค์ด๋‹ค.

 

 ์ด๋Ÿฌํ•œ ๋ฐœ๊ฒฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ OpenAI๋Š” ๊ฐ™์€ ๋…„๋„ 5์›”์— ๊ธฐ์กด์˜ ๋ชจ๋ธ๋ณด๋‹ค 10๋ฐฐ๊ฐ€๋Ÿ‰ ์ปค์ง„ 1,750์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” LM์ธ 'GPT-3'๋ฅผ ์†Œ๊ฐœํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ ์ดํ›„๋กœ๋„ ์•ฝ 2๋…„ ๊ฐ„ LM์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์ง€๋Š” ํŠธ๋ Œ๋“œ๋Š” ๋‹ค์Œ์˜ ๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ ๊ณ„์†๋œ๋‹ค. 

 

 

LM์˜ scaling trend (์ถœ์ฒ˜: https://huggingface.co/blog/large-language-models)

 

Not only parameters but also data are too important! (2022) ๐Ÿ“œ

 Kaplan et al. 2020์˜ scaling law๋Š” ๋ฐœํ‘œ ์ดํ›„ ํ–ฅํ›„ 2๋…„ ๋™์•ˆ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๋“ค์—์„œ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋ฉฐ, LM์˜ scaling trend๋ฅผ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋Œ๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณผ์—ฐ Kaplan et al. 2020์—์„œ ๋ฐœํ‘œํ•œ scaling law๊ฐ€ ์™„๋ฒฝํ•œ scaling law ์˜€์„๊นŒ? ์ด๋Ÿฌํ•œ ์˜๊ตฌ์‹ฌ์€ ๋˜ ํ•œ ๋ฒˆ ์ƒˆ๋กœ์šด scaling law๋ฅผ ์ œ์•ˆํ•˜๋Š” ๋…ผ๋ฌธ์ธ Hoffmann, Jordan, et al. 'Training compute-optimal large language models.' (2022)์˜ ๋ฐœํ‘œ๋ฅผ ์ด๋Œ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ „์˜ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ ์ค‘์‹ฌ์  scaling law๋ฅผ ์‹คํ—˜์„ ํ†ตํ•ด ๋น„ํŒํ•˜๋ฉด์„œ ์ข€ ๋” ๋‚˜์€ ๊ฐœ์„  ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ด์ง ์›ƒ๊ธด ์‚ฌ์‹ค์€ ์ฒซ ๋ฒˆ์งธ scaling law๋„ OpenAI์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ธ๋ฐ, ๋‘ ๋ฒˆ์งธ scaling law๋„ OpenAI์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ด๋ผ๋Š” ์ ์ด๋‹ค. ๐Ÿคฃ 

 

 ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ฃผ์–ด์ง„ compute budget์—์„œ LM์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ตœ์ ์˜ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ & ํ† ํฐ์˜ ์ˆ˜๋ฅผ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ Kaplan et al. 2020์˜ scaling law๋ฅผ ๋”ฐ๋ฅด๋Š” ๋ชจ๋ธ๋“ค์€ ์ƒ๋‹นํžˆ under-train ๋˜์–ด ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ํ•œ ๋งˆ๋””๋กœ ๋ชจ๋ธ์˜ ์‚ฌ์ด์ฆˆ๋งŒ ํ‚ค์šฐ๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ •๋‹ต์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋– ํ•œ ๋ฐฉ์‹์œผ๋กœ scaling์„ ํ•ด์•ผ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ผ๊นŒ? ์ด ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.

 

'๊ณ ์ •๋œ FLOPs budget์ด ์ฃผ์–ด์ง€๋ฉด, ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์™€ training token์˜ ์ˆ˜๋ฅผ ์–ด๋–ป๊ฒŒ trade-off ํ•ด์•ผ ํ• ๊นŒ?'

 

 ์ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋Œ€๋‹ต์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ฃจ์ง€ ์•Š๋„๋ก ํ•˜๊ฒ ๋‹ค.(์ž์„ธํ•œ ๋‚ด์šฉ์ด ๊ถ๊ธˆํ•˜๋‹ค๋ฉด Chinchilla review๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋žŒ) ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์—์„œ ์‹คํ—˜์„ ํ†ตํ•ด ์–ป๊ฒŒ ๋˜๋Š” ํ•˜๋‚˜์˜ ๊ทธ๋ž˜ํ”„๊ฐ€ ์žˆ๋‹ค. ์ด ๊ทธ๋ž˜ํ”„๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

 

๋…ผ๋ฌธ์˜ 3๊ฐ€์ง€ apporach์™€ ๊ธฐ์กด Kaplan์˜ scaling law (์ถœ์ฒ˜: Hoffmann, Jordan, et al. "Training compute-optimal large language models.")

 

 ์œ„์˜ ๊ทธ๋ž˜ํ”„๋Š” Kaplan์˜ scaling law์™€ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ 3๊ฐ€์ง€ approach์˜ scaling law๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ธฐ์กด์˜ scaling law๋Š” ์ข€ ๋” ๊ฒฝ์‚ฌ๊ฐ€ ์žˆ๋Š” scaling law๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์˜ ์ฆ๊ฐ€์— ๋Œ€ํ•œ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์˜ approach๋“ค์„ ๋ณด๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ์ฆ๊ฐ€์‹œํ‚ฌ ํ•„์š”๋Š” ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์‹ค์ œ๋กœ ๋…ผ๋ฌธ์˜ approach๋“ค์ด ์ œ์•ˆํ•˜๋Š” scaling law๋Š” ๋‹ค์Œ์˜ ํ‘œ์™€ ๊ฐ™๋‹ค. 

 

training compute์— ๋”ฐ๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๋ฐ์ดํ„ฐ์˜ scaling (์ถœ์ฒ˜: Hoffmann, Jordan, et al. "Training compute-optimal large language models.")

 

 ์œ„์˜ ํ‘œ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด Kaplan์˜ scaling law๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์˜ ์ฆ๊ฐ€๋ฅผ ๋˜๊ฒŒ ์ค‘์š”ํ•˜๊ฒŒ ์ƒ๊ฐํ•˜์ง€๋งŒ, ๋…ผ๋ฌธ์˜ approach๋“ค์„ ๋ณด๋ฉด ๊ตณ์ด ๊ทธ๋Ÿด ํ•„์š” ์—†์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋ฅผ ๊ฐ™์€ ๋น„์œจ๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์‚ฌ์‹ค์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ง€๊ธˆ๊นŒ์ง€์˜ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋” ์ž‘์€ ์‚ฌ์ด์ฆˆ์˜ 70B ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์— ๊ธฐ์กด 300B ๊ฐœ์˜ ํ† ํฐ๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ์€ 1.4T๊ฐœ์˜ ํ† ํฐ์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ 4๋ฐฐ ํฐ Gopher ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค! ๐Ÿ˜ฒ

 

Less Is Better: LIB..?? LIMA!! (2023) ๐Ÿคฃ

 ์ด๋ ‡๊ฒŒ ํ•ด์„œ Kaplan et al. 2020Hoffman et al. 2022์—์„œ ์ œ์•ˆํ•œ scaling law๊นŒ์ง€ ์•Œ์•„๋ดค๋‹ค. ์ฒ˜์Œ์˜ scaling law์—์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜์˜€๊ณ , ๊ทธ๋‹ค์Œ์˜ scaling law์—์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์„ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ under-train ํ•˜๊ฒŒ ๋˜๊ณ , ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๊ธฐ๋ณด๋‹ค๋Š” ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์„œ LM์˜ scaling law๋Š” ์ด 2๋ฒˆ์˜ ๋ณ€ํ™”๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฒˆ์— ์†Œ๊ฐœํ•  ๋…ผ๋ฌธ์ด ํ•„์ž๋Š” 3๋ฒˆ์งธ scaling law๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๋ฌผ๋ก  ์ง€๊ธˆ๊นŒ์ง€๋Š” pre-training data์™€ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜์˜€๊ณ , ์ง€๊ธˆ ์†Œ๊ฐœํ•˜๋ ค๋Š” ๋…ผ๋ฌธ์€ fine-tuning data์— ๋Œ€ํ•ด์„œ ๋ถ„์„์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‚ด์ง ๋‹ค๋ฅธ ๊ฐ์ด ์žˆ์ง€๋งŒ, ๋น„์Šทํ•œ ๋งฅ๋ฝ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์—ฌ ์†Œ๊ฐœํ•˜๋ ค ํ•œ๋‹ค! ๐Ÿ˜

 

 ์ด๋ฒˆ์— ์†Œ๊ฐœํ•  ๋…ผ๋ฌธ์€ Meta์—์„œ 2023๋…„์— ๋ฐœํ‘œํ•œ ๋”ฐ๋ˆ๋”ฐ๋ˆํ•œ ๋…ผ๋ฌธ์ธ Zhou, Chunting, et al. 'LIMA: Less Is More for Alignment.' (2023) ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์„ ๋ณด๋ฉด ์ง์ž‘ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๊ทธ๋ ‡๊ฒŒ ๋งŽ์„ ํ•„์š”๋„ ์—†๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค. ์•„ ๋ฌผ๋ก  ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ๋ฐ์ดํ„ฐ๋Š” fine-tuning์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๐Ÿ˜ ๋ฌผ๋ก  ๊ทธ๋ ‡๋‹ค๊ณ  ์•„๋ฌด๋ ‡๊ฒŒ๋‚˜ ์„ ํƒํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ์–˜๊ธฐ๋Š” ์•„๋‹ˆ๊ณ , LIMA๋งŒ์˜ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒ๋œ ๋ฐ์ดํ„ฐ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ๋กœ ๋‹จ 1,000๊ฐœ์˜ instruction data๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์„ fine-tune ํ•˜๊ณ ๋„ ์ƒ๋‹นํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž! (์—ฌ๊ธฐ์„œ๋Š” ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š”๋ฐ ๋”์šฑ ๊ตฌ์ฒด์ ์ธ ๋‚ด์šฉ์ด ๊ถ๊ธˆํ•˜๋‹ค๋ฉด LIMA review๋ฅผ ํ™•์ธํ•˜๊ธธ ๋ฐ”๋žŒ)

 

'๋ชจ๋ธ์˜ ์ง€์‹๊ณผ ๋Šฅ๋ ฅ์€ ๋Œ€๋ถ€๋ถ„ pre-training ์ค‘์— ํ•™์Šต๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  fine-tuning์€ ์‚ฌ์šฉ์ž์™€

์ƒํ˜ธ์ž‘์šฉ์„ ํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ํฌ๋งท์˜ ํ•˜์œ„ ๋ถ„ํฌ๋ฅผ ๊ฐ€๋ฅด์น˜๋Š” ๊ฒƒ์ž„'

 

 LIMA ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจ๋ธ์€ ์ด๋ฏธ pre-training ์ค‘์— ๋Œ€๋ถ€๋ถ„์˜ ์ง€์‹๊ณผ ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•˜๊ฒŒ ๋˜๊ณ , fine-tuning์€ ๋ชจ๋ธ์ด ์‚ฌ์šฉ์ž๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉ ํ•˜๊ธฐ ์œ„ํ•œ ์Šคํƒ€์ผ ๋˜๋Š” ํ˜•์‹์„ ํ•™์Šตํ•˜๋Š” ๊ฐ„๋‹จํ•œ ํ”„๋กœ์„ธ์Šค์ผ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์˜€๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ ์‚ฌ์šฉ์ž prompt์™€ high-quality reponse์— ๊ฐ€๊นŒ์šด 1,000๊ฐœ์˜ example์„ ์—„์„ ํ•˜์—ฌ LLaMA-65B๋ฅผ fine-tune ํ•˜์—ฌ 'LIMA'๋ฅผ ๋งŒ๋“ค์–ด ๋‚ด๊ณ , ์—ฌ๋Ÿฌ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•ด๋ณธ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹๋“ค์—์„œ ์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด 1,000๊ฐœ์˜ prompt & response ๋ฐ์ดํ„ฐ์…‹์„ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค.

 

training prompts(inputs), responses(outputs), test prompts์˜ ์†Œ์Šค (์ถœ์ฒ˜: Zhou, Chunting, et al. "LIMA: Less Is More for Alignment.")

 

 ์œ„์˜ 1,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ fine-tuning ํ•˜์—ฌ LIMA๋ฅผ ์–ป๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ์‹ค์ œ๋กœ ์ด LIMA๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”์ง€ ๊ฒ€์ฆ์„ ๊ฑฐ์ณ์•ผ ํ•  ์‹œ๊ฐ„์ด๋‹ค! LIMA๋Š” human evaluation๊ณผ model evaluation์„ ๋ชจ๋‘ ๊ฑฐ์น˜๋Š”๋ฐ ์•„๋ž˜ ๊ทธ๋ฆผ์˜ ์™ผ์ชฝ์ด human preference evaluation, ์˜ค๋ฅธ์ชฝ์ด GPT-4 preference evaluation์˜ ๊ฒฐ๊ณผ์ด๋‹ค.

 

LIMA results. ์™ผ์ชฝ: human preference evaluation / ์˜ค๋ฅธ์ชฝ: GPT-4 preference evaluation (์ถœ์ฒ˜: Zhou, Chunting, et al. "LIMA: Less Is More for Alignment.")

 

 ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด LIMA๋Š” ๋˜‘๊ฐ™์€ base model๊ณผ ๋˜‘๊ฐ™์€ ์‚ฌ์ด์ฆˆ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ , ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tuningํ•œ Alpaca 65B๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ , ํฌ๊ธฐ๊ฐ€ ๋” ํฌ๊ณ  ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tuning ํ•œ text-davinci-003๋ณด๋‹ค๋„ ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค๋ฅธ proprietary LM๋ณด๋‹ค๋Š” ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๊ทธ๋ž˜๋„ ๋”ฐ์ง€๊ณ  ๋ณด๋ฉด 43%์˜ ์‘๋‹ต์—์„œ GPT-4์™€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค!

 

 ์ด๋ ‡๋“ฏ ์ง€๊ธˆ๊นŒ์ง€์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์ƒ๋‹นํžˆ ์ ์ง€๋งŒ, ๊ทธ๋งŒํผ ์ข‹์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tuningํ•˜๋Š” ๊ฒƒ๋„ ์ƒ๋‹นํžˆ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋กœ ์ธํ•ด fine-tuning data๋Š” ์‹ค์ œ๋กœ ๊ทธ๋ ‡๊ฒŒ ๋งŽ์€ ์–‘์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์‹ค์ œ๋กœ LIMA์˜ ์ด๋Ÿฌํ•œ concept๋ฅผ ๋ฒ ์ด์Šค๋กœ ํ•ด์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๊ฑฐ๋‚˜ ํšจ์œจ์ ์ธ method๋ฅผ ์ œ์•ˆํ•œ ์—ฐ๊ตฌ๋“ค๋„ ์žˆ๋‹ค! (Chen et al. 2023, Alshikh et al. 2023)

 

In the future.. โœจ

 ์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ง€๊ธˆ๊นŒ์ง€ ์ œ์•ˆ๋œ ๊ตต์ง๊ตต์งํ•œ LM์˜ scaling law์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ดค๋‹ค. ์ฒ˜์Œ์— ์†Œ๊ฐœ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ค‘์š”์‹œ ์—ฌ๊ธฐ๋Š” Kaplan์˜ scaling law์—์„œ, ๋ฐ์ดํ„ฐ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋˜‘๊ฐ™์€ ๋น„์œจ๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋Š” Chinchilla scaling law๋ฅผ ๊ฑฐ์ณ์„œ, ๋งˆ์ง€๋ง‰์—๋Š” ์ข€ ๋” ์ ์ง€๋งŒ high-quality ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tuning ํ•˜๋Š” LIMA๊นŒ์ง€ ์•Œ์•„๋ณด์•˜๋‹ค. ์ด๋ ‡๋“ฏ LM์˜ scaling law๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์ข€ ๋” ์‹ฌ๋„ ์žˆ๋Š” ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋‹จ์ ์ด ๋“œ๋Ÿฌ๋‚˜๊ฒŒ ๋˜๊ณ  ๋˜ ์ƒˆ๋กœ์šด scaling law๊ฐ€ ๋ฐœํ‘œ๋˜์–ด ๊ฐ„๋‹ค. ๋”ฐ๋ผ์„œ ์ง€๊ธˆ์˜ scaling law๋„ ์–ธ์  ๊ฐ€๋Š” ๋‹จ์ ์ด ์ง€์ ๋˜๊ณ  ๋˜ ๋‹ค๋ฅธ scaling law๊ฐ€ ๋ฐœํ‘œ๋  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๋ฌผ๋ก  ์•„์ง ๊ทธ ์‹œ์ ์ด ์˜ค์ง€ ์•Š์•˜๊ธฐ์— ์ •ํ™•ํžˆ ์–ด๋–ป๊ฒŒ ๋” ๊ฐœ์„ ๋œ scaling law๊ฐ€ ์ œ์•ˆ๋ ์ง€๋Š” ์ž˜ ๋ชจ๋ฅด์ง€๋งŒ, ํ–ฅํ›„์—๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ cost ์ธก๋ฉด์—์„œ ์ข€ ๋” ํšจ์œจ์ ์ด๊ณ  ํšจ๊ณผ์ ์ธ scaling law๊ฐ€ ๋ฐœํ‘œ๋˜์ง€ ์•Š์„๊นŒ ๋ผ๋Š” ์ƒ๊ฐ์„ ๋˜์ ธ๋ณธ๋‹ค ใ…Žใ…Ž

 

 ํฌ์ŠคํŒ…์„ ๋๊นŒ์ง€ ๋ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•˜๊ณ , ์ด ๊ธ€์„ ์ฝ์–ด์ฃผ์‹  ๋ถ„๋“ค๊ป˜์„œ๋„ ์˜๊ฒฌ์„ ์•Œ๋ ค์ฃผ์…จ์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค! ๊ทธ๋ฆฌ๊ณ  ์ž˜๋ชป๋œ ์ ์ด๋‚˜ ์ด์ƒํ•œ ์  ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค!! ์ด๋งŒ ํฌ์ŠคํŒ…์„ ๋งˆ์ณ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค! ๋‹ค์Œ์— ๋” ์ข‹์€ ๊ธ€๋กœ ์ฐพ์•„๋ต™๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค! ๐Ÿ˜Š๐Ÿ˜Š

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2001.08361

 

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitu

arxiv.org

https://arxiv.org/abs/2203.15556

 

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling langu

arxiv.org

https://arxiv.org/abs/2305.11206

 

LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We

arxiv.org