Insight ๐Ÿ˜Ž

Noise makes LLM better! - NEFTune ๐Ÿ˜‰

Cartinoe 2023. 10. 18. 16:15

What is the big difference of NLP compared to CV? ๐Ÿ˜ฎ

 ์ด ํฌ์ŠคํŒ…์˜ ์ œ๋ชฉ๋ถ€ํ„ฐ ํ•ด์„œ ์˜์•„ํ•œ ๋ถ€๋ถ„์ด ํ•œ๋‘ ๊ฐ€์ง€๊ฐ€ ์•„๋‹ ๊ฒƒ์ด๋‹ค. ๊ฐ‘์ž๊ธฐ ๋’ค๋Œ์•„๋ด์•ผ ํ•œ๋‹ค๋А๋‹ˆ CV์™€ NLP์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์ด ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•ด ๋ฌป์ง€๋ฅผ ์•Š๋‚˜. ํ•˜์ง€๋งŒ ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์„ ์œ„ํ•ด์„œ๋Š” ์ด ์ฐจ์ด์ ์„ ๋˜์งš์–ด๋ณด์•„์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค! ๊ทธ๋ ‡๋‹ค๋ฉด ๋จผ์ € ๋…์ž๋ถ„๋“ค๊ป˜ ์งˆ๋ฌธํ•ด ๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค. NLP๊ณผ CV์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์€ ๋ฌด์—‡์ผ๊นŒ? ์•„๋งˆ๋„ ์ด๋ ‡๊ฒŒ ์ถ”์ƒ์ ์œผ๋กœ ์งˆ๋ฌธํ•œ๋‹ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹ต๋ณ€๋“ค์ด ๋‚˜์˜ฌ ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•œ๋‹ค. ๐Ÿ˜

 

  • ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๋ฆ„. (text & image)
  • ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ๋“ค์˜ ์ฐจ์ด
  • ํ•™์Šต ๋ฐฉ์‹์˜ ์ฐจ์ด

 ๋ฌผ๋ก  ์œ„์™€ ๊ฐ™์€ ๋‹ต๋ณ€๋“ค๋„ ๋งž์ง€๋งŒ, ํ•„์ž๊ฐ€ ๋ณธ ํฌ์ŠคํŒ…์—์„œ ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๋‘ ์—ฐ๊ตฌ๊ณ„์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์€ "regularization์— ๋Œ€ํ•œ ์—ฐ๊ตฌ" ๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์•ฝ๊ฐ„ ์งˆ๋ฌธ์ด ํ—ˆ์ˆ ํ–ˆ๋˜ ์ ์— ๋Œ€ํ•ด์„œ๋Š” ์–‘ํ•ด๋ฅผ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๐Ÿ˜… ๊ทธ๋ ‡๋‹ค๋ฉด ์ด๊ฒŒ ๋ฌด์Šจ ๋ง์ผ๊นŒ? ๊ฐ‘์ž๊ธฐ regularization์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์— ๋Œ€ํ•ด์„œ ์–˜๊ธฐ๋ฅผ ํ•˜๋‹ค๋‹ˆ? 

 

 ์šฐ์„  regularization์— ๋Œ€ํ•ด ์–˜๊ธฐํ•˜๊ธฐ ์ „์— ๊ฐ ์—ฐ๊ตฌ๊ณ„์˜ ์—ฐ๊ตฌ ๋™ํ–ฅ๋“ค์— ๋Œ€ํ•ด์„œ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

 

  • Computer VIsion: regularization๊ณผ overfitting์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋“ค์ด ํ™œ๋ฐœํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง.
  • Natural Language Processing: new & high-quality data๋ฅผ ํ™œ์šฉํ•ด์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์ด๋ฃจ๊ณ ์ž ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋กœ ์ด๋ฃจ์–ด์ง.

 ์ด๊ฒƒ๋งŒ ๋ด๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด CV์—์„œ๋Š” regularization๊ณผ overfitting์— ๋Œ€ํ•ด์„œ ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๋“ค์ด ์ด์–ด์ง€๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด, NLP๋Š” ์•„์ง์€ ์ƒˆ๋กญ๊ณ  ๋”์šฑ ํ€„๋ฆฌํ‹ฐ๊ฐ€ ์ข‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ์•„ ๋ฌผ๋ก  prompting, fine-tuning, RLHF ๋“ฑ์˜ ์—ฐ๊ตฌ๋“ค๋„ ํ•จ๊ป˜ ํ™œ๋ฐœํ•˜๊ฒŒ ์ด์–ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, CV์— ๋น„ํ•ด์„œ NLP ๋ถ„์•ผ์—์„œ๋Š” ์•„์ง regularization๊ณผ overfitting์— ๋Œ€ํ•ด์„œ๋Š” ์ถฉ๋ถ„ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์ง€ ์•Š๋‹ค

 

 ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด๋Ÿฌํ•œ NLP ๋ถ„์•ผ์˜ ํ—ˆ์ ์„ ํŒŒ๊ณ ๋“  ๋…ผ๋ฌธ์ธ "NEFTune: Noisy Embedding Improve Instruction Finetuning(Jain et al. 2023)"์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค! ๐Ÿค—

 

 

NEFTune, the new paradigm of model training โœจ

Introduction 

 

 ์ด ํฌ์ŠคํŒ…์„ ์ž‘์„ฑํ•˜๋Š” ์‹œ์ ์—์„œ ๋ถˆ๊ณผ ์ผ์ฃผ์ผ ์ •๋„ ์ „์— ๊ณต๊ฐœ๋œ ๋”ฐ๋ˆ๋”ฐ๋ˆํ•œ ๋…ผ๋ฌธ์ธ "NEFTune: Noisy Embedding Improve Instruction Finetuning"์—์„œ๋Š” ๊ธฐ์กด์˜ fine-tuning์— ๋งค์šฐ ๊ฐ„๋‹จํ•œ trick์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๊ธฐ์กด์˜ fine-tuning๋ณด๋‹ค ํ›จ์”ฌ ๋” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค. ์‹ค์ œ๋กœ NEFTune์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด NEFTune์ด ์ถฉ๋ถ„ํžˆ ๋งค๋ ฅ์ ์ธ training ๋ฐฉ์‹์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค. 

 

Standard fine-tuning๊ณผ NEFTune์˜ AlpacaEval์—์„œ์˜ ์„ฑ๋Šฅ ๋น„๊ต. Alpaca ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•œ fine-tuning์—์„œ๋Š” ๋ฌด๋ ค ๊ธฐ์กด๋ณด๋‹ค 34.9% ๋” ๋†’์€ win rate๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ. ํ™•์‹คํžˆ NEFTune์ด ํšจ๊ณผ์ ์ธ training method๋ผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.

 

 ์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์–ผํ• ๋ณด๊ธฐ์—๋„ NEFTune์€ standard fine-tuning๋ณด๋‹ค ํ›จ์”ฌ ๋” ๊ฐœ์„ ๋œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด NEFTune์€ ์–ด๋–ค ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๊ธฐ์— ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ ‡๊ฒŒ ํšจ๊ณผ์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ์ผ๊นŒ? 

 

NEFTune

 

 ๊ธฐ์กด์˜ instruction-tuned model๋“ค์„ ์‚ดํŽด๋ณด๋ฉด ๋ณดํ†ต instruction & response ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ๋‹ค. NEFTune๋„ ์ด๋“ค๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ ์Šคํ…์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ instruction์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ์ด ํ† ํฐ์„ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ํ†ตํ•ด ์‹œ์ž‘๋œ๋‹ค. ๊ทธ๋‹ค์Œ์— NEFTune์€ ์ž„๋ฒ ๋”ฉ์— random noise vector๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ standard training์„ ์‹œ์ž‘ํ•œ๋‹ค. ์ด๊ฒŒ NEFTune์˜ ๋ชจ๋“  ๊ฒƒ์ด๋‹ค! ๋„ˆ๋ฌด ๊ฐ„๋‹จํ•ด์„œ ์˜์‹ฌ์ด ๋“ค ์ •๋„์ธ๋ฐ ์‹ค์ œ๋กœ ๋…ผ๋ฌธ์—์„œ๋„ NEFTune method์— ๋Œ€ํ•œ ์„ค๋ช…์€ ๋ถˆ๊ณผ 9์ค„ ์ •๋„์— ๋ถˆ๊ณผํ•  ์ •๋„๋กœ ๋งค์šฐ ๊ฐ„๋‹จํ•œ method์ด๋‹ค. ์ด ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ NEFTune์€ standard fine-tuning์„ ์••๋„ํ•˜๋Š” ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ์ด๋‹ค! 

 

NEFTune Algorithm. standard fine-tuning์— ๋น„ํ•ด ๋‹ฌ๋ผ์ง„ ๋ถ€๋ถ„์€ ๋นจ๊ฐ„ ๋ฐ•์Šค์™€ ๊ฐ™์Œ.

 

 NEFTune์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ดํŽด๋ณด๋ฉด standard fine-tuning์— ๋น„ํ•ด์„œ NEFTune์—์„œ ์ถ”๊ฐ€๋œ ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • $\epsilon \sim Uniform(-1, 1), \mathbb{R}^{B \times L \times d}$ : noise vector๋ฅผ ๋…๋ฆฝํ•ญ๋“ฑ๋ถ„ํฌ(iid)์—์„œ ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง
  • $(\frac {\alpha}{\sqrt{Ld}}) \epsilon$ : factor๋ฅผ ์‚ฌ์šฉํ•ด์„œ noise vector๋ฅผ scaling
  • ${X}'_{emb} \leftarrow X_{emb} + (\frac {\alpha}{\sqrt{Ld}}) \epsilon$ : original embedding์— scaled noise vector๋ฅผ ํ•ฉ์นจ

  ์ด๋Ÿฌํ•œ NEFTune์˜ ๊ตฌ์กฐ๋ฅผ ์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

def noised_embed(orig_embed, noise_alpha):
    embed_init = orig_embed(x)
    dims = torch.tensor(embed_init.size(1) * embed_init.size(2))
    mag_norm = noise_alpha/torch.sqrt(dims)
    return embed_init + torch.zeros_like(embed_init).uniform_(-mag_norm, mag_norm)

 

 NEFTune์€ ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ Noisy Embedding Fine-tuning์ธ ๊ฒƒ์ด๋‹ค. ๐Ÿ˜„ NEFTune์˜ ์ž‘๋™ ๋ฐฉ์‹์„ ๋ณด๋ฉด ๋ฌด์–ธ๊ฐ€ ๋– ์˜ค๋ฅด๋Š” ๊ฒƒ์ด ํ•˜๋‚˜ ์žˆ์ง€ ์•Š์€๊ฐ€? ๋ฐ”๋กœ Computer Vision์—์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์ž์ฃผ ์‚ฌ์šฉ๋˜์—ˆ๋˜ Noise Injection๊ณผ ์ƒ๋‹นํžˆ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํ•„์ž๋Š” ์ƒ๊ฐํ•œ๋‹ค. ์ƒ๊ฐํ•ด ๋ณด๋ฉด ์ด noise injection์„ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์— ์ ์šฉํ•œ ๊ฒƒ์ด NEFTune์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค! Computer Vision์—์„œ noise injection์€ ๋ชจ๋ธ์—๊ฒŒ robustํ•จ์„ ์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„ฑ๋Šฅ ๊ฐœ์„ ๋„ ์ด๋Œ ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ NLP์—์„œ๋Š” ์ด noisy embedding์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์„๊นŒ?

 

 

Striking performance of NEFTune ๐Ÿ”ฅ

 NEFTune์€ ์•ž์„œ๋„ ๋งํ–ˆ๋“ฏ์ด ์ƒ๋‹นํžˆ ๊ฐ„๋‹จํ•œ method์ด๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด ๊ฐ„๋‹จํ•œ NEFTune์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ์–ผ๋งˆ๋‚˜ ํฌ๊ณ  ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์„๊นŒ? ๋…ผ๋ฌธ์—์„œ ๋ณด์—ฌ์ค€ ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž!

 

  • NEFTune์€ Text Quality๋ฅผ ๊ฐœ์„ ์‹œํ‚ดโฌ†๏ธ  ์•„๋ž˜์˜ ํ‘œ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด NEFTune์„ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์„ fine-tuningํ•˜๋ฉด ๋ชจ๋ธ์˜ conversational ability & answer quality๋ฅผ ์ƒ๋‹นํžˆ ๊ฐœ์„ ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๋Ÿฌ instruction data์—์„œ NEFTune์„ ์‚ฌ์šฉํ•ด์„œ fine-tuningํ•  ๋•Œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.

 

  • NEFTune์€ Chat Model๋„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ๐Ÿ—ฃ๏ธโฌ†๏ธ  Llama-2-Chat๊ณผ ๊ฐ™์€ RLHF๋ฅผ ํ†ตํ•ด fine-tune๋œ Chat model์— ๋Œ€ํ•ด์„œ๋„ WizardLM์˜ Evol-Instruct์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ 3% ์ •๋„์˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  NEFTune์„ ์‚ฌ์šฉํ•ด์„œ fine-tuning์„ ํ•˜๋ฉด ๋ฌด๋ ค 10% ์ •๋„ ๋” ๊ฐœ์„ ๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค! ํ•˜์ง€๋งŒ ์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์ผ๋ถ€ ๊ธฐ๋Šฅ์€ ์œ ํ•ดํ•œ ๋™์ž‘ ์ถœ๋ ฅ์„ ์–ต์ œํ•˜๋Š” ๊ธฐ๋Šฅ๊ณผ ๊ฐ™์€ ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

Chat model์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ fine-tuning๊ณผ NEFTune์€ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๊ฐ€์ ธ์˜ด.

 

  • Benchmark์—์„œ์˜ ์„ฑ๋Šฅ ์œ ์ง€๐ŸŸฐ  NEFTune์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ํ™•์‹คํžˆ ๋ชจ๋ธ์˜ conversational ability๋ฅผ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ํ™•์ธํ•˜์˜€์œผ๋‚˜ conversational ability ์™ธ์—๋„ benchmark ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ ๋˜ํ•œ LLM์˜ ์ƒ๋‹นํžˆ ์ค‘์š”ํ•œ ๊ณผ์ œ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” HuggingFace์˜ Open LLM Leaderboard์˜ ํ‰๊ฐ€์˜์—ญ์ธ ARC, HellaSwag, MMLU, TruthfulQA์— ๋Œ€ํ•ด์„œ๋„ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ๋‹ค์Œ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด NEFTune์€ Benchmark์—์„œ๋„ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ•˜๋ฝ์‹œํ‚ค์ง€ ์•Š๊ณ  ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

ARC, HellaSwag, MMLU, TruthfulQA์— ๋Œ€ํ•ด์„œ NEFTune์€ ์„ฑ๋Šฅ์— ๋ณ„ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Œ.

 

  • NEFTune์€ QLoRA์—์„œ๋„ ์ž‘๋™ํ•จ๐Ÿ˜ฎ  NEFTune์ด ํ™•์‹คํžˆ standard fine-tuning์— ๋น„ํ•ด์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ํ™•์ธํ•˜์˜€๊ณ , ๊ทธ๋ ‡๋‹ค๋ฉด QLoRA์™€ ๊ฐ™์€ Parameter Efficient Fine-tuning์—์„œ๋„ NEFTune์€ ํšจ๊ณผ์ ์ผ๊นŒ? ๊ทธ๋ ‡๋‹ค!! ๋ฌผ๋ก  ๊ทธ ํšจ๊ณผ๊ฐ€ full-finetuning์— ๋น„ํ•ด์„œ๋Š” ์†Œ์†Œํ•˜์ง€๋งŒ ํ™•์‹คํžˆ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ๋‹ค์ค€๋‹ค.

PEFT๋ฅผ ํ™œ์šฉํ•ด์„œ ํ•™์Šต์‹œ์ผœ๋„ ๋ฏธ๋ฏธํ•˜์ง€๋งŒ NEFTune์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.

 

  • NEFTune์€ ๋”์šฑ ๋””ํ…Œ์ผํ•œ response๋ฅผ ์ œ๊ณต๐Ÿ“š  ์ •ํ™•ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด NEFTune์˜ response์™€ standard fine-tuning์˜ responser๋ฅผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ NEFTune์˜ response๊ฐ€ standard fine-tuning์˜ response์— ๋น„ํ•ด ๋”์šฑ ๊ตฌ์ฒด์ ์ด๊ณ  ๋””ํ…Œ์ผํ•œ ์ •๋ณด๋ฅผ ์ฃผ๊ณ , ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋ฅผ ๋” ์ค€๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์˜ Appendix๋ฅผ ํ™•์ธํ•ด ๋ณผ ์ˆ˜ ์žˆ๊ธธ ๋ฐ”๋ž€๋‹ค!

 

 ์ด์™€ ๊ฐ™์ด NEFTune์„ ํ†ตํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์„œ ๋“œ๋Š” ์˜๋ฌธ์ ์ด ํ•˜๋‚˜ ์žˆ๋‹ค. ๋„๋Œ€์ฒด ์–ด๋–ค ์ ์ด ์ด ๊ฐ„๋‹จํ•œ NEFTune method๋ฅผ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์—์„œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ method๋กœ ๋งŒ๋“  ๊ฒƒ์ผ๊นŒ? ๋ถ„๋ช… NEFTune์—์„œ ํ•œ ๊ฒƒ์ด๋ผ๊ณ ๋Š” ๊ณ ์ž‘ original embedding์— noise๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ฒƒ ์™ธ์—๋Š” ํ•œ ๊ฒƒ์ด ์—†๋Š”๋ฐ, ์ด๊ฒƒ์ด NEFTune์„ ํšจ๊ณผ์ ์ธ method๋กœ ๋งŒ๋“ค์–ด์ค€ ๊ฒƒ์ผ๊นŒ?

 

 

What makes NEFTune effective? ๐Ÿค”

 ์•ž์„  ์„น์…˜์˜ ๋ง๋ฏธ์— ๋˜์ง„ ์งˆ๋ฌธ์„ ์ด๋ฒˆ ์„น์…˜์—์„œ ๋ฐํ˜€๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค! ๋…ผ๋ฌธ์—์„œ๋Š” NEFTune์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๊ฐ€ ์•ž์„œ ์„ธ์› ๋˜ ๊ฐ€์„ค์ฒ˜๋Ÿผ ์ž„๋ฒ ๋”ฉ์— noise๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ ์˜จ๋‹ค๊ณ  ๊ฐ€์„ค์„ ์„ธ์› ๋‹ค. ์ด noise๋ฅผ ํ†ตํ•ด์„œ ๋ชจ๋ธ์ด ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • Overfitting โฌ‡๏ธ: noise data๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์€ instruction dataset, formatting detail, exacting word, text length์— ๋Œ€ํ•œ overfitting์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ.
  • Pre-trained modeld์˜ ์ง€์‹์„ ๋”์šฑ ํ™œ์šฉ ๊ฐ€๋Šฅ ๐Ÿ”ฅ: noise data๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์ด ํŠน์ • ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—๋งŒ ๋จธ๋ฌด๋ฅด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋” ๋‹ค์–‘ํ•œ ๋ถ„ํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋”์šฑ high-quality์˜ response๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Œ.

 ๋…ผ๋ฌธ์—์„œ๋Š” NEFTune์ด standard fine-tuning๋ณด๋‹ค ๋œ overfit ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋” ๊ฐœ์„ ๋œ ์„ฑ๋Šฅ์„ ๋‚˜์„ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค๋Š” ๊ฐ€์„ค์„ ์„ธ์šฐ๊ณ , ์‹คํ—˜์„ ํ†ตํ•ด ์ด๋ฅผ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจ๋ธ์˜ training loss์™€ test loss๋ฅผ ๋‹ค์Œ์˜ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™์ด ๋น„๊ตํ•˜์˜€๋‹ค.

 

์™ผ์ชฝ์€ Alpaca dataset์—์„œ์˜ fine-tuning์˜ loss๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ , ์˜ค๋ฅธ์ชฝ์€ Evol-Instruct์—์„œ์˜ test loss๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Œ.

 

 ์œ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด NEFTune์€ training loss์—์„œ standard fine-tuning๋ณด๋‹ค ๋†’์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, test loss์—์„œ๋Š” ๊ทผ์†Œํ•˜๊ฒŒ standard fine-tuning๋ณด๋‹ค ๋‚ฎ์€ loss๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ด๊ฒƒ์œผ๋กœ ๋ฏธ๋ฃจ์–ด๋ณด์•„ ํ™•์‹คํžˆ NEFTune์€ standard fine-tuning์— ๋น„ํ•ด์„œ ๋œ overfitํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 ๋˜ํ•œ model์˜ response์™€ ground-truth answer๊ณผ์˜ ์œ ์‚ฌ๋„ ๋น„๊ต๋ฅผ ์œ„ํ•ด ROUGE-L๊ณผ BLEU๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ‰๊ฐ€ํ•ด ๋ณธ ๊ฒฐ๊ณผ, NEFTune์€ standard fine-tuning์— ๋น„ํ•ด์„œ ํ›จ์”ฌ ๋” ์ž‘์€ ROUGE-L & BLEU score๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ ground-truth answer๊ณผ๋„ ํฌ๊ฒŒ ๋‹ค๋ฅธ response๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์œผ๋กœ ๋ฏธ๋ฃจ์–ด๋ณด์•„ ํ™•์‹คํžˆ NEFTune์€ ๋œ overfit ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

model response์™€ ground-truth answer๊ณผ์˜ ์œ ์‚ฌ๋„ ๋น„๊ต. NEFTune์ด standard fine-tuning์— ๋น„ํ•ด์„œ ํ›จ์”ฌ ๋œ ๋น„์Šทํ•œ response๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.

 

 

NEFTune with HuggingFace TRL

 ํฌ์ŠคํŒ…์„ ์˜ฌ๋ฆฌ๋Š” ์ผ์ž์ธ 10/18 ํ•˜๋ฃจ ์ „์ธ 10/17์— HuggingFace์˜ TRL ํŒ€์—์„œ NEFTune์„ TRL์˜ SFTTrainer์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค! (์ฐธ๊ณ : https://www.linkedin.com/feed/update/urn:li:activity:7120085541861085185/) ์ด์ œ NEFTune์„ ๋ชจ๋ธ์˜ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฑด๋“œ๋ฆด ํ•„์š” ์—†์ด ๋”ฑ ํ•œ ์ค„์˜ ์ฝ”๋“œ๋ฅผ SFTTrainer์˜ arg์— ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ NEFTune์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค! NEFTune์„ ํ™œ์šฉํ•ด์„œ fine-tuning์„ ํ•ด๋ณด๊ณ ์ž ํ•˜๋Š” ๋…์ž๊ฐ€ ์žˆ๋‹ค๋ฉด ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.

 

NEFTune with TRL

 

What should NLP do in the future? ๐Ÿง

 ์ด๋ ‡๊ฒŒ ํ•ด์„œ ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋ฌด์‹œ ๋ชป ํ•  ์ •๋„์˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค€ NEFTune์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์•˜๋‹ค. ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ด๋Ÿฐ ์—„์ฒญ๋‚œ ์—ฐ๊ตฌ๋ฅผ ํ•œ ์—ฐ๊ตฌ์ž๋ถ„๋“ค๊ป˜ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.. ํ•˜๊ณ  ํ‰์†Œ๋ผ๋ฉด ํฌ์ŠคํŒ…์„ ๋๋ƒˆ๊ฒ ์ง€๋งŒ, ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ข€ ๋” ์‹ฌ์˜คํ•œ ์–˜๊ธฐ๋ฅผ ๋‚˜๋ˆ ๋ณด๊ณ ์ž ํ•œ๋‹ค. ์šฐ์„  ๋…ผ๋ฌธ์˜ Conclusion์— ์ ํ˜€ ์žˆ๋Š” ๊ธ€๊ท€๋ฅผ ๋นŒ๋ ค์„œ ๋งํ•ด ๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

"Unlike the computer vision, which has studied regularization and overfitting for years,

the LLM community tends to use standardized training loops that are designed for optimizer stability and generalization."

 

 NLP๋ฅผ ๊ณต๋ถ€ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด๋ผ๋ฉด ์œ„ ๊ธ€๊ท€์— ๋Œ€ํ•ด์„œ ๋ชน์‹œ ๊ณต๊ฐํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์ผ๋‹จ ํ•„์ž๋Š” ์œ„์˜ ๊ธ€์ด ํ˜„์žฌ NLP์˜ ์—ฐ๊ตฌ ๋™ํ–ฅ์— ๋Œ€ํ•ด์„œ ์ผ์นจ์„ ๋‚ ๋ฆฌ๋Š” ๋ง์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ํ˜„์žฌ NLP ์—ฐ๊ตฌ ์ถ”์„ธ๋ฅผ ์‚ดํŽด๋ณด๋ฉด basic ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์ด๋ค„์ง€์ง€ ์•Š์€ ์ฑ„ ๊ทธ ์œ„์— ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด ์Œ“์—ฌ๊ฐ€๊ณ  ์žˆ๋‹ค. ๋ถˆ๊ณผ overfitting์— ๋Œ€ํ•ด์„œ๋„ ํฌ๊ฒŒ ์ƒ๊ฐํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ๋‚˜๊ฐ€๊ณ  ์žˆ์œผ๋‹ˆ ๋ง์ด๋‹ค! ๋ฌผ๋ก  ์ง€๊ธˆ์˜ ์—ฐ๊ตฌ๋“ค๋„ ์ •๋ง ๋†€๋ž๊ณ  ์ƒˆ๋กœ์šด ๋ฐœ๊ฒฌ๋“ค์„ ์ด์–ด๋‚˜๊ฐ€๊ณ  ์žˆ์ง€๋งŒ, ๊ทธ๋งŒํผ basic ํ•œ ์—ฐ๊ตฌ๋“ค์—๋„ ๊ด€์‹ฌ์„ ์Ÿ์•„์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๊ทธ๋Ÿฐ ์˜๋ฏธ์—์„œ ์ด NEFTune์€ ์•ž์œผ๋กœ์˜ NLP ์—ฐ๊ตฌ ๋™ํ–ฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๐Ÿ˜Š

 

 ์ด ํฌ์ŠคํŒ…์„ ์ฝ์€ ๋…์ž๋“ค์€ ํ•„์ž์™€๋Š” ๋‹ค๋ฅธ ์ƒ๊ฐ์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ํ•„์ž๋Š” ๋งŽ์€ application study๋“ค๋„ ์ข‹์ง€๋งŒ, ๊ธฐ๋ณธ์ ์ธ ํ•˜๋‚˜์˜ basicํ•œ study๊ฐ€ ์ •๋ง ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์•ž์œผ๋กœ์˜ NLP ์—ฐ๊ตฌ ๋™ํ–ฅ์ด ์ด๋Ÿฌํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐˆ ์ˆ˜ ์žˆ๊ธธ ๋ฐ”๋ผ๋ฉฐ ํฌ์ŠคํŒ…์„ ๋งˆ์ณ๋ณด๊ณ ์ž ํ•œ๋‹ค.