Paper Reading ๐Ÿ“œ/Natural Language Processing

RoBERTa: A Robustly Optimized BERT Pretraining Approach ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2022. 12. 7. 14:41

The overview of this paper

์ด ๋…ผ๋ฌธ์€ BERT์˜ replication study๋กœ ๋‹ค์–‘ํ•œ key parameter๋“ค๊ณผ training data์˜ ํฌ๊ธฐ์˜ ์ค‘์š”์„ฑ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜๋‹ค. ๊ทธ ๊ณผ์ •์—์„œ ์—ฐ๊ตฌ์ง„๋“ค์€ BERT๋Š” ์ƒ๋‹นํžˆ undertrained ๋˜์—ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ์•„๋‚ด์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  BERT ์ดํ›„์— ์ถœ์‹œ๋œ ๋ชจ๋ธ๋“ค์— ๋Œ€ํ•ด BERT๊ฐ€ ๊ทธ์— ์›ƒ๋„๋Š” ๋˜๋Š” ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ ๋˜ํ•œ ์•Œ์•„๋ƒˆ๋‹ค. ์‹ค์ œ๋กœ๋„ GLUE, RACE, SQuAD ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ SoTA๋ฅผ ์ฐจ์ง€ํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ•์กฐํ•˜๋Š” ๊ฒƒ์€ ์ด์ „์— ๊ฐ„๊ณผ๋˜์—ˆ๋˜ ๋””์ž์ธ ์„ ํƒ๊ณผ ์š”์ฆ˜์— ๋ฐœํ‘œ๋˜๋Š” ๊ฐœ์„ ์•ˆ๋“ค์˜ ๊ทผ์›์— ๋Œ€ํ•ด ์˜๋ฌธ์ ์„ ์ œ๊ธฐํ•˜์˜€๋‹ค.

 

 

Table of Contents

1. Introduction

2. Background

3. Training Procedure Analysis

   3-1. Static vs. Dynamic Masking

   3-2. Model Input Format and Next Sentence Prediction

   3-3. Training with large batches

   3-4. Text Encoding

4. RoBERTa

 

 

1. Introduction

ํ˜„์žฌ ์ˆ˜๋งŽ์€ self-training method๋“ค์ด ์†Œ๊ฐœ๋˜์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, EMLo, GPT, BERT, XLM, XLNet ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค ๋ง์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•๋“ค์€ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์คฌ์ง€๋งŒ, ์–ด๋– ํ•œ ์ ์ด ์ด ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋Š”์ง€ ์ž์„ธํžˆ ํ™•์ธํ•  ์ˆ˜ ์—†๋‹ค๋Š” ์ ์ด ๋‚œ์ œ์ด๋‹ค. ํ›ˆ๋ จ์„ ํ•˜๋Š”๋ฐ ๋“œ๋Š” ๋น„์šฉ์€ ์ƒ๋‹นํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐœ์ธ์  data๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ํ›ˆ๋ จ์˜ ์–‘์„ ์ค„์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ modeling์˜ ์žฅ์ ์„ ์ œ๋Œ€๋กœ ํŒŒ์•…ํ•˜๊ธฐ ํž˜๋“ค๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” BERT์˜ replication study๋ฅผ ์ œ์•ˆํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์—๋Š” hyperparameter tuning๊ณผ ํ›ˆ๋ จ ์„ธํŠธ ํฌ๊ธฐ์˜ ํšจ๊ณผ์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ํ‰๊ฐ€๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” BERT๊ฐ€ ์ƒ๋‹นํžˆ undertrained ๋˜์–ด ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , ์ด BERT๋ฅผ ๋ฐœ์ „์‹œํ‚จ RoBERTa๋ผ๋Š” ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ˆ˜์ •ํ•œ ์ ์„ ๊ฐ„๋‹จํ•œ๋ฐ, ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ ์œ„์—์„œ, ๋” ํฐ ๋ฐฐ์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์„ ๋” ์˜ค๋ž˜ ํ•™์Šต์‹œํ‚ด
  2. next sentence prediction ์ œ๊ฑฐ
  3. ๋”์šฑ ๊ธด sequence ์œ„์—์„œ ํ•™์Šต 
  4. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์—ญ๋™์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” masking ํŒจํ„ด์„ ์ ์šฉ์‹œํ‚ด

๊ทธ๋ฆฌ๊ณ  ๋˜ํ•œ, ์ด์ „์˜ ๋ฐ์ดํ„ฐ์…‹๋“ค๊ณผ๋Š” ๋‹ค๋ฅธ ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹ CC-NEWS๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด์™€ ๊ฐ™์€ modification์„ ํ†ตํ•ด ์‹ค์ œ๋กœ๋„ ๋งŽ์€ ๋ถ„์•ผ์—์„œ SoTA๋ฅผ ์ฐจ์ง€ํ•˜์˜€๊ณ , ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ ์ˆ˜์ •๋œ masking ๋ชจ๋ธ์€ ๋”์šฑ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋Œ์–ด๋‚ด๋Š” ๋ฐ ์ƒ๋‹นํ•œ ๋„์›€์„ ์ฃผ์—ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ contribution์„ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. BERT ๋””์ž์ธ ์„ ํƒ์˜ ์ค‘์š”์„ฑ๊ณผ ํ›ˆ๋ จ ์ „๋žต, ๊ทธ๋ฆฌ๊ณ  downstream task์— ๋Œ€ํ•ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” alternative๋ฅผ ์†Œ๊ฐœํ•จ
  2. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์ธ CC-NEWS๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์„ ์ค€๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•จ
  3. ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ๋ณด๋ฉด, ์ ํ•ฉํ•œ ๋””์ž์ธ ์„ ํƒ ์•„๋ž˜์—์„œ MLM์ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๊ณ , ์š”์ฆ˜์— ์†Œ๊ฐœ๋˜๋Š” ์ƒˆ๋กœ์šด method๋“ค๊ณผ ๋น„๊ตํ•ด๋ด๋„ ๊ฟ‡๋ฆฌ์ง€ ์•Š์„ ์ •๋„์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ

 

 

2. Background

์ด ์„น์…˜์—์„œ๋Š” BERT์˜ ๊ธฐ๋ณธ์ ์ธ ๋ฐฐ๊ฒฝ์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. ๊ธฐ๋ณธ์ ์ธ setup๊ณผ ๊ตฌ์กฐ, ํ›ˆ๋ จ ๋ฐฉ๋ฒ•, ์ตœ์ ํ™” ๋ฐฉ๋ฒ• ๋“ฑ์— ๋Œ€ํ•ด์„œ ๋ง์ด๋‹ค. ์ด์— ๋Œ€ํ•ด ์ž์„ธํžˆ ํ™•์ธํ•˜๊ณ  ์‹ถ์œผ๋ฉด ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์‹œ์˜ค.

 

 

3. Training Procedure Analysis

์ด ์„น์…˜์—์„œ๋Š” ์–ด๋– ํ•œ ์„ ํƒ์ด BERT model์„ pre-trainํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ์ง€ ํƒ๊ตฌํ•˜๊ณ , ์ธก์ •ํ•ด๋ณด์•˜๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์€ ๊ณ ์ •ํ•ด๋‘๊ณ  ์ง„ํ–‰ํ–ˆ๋Š”๋ฐ, BERT_BASE ((L=12, H=768, A=12, 110M params))์™€ ๋˜‘๊ฐ™์€ BERT model๋กœ training์„ ํ•˜์˜€๋‹ค.

 

3-1. Static vs. Dynamic Masking

BERT๋Š” ๋žœ๋คํ•˜๊ฒŒ ๋งˆ์Šคํ‚น๋˜๊ณ , ํ† ํฐ์„ ์˜ˆ์ธกํ•œ๋‹ค. ๊ธฐ์กด์˜ BERT์—์„œ๋Š” data ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ, ํ•œ ๋ฒˆ๋งŒ masking์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ, single static mask๊ฐ€ ์ง„ํ–‰๋œ๋‹ค. ๋งค epoch์—์„œ ๊ฐ๊ฐ์˜ training instance์— ๋Œ€ํ•ด ๋˜‘๊ฐ™์€ mask๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” 10๋ฒˆ ๋ณต์ œ๋˜๋Š”๋ฐ, ๊ฐ ์‹œํ€€์Šค๋Š” ๊ทธ์— ๋”ฐ๋ผ 40 epoch์˜ training ์œ„์—์„œ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ 10๋ฒˆ์”ฉ ๋งˆ์Šคํ‚น ๋œ๋‹ค. ๋”ฐ๋ผ์„œ, ๊ฐ training sequence๋Š” ํ›ˆ๋ จ ์ค‘์— ๊ฐ™์€ ๋งˆ์Šคํฌ๋กœ 4๋ฒˆ์”ฉ ๋ชฉ๊ฒฉ๋œ๋‹ค.

์ด๋ฅผ ๋ชจ๋ธ์— sequence๋ฅผ ๋„ฃ์„ ๋•Œ๋งˆ๋‹ค masking ํŒจํ„ด์„ ์ƒ์„ฑํ•˜๋Š” dynamic masking๊ณผ ๋น„๊ตํ•ด๋ณด์•˜๋‹ค. ์ด๊ฒƒ์€ ๋”์šฑ ๋งŽ์€ step๊ณผ ๋”์šฑ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด pretraining์„ ํ•  ๋•Œ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๋‹ค์Œ์˜ ํ‘œ 1์€ BERT_BASE์— ๋Œ€ํ•œ static๊ณผ dynamic masking์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด์ค€๋‹ค.

 

ํ‘œ 1. static vs. dynamic masking performance

๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, static๋„ ๋‚˜๋ฆ„ ๊ดœ์ฐฎ์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ๊ณ  ์‹ฌ์ง€์–ด๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•˜์ง€๋งŒ, ์ „๋ฐ˜์ ์ธ ๋ถ€๋ถ„์—์„œ dynamic masking์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” dynamic masking์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

3-2. Model Input Format and Next Sentence Prediction

๊ธฐ์กด BERT์˜ pretraining procedure์—์„œ๋Š” ๋ชจ๋ธ์ด ๋˜‘๊ฐ™์€ document ๋‚ด์—์„œ ๋‚˜์˜จ ์ด์–ด์ง„ ๋ฌธ์žฅ์ด๊ฑฐ๋‚˜, ๋‹ค๋ฅธ ๋ฌธ์žฅ์—์„œ ๋‚˜์˜จ ๋‘ ๋ฌธ์žฅ์ด ํ•ฉ์ณ์ง„ ๋‘ ๊ฐœ์˜ document segment๋ฅผ ๋ฐ›๋Š”๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, masked language modeling์—์„œ, ๋ชจ๋ธ์€ document segment๊ฐ€ ๋ณด์กฐ NSP((Next Sentence Prediction)) loss๋ฅผ ํ†ตํ•ด ๋™์ผํ•˜๊ฑฐ๋‚˜ ๋ณ„๊ฐœ์˜ ๋ฌธ์„œ์—์„œ ์˜ค๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ๋œ๋‹ค. 

์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์—์„œ NSP loss๋Š” ๊ธฐ์กด์˜ BERT๋ฅผ ํ›ˆ๋ จํ•  ๋•Œ ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ๋กœ ์—ฌ๊ฒจ์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ๋งŒ์•ฝ NSP๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค๋ฉด, performance์— ์ƒ๋‹นํ•œ ํ•ด๋ฅผ ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ์—์„œ NSP loss์˜ ํ•„์š”์„ฑ์— ๋Œ€ํ•ด์„œ ์˜๊ตฌ์‹ฌ์„ ์ œ๊ธฐํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. ์ด๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ alternative training format์„ ํ†ตํ•ด ์•Œ์•„๋ณด์•˜๋‹ค.

  • SEGMENT-PAIR + NSP: BERT์˜ ์ž…๋ ฅ ํ˜•์‹์„ NSP loss์™€ ํ•จ๊ป˜ ๋”ฐ๋ž๋‹ค. ๊ฐ ์ž…๋ ฅ์€ segment ์Œ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , ๊ฐ๊ฐ์€ ๋‹ค์ค‘์˜ natural sentence๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ์ตœ์ข… ๊ธธ์ด๋Š” 512 token๋ณด๋‹ค ์ ์–ด์•ผ๋งŒ ํ•œ๋‹ค. 
  • SENTENCE-PAIR + NSP: ๊ฐ ์ž…๋ ฅ์€ natural sentence์˜ ์Œ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , ๊ฐ™์€ document ๋‚ด์—์„œ ๊ฐ€์ง€๊ณ  ์˜ค๊ฑฐ๋‚˜ ์„œ๋กœ ๋‹ค๋ฅธ document ๋‚ด์—์„œ ๊ฐ€์ง€๊ณ  ์™”๋‹ค. ๊ทธ๋ž˜์„œ  ์ž…๋ ฅ์€ 512 token ๋ณด๋‹ค ์งง์„ ์ˆ˜๋ฐ–์— ์—†๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ batch size๋ฅผ ๋Š˜๋ฆผ์œผ๋กœ์จ ์ด token์˜ ์ˆ˜๊ฐ€ SEGMENT-PAIR + NSP์™€ ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถ”์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  NSP loss๋˜ํ•œ ์–ป์—ˆ๋‹ค.
  • FULL-SENTENCES: ๊ฐ ์ž…๋ ฅ์€ ์ „์ฒด ๊ธธ์ด๊ฐ€ ์ตœ๋Œ€ 512๊ฐœ์˜ ํ† ํฐ์ด ๋˜๋„๋ก ํ•˜๋‚˜ ์ด์ƒ์˜ ๋ฌธ์„œ์—์„œ ์—ฐ์†์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋œ ์ „์ฒด ๋ฌธ์žฅ์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค. ๊ทธ๋ž˜์„œ ์ž…๋ ฅ์€ document boundaries๋ฅผ ๋„˜์„ ์ˆ˜๋„ ์žˆ๋‹ค. ํ•œ ๋ฌธ์„œ์˜ ๋งˆ์ง€๋ง‰์— ๋‹ค๋‹ค๋ฅด๋ฉด, ๋‹ค์Œ ๋ฌธ์„œ์—์„œ ๋ฌธ์žฅ์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ๋ฌธ์„œ ์‚ฌ์ด์— ์ถ”๊ฐ€์ ์œผ๋กœ separator token์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” NSP loss๋ฅผ ์ œ๊ฑฐํ•˜์˜€๋‹ค.
  • DOC-SENTENCES: ์ž…๋ ฅ์€ FULL-SENTENCES์™€ ๋น„์Šทํ•˜๊ฒŒ ์ƒ์„ฑ๋˜์—ˆ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” document boundaries๋ฅผ ๋„˜์ง€๋Š” ์•Š๋Š”๋‹ค. ์ž…๋ ฅ์ด ๋ฌธ์„œ์˜ ๋งˆ์ง€๋ง‰ ๊ฐ€๊นŒ์ด์—์„œ ์ƒ˜ํ”Œ๋ง์ด ๋˜๋ฉด ์ด๊ฒƒ์€ 512๊ฐœ์˜ token๋ณด๋‹ค๋Š” ์งง์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, FULL-SENTENCES์™€ ๋น„์Šทํ•œ ์ˆ˜์˜ ์ด token ์ˆ˜๋ฅผ ๊ฐ–๊ธฐ ์œ„ํ•ด ์—ญ๋™์ ์œผ๋กœ batch size๋ฅผ ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค. ์—ฌ๊ธฐ์„œ๋„ NSP loss๋ฅผ ์ œ๊ฑฐํ•˜์˜€๋‹ค.

Results

๋‹ค์Œ์˜ ํ‘œ 2๋Š” 4๊ฐœ์˜ ๋‹ค๋ฅธ ์„ธํŒ…์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

ํ‘œ 2. input format & Next Sentence Prediction์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ

์ฒซ ๋ฒˆ์งธ๋กœ, ๊ธฐ์กด์˜ SEGMENT-PAIR ์ž…๋ ฅ ํฌ๋งท์„ SENTENCE-PAIR ํฌ๋งท๊ณผ ๋น„๊ตํ•˜์˜€๋‹ค. ์ด ๋‘ ํฌ๋งท์€ NSP loss๋ฅผ ์–ป์ง€๋งŒ, ํ›„์ž์˜ ๊ฒฝ์šฐ์—๋Š” single sentence๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” individual sentence๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด downstream task์˜ ์„ฑ๋Šฅ์— ํ•ด๋ฅผ ๊ฐ€ํ•œ๋‹ค ๋ผ๋Š” ์ ์„ ์ฐพ์•„๋‚ด์—ˆ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ์ธ๋ฐ, ์™œ๋ƒํ•˜๋ฉด ๋ชจ๋ธ์€ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํ•™์Šตํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ทธ ๋‹ค์Œ์— NSP loss๊ฐ€ ์—†๋Š” training๊ณผ single document์˜ ํ…์ŠคํŠธ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ training์„ ๋น„๊ตํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์„ธํŒ…์ด ๊ธฐ์กด์˜ BERT_BASE์˜ ๊ฒฐ๊ณผ๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๊ณ , NSP loss๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด downstream task์˜ ์„ฑ๋Šฅ์„ ์›ƒ๋Œ๊ฑฐ๋‚˜ ์‚ด์ง ์ƒ์Šน์‹œํ‚ด์„ ์•Œ์•„๋ƒˆ๋‹ค. ์ด๊ฒƒ์ด ๊ฐ€๋Šฅํ•œ ์ด์œ ๋Š” ๊ธฐ์กด์˜ BERT์—์„œ ์˜ค์ง loss term๋งŒ์„ ์ง€์šฐ๊ณ , SEGMENT-PAIR์˜ ์ž…๋ ฅ ํฌ๋งท์€ ์œ ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋งˆ์นจ๋‚ด, single document์—์„œ ์˜ค๋Š” ์‹œํ€€์Šค ์ œํ•œ((DOC-SENTENCES))์ด ๋‹ค์ค‘์˜ document์—์„œ ์‹œํ€€์Šค๋ฅผ ํ•ฉ์น˜๋Š” ๊ฒƒ((FULL-SENTENCES))๋ณด๋‹ค ์‚ด์ง ๋‚˜์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, DOC-SENTENCES ํฌ๋งท์˜ ๋‹ค์–‘ํ•œ barch size ๋•Œ๋ฌธ์—, ๋…ผ๋ฌธ์—์„œ๋Š” ๊ด€๋ จ ์ž‘์—…๊ณผ ์‰ฝ๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋„๋ก FULL-SENTENCES์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

3-3. Training with large batches

์ด์ „์— ์ˆ˜ํ–‰๋œ ๋งŽ์€ Neural Machine Translation์—์„œ ๋ณด์—ฌ์คฌ๋“ฏ์ด learning rate๊ฐ€ ์ ์ ˆํ•˜๊ฒŒ ์ƒ์Šนํ•  ๋•Œ, ๋งค์šฐ ํฐ mini-batches์™€ ํ•จ๊ป˜ training์„ ์ง„ํ–‰ํ•˜๋ฉด, optimization ์†๋„์™€ end-task์˜ ์„ฑ๋Šฅ์ด ํ•จ๊ป˜ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, BERT ๋˜ํ•œ large batch training์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

BERT ๋…ผ๋ฌธ์—์„œ๋Š” BERT_BASE๋ฅผ batch size 256 sequence์™€ ํ•จ๊ป˜ 1,000,000 step ๋™์•ˆ trainํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ํ†ตํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์—์„œ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 2K ์‹œํ€€์Šค์ธ 125K ๋‹จ๊ณ„ ๋˜๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 8K์ธ 31K ๋‹จ๊ณ„์— ๋Œ€ํ•œ ๊ต์œก๊ณผ ๋™์ผํ•˜๋‹ค. ๋‹ค์Œ์˜ ํ‘œ 3์—์„œ๋Š” barch size๋ฅผ ๋Š˜๋ฆฌ๊ณ  train data๋ฅผ ํ†ต๊ณผํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ์ œ์–ดํ•˜๋ฉด์„œ BERT_BASE์˜ ์ž‘์—… ์„ฑ๋Šฅ๊ณผ ๋ณต์žก๋„๋ฅผ ๋น„๊ตํ•˜์˜€๋‹ค.

 

ํ‘œ 3. batch size์— ๋”ฐ๋ฅธ perplexity & end-task performance ๋น„๊ต

๋…ผ๋ฌธ์—์„œ๋Š” ํฐ batch๋กœ training์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด MLM์„ ์œ„ํ•œ ๋ณต์žก๋„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ end-task์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํฐ batch๋Š” ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ training์„ ํ†ตํ•ด ๋ณ‘๋ ฌํ™”ํ•˜๊ธฐ๊ฐ€ ๋” ์‰ฌ์šฐ๋ฉฐ ์ดํ›„ ์‹คํ—˜์—์„œ๋Š” 8K ์‹œํ€€์Šค์˜ batch๋กœ ๊ต์œกํ•˜์˜€๋‹ค.

 

3-4. Text Encoding

Byte-Pair Encdoing((BPE))๋Š” character์™€ word-level representation์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋กœ, ์ž์—ฐ์–ด corpora์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ํฐ ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. BPE๋Š” full-word ๋Œ€์‹ ์—, subword unit์— ์˜์กดํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ subword unit์€ training corpus์˜ statisticalํ•œ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

BPE์˜ ์–ดํœ˜ ํฌ๊ธฐ๋Š” ๋ณดํ†ต 10K์—์„œ 100K์˜ subword unit์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, unicode character์€ ํฌ๊ณ  ๋‹ค์–‘ํ•œ corpora๋ฅผ ๋ชจ๋ธ๋งํ•  ๋•Œ, ์ด ์–ดํœ˜์˜ ์ƒ๋‹นํ•œ ๋ถ€๋ถ„์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์ „์˜ ์—ฐ๊ตฌ์—์„œ base subword unit์œผ๋กœ unicode character์„ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ ์— bytes๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ˜„๋ช…ํ•œ BPE ์‘์šฉ ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. bytes๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ ๋‹นํ•œ ํฌ๊ธฐ((50K units))์˜ subword ์–ดํœ˜๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค. 

๊ธฐ์กด์˜ BERT์—์„œ๋Š” ์ž…๋ ฅ์„ heuristic tokenization rule์— ๋”ฐ๋ผ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ํ›„์— ํ•™์Šต๋˜๋Š” 30K ํฌ๊ธฐ์˜ character-level BPE ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. GPT-2 ๋…ผ๋ฌธ์— ๋”ฐ๋ผ, ์ž…๋ ฅ์˜ ์ถ”๊ฐ€ ์‚ฌ์ „ ์ฒ˜๋ฆฌ ๋˜๋Š” ํ† ํฐํ™”๊ฐ€ ์—†๋Š” ๋Œ€์‹ ์—, 50K subword unit์„ ํฌํ•จํ•˜๋Š” ๋” ํฐ ๋ฐ”์ดํŠธ ์ˆ˜์ค€ BPE ์–ดํœ˜๋กœ BERT ๊ต์œก์„ ๊ณ ๋ คํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ BERT_BASE์™€ BET_LARGE์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ ์•ฝ 15,000,000๊ฐœ์™€ 20,000,000๊ฐœ์˜ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์— ์˜ํ•ด ์ด๋Ÿฌํ•œ ์ธ์ฝ”๋”ฉ๋“ค ์‚ฌ์ด์—๋Š” ๊ทธ์ € ์‚ด์ง ๋‹ค๋ฅธ ์ ๋งŒ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ๋“œ๋Ÿฌ๋‚ฌ๊ณ , BPE๊ฐ€ ์–ด๋А task์˜ end-task performance์— ๋Œ€ํ•ด ์‚ด์ง ์ข‹์ง€ ์•Š๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์ด ์•Œ๋ ค์กŒ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋ฒ”์šฉ ์ธ์ฝ”๋”ฉ ์ฒด๊ณ„์˜ ์žฅ์ ์ด ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋Šฅ๊ฐ€ํ•œ๋‹ค๊ณ  ๋ฏฟ๊ณ  ๋‚˜๋จธ์ง€ ์‹คํ—˜์—์„œ ์ด ์ธ์ฝ”๋”ฉ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

4. RoBERTa

์ด์ „ ์„น์…˜์—์„œ BERT์˜ pretraining procedure์„ ์ˆ˜์ •ํ–ˆ์„ ๋•Œ, end-task performance๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์ œ ์ด๋Ÿฌํ•œ ํ–ฅ์ƒ๋œ ์ ์„ ๋ชจ์œผ๊ณ  ์ด๋ ‡๊ฒŒ ๋ฌถ์ธ ํšจ๊ณผ๋ฅผ ์ธก์ •ํ•ด๋ณด์•˜๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ๋ชจ๋ธ RoBERTa์„ Robustly optimized BERT approach๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. RoBERTa๋Š” ๋‹ค์Œ์˜ ์กฐ๊ฑด์— ๋งž์ถฐ์„œ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค.

  • dynamic masking ((3-1. ์—์„œ ์–ธ๊ธ‰))
  • NSP loss ์—†๋Š” FULL-SENTENCES ((3-2. ์—์„œ ์–ธ๊ธ‰))
  • ๊ฑฐ๋Œ€ํ•œ mini-batches ((3-3. ์—์„œ ์–ธ๊ธ‰))
  • ๊ฑฐ๋Œ€ํ•œ byte-level BPE ((3-4. ์—์„œ ์–ธ๊ธ‰))

์ถ”๊ฐ€์ ์œผ๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ง€๊ธˆ๊นŒ์ง€ ๊ณผ์†Œํ‰๊ฐ€ ๋˜์—ˆ๋˜ ๋‘ ๊ฐœ์˜ ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์š”์†Œ๋“ค์— ๋Œ€ํ•ด ์กฐ์‚ฌํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š”, pretraining์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์ด๊ณ , ๋‘ ๋ฒˆ์งธ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•œ training ํŒจ์Šค์˜ ์ˆ˜ ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, XLNet architecture๋Š” BERT์˜ 10๋ฐฐ์— ๋‹ฌํ•˜๋Š” ๋ฐ์ดํ„ฐ๋กœ pre-trained ๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ํ•œ ์ ˆ๋ฐ˜์˜ optimization ๋‹จ๊ณ„์—์„œ 8๋ฐฐ ๋” ํฐ batch size๋กœ ํ•™์Šต๋˜์„œ, BERT์— ๋น„ํ•ด 4๋ฐฐ ๋” ๋งŽ์€ ์–‘์˜ ์‹œํ€€์Šค๋ฅผ pretraining์— ์‚ฌ์šฉํ•œ๋‹ค.

๋‹ค๋ฅธ ๋ชจ๋ธ๋ง ์„ ํƒ์—์„œ ์ด๋Ÿฌํ•œ ์š”์†Œ์˜ ์ค‘์š”๋„๋ฅผ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” BERT_LARGE architecture ((L=24, H=1024, A=16, 355M parameters))์— ๋”ฐ๋ผ RoBERTa๋ฅผ ๊ต์œกํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” BERT์—์„œ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์— ๋น„๊ฒฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹ ์œ„์—์„œ 100K steps ์ •๋„ pretrain์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. 

 

Results

์‹คํ—˜์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ์˜ ํ‘œ 4์— ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ œ์–ด๋ฅผ ํ•  ๋•Œ, RoBERTa๊ฐ€ ๊ธฐ์กด์˜ BERT_LARGE์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ๋งŽ์ด ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋กœ์จ ๋””์ž์ธ ์„ ํƒ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ์ง€ ์žฌ์ฐจ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

ํ‘œ 4. RoBERTa ๊ฒฐ๊ณผ

๊ทธ ๋‹ค์Œ์—, ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์„ธ ๊ฐœ์˜ ์ถ”๊ฐ€์ ์ธ ๋ฐ์ดํ„ฐ์…‹์„ combine ํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  RoBERTa๋ฅผ ์ด combined๋œ ๋ฐ์ดํ„ฐ ์œ„์—์„œ ์ด์ „๊ณผ ๋˜‘๊ฐ™์€ training step((100K))๋งŒํผ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ข…ํ•ฉ์ ์œผ๋กœ, 160GB์— ๋‹ฌํ•˜๋Š” text ๋ฐ์ดํ„ฐ ์œ„์—์„œ pretrain ๋˜์—ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ๋ชจ๋“  downstream task์— ๋Œ€ํ•ด์„œ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์ด ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ์™€ ๋‹ค์–‘์„ฑ์ด pretraining์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ์ง€ ์ž…์ฆํ•˜์˜€๋‹ค.

๋งˆ์นจ๋‚ด, RoBERTa๋ฅผ ๋”์šฑ ๊ธธ๊ณ  ์ฆ๊ฐ€๋œ ์ˆ˜์˜ pretraining step์—์„œ pretrain ํ•˜์˜€๋‹ค. ((100K -> 300K -> 500K)) ์ด๋ฅผ ํ†ตํ•ด downstream task์—์„œ ๋”์šฑ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๊ณ , 300K์™€ 500K step ๋ชจ๋ธ์€ ๋ชจ๋“  task์— ๋Œ€ํ•ด XLNet_LARGE๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ๋ ‡๋‹ค๊ณ  ํ•ด์„œ ์ด ๋ชจ๋ธ๋“ค์ด ์˜ค๋ฒ„ํ”ผํŒ…๋œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , ์ถ”๊ฐ€์ ์ธ ํ•™์Šต์„ ํ†ตํ•ด ๋” ์žฅ์ ์„ ์ด๋Œ์–ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

 

 

์ฐธ๊ณ ๋ฌธํ—Œ

https://arxiv.org/abs/1907.11692

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

ELECTRA: Pre-training Text Encoders as Discriminators rather than Generators ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2022.12.08
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2022.12.07
XLNet: Generalized Autoregressive Pretraining for Language Understanding ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2022.12.06
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2022.11.30
Transformer: 'Attention Is All You Need' ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (6) 2022.11.21
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • ELECTRA: Pre-training Text Encoders as Discriminators rather than Generators ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ALBERT: A Lite BERT for Self-supervised Learning of Language Representations ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • XLNet: Generalized Autoregressive Pretraining for Language Understanding ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
Cartinoe's paper reviewWelcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • MT-Bench
  • context length
  • RLHF
  • LLM
  • LLAMA2
  • scaling law
  • ChatGPT
  • proprietary model
  • closed-source
  • Open-source
  • Vicuna
  • Vicuna Evaluation
  • closed-source model
  • GPT-4
  • Evaluation Metric
  • transformer
  • Chinchilla
  • LM
  • open-source model
  • context window
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
RoBERTa: A Robustly Optimized BERT Pretraining Approach ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.