Paper Reading ๐Ÿ“œ/Natural Language Processing

Pre-LN Transformer: On Layer Normalization in the Transformer Architecture ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 3. 9. 09:36

The overview of this paper

 Transformer๋Š” NLP task์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ํ•˜์ง€๋งŒ Transformer๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๋Œ€๊ฒŒ ์‹ ์ค‘ํ•˜๊ฒŒ ๋””์ž์ธ๋œ learning rate warm-up stage๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด learning rate warm-up stage๋Š” ์ตœ์ข… ์„ฑ๋Šฅ์— ๋งŽ์€ ์˜ํ–ฅ์„ ๋ผ์น˜์ง€๋งŒ, optimization์˜ ์†๋„๋ฅผ ์ €ํ•˜์‹œํ‚ค๊ณ  ๋” ๋งŽ์€ hyper-parameter tuning์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” learning rate warm-up stage๊ฐ€ ์™œ ํ•„์ˆ˜์ ์ธ์ง€์™€ layer normalization$($LN$)$์˜ ์œ„์น˜์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ดˆ๊ธฐํ™” ์‹œ residual block ์‚ฌ์ด์— layer normalization์„ ๋ฐฐ์น˜ํ•˜๋Š” ์›๋ž˜ ์„ค๊ณ„๋œ Post-LN Transformer์˜ ๊ฒฝ์šฐ ์ถœ๋ ฅ ๋ ˆ์ด์–ด ๊ทผ์ฒ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์˜ˆ์ƒ ๊ธฐ์šธ๊ธฐ๊ฐ€ ํฌ๋‹ค๋Š” ๊ฒƒ์„ mean field ์ด๋ก ์œผ๋กœ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ๊ทธ ๊ธฐ์šธ๊ธฐ์—์„œ ํฐ learning rate๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ํ•™์Šต์„ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. warm-up stage๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๋Š”๋ฐ ๋งค์šฐ ์‹ค์šฉ์ ์ด๋‹ค. ๋ฐ˜๋Œ€๋กœ, ๋…ผ๋ฌธ์˜ ์ด๋ก ์€ layer normalization์ด residual block ์‚ฌ์ด์— ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๋ฉด, ์ดˆ๊ธฐํ™”์‹œ์— ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ž˜ ์ ์šฉ๋˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค. ๋…ผ๋ฌธ์˜ ์‹คํ—˜์—์„œ๋Š” warm-up stage๊ฐ€ ์—†๋Š” Pre-LN Transformer์ด ์ƒ๋‹นํžˆ ์ ์€ ํ•™์Šต ์‹œ๊ฐ„๊ณผ hyper-parameter tuning์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , baseline์— ํ•„์ ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 

Table of Contents

1. Introduction

2. Optimization for the Transformer

   2-1. Transformer with Post-Layer Normalization

   2-2. The learning rate warm-up stage

   2-3. Understanding the Transformer at initialization

3. Experiment Results

 

 

1. Introduction

 Transformer๋Š” NLP์—์„œ ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์‹ ๊ฒฝ๋ง architecture์ด๋‹ค. LN์€ Transformer์˜ ์„ฑ๊ณต์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ Transformer๋Š” residual block ์‚ฌ์ด์— LN์„ ๋‘๋Š”๋ฐ, ์ด๋ฅผ Transformer with Post-Layer Normalization์ด๋ผ ํ•œ๋‹ค. ์ด architecture๋Š” ์—ฌ๋Ÿฌ ๋ถ„์•ผ์˜ NLP task์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  Post-LN Transformer์— ๊ธฐ๋ฐ˜์„ ๋‘” ๋น„์ง€๋„ํ•™์Šต pre-trained model์€ ๋‹ค์–‘ํ•œ downstream task์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ์ด๋Ÿฌํ•œ ์„ฑ๊ณต์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์‚ฌ๋žŒ๋“ค์€ ๋ณดํ†ต Post-LN Transformer๋Š” CNN์ด๋‚˜ seq2seq model๋ณด๋‹ค ๋”์šฑ ์กฐ์‹ฌํžˆ optimization์„ ๋‹ค๋ฃฌ๋‹ค. ํŠนํžˆ, model์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š”, ์–ด๋– ํ•œ ๊ธฐ์šธ๊ธฐ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๋ฐฉ์‹์ด๋“  ๊ฐ„์— learning rate warm-up stage๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด learning rate warm-up stage๋Š” ๋งค์šฐ ์ž‘์€ learning rate์—์„œ ์‹œ์ž‘ํ•ด์„œ ์ด๋ฅผ pre-defined ๋ฐ˜๋ณต์ˆ˜๋กœ pre-defined maximum value๊นŒ์ง€ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ warm-up stage๋Š” ์ตœ์ ํ™”์˜ ์†๋„๋ฅผ ๋Šฆ์ถœ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋” ๋งŽ์€ parameter tuning์„ ํ•„์š”๋กœ ํ•œ๋‹ค. 

 

 ์ด ๋…ผ๋ฌธ์—์„œ๋Š” learning rate warm-up stage๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์Œ์œผ๋กœ์จ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต์—์„œ warm-up stage๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, mean field ์ด๋ก ์„ ์‚ฌ์šฉํ•ด์„œ ์ดˆ๊ธฐํ™” ์ƒํƒœ์—์„œ optimization ํ–‰๋™์„ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์˜ ์ด๋ก ์  ๋ถ„์„์— ์˜ํ•˜๋ฉด, residual block ์‚ฌ์ด์— LN์„ ๋ฐฐ์น˜ํ•  ๋•Œ, output layer ๊ทผ์ฒ˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์˜ˆ์ƒ๋˜๋Š” ๊ธฐ์šธ๊ธฐ๋Š” ํฌ๋‹ค. ๊ทธ๋ž˜์„œ, warm-up stage ์—†์ด large learning rate๋ฅผ ์ด ํŒŒ๋ผ๋ฏธํ„ฐ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ optimization ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. warm-up stage์™€ small learning rate๋กœ model์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ์ด ๋ฌธ์ œ๋ฅผ ์‹ค์šฉ์ ์œผ๋กœ ํ”ผํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ค€๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์€ ๋…ผ๋ฌธ์˜ ์ด๋ก ์  ๋ฐœ๊ฒฌ์„ ์ง€์ง€ํ•˜๋Š” ๋‚ด์šฉ์„ ์ œ๊ณตํ•ด์ค€๋‹ค.

 

 ๋…ผ๋ฌธ์˜ ์ด๋ก ์€ LN์ด ๊ธฐ์šธ๊ธฐ scale์„ ์กฐ์ ˆํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋Š” ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ LN์„ ์–ด๋–ค ์œ„์น˜์— ๋†“์•˜์„ ๋•Œ ์ž˜ ์ ์‘๋œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์–ป๊ฒŒ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌํ•˜๊ณ ์ž ํ•˜๊ฒŒ ํ•˜์˜€๋‹ค. ํŠนํžˆ, ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค๋ฅธ ๋ณ€ํ˜•์ธ Pre-LN์„ ์‚ฌ์šฉํ•˜๋Š” Transformer๋ฅผ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. Pre-LN Transformer๋Š” residual connection ์•ˆ์— LN์„ ๋„ฃ๊ณ , ์˜ˆ์ธก ์ „์— ์ถ”๊ฐ€์ ์ธ final-layer normalization์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ดˆ๊ธฐํ™” ์‹œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ด๋ก ์ ์œผ๋กœ๋‚˜ ๊ฒฝํ—˜์ ์œผ๋กœ Pre-LN Transformer์— ๋Œ€ํ•ด ํญ๋ฐœํ•˜๊ฑฐ๋‚˜ ์‚ฌ๋ผ์ง€์ง€ ์•Š๊ณ  ์ž˜ ๋™์ž‘ํ•จ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ๋…ผ๋ฌธ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ๋ชจ๋“  task์—์„œ learning rate warm-up stage๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ์ œ๊ฑฐ๋˜์—ˆ๊ณ , hyper-parameter์˜ ์ˆ˜ ๋˜ํ•œ ์ค„์–ด๋“ค์—ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๋…ผ๋ฌธ์—์„œ๋Š” Pre-LN Transformer์— ๋Œ€ํ•ด loss decay๊ฐ€ ๋” ๋นจ๋ž๋‹ค. Pre-LN Transformer๋Š” ๋น„์Šทํ•œ ์ตœ์ข… ์„ฑ๋Šฅ์„ ๋”์šฑ ์ ์€ ํ•™์Šต ์‹œ๊ฐ„์„ ์‚ฌ์šฉํ•ด์„œ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋Š” large-scale model์„ large-scale dataset์— ๋Œ€ํ•ด ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ํŠนํžˆ ์ค‘์š”ํ•˜๋‹ค.

 

 ๋…ผ๋ฌธ์˜ Contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ๊ฐœ์˜ Transformer ๋ณ€ํ˜•์ธ, Post-LN Transformer๊ณผ Pre-LN Transformer์— ๋Œ€ํ•ด mean field ์ด๋ก ์„ ์‚ฌ์šฉํ•ด์„œ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ์ดˆ๊ธฐํ™” ์‹œ์— ๊ธฐ์šธ๊ธฐ๋ฅผ ์—ฐ๊ตฌํ•˜์—ฌ, ๋…ผ๋ฌธ์—์„œ๋Š” ์™œ learning rate warm-up stage๊ฐ€ Post-LN Transformer์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ํ•„์ˆ˜์ ์ธ์ง€์— ๋Œ€ํ•œ ์ฆ๊ฑฐ๋ฅผ ์ œ๊ณตํ•˜์˜€๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ๋Š” ์ฒ˜์Œ์œผ๋กœ learning rate warm-up stage๊ฐ€ hyperparameter tuning์„ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” Pre-LN Transformer๋ฅผ ์œ„ํ•ด ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ์ ์ •ํ•œ learning rate scheduler๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ training time์ด ๊ด‘๋ฒ”์œ„ํ•œ ์‘์šฉ์—์„œ ํฌ๊ฒŒ ๊ฐ์†Œ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

 

2. Optimization for the Transformer

2-1. Transformer with Post-LN Normalization

 

 ์ด ์„น์…˜์—์„œ๋Š” vanilla Transformer์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. Transformer์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์ด ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค. 

 

 Transformer layer์—์„œ sub layer, reidual connection, LN์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ˆœ์„œ๋Š” Transformer architecture์˜ ๋ณ€ํ˜•์„ ์•ผ๊ธฐํ•œ๋‹ค. Transformer์™€ BERT๋ฅผ ์œ„ํ•œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ด๊ณ  ๊ฐ€์žฅ ์œ ๋ช…ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” architecture๋Š” "self-attention$($FFN$)$ sub-layer โ†’ residual connection โ†’ layer normalization"์„ ๋”ฐ๋ฅธ๋‹ค. ์ด๊ฒƒ์€ Post-LN with Transformer์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ณ , ๋‹ค์Œ์˜ ๊ทธ๋ฆผ 1์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค.

 

๊ทธ๋ฆผ 1. a๋Š” Post-LN Transformer, b๋Š” Pre-LN Transformer

 

Post-LN Transformer  $x_{l, i}$๋ฅผ ์œ„์น˜ $i$์—์„œ $l$๋ฒˆ์งธ Transformer layer์˜ ์ž…๋ ฅ์œผ๋กœ ํ‘œ๊ธฐํ•˜๊ณ , ์—ฌ๊ธฐ์„œ $x_{l, i}$๋Š” ์ฐจ์›์ด $d, i = 1, 2, ..., n, l = 1, 2, ..., L$์˜ ์‹ค์ œ๊ฐ’ ๋ฒกํ„ฐ์ด๋‹ค. $n$์€ sequence์˜ ๊ธธ์ด์ด๊ณ , $L$์€ layer์˜ ์ˆ˜์ด๋‹ค. ์™„์„ฑ๋„๋ฅผ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ word embedding๊ณผ positional embedding์˜ ์กฐํ•ฉ์ธ ์œ„์น˜ $i$์—์„œ input embedding์„ $x_{0, i}$๋กœ ์ •์˜ํ•œ๋‹ค. $l$๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ์•ˆ์— ๋“ค์–ด์žˆ๋Š” ๊ณ„์‚ฐ์€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ , ํ‘œ 1์˜ ์™ผ์ชฝ์ฒ˜๋Ÿผ ๋‹ค๋ฅธ ์Šคํ…์˜ input๊ณผ output์„ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด $x$์— super-script๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ $W^{1,l}, W^{2,l}, b^{1,l}, b^{2,l}$์€ $l$๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ FFN sub-layer์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด๋‹ค.

 

ํ‘œ 1. Post-LN Transformer vs. Pre-LN Transformer

 

 

2-2. The learning rate warm-up stage$($Post-LN Transformer$)$

 

๋…ผ๋ฌธ์—์„œ๋Š” Post-LN Transformer์˜ optimization์—์„œ learning rate wam-up stage์— ๊ด€์‹ฌ์„ ๊ฐ€์กŒ๋‹ค. learning rate๊ฐ€ ์ƒ๋Œ€์  ํฐ ๊ฐ’์—์„œ ์‹œ์ž‘ํ•ด์„œ ๊ฐ์†Œํ•˜๋Š” ๋‹ค๋ฅธ ๋งŽ์€ architecture์˜ optimization๊ณผ ๋‹ฌ๋ฆฌ Post-LN Transformer์˜ learning rate warm-up stage๋Š” ๋งค์šฐ ์ค‘์š”ํ•ด ๋ณด์ธ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” $t$๋ฒˆ์งธ ๋ฐ˜๋ณต์˜ learning rate๋ฅผ $lr(t)$๋กœ ํ‘œ๊ธฐํ•˜๊ณ , ํ•™์Šต ์ค‘์— maximum learning rate๋ฅผ $lr_{max}$๋กœ ํ‘œ๊ธฐํ•˜์˜€๋‹ค. pre-defined ํƒ€์ž„ ํ”„๋ ˆ์ž„ $T_{warmup}$์ด ์ฃผ์–ด์ง€๋ฉด, ์ฒซ ๋ฒˆ์งธ $T_{warmup}$ ๋ฐ˜๋ณต์„ ์œ„ํ•œ learning rate scheduler๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

 

$lr(t) = \frac {t}{T_{warmup}}lr_{max}, t \leq T_{warmup}$

 

 ์ด warm-up stage ์ดํ›„์—, learning rate๋Š” ์ผ๋ฐ˜์ ์ธ learning rate scheduler๋กœ ์„ค์ •๋œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” Post-LN Transformer๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด learning rate warmup-stage๊ฐ€ ํ•„์ˆ˜์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์‹คํ—˜์„ ํ†ตํ•ด ๋ณด์—ฌ์คฌ๋‹ค.

 

Results and discussions  ๋…ผ๋ฌธ์—์„œ๋Š” validation loss์™€ BLEU score๋ฅผ ํ•™์Šตํ•˜๊ณ  ๊ณ„์‚ฐํ•  ๋•Œ ๋งค epoch์— ๋Œ€ํ•ด model checkpoint๋ฅผ ๊ธฐ๋กํ•˜์˜€๋‹ค. model์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋ฆผ 2$($a$)$์™€ ๊ทธ๋ฆผ 2$($b$)$์— ๋‚˜ํƒ€๋‚˜์žˆ๋‹ค. $x$์ถ•์€ epoch์˜ ์ˆ˜์ด๊ณ , $y$์ถ•์€ BLEU score/validation loss์ด๋‹ค. "w/o warm-up"์€ "warm-up stage๊ฐ€ ์—†์Œ"์„ ์˜๋ฏธํ•˜๋Š” ๋ฐ˜๋ฉด์— "w/ warm-up"์€ "warm-up stage์™€ ํ•จ๊ป˜"๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

๊ทธ๋ฆผ 2. Adam๊ณผ SGD์— ์˜ํ•ด ์ตœ์ ํ™”๋œ model์˜ ์„ฑ๋Šฅ

 

 ์ฒซ ๋ฒˆ์งธ๋กœ, ๋‘ optimizer์— ๋Œ€ํ•ด learning rate warm-up stage๊ฐ€ ํ•„์ˆ˜์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ์•„๋ƒˆ๋‹ค. warm-up stage ์—†์ด Adam optimizer์™€ ํ•จ๊ป˜ ํ•™์Šต๋œ ๋ชจ๋ธ์€ BLEU score์—์„œ 8.45๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋น„๊ต์— ๋”ฐ๋ฅด๋ฉด, warm-up stage๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ BLEU score์—์„œ 34๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝํ–ฅ์€ validation loss curve์—์„œ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. SGD๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต๋œ ๋ชจ๋ธ์€ Adam์— ๋น„ํ•ด ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธด ํ–ˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๊ฒฝํ–ฅ์€ ๋˜‘๊ฐ™์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค. warm-up stage๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ์€ 15 epoch์˜ ํ•™์Šต ๋’ค์—๋„ BLEU score๊ฐ€ 0์„ ์กฐ๊ธˆ ๋„˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ๋‘ ๋ฒˆ์งธ๋กœ, optimization ํ”„๋กœ์„ธ์Šค๋Š” $T_{warmup}$์˜ ๊ฐ’์— ๋Œ€ํ•ด์„œ ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๊ฒƒ์€ Post-LN Transformer์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ $T_{warmup}$์ด ์ค‘์š”ํ•œ hyper-parameter๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, $T_{warmup} = 500$์œผ๋กœ ์„ค์ •ํ–ˆ์„ ๋•Œ, Adam์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์€ ๊ฐ๊ฐ $lr_{max}=5e^{-4}$์™€ $1e^{-3}$์— ๋Œ€ํ•ด BLEU score์—์„œ ๊ฒจ์šฐ 31.16๊ณผ 2.77์˜ ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜์˜€๋‹ค.

 

 ์ด๋Ÿฌํ•œ warm-up stage๋Š” ์—ฌ๋Ÿฌ ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, warm-up stage์˜ ๊ตฌ์„ฑ์€ ์ตœ์ข… ์„ฑ๋Šฅ์— ์ƒ๋‹นํ•œ ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐœ๋ฐœ์ž๋“ค์€ large-scale NLP task์— ๋Œ€ํ•ด ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„์‹ผ ์‹ ์ค‘ํ•œ hyper-parameter tuning์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, warm-up stage๋Š” optimization์˜ ์†๋„๋ฅผ ๋Šฆ์ถ˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ธฐ์กด์˜ optimization ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ณดํ†ต ๋น ๋ฅธ ์ˆ˜๋ ด์„ ์œ„ํ•ด ์ปค๋‹ค๋ž€ learning rate์—์„œ ์‹œ์ž‘ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, warm-up stage๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, learning rate๋Š” 0์—์„œ๋ถ€ํ„ฐ ์ ์ฐจ ์ฆ๊ฐ€ํ•ด ๋‚˜๊ฐ€๋Š”๋ฐ, ์ด๋Š” ํ•™์Šต์„ ๋น„ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ ๋‹ค. ์ด์ „์˜ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, warm-up stage๋Š” ๋ชจ๋ธ ํ›ˆ๋ จ์˜ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ Adam์˜ ๋ฐ”๋žŒ์งํ•˜์ง€ ์•Š์€ ์ƒ๋‹นํ•œ ๋ถ„์‚ฐ์„ ์ค„์ด๋Š” ๋ฐ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋…ผ๋ฌธ์˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, warm-up stage๋Š” SGD์˜ ํ•™์Šต ๋˜ํ•œ ๋„์™€์ฃผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” warm-up stage์˜ ์ด์ต์€ ํŠน์ • optimizer์— ๊ตญํ•œ๋˜์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

2-3. Understanding the Transformer at initialization$($Pre-LN Transformer$)$

 

 ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ Pre-LN Transformer๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

๊ทธ๋ฆผ 3. Pre-LN Transformer

 

 ์ด ๊ทธ๋ฆผ 3์„ ์‚ดํŽด๋ณด๋ฉด Post-LN Transformer์™€ ๋‹ฌ๋ฆฌ LN์˜ ์œ„์น˜๊ฐ€ ๋ณ€ํ•ด์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์ด Post-LN Transformer์™€ Pre-LN Transformer์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์ด๋‹ค.

 

Gradient of Weight Prameter  ์ง๊ฐ์ ์œผ๋กœ ๋žœ๋ค ๋ณ€์ˆ˜ $Z$๊ฐ€ $(\varepsilon /, \delta)$-bounded์ด๋ฉด, ๋†’์€ ํ™•๋ฅ ๋กœ ๊ทธ realization์ด ๊ธฐ๋Œ€์—์„œ ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ์žˆ์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค. $L$๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ Post-LN Transformer์— ๋Œ€ํ•œ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ๋Š” ๋‹ค์Œ์„ ๋งŒ์กฑํ•œ๋‹ค.

 

$\begin{Vmatrix}
\frac {\partial \tilde{\mathfrak{L}}}{\partial W^{2,L}}
\end{Vmatrix}_{F} \leq O(d \sqrt{ln d})$

 

 ๋ฐ˜๋ฉด์— $L$๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ Pre-LN Transformer์˜ ๊ธฐ์šธ๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

$\begin{Vmatrix}
\frac {\partial \tilde{\mathfrak{L}}}{\partial W^{2,L}}
\end{Vmatrix}_{F} \leq O(d \sqrt{\frac {ln d}{L}})$

 

 Post-LN Transformer์˜ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ FFN layer์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ์˜ ์Šค์ผ€์ผ์ด $L$๊ณผ ๋ฌด๊ด€ํ•œ $O(d \sqrt{ln d})$ ์ฐจ์ˆ˜์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Pre-LN Transformer์˜ ๊ฒฝ์šฐ์—๋Š”, ๊ธฐ์šธ๊ธฐ์˜ ์Šค์ผ€์ผ์ด ํ›จ์”ฌ ์ž‘์€ $O(d \sqrt{\frac {ln d}{L}})$์ด๋‹ค.

 

Scale of Hidden States  ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์˜ hidden state์˜ ์Šค์ผ€์ผ์„ ์ธก์ •ํ•˜์˜€๋‹ค. ์ž…๋ ฅ๊ณผ ์ดˆ๊ธฐํ™”์˜ ๋ฌด์ž‘์œ„์„ฑ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ€ ์žˆ๋‹ค. ๋งŒ์•ฝ $X \in R^{d}$๊ฐ€ ๊ฐ€์šฐ์‹œ์•ˆ ๋ฒกํ„ฐ $X ~ N(0, \sigma^{2}I_d)$์ด๋ฉด, $\mathbb{E}(||ReLU(X)||_{2}^{2})=\frac {1}{2}\sigma^{2}d$์ด๋‹ค. ์ดˆ๊ธฐํ™” ์‹œ์ ์—์„œ Post-LN Transformer์— ๋Œ€ํ•ด์„œ๋Š” $\mathbb{E}(||x_{l,i}^{post, 5}||_{2}^{2})=\frac {3}{2}d$์ด๊ณ , Pre-LN Transformer์— ๋Œ€ํ•ด์„œ๋Š” $(1 + \frac {l}{2})d \leq \mathbb{E}(||x_{l,i}^{pre}||_{2}^{2}) \leq (1 + \frac {3l}{2})d$์ด๋‹ค.

 

Advantage

 

๊ทธ๋ฆผ 3. 1. 6-6 Transformer์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์˜ ๊ธฐ์šธ๊ธฐ a,b  /  Transformer์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์—์„œ W^{2,L} c,d

 

 ์œ„์˜ ๊ทธ๋ฆผ์—์„œ ๋‚˜์™€์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ์˜ˆ์ƒ๋˜๋Š” ๊ธฐ์šธ๊ธฐ์˜ scale์€ Post-LN Transformer์— ๋Œ€ํ•œ layer index์™€ ํ•จ๊ป˜ ์„ฑ์žฅํ•œ๋‹ค. ์ด์™€๋Š” ๋ฐ˜๋Œ€๋กœ, Pre-LN Transformer์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด scale์€ ๊ฑฐ์˜ ๋˜‘๊ฐ™์€ ๊ฐ’์„ ์œ ์ง€ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ main idea๋Š” LN์ด ๊ธฐ์šธ๊ธฐ๋ฅผ ์ •๊ทœํ™”ํ•  ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

 

 Post-LN Transformer์—์„œ LN์— ๋Œ€ํ•œ ์ž…๋ ฅ์˜ ์Šค์ผ€์ผ์€ $L$๊ณผ ๋ฌด๊ด€ํ•˜๋ฏ€๋กœ ๋งˆ์ง€๋ง‰ ๊ณ„์ธต์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ธฐ์šธ๊ธฐ๋Š” $L$๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค. Pre-LN Transformer์— ์žˆ๋Š” ๋™์•ˆ ์ตœ์ข… LN์— ๋Œ€ํ•œ ์ž…๋ ฅ์˜ ์Šค์ผ€์ผ์€ $L$์—์„œ ์„ ํ˜•์ด๋ฏ€๋กœ ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋Š” $\sqrt{L}$๋กœ ์ •๊ทœํ™”๋œ๋‹ค. 

 

 

3. Experiment Results

 ์ด์ „ ์„น์…˜์—์„œ Pre-LN Transformer๋ฅผ ์œ„ํ•œ ์ดˆ๊ธฐํ™” ์‹œ์— ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ž˜ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‚ฌ์‹ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ, Pre-LN Transformer๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ learning rate warm-up stage๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋ฒˆ ์„น์…˜์—์„œ๋Š” ์ด๋ฅผ 2๊ฐœ์˜ NLP task์— ๋Œ€ํ•ด ์‹คํ—˜์ ์œผ๋กœ ์ž…์ฆํ•˜๋„๋ก ํ•˜์˜€๋‹ค. ์ด ๋‘ ๊ฐœ์˜ NLP task๋Š” Machine Translation๊ณผ Unsupervised Pre-Training$($BERT$)$์ด๋‹ค.

 

Machine Translation  ์ด task์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ๋‹ค์Œ์˜ ๊ทธ๋ฆผ 4์— a๋ถ€ํ„ฐ d๊นŒ์ง€ ๋‚˜์™€์žˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  1. ๋” ์ด์ƒ learning rate warm-up stage๊ฐ€ ๊ฐ•๋ ฅํ•˜์ง€ ์•Š์Œ. ๋”ฐ๋ผ์„œ Pre-LN Transformer๋„ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ–์ถ”๊ฒŒ ๋˜์—ˆ์Œ.
  2. Pre-LN Transformer๊ฐ€ Post-LN Transformer๋ณด๋‹ค ๋˜‘๊ฐ™์€ $lr_{max}$์— ๋Œ€ํ•ด ๋” ๋นจ๋ฆฌ ์ˆ˜๋ ดํ•จ.
  3. LN ์œ„์น˜์˜ ๋ณ€ํ™”๊ฐ€ optimizer์˜ ๋ณ€ํ™”๋ฅผ '์ง€๋ฐฐ'ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค ๋ฐœ๊ฒฌ

 

๊ทธ๋ฆผ 4. Machine Translation task์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ

 

Unsupervised Pre-training$($BERT$)$  model์˜ validation loss๋ฅผ ๋‹ค์Œ์˜ ๊ทธ๋ฆผ 5์˜ a์™€ ๊ฐ™์ด ๊ธฐ๋กํ•˜์˜€๋‹ค. machine translation task์™€ ๋น„์Šทํ•˜๊ฒŒ Pre-LN Transformer๋ฅผ ์œ„ํ•ด learning rate warm-up stage๋Š” ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด Pre-LN Transformer๋Š” ๋”์šฑ ๋นจ๋ฆฌ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Pre-LN Transformer๋Š” ํฐ learning rate๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋”์šฑ ์‰ฝ๊ฒŒ ์ตœ์ ํ™”๋  ์ˆ˜ ์žˆ๋‹ค. ์‹คํ—˜์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ model์˜ checkpoint๋ฅผ MRPC์™€ RTE downstream task์— ๋Œ€ํ•ด ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์‹คํ—˜์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๊ทธ๋ฆผ 5์˜ b์™€ c์— ๋‚˜์™€์žˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด Pre-LN Transformer๊ฐ€ downstream task์— ๋”์šฑ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

๊ทธ๋ฆผ 5. ๋น„์ง€๋„ pre-training BERT์™€ downstream task์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ

 

 ์š”์•ฝํ•˜๋ฉด, ์„œ๋กœ ๋‹ค๋ฅธ task์— ๋Œ€ํ•œ ๋ชจ๋“  ์‹คํ—˜์€ Pre-LN Transformer์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด learning rate warm-up stage์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  Post-LN Transformer์— ๋น„ํ•ด ๋”์šฑ ๋น ๋ฅด๊ฒŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://sh-tsang.medium.com/review-pre-ln-transformer-on-layer-normalization-in-the-transformer-architecture-b6c91a89e9ab

 

Reviewโ€Šโ€”โ€ŠPre-LN Transformer: On Layer Normalization in the Transformer Architecture

Pre-LN Transformer, Warm-Up Stage is Skipped

sh-tsang.medium.com

https://arxiv.org/pdf/2002.04745.pdf

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

GPT-3: Language Models are Few-Shot Learners ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.21
TinyBERT: Distilling BERT for Natural Language Understanding ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.12
Longformer: The Long-Document Transformer ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.07
SpanBERT: Improving Pre-training by Representing and Predicting Spans ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.03.06
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.03
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • GPT-3: Language Models are Few-Shot Learners ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • TinyBERT: Distilling BERT for Natural Language Understanding ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • Longformer: The Long-Document Transformer ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • SpanBERT: Improving Pre-training by Representing and Predicting Spans ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • open-source model
  • Open-source
  • Vicuna Evaluation
  • Vicuna
  • GPT-4
  • LLAMA2
  • context length
  • proprietary model
  • Evaluation Metric
  • LLM
  • LM
  • ChatGPT
  • RLHF
  • scaling law
  • Chinchilla
  • MT-Bench
  • closed-source model
  • transformer
  • closed-source
  • context window
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
Pre-LN Transformer: On Layer Normalization in the Transformer Architecture ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.