Paper Reading ๐Ÿ“œ/Natural Language Processing

Let's verify step-by-step ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

Cartinoe 2023. 6. 20. 22:45

The overview of this paper

 ์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ LLM์€ ๋ณต์žกํ•œ multi-step ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๋Šฅ๋ ฅ์ด ์ƒ๋‹นํžˆ ๊ฐœ์„ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, SoTA ๋ชจ๋ธ์€ ์•„์ง ๋…ผ๋ฆฌ์  ์˜ค๋ฅ˜๋ฅผ ๋งŒ๋“ค์–ด ๋‚ด๊ธฐ๋„ ํ•œ๋‹ค. ๋”์šฑ ์‹ ๋ขฐ๋„ ์žˆ๋Š” ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•˜๋Š” outcome supervision์œผ๋กœ ์ „ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.

 

 ๋…ผ๋ฌธ์˜ ์‹คํ—˜์„ ํ†ตํ•ด ์–ด๋ ค์šด MATH ๋ฐ์ดํ„ฐ์…‹์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด process supervision์ด outcome supervision์„ ์ƒ๋‹นํžˆ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค. ๋˜ํ•œ active learning์ด process supervision์˜ ํšจํ—˜์„ ์ƒ๋‹นํžˆ ๊ฐœ์„ ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  80๋งŒ ๊ฐœ์˜ step-level human feedback ๋ผ๋ฒจ์„ ํฌํ•จํ•˜๋Š” ์™„์„ฑ ๋ฐ์ดํ„ฐ์…‹์ธ PRM800K๋ฅผ ๊ณต๊ฐœํ•˜์˜€๋‹ค.

 

 

Table of Contents

1. Introduction

2. Methods

3. Large-Scale Supervision

4. Small-scale Synthetic Supervision

5. OOD Generalization

6. Discussion

 

 

1. Introduction

 LLM์€ step-by-step CoT ํฌ๋งท์—์„œ ์†”๋ฃจ์…˜์„ ์ƒ์„ฑํ•จ์œผ๋กœ์จ ๋ณต์žกํ•œ multi-step ์ถ”๋ก ์„ ํ•„์š”๋กœ ํ•˜๋Š” task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ฃผ๊ฑฐ๋‚˜ hallucination์„ ์ผ์œผํ‚ค๊ธฐ๋„ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ hallucination์„ ํƒ์ง€ํ•˜๊ณ  ์™„ํ™”ํ•˜๋Š” ๊ฒƒ์€ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š”๋ฐ ํ•„์ˆ˜์ ์ด๋‹ค. 

 

  ํšจ๊ณผ์ ์ธ method ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ”๋žŒ์งํ•œ output ๊ฐ„์— ๊ตฌ๋ณ„์„ ํ•˜๊ธฐ ์œ„ํ•œ RM ํ•™์Šต์„ ํฌํ•จํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ์œ ์šฉํ•˜์ง€๋งŒ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ์‹œ์Šคํ…œ์€ RM ์ž์‹ ์—๊ฒŒ๋งŒ ์‹ ๋ขฐ๋„๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋–ป๊ฒŒ ๊ฐ€์žฅ ํšจ๊ณผ์ ์œผ๋กœ ์‹ ๋ขฐ๋„ ์žˆ๋Š” RM์„ ํ•™์Šต์‹œํ‚ค๋Š”์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

 

 ์ด์ „ ์—ฐ๊ตฌ์—์„œ RM์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ 2๊ฐœ์˜ method๋ฅผ ์„ค๋ช…ํ•œ๋‹ค: outcome supervision & process supervision.

 

  • Outcome-supervised reward models(ORM): ๋ชจ๋ธ์˜ CoT์˜ ์ตœ์ข… ๊ฒฐ๊ณผ๋งŒ์„ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•จ
  • Process-supervised reward models(PRM): CoT์—์„œ ๊ฐ ์Šคํ…์— ๋Œ€ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์Œ

 

 process supervision์„ ๋” ์„ ํ˜ธํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ •ํ™•ํ•œ ์œ„์น˜๋ฅผ ๋ช…์‹œํ•ด์„œ ๋”์šฑ ์ •ํ™•ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•˜๊ณ , process supervision์€ ์ด๋Ÿฌํ•œ misaligned ํŠน์„ฑ์„ ์™„ํ™”ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค. 

 

 ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์žฅ์ ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  outcome supervision๊ณผ process supervision์€ ๋น„์Šทํ•œ ์ตœ์ข… ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ธ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 3๊ฐœ์˜ ์ฃผ๋œ ์ฐจ์ด์ ์„ ๊ฐ€์ง€๊ณ  ๋””ํ…Œ์ผํ•œ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

  • ๋” ์œ ๋Šฅํ•œ base model
  • ์ƒ๋‹นํžˆ ๋” ๋‚˜์€ human feedback
  • ๋”์šฑ ์–ด๋ ค์šด MATH ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ

 

 ๋…ผ๋ฌธ์˜ contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  1. process supervision์€ outcome supervision๋ณด๋‹ค ๋” ์‹ ๋ขฐ๋„ ์žˆ๋Š” RM์„ ํ•™์Šต์‹œ์ผฐ์Œ
  2. ๊ฑฐ๋Œ€ RM์€ smaller RM์— ๋Œ€ํ•ด human supervision์— ๊ทผ์ ‘ํ•  ์ˆ˜ ์žˆ์Œ
  3. active learning์€ process supervision ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์—์„œ 2.6๋ฐฐ์˜ ๊ฐœ์„ ์„ ์ด๋Œ์—ˆ์Œ
  4. PRM800K supervision ๋ฐ์ดํ„ฐ์…‹์„ ๊ณต๊ฐœํ•จ

 

2. Methods

 outcome & process supervision์˜ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. outcome supervision์€ ์‚ฌ๋žŒ ์—†์ด๋„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋‹ต์„ ํ™•์ธํ•ด์„œ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, process supervision์€ ๋”ฐ๋กœ ์ž๋™ํ™”๋œ method๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— human data-labeler์— ์˜์กดํ•ด์„œ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” 2๊ฐœ์˜ ๋ณ„๊ฐœ์˜ ์˜์—ญ์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค: large-scale & small-scale. ๊ฐ๊ฐ์€ ๊ฐ๊ฐ์˜ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ€์žฅ ์‹ ๋ขฐ๋„ ์žˆ๋Š” ORM๊ณผ PRM์„ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ SoTA๋ฅผ ๋ฐœ์ „์‹œํ‚ค๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ท„๋‹ค. ๋ถˆํ–‰ํ•˜๊ฒŒ๋„ ์ด RM์— ๋Œ€ํ•œ training set๋Š” ์ง์ ‘์ ์œผ๋กœ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ๋ฐ, ์ด ๊ฒฐ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ ์ง์ ‘์ ์ธ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์†Œ๊ทœ๋ชจ์˜ ๋ชจ๋ธ ๋˜ํ•œ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๋˜ํ•œ ๊ฐ’๋น„์‹ผ human feedback์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ์†Œ๊ทœ๋ชจ training์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

2-1. Scope

 

 ๊ฐ ๋ชจ๋ธ ๊ทœ๋ชจ์—์„œ, ๋ชจ๋“  ์†”๋ฃจ์…˜์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ๊ณ ์ •๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์„ generator๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. generator๋ฅผ ๊ฐœ์„ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋”ฐ๋กœ RL์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋Š”๋ฐ, ์ด๋Š” RL์„ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šต์‹œํ‚ค๋ฉด generator๊ฐ€ RM์œผ๋กœ๋ถ€ํ„ฐ ์–ด๋– ํ•œ supervision๋„ ๋ฐ›์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋Š” ๋…ผ๋ฌธ์˜ ์ทจ์ง€์™€ ๋งž์ง€ ์•Š์•„์„œ ๋ฐฐ์ œํ•˜์˜€๋‹ค. ๊ทธ ๋Œ€์‹ ์— ์–ด๋–ป๊ฒŒ ๊ฐ€์žฅ ์‹ ๋ขฐ๋„ ์žˆ๋Š” RM์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค.

 

2-2. Base Models

 

 ๋ชจ๋“  ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์€ base GPT-4 ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ fine-tuned ๋˜์—ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ next token prediction์œผ๋กœ๋งŒ pre-train ๋˜๊ณ  RLHF๋กœ๋Š” ํ•™์Šต๋˜์ง€ ์•Š์•˜๋‹ค. ์†Œ๊ทœ๋ชจ ๋ชจ๋ธ๋„ GPT-4์™€ ๋””์ž์ธ์€ ๋น„์Šทํ•˜์ง€๋งŒ 200๋ฐฐ ๋” ์ ์€ ๊ณ„์‚ฐ๋Ÿ‰์—์„œ pre-train ๋˜์—ˆ๋‹ค. ์ถ”๊ฐ€์ ์ธ pre-training step์œผ๋กœ ๋ชจ๋“  ๋ชจ๋ธ์„ MathMix ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ fine-tune ํ•˜์˜€๋‹ค. ์ด๊ฒƒ์ด ๋ชจ๋ธ์˜ ์ˆ˜ํ•™์  ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐœ์„ ์‹œ์ผฐ๋‹ค.

 

2-3. Generator

 

 ๊ฐœ๋ณ„ step์˜ ๋ถ„์„์„ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด step-by-step ํ˜•์‹์œผ๋กœ ์†”๋ฃจ์…˜์„ ์ƒ์„ฑํ•˜๋„๋ก generator๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ MATH training ๋ฌธ์ œ์— ๋Œ€ํ•œ ์†”๋ฃจ์…˜์„ few-shot์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์•Œ๋งž์€ ์ตœ์ข… ์‘๋‹ต์— ๋„๋‹ฌํ•˜๋Š”์ง€ ํ•„ํ„ฐ๋งํ•˜๊ณ , ์ด ๋ฐ์ดํ„ฐ์…‹์—์„œ base model์„ 1 epoch์—์„œ fine-tune ํ•˜์˜€๋‹ค. ์ด ์Šคํ…์€ generator๊ฐ€ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ํ˜•์‹์œผ๋กœ ์†”๋ฃจ์…˜์„ ์ƒ์„ฑํ•˜๋„๋ก ๊ฐ€๋ฅด์นœ๋‹ค.

 

2-4. Data Collection

 

 process supervision ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ธฐ ์œ„ํ•ด human data-labeler์™€ ๋Œ€๊ทœ๋ชจ generator์— ์˜ํ•ด ์ƒ˜ํ”Œ๋ง๋œ MATH problem์— ๋Œ€ํ•œ ์ด๋“ค์˜ task๋Š” ์†”๋ฃจ์…˜๊ณผ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ด๋“ค์˜ task๋Š” ๊ทธ๋ฆผ 1์ฒ˜๋Ÿผ ์†”๋ฃจ์…˜์—์„œ ๊ฐ ์Šคํ…์— ๋ผ๋ฒจ(positive, neutral, negative)์„ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ neutral ๋ผ๋ฒจ์„ ํ—ˆ๋ฝํ•˜๋Š” ์ด์œ ๋Š” ์–ด๋–ป๊ฒŒ ๋ชจํ˜ธ์„ฑ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒฐ์ •์„ ๋‹ฌ๋ฆฌํ•˜๋Š” ๊ฒƒ์„ ํ—ˆ๋ฝํ•ด ์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

๊ทธ๋ฆผ 1. ํ”ผ๋“œ๋ฐฑ์„ ์ˆ˜์ง‘ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ ์ธํ„ฐํŽ˜์ด์Šค์˜ ์Šคํฌ๋ฆฐ์ƒท

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ์†”๋ฃจ์…˜์„ ์ œํ•œ๋œ human-data ๋ฆฌ์†Œ์Šค์˜ ๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ generator๋กœ๋ถ€ํ„ฐ ๋ผ๋ฒจ๋ง ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ˆ˜์ง‘๋œ step-level ๋ผ๋ฒจ์˜ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ PRM800K ๋ผ๊ณ  ์ง€์นญํ•˜์˜€๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ 12K ๊ฐœ์˜ ๋ฌธ์ œ์— ๋Œ€ํ•œ 75K ๊ฐœ์˜ ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ 800K ๊ฐœ์˜ step-level ๋ผ๋ฒจ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์˜ค๋ฒ„ํ”ผํŒ…์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด 4.5K ๊ฐœ์˜ MATH test ๋ฌธ์ œ๋ฅผ PRM800K training set์— ํฌํ•จ์‹œ์ผฐ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚จ์€ 500๊ฐœ์˜ MATH test ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค.

 

 ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ค‘์— ์–ด๋–ค ์†”๋ฃจ์…˜์ด ๋ฐ์ดํ„ฐ ๋ผ๋ฒจ๋Ÿฌ์—๊ฒŒ ๋ณด์—ฌ์ ธ์•ผ ํ• ์ง€ ๊ฒฐ์ •๋˜์–ด์•ผ๋งŒ ํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์€ best RM์„ ์ตœ๋Œ€ํ•œ ์ž˜ ์†์ด๋Š” ์†”๋ฃจ์…˜์„ ์„ ํ˜ธํ•œ๋‹ค. ๊ทธ๋ž˜์„œ convincing wrong-answer๋ฅผ ์„ ํƒํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์„œ convincing์€ PRM์œผ๋กœ๋ถ€ํ„ฐ ๋†’๊ฒŒ ํ‰๊ฐ€๋œ ์†”๋ฃจ์…˜์ด๊ณ , wrong-answer๋Š” incorrect final answer์— ๋„๋‹ฌํ•˜๋Š” ์†”๋ฃจ์…˜์„ ๋งํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ์•ฝ๊ฐ„ ์žฅํ™ฉํ•œ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•์„ฑ์ด ์ „์ ์œผ๋กœ final answer๋ฅผ ํ™•์ธํ•จ์œผ๋กœ์จ ๊ฒฐ์ •๋œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ฐ•์กฐํ•˜์˜€๋‹ค. ์ด ๊ณผ์ •์€ ๋•Œ๋•Œ๋กœ ์ž˜๋ชป๋œ ๋‹ต์œผ๋กœ ์ด์–ด์ง€๊ธฐ๋„ ํ•œ๋‹ค. convincing wrong answer ์†”๋ฃจ์…˜์„ ๋ผ๋ฒจ๋ง ํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ ๋งŽ์€ ์ •๋ณด๋ฅผ ์–ป์„ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒํ•˜์˜€๋Š”๋ฐ, ์™œ๋ƒํ•˜๋ฉด PRM์€ ์ด๋Ÿฌํ•œ ๊ฐ ์†”๋ฃจ์…˜์—์„œ ์ตœ์†Œ ํ•œ ์Šคํ…์— ๋Œ€ํ•ด ์‹ค์ˆ˜ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

 

 ๊ฒŒ๋‹ค๊ฐ€ ์ด๋Ÿฌํ•œ ์„ ํƒ ์ „๋žต์„ ์‚ฌ์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ”„๋กœ์„ธ์Šค์˜ ์—ฌ๋Ÿฌ ํฌ์ธํŠธ์—์„œ ์ตœ๊ทผ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ PRM์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์žฌํ•™์Šต์‹œ์ผฐ๋‹ค. 

 

2-5. Outcome-supervised Reward Models(ORM)

 

 ๋…ผ๋ฌธ์—์„œ๋Š” generator๋กœ๋ถ€ํ„ฐ ๋ฌธ์ œ ๋‹น ๊ณ ์ •๋œ ์ˆ˜์˜ ์†”๋ฃจ์…˜์„ ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ฐ ์†”๋ฃจ์…˜์ด ๋งž๋Š”์ง€ ๋˜๋Š” ์•ˆ ๋งž๋Š”์ง€ ์˜ˆ์ธกํ•˜๋„๋ก ORM์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. test time์— ORM์˜ ์˜ˆ์ธก์„ ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ ์ „๋ฐ˜์ ์ธ score์˜ ์ตœ์ข… ํ† ํฐ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋˜ํ•œ ORM ํƒ€๊นƒ์ด ์™„๋ฒฝํžˆ ์‹ ๋ขฐ๋˜๋Š”์ง€, ๊ทธ๋ ‡์ง€ ์•Š์€์ง€๋ฅผ ์ž๋™ํ™”๋œ grading์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ์ •ํ•˜์˜€๋‹ค: incorrect ์ถ”๋ก ์œผ๋กœ correct answer์— ๋„๋‹ฌํ•˜๋Š” false positive ์†”๋ฃจ์…˜์€ ์ž˜๋ชป grading ๋  ๊ฒƒ.

 

2-6. Process-supervised Reward Models(PRM)

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ์Šคํ…์—์„œ ์ตœ์ข… ํ† ํฐ ์ดํ›„์˜ ๊ฐ ์Šคํ…์˜ ์ •ํ™•๋„๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด PRM์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ด ์˜ˆ์ธก์€ single token์˜ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ , training ์ค‘์— ํƒ€๊นƒ ํ† ํฐ์˜ log-likelihood๋ฅผ ์ตœ๋Œ€ํ™”์‹œํ‚จ๋‹ค. ํ…Œ์ŠคํŠธ ์‹œ์— step-level ์˜ˆ์ธก์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด PRM์ด suffix ๋ผ์„œ ์ „์ฒด ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ PRM forward pass๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ๋ฆผ 2์— 2๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ ๋Œ€๊ทœ๋ชจ PRM score๋ฅผ ์‹œ๊ฐํ™”ํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ์†”๋ฃจ์…˜์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ single score ๊ณ„์‚ฐ์€ ํ•„์ˆ˜์ ์ด๋‹ค.

 

๊ทธ๋ฆผ 2. ๋˜‘๊ฐ™์€ ๋ฌธ์ œ์— ๋Œ€ํ•ด PRM์— ๋“ธํ•ด ํ‰๊ฐ€๋œ 2๊ฐ€์ง€ ์†”๋ฃจ์…˜. ์™ผ์ชฝ์˜ ์†”๋ฃจ์…˜์€ ์ •ํ™•ํ–ˆ์ง€๋งŒ, ์˜ค๋ฅธ์ชฝ์˜ ์†”๋ฃจ์…˜์€ ๋ถ€์ •ํ™•ํ•œ ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค. ์ดˆ๋ก์ƒ‰์€ ๋†’์€ PRM score๋ฅผ ๋งํ•˜๊ณ , ๋นจ๊ฐ„์ƒ‰์€ ๋‚ฎ์€ PRM score๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

 process supervision์„ ์ œ๊ณตํ•  ๋•Œ ์˜๋„์ ์œผ๋กœ ์ฒซ ๋ฒˆ์งธ ์ž˜๋ชป๋œ ๋‹จ๊ณ„๊นŒ์ง€๋งŒ supervise ํ•˜๋„๋ก ์„ ํƒํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ outcome supervision๊ณผ process supervision ๊ฐ„์˜ ๋น„๊ต๋ฅผ ๋”์šฑ ์ง๊ด€์ ์œผ๋กœ ๋งŒ๋“ค์–ด์คฌ๋‹ค. ์•Œ๋งž์€ ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด ๋‘ method๋Š” ๋ชจ๋“  ์Šคํ…์ด ์•Œ๋งž๋‹ค๋Š” ๋˜‘๊ฐ™์€ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ์•Œ๋งž์ง€ ์•Š์€ ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด 2๊ฐœ์˜ method๋Š” ์ตœ์†Œ ํ•˜๋‚˜์˜ ์‹ค์ˆ˜์˜ ์กด์žฌ๋ฅผ ๋ฐํ˜€๋‚ด๊ณ , process supervision์ด ์ œ๊ณต๋˜๋ฉด proces supervision์€ ๋”์šฑ ํฐ ์ •๋ณด์  ์žฅ์ ์„ ๊ฐ€์ง€๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ์ •์€ ์‚ฌ๋žŒ๊ณผ ๋น„์Šทํ•œ ๋ผ๋ฒจ๋ง ๋น„์šฉ์„ ์œ ์ง€ํ•œ๋‹ค: ํ™•์ธํ•˜๊ธฐ ์‰ฌ์šด final answer์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  ์†”๋ฃจ์…˜์˜ ์ •ํ™•์„ฑ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ์ฒซ ๋ฒˆ์งธ ์‹ค์ˆ˜๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ MATH problem์€ easy-to-check final anser๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๊ฒƒ์ด ๋”์šฑ ๋ณต์žกํ•œ ๋„๋ฉ”์ธ์—์„œ๋„ ์œ ์ง€๋  ๊ฒƒ์ด๋ผ๊ณ ๋Š” ์ƒ๊ฐํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ์˜ˆ์ƒํ•˜์˜€๋‹ค.

 

 

3. Large-scale Supervision

 ๋…ผ๋ฌธ์—์„œ๋Š” PRM800K์˜ step-level ๋ผ๋ฒจ์„ ์‚ฌ์šฉํ•ด์„œ ๋Œ€๊ทœ๋ชจ PRM์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๋Œ€๊ทœ๋ชจ ORM baseline์ด ๊ฐ•๋ ฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์ œ ๋‹น 100๊ฐœ์˜ generator๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ๊ท ์ผํ•œ ์ƒ˜ํ”Œ์—์„œ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ด๊ฒƒ์€ PRM800K์™€ ์˜ค๋ฒ„๋žฉ์ด ์—†๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๋น„๋ก ์ด๋Ÿฌํ•œ 2๊ฐœ์˜ training set๋Š” ์ง์ ‘์ ์œผ๋กœ ๋น„๊ตํ•  ์ˆ˜ ์—†์ง€๋งŒ, ๊ฐ๊ฐ์€ SoTA๋ฅผ ๋ฐœ์ „์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ตœ๊ณ ์˜ ์‹œ๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ORM์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์™œ๋ƒํ•˜๋ฉด active learning ์ „๋žต์ด wrong-anser ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ํฌ๊ฒŒ ํŽธํ–ฅ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋œ ์†”๋ฃจ์…˜์„ ๋ฌถ์Œ์œผ๋กœ์จ PRM800K ์†”๋ฃจ์…˜์˜ superset์—์„œ ORM์˜ explore training์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, ORM ์„ฑ๋Šฅ์€ ๊ฐœ์„ ๋˜์ง€ ์•Š์•˜๋‹ค.

 

 ๊ทธ๋ฆผ 3์€ $N$์˜ ํ•จ์ˆ˜๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ํ•จ์— ๋”ฐ๋ผ ๊ฐ RM์˜ best-of-N ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. majority voting์€ ๊ฐ•๋ ฅํ•œ baseline์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด method๋ฅผ ๋น„๊ต์˜ ํฌ์ธํŠธ๋กœ ํฌํ•จํ•˜์˜€๋‹ค. ORM์€ majority voting๋ณด๋‹ค ์‚ด์ง ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, PRM์€ ์ด ๋‘˜์„ ๊ฐ•ํ•˜๊ฒŒ ๋Šฅ๊ฐ€ํ•œ๋‹ค. ์ด๊ฒƒ์€ PRM์ด ๋งŽ์€ ์ˆ˜์˜ model-generated ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด ๊ฒ€์ƒ‰ํ•˜๋Š”๋ฐ ORM๊ณผ majority voting ๋ณด๋‹ค ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๊ฐ€๋ฆฌํ‚จ๋‹ค. PRM๊ณผ majority voting์˜ ์ด์ต์„ ๋ฌถ๊ธฐ ์œ„ํ•ด RM-weighted voting์„ ์‚ฌ์šฉํ•ด์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ, ์„ฑ๋Šฅ์„ ๋ˆˆ์— ๋„๊ฒŒ ๊ฐœ์„ ์‹œํ‚ค์ง€๋Š” ์•Š์•˜๋‹ค. 

 

๊ทธ๋ฆผ 3. outcome-supervised & process-supervised RM ๊ฐ„์˜ ๋น„๊ต

 

 

4. Small-scale Synthetic Supervision

 PRM์€ ๋Œ€๊ทœ๋ชจ์—์„œ ORM์„ ๋Šฅ๊ฐ€ํ•˜์ง€๋งŒ, ์ด ๊ฒฐ๊ณผ ํ˜ผ์ž๋กœ๋Š” ์™„๋ฒฝํ•˜์ง€ ์•Š์€ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฐ๋‹ค. outcome & process supervision์„ ๋” ์ž˜ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด, ์—ฌ๊ธฐ์—๋Š” ๊ณ ๋ฆฝ๋˜์–ด์•ผ๋งŒ ํ•˜๋Š” 2๊ฐœ์˜ ํ˜ผ๋ž€์— ๋น ๋œจ๋ฆฌ๋Š” ์š”์ธ์ด ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์š”์ธ๋“ค์€ ORM์˜ ์„ฑ๋Šฅ์— ํ•ด๋ฅผ ๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

  1. ORM๊ณผ PRM์„ ์œ„ํ•œ training set๋Š” ์ง์ ‘์ ์œผ๋กœ ๋น„๊ต๊ฐ€ ๋˜์ง€ ์•Š์Œ
  2. final-answer grading์€ correct answer์— ๋„๋‹ฌํ•˜์ง€๋งŒ, ์•Œ๋งž์ง€ ์•Š์€ ์ถ”๋ก ์„ ํ•˜๋Š” ์ž˜๋ชป๋œ ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด positive ๋ผ๋ฒจ์„ ์ œ๊ณตํ•ด ์คŒ

 

 human feedback ์ˆ˜์ง‘์˜ ๋น„์‹ผ ๋น„์šฉ ๋•Œ๋ฌธ์— human labeler๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‰ฝ๊ฒŒ ablate ํ•  ์ˆ˜ ์—†๋‹ค. ๊ทธ ๋Œ€์‹ ์— ๋Œ€๊ทœ๋ชจ RM์„ ์‚ฌ์šฉํ•ด์„œ smaller model์„ supervise ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์—ฐ๊ด€๋œ ablation์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ด ์…‹์—…์€ ์ˆ˜์ˆ˜ํ•œ ๋น„์šฉ์—์„œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด ์ค€๋‹ค.

 

4-1. Process vs. Outcome Supervision

 

 ๋…ผ๋ฌธ์—์„œ๋Š” outcome & process supervision์˜ ์ง์ ‘์ ์ธ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์†Œ๊ทœ๋ชจ generator๋กœ๋ถ€ํ„ฐ ๋ฌธ์ œ ๋‹น 1๊ฐœ๋ถ€ํ„ฐ 200๊ฐœ์˜ ์†”๋ฃจ์…˜์„ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด 3๊ฐ€์ง€ ํ˜•ํƒœ์˜ supervision์„ ์ œ๊ณตํ•˜์˜€๋‹ค.

 

  • process supervision w/ $PRM_{large}$
  • outcome supervision w/ $PRM_{large}$
  • outcome supervision w/ final-answer checking

 

 ๊ทธ๋ฆผ 4(a) ์—์„œ process supervision๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๊ทœ๋ชจ์—์„œ outcome supervision์˜ ๋‘ ํ˜•ํƒœ๋ฅผ ์ƒ๋‹นํžˆ ๋Šฅ๊ฐ€ํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 4(b)์—์„œ outcome supervision w/ $PRM_{large}$๋Š” final-answer checking๋ณด๋‹ค ๋”์šฑ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค. ์ด๋Š” $PRM_{large}$๊ฐ€ incorrect ์ถ”๋ก ์„ ์‚ฌ์šฉํ•ด์„œ correct final answer์— ๋„๋‹ฌํ•˜๋Š” ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ ๋” ๋‚˜์€ $PRM_{large}$๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 $PRM_{large}$์™€ final-answer checking์œผ๋กœ๋ถ€ํ„ฐ ์–ป์–ด์ง„ supervision ์ค‘ ๋ฌด์—‡์ด ๋” ์ ์ ˆํ•œ baseline์„ ๋‚˜ํƒ€๋‚ด๋Š”์ง€๋Š” ๋ถ„๋ช…ํ•˜์ง€ ์•Š๋‹ค. final-answer supervision์˜ ์ฃผ๋œ ์•ฝ์ ์€ false-positive๊ฐ€ ์žˆ๋‹ค๋Š” ์ ์ธ๋ฐ $PRM_{large}$๋กœ๋ถ€ํ„ฐ ์–ป์€ outcome supervision์€ ๋” ์ ๊ฒŒ false-positive์— ๊ธฐ์ธํ•˜๋Š” ๋„๋ฉ”์ธ ๋‚ด์—์„œ์˜ ๋” ๋‚˜์€ outcome supervision์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

 

4-2. Active Learning

 

 ๋งˆ์ง€๋ง‰์œผ๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” active learning์˜ ํšจ๊ณผ๋ฅผ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋ฌธ์ œ๋กœ๋ถ€ํ„ฐ ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์—์„œ ์†Œ๊ทœ๋ชจ RM์ธ $PRM_{selector}$๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ , ์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ๋ฌธ์ œ ๋‹น 1,000๊ฐœ์˜ ์ƒ˜ํ”Œ์— ์ ์ˆ˜๋ฅผ ๋งค๊ฒผ๋‹ค. ์„ ํƒ๋œ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด $PRM_{large}$๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๊ณ  ์ด ์ ์ˆ˜์—์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ ๋ผ๋ฒจ๋ง์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ๊ทธ๋ฆผ 4(a)์— ๋‚˜ํƒ€๋‚˜์žˆ๋‹ค. active learning์ด ์žˆ์„ ๋•Œ์™€ ์—†์„ ๋•Œ ์ตœ์ ์„ ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋น„๊ตํ•จ์œผ๋กœ์จ, ์ด๋Ÿฐ ํ˜•ํƒœ์˜ active learning ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์€ ์˜ˆ์ƒํ•œ ํŠธ๋ Œ๋“œ ๋ผ์ธ๋ณด๋‹ค ์‚ด์ง ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ๋…ผ๋ฌธ์—์„œ ์„ค๋ช…ํ•œ ๊ฐ€์žฅ best ์„ค๋ช…์€ 200๊ฐœ์˜ ์ƒ˜ํ”Œ๋กœ ์ „๋ฐ˜์ ์ธ ์„ ํƒํ’€์˜ ์ƒ๋‹นํ•œ ๋ถ€๋ถ„์„ ๋‚˜ํƒ€๋‚ด๊ณ  ๋น„๊ต์  ๋ถ€์กฑํ•œ ๋‹ค์–‘์„ฑ์€ active learning์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ€๋Šฅํ•œ upside๋ฅผ ์ œ์•ˆํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

 ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์„ ํ•˜๋Š” ๋™์•ˆ $PRM_{selector}$๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์žฌํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์˜ ํšจ๊ณผ์— ๋Œ€ํ•ด ์‚ฌ์ „ ์กฐ์‚ฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ๋ถˆํ–‰ํ•˜๊ฒŒ๋„ ์ด ํ”„๋กœ์„ธ์Šค์—์„œ๋Š” ๋ถˆ์•ˆ์„ฑ์ด ๊ด€์ฐฐ๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๋ชจ๋ธ์€ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์ข‹์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ์ง€๋Š” ์•Š์•˜๋‹ค.

 

 

5. OOD Generalization

 OOD ์ผ๋ฐ˜ํ™”๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด, ๋Œ€๊ทœ๋ชจ ORM๊ณผ PRM์„ 224๊ฐœ์˜ STEM question์—์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ํ‘œ 1์— ORM, PRM, majority voting์˜ best-of-100 ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ๋Š” ์„น์…˜ 3๊ณผ ์œ ์‚ฌํ•˜๋‹ค.

 

ํ‘œ 1. STEM test๋ฅผ ์‚ฌ์šฉํ•ด์„œ OOD ์ผ๋ฐ˜ํ™”๋ฅผ ์ธก์ •

 

6. Discussion

6-1. Credit Assignment

 

 process supervision์€ outcome supervision ๋ณด๋‹ค ๋”์šฑ ์ •ํ™•ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•ด ์ค€๋‹ค. process supervision์€ ๋ช‡ ๊ฐœ์˜ first step์˜ fact correct์ธ์ง€ ๋ช…์‹œํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ incorrect step์˜ ์ •ํ™•ํ•œ ์œ„์น˜๋ฅผ ์•Œ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

 

6-2. Alignment Impact

 

 AI alignment์™€ ๊ด€๋ จํ•ด์„œ process supervision์€ ์—ฌ๋Ÿฌ ์ด์ ์„ ๊ฐ€์ง„๋‹ค. process supervision์€ ๋”์šฑ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์ถ”๋ก ์„ ์ƒ์„ฑํ•ด ๋‚ด๊ณ , align ๋œ CoT์— ์ง์ ‘์ ์œผ๋กœ ๋ณด์ƒ์„ ์ค˜์„œ ๋ณธ์งˆ์ ์œผ๋กœ ๋” ์•ˆ์ „ํ•˜๋‹ค.

 

 ๋˜ํ•œ ์•ˆ์ „ํ•œ ๋ชจ๋ธ์€ ์กฐ๊ธˆ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋Š”๋ฐ, process supervision์€ alignment tax ๋ถ€๋ถ„์—์„œ ๋” ๋‚˜์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๊ฒƒ์€ process supervision์˜ ์ฆ๊ฐ€๋œ ์‚ฌ์šฉ์„ ์ด๋Œ๊ฒŒ ๋  ๊ฒƒ์ด๋ผ๊ณ  ๋ฏฟ์–ด์ง„๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2305.20050

 

Let's Verify Step by Step

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outc

arxiv.org