Research & Project ๐Ÿ”ฌ

์–ด๋–ป๊ฒŒ Quantization์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ผ๊นŒ? ๐Ÿค”

Cartinoe 2023. 9. 18. 10:39

Which quantization method is efficient & effective? ๐Ÿง

 ๋‚ ์ด ์ง€๋‚˜๋ฉด ์ง€๋‚ ์ˆ˜๋ก ์ ์  ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์ ธ๊ฐ€๋Š” LLM์˜ ํŒ๋„์—์„œ ์ด๋“ค์„ ์†์‰ฝ๊ฒŒ ํšจ์œจ์  ๋ฐ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์—๋Š” ๋ฌด์—‡์ด ์žˆ์„๊นŒ? ์š”์ฆ˜์—๋Š” ๋‹ค๋ฅธ method๋“ค๋ณด๋‹ค๋„ quantization, ์ฆ‰ ์–‘์žํ™”๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ถ”์„ธ์ด๋‹ค. ์ด quantization์„ ํ†ตํ•ด ์‚ฌ๋žŒ๋“ค์€ ๊ณ ์šฉ๋Ÿ‰ RAM์„ ๊ฐ€์ง€๋Š” GPU์—์„œ๋„ ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค๋˜ LLM์„ ํ›จ์”ฌ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค! ๐Ÿค—

 

 ์ตœ์†Œํ•œ์˜ ์„ฑ๋Šฅ ๊ฐ์†Œ๋กœ ์ตœ์ ์˜ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” quantization์„ ์œ„ํ•ด HuuggingFace์—์„œ๋Š” 2๊ฐ€์ง€ quantization method๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค. ๋ฐ”๋กœ BitsAndBytes์™€ GPTQ์ด๋‹ค. ์ด๋ฅผ ํ† ๋Œ€๋กœ ๋‘ quantization method๊ฐ€ ์–ด๋–ค ์žฅ๋‹จ์ ์„ ๊ฐ€์ง€๋Š”์ง€ ์ง์ ‘ ๋น„๊ต ๋ฐ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ HuggingFace์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ๋ณด๊ณ , ์ด ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— ์ œ์‹œ๋˜์—ˆ๋˜ quantization method์˜ ํšจ์œจ์„ฑ์„ ์ง์ ‘ ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ฆํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค! ํ”„๋กœ์ ํŠธ์˜ Github Repository๋Š” ๋ณธ ํฌ์ŠคํŠธ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— ์žˆ๋Š” ์ถœ์ฒ˜์—์„œ ํ™•์ธํ•˜์‹œ๊ธธ ๋ฐ”๋ผ๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 ํ”„๋กœ์ ํŠธ๋ฅผ ์†Œ๊ฐœํ•˜๊ธฐ์— ์•ž์„œ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์†Œ๊ฐœํ•œ ๊ฐœ๋…๋“ค์— ๋Œ€ํ•ด ์ž ๊น ์งš๊ณ  ๋„˜์–ด๊ฐ€ ๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค!

 

Before start..

 ๊ฐ quantization method์— ๋Œ€ํ•ด ์ž์„ธํžˆ ๋”์šฑ ์ž์„ธํ•˜๊ฒŒ ์•Œ์•„๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์˜ Resource๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์‹œ๊ธธ ๋ฐ”๋ผ๊ฒ ์Šต๋‹ˆ๋‹ค!

 

Resources

  • GPTQ blogpost: GPTQ quantization method์— ๋Œ€ํ•œ ์ „์ฒด์ ์ธ ๊ฐœ์š”๋ฅผ ์•Œ๋ ค์ฃผ๊ณ  ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ๋ ค์คŒ.
  • bitsandbytes 4-bit quantization blogpost: 4-bit ์–‘์ž์™€์™€ ํšจ์œจ์ ์ธ fine-tuning ๋ฐฉ์‹์ธ QLoRA์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•จ.
  • bitsandbytes 8-bit quantization blogpost: bitsandbytes๋ฅผ ํ™œ์šฉํ–ˆ์„ ๋•Œ 8-bit quantization์ด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์„ค๋ช…ํ•จ.
  • Basic usage Google Colab notebook for GPTQ: GPTQ method๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์–ด๋–ป๊ฒŒ Transformers ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•˜๊ณ , ์ถ”๋ก ํ•˜๋Š”์ง€ ๋“ฑ์„ ์•Œ๋ ค์คŒ. ๊ทธ๋ฆฌ๊ณ  quantized model์„ ํ™œ์šฉํ•˜์—ฌ ์–ด๋–ป๊ฒŒ fine-tuning ํ•˜๋Š”์ง€ ์•Œ๋ ค์คŒ.
  • Basic usage Google Colab notebook for bitsandbytes: 4-bit model์„ ํ™œ์šฉํ•ด์„œ ์–ด๋–ป๊ฒŒ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์คŒ. ๊ทธ๋ฆฌ๊ณ  GPT-neo-X๋ฅผ Google Colab Free GPU์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์คŒ.
  • Merve's blogpost on quantization: ์–‘์žํ™”์™€ ์–‘์žํ™” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ž์„ธํ•˜๊ฒŒ ์•Œ๋ ค์คŒ.

 

Pros & Cons Analysis(bitsandbytes, GPTQ) ๐Ÿ†š

 HuggingFace์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ quantization method์˜ ์žฅ๋‹จ์ ์„ ์†Œ๊ฐœํ•œ๋‹ค.

 

The benefits & rooms of improvements of bitsandbytes

 

Benefits

  • easy ๐Ÿ˜™: bitsandbytes๋Š” ๋ชจ๋ธ ๋กœ๋“œ ์‹œ์— ๋ชจ๋“  ๊ฒƒ์„ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ ์–ด๋– ํ•œ ํ›„์ฒ˜๋ฆฌ ๋˜๋Š” ์ค€๋น„ ์Šคํ…์„ ๊ฑฐ์น˜์ง€ ์•Š์•„๋„ ๋Œ.
  • cross-modality interoperability ๐Ÿงฐ: ์–ด๋– ํ•œ modality๋ผ๊ณ  ํ•ด๋„ quantization์ด ๊ฐ€๋Šฅํ•จ. ๋”ฐ๋ผ์„œ ๋ฒ”์šฉ์„ฑ์ด ๋„“์Œ.
  • 0 performance degradation when merging adapters โœ…: ํ•™์Šต๋œ adapter๋ฅผ base model ๋˜๋Š” dequantized model์— ์–ด๋– ํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ•ฉ์น  ์ˆ˜ ์žˆ์Œ. merging์€ GPTQ์—์„œ๋Š” ์ง€์›๋˜์ง€ ์•Š์Œ.

 

Rooms of Improvements

  • slower than GPTQ for text generation ๐Ÿข: bitsandbytes์˜ 4-bit model์€ inference ์‹œ์— GPTQ๋ณด๋‹ค ๋А๋ฆฐ ์†๋„๋ฅผ ๋ณด์—ฌ์คŒ.
  • 4-bit weights are nor serializable ๐Ÿ˜“: ํ˜„์žฌ๋กœ์„œ๋Š” 4-bit model์„ ์ง๋ ฌํ™”ํ•  ์ˆ˜ ์—†์Œ.

 

The benefits & rooms of improvements of GPTQ

 

Benefits

  • fast for text generation โฉ: text generation ์ธก๋ฉด์—์„œ GPTQ quantized model์€ bitsandbytes quantized model๋ณด๋‹ค ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์คŒ.
  • n-bit support ๐Ÿ”ข: GPTQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 2bit ์ด์ƒ์œผ๋กœ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ์ถ”์ฒœ๋˜๋Š” bit์˜ ์ˆ˜๋Š” 4์ž„.
  • easily serializable ๐Ÿ˜Š: GPTQ model์€ ์–ด๋–ค ์ˆ˜์˜ bit๋“  ์ง๋ ฌํ™”๋ฅผ ์ง€์›ํ•จ.
  • AMD Support ๐Ÿ’ฝ: Nvidia GPU ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ AMD GPU๋„ ์ง€์›๋จ.

 

Rooms of Improvements

  • calibration dataset ๐Ÿ˜“: GPTQ๋ฅผ ์œ„ํ•ด์„œ๋Š” calibration dataset์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋กœ ์ธํ•ด GPTQ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋Š” ์‚ฌ์šฉ์ž๋“ค์ด ์ค„์–ด๋“ฆ. ๊ฒŒ๋‹ค๊ฐ€ model์„ quantize ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์กฐ๊ธˆ ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋จ.
  • works only for language models ๐Ÿ˜ข: GPTQ๋Š” ์˜ค์ง language model์„ ์œ„ํ•ด์„œ๋งŒ ๋งŒ๋“ค์–ด์กŒ์Œ.

 

Conclusion & Final Words of Blog ๐Ÿซก

 HuggingFace์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ๋Š” ๊ฐ quantization method์˜ ๋น„๊ต๋ฅผ ํ•˜๊ณ  ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— ์ด๋ฅผ ํ† ๋Œ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด quantization์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ํšจ์œจ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์ฃผ์žฅํ•˜์˜€๋‹ค. ํ•„์ž๊ฐ€ ์ง„ํ–‰ํ•œ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ฃผ์žฅ์„ ์‹ค์ œ ์‹คํ—˜์„ ํ†ตํ•ด ํšจ์œจ์„ฑ์„ ์ž…์ฆํ•˜๊ณ ์ž ํ•œ ๊ฒƒ์ด๋‹ค.

 

Suggestion of Blog

 

  1. bitsandbytes๋ฅผ ์‚ฌ์šฉํ•ด์„œ base model์„ ์–‘์žํ™”ํ•จ
  2. adapter๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  fine-tuning
  3. base model ๋˜๋Š” dequantized model์˜ ์œ„์— ํ•™์Šต๋œ adapter๋ฅผ merge ํ•จ
  4. GPTQ๋ฅผ ์‚ฌ์šฉํ•ด์„œ merged model์„ quantize ํ•˜๊ณ , ์ด๋ฅผ ์ด์šฉํ•ด์„œ inference๋ฅผ ์ง„ํ–‰

 

Experiments ๐Ÿงช

๋ณธ ํ”„๋กœ์ ํŠธ์—์„œ ์ง„ํ–‰ํ•œ ์‹คํ—˜์€ HuggingFace ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์ฃผ์žฅํ•œ method๊ฐ€ ์‹ค์ œ๋กœ ํšจ์œจ์  ์ผ์ง€ fine-tuning ์‹œ์˜ ํšจ์œจ์„ฑ๊ณผ inference ์‹œ์˜ ํšจ์œจ์„ฑ์„ ๋น„๊ตํ•˜๋ฉฐ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ๋ณธ์ ์ธ setup์€ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์‚ฌ์šฉ๋œ setup์„ ๋”ฐ๋ž๊ณ , ์‹คํ—˜๊ณผ ๊ด€๋ จํ•ด์„œ ๋”์šฑ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Github Repository๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์‹œ๊ธธ ๋ฐ”๋ผ๊ฒ ์Šต๋‹ˆ๋‹ค. baseline์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Baselines

 

  1. fine-tune w/ bitsandbytes & inference w/ bitsandbytes
  2. fine-tune w/ auto-GPTQ & inference w/ auto-GPTQ
  3. fine-tune w/ bitsandbytes & inference w/ auto-GPTQ(proposed method)

 

Results

 

Benchmark

 

 ๊ฐ baseline์˜ ํšจ์œจ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ๋ชจ๋ธ์„ ์ธก์ •ํ•˜์˜€๋‹ค.

 

  • Fine-tuning: Throughput per second(steps). ์ด ์ง€ํ‘œ๋Š” fine-tuning ์‹œ์— ๋ชจ๋ธ์ด ์ดˆ ๋‹น ์ฒ˜๋ฆฌํ•˜๋Š” ์Šคํ…์˜ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ„.
  • Inference: Average inference time(s). ์ด ์ง€ํ‘œ๋Š” ํ•œ ๋ฒˆ์˜ inference๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„์„ ์˜๋ฏธํ•จ.

 

Fine-tuning

 

 ์•„๋ž˜์˜ ํ‘œ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, bitsandbytes๊ฐ€ GPTQ๋ณด๋‹ค ๋” ๋น ๋ฅธ fine-tuning ์†๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” HuggingFace ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์ œ์•ˆ๋˜์—ˆ๋˜ method๋ฅผ ๋’ท๋ฐ›์นจํ•ด ์ค€๋‹ค(bitsandbytes๋ฅผ ์‚ฌ์šฉํ•ด์„œ adapter๋ฅผ fine-tuning)!

 

Quantization Method Throughput Per-Second(steps) โฌ†๏ธ Fine-tuning time(s) โฌ‡๏ธ
GPTQ 1.45 712
bitsandbytes 2.18 469

 

Inference

 

 ๊ฐ baseline์˜ inference speed๋ฅผ ๋น„๊ต ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™๋‹ค.

 

batch size์— ๋”ฐ๋ผ ํ•œ ๋ฒˆ์˜ inference ๋‹น ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„

 

  ์œ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด HuggingFace ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์ œ์•ˆ๋œ method(bnb-gptq)๊ฐ€ ๋‹ค๋ฅธ method์— ๋น„ํ•ด์„œ ๋”์šฑ ๋น ๋ฅธ ์†๋„๋กœ inference๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 

 

Final Results

 

์ด๋ ‡๊ฒŒ ํ•ด์„œ fine-tuning & inference์— ๋Œ€ํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค. ๋‹ค์Œ์˜ ํ‘œ๋Š” ๊ฐ baseline์˜ ์ข…ํ•ฉ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค์Œ์˜ ํ‘œ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด HuggingFace ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์—์„œ ์ œ์•ˆ๋œ 'bnb-gptq'๊ฐ€ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด ๋”์šฑ ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!

 

Method(Baseline) Throughput Per-Seconf(steps) โฌ†๏ธ Inference Speed(step/s) โฌ‡๏ธ
bnb-bnb 2.18 6.06
gptq-gptq 1.45 2.04
bnb-gptq ๐Ÿ‘‘ 1.45 1.31

 

Closing post..

 ์ด ํ”„๋กœ์ ํŠธ๋Š” ์•ž์„œ๋„ ๋งํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ HuggingFace์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์ธ 'Overview of natively supported quantization schemes in ๐Ÿค— Transformers' ์—์„œ์˜ ์ฃผ์žฅ์„ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด ์ง„ํ–‰๋˜์—ˆ๋‹ค. ์‹ค์ œ๋กœ ์‹คํ—˜์„ ํ†ตํ•ด ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ์˜ ์ฃผ์žฅ์ด ํšจ์œจ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์ž…์ฆํ•˜์˜€์œผ๋‚˜, ์ž์›์˜ ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด performance degradation ๋“ฑ์˜ ํšจ๊ณผ์ ์ธ ์ธก๋ฉด์€ ๊ฒ€์ฆํ•˜์ง€ ๋ชปํ–ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์€ future work๋กœ ๋‚จ๊ฒจ๋‘๋„๋ก ํ•˜๊ฒ ๋‹ค. ๋‹ค์‹œ ํ•œ๋ฒˆ 'Overview of natively supported quantization schemes in ๐Ÿค— Transformers' ์˜ ๋ชจ๋“  author ๋ถ„๋“ค๊ป˜ ๊ฐ์‚ฌ๋“œ๋ฆฐ๋‹ค๋Š” ๋ง์”€์„ ๋“œ๋ฆฌ๋ฉฐ ํฌ์ŠคํŠธ๋ฅผ ๋งˆ์น˜๊ฒ ๋‹ค!

 

 

 

 

์ถœ์ฒ˜

https://huggingface.co/blog/overview-quantization-transformers

 

Overview of natively supported quantization schemes in ๐Ÿค— Transformers

Overview of natively supported quantization schemes in ๐Ÿค— Transformers We aim to give a clear overview of the pros and cons of each quantization scheme supported in transformers to help you decide which one you should go for. Currently, quantizing models

huggingface.co

 

https://github.com/gauss5930/Quantization/tree/main

 

GitHub - gauss5930/Quantization: The comparison of most popular quantization methods, BitsAndBytes and GPTQ

The comparison of most popular quantization methods, BitsAndBytes and GPTQ - GitHub - gauss5930/Quantization: The comparison of most popular quantization methods, BitsAndBytes and GPTQ

github.com