Paper Reading ๐Ÿ“œ/Natural Language Processing

Llama์˜ ์ƒˆ๋กœ์šด ๋Œ€ํ•ญ๋งˆ, Mistral LM! ๐Ÿ˜ฎ

Cartinoe 2023. 10. 2. 12:03

The preview of Llama3..?

 ์ตœ๊ทผ์— HuggingFace๋ฅผ ๋ณด๋‹ค๊ฐ€ ์•Œ๊ฒŒ ๋œ ๋ชจ๋ธ์ด ํ•˜๋‚˜ ์žˆ๋‹ค. ๋ฐ”๋กœ LLM ์‹œ์žฅ์„ ๋œจ๊ฒ๊ฒŒ ๋‹ฌ๊ตฐ ๋ชจ๋ธ์ธ Mistral LM์ด๋‹ค! ํ˜œ์„ฑ์ฒ˜๋Ÿผ Open-source LLM ๊ณ„์— ๋‚˜ํƒ€๋‚œ Mistral 7B๋Š” ๊ทธ ๋“ฑ์žฅ๋งŒ์œผ๋กœ๋„ Open-source LLM๊ณ„๋ฅผ ๋œจ๊ฒ๊ฒŒ ๋‹ฌ๊ตฌ์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด Mistral 7B๋Š” ๋ฌด์—‡์„ ์–ด๋–ป๊ฒŒ ํ–ˆ๊ธธ๋ž˜ ๋ชจ๋‘์˜ ์ด๋ชฉ์„ ์ง‘์ค‘์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ์ผ๊นŒ? ๊ทธ๊ฒƒ์€ Mistral 7B๊ฐ€ ์ด๋ค„๋‚ธ ์—…์ ์„ ์‚ดํŽด๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋‹ค:

 

  • ๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ์—์„œ Llama2 13B๋ฅผ ๋Šฅ๊ฐ€
  • ๋งŽ์€ ๋ฒค์น˜๋งˆํฌ์—์„œ Llama1 34B๋ฅผ ๋Šฅ๊ฐ€(๋น„๊ต ๋Œ€์ƒ์ด Llama2๊ฐ€ ์•„๋‹ˆ๋ผ Llama1์ด์—ˆ๋˜ ์ด์œ ๋Š” Llama2์˜ 34B ๋ชจ๋ธ์ด ๊ณต๊ฐœ๋˜์—ˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ)
  • ์ฝ”๋“œ ๊ด€๋ จ ๋ฒค์น˜๋งˆํฌ์—์„œ CodeLlama 7B์˜ ์„ฑ๋Šฅ์— ๊ทผ์ ‘ํ•˜๋ฉด์„œ, ์˜์–ด task์˜ ์„ฑ๋Šฅ๋„ ์ข‹๊ฒŒ ์œ ์ง€
  • ๋น ๋ฅธ ์ถ”๋ก ์„ ์œ„ํ•ด Grouped-query attention(GQA)๋ฅผ ์‚ฌ์šฉ
  • ์ ์€ ๋น„์šฉ์œผ๋กœ ๋”์šฑ ๊ธด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด Sliding Window Attention(SWA)๋ฅผ ์‚ฌ์šฉ
  • chat์„ ์œ„ํ•ด fine-tune ๋œ Mistral 7B ๋ชจ๋ธ์€ Llama2 13B Chat์„ ๋Šฅ๊ฐ€ํ•จ

 

 ์œ„์˜ ์—…์ ๋“ค๋งŒ ์‚ดํŽด๋ณด์•„๋„ Mistral 7B ๋ชจ๋ธ์€ ์ƒ๋‹นํžˆ ์—„์ฒญ๋‚œ ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Mistral 7B๋Š” ๋งˆ์น˜ Llama3์˜ ํ”„๋ฆฌ๋ทฐ ๊ฐ™์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” Open-source LLM ํŒ์˜ ๋œจ๊ฑฐ์šด ๊ฐ์ž์ธ Mistral 7B์— ๋Œ€ํ•ด์„œ Mistral 7B ์†Œ๊ฐœ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ํ† ๋Œ€๋กœ ์•Œ์•„๋ณด๋Š” ์‹œ๊ฐ„์„ ๊ฐ€์ ธ๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

 

Performance in details

 Mistral 7B ๋ชจ๋ธ๊ณผ Llama2 model familty, Llama1 34B model์„ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Mistral 7B์™€ ๋‹ค๋ฅธ Llama model๋“ค์˜ ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ๋น„๊ต ๊ฒฐ๊ณผ

 

 ์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด Mistral 7B๋Š” ๋™ ์‚ฌ์ด์ฆˆ์˜ Llama2 7B ๋ชจ๋ธ์„ ๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ์—์„œ ๋Šฅ๊ฐ€ํ•˜๊ณ , ์‚ฌ์ด์ฆˆ๊ฐ€ ๋” ํฐ Llama2 13B์™€ Llama1 34B ๋ชจ๋ธ์— ๋น„ํ•ด์„œ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 7B model์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ด ์ •๋„์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ •๋ง ๋†€๋ผ์šด ๊ฒƒ ๊ฐ™๋‹ค.

 

 ๊ทธ๋ฆฌ๊ณ  Mistral ์—ฐ๊ตฌํŒ€์—์„œ๋Š” ํ•œ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ๋Š”๋ฐ, Llama2 model์ด Mistral model๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์–ผ๋งˆ๋‚˜ scaling up ํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” "equivalent model sizes"์ด๋‹ค. ์ด 4๊ฐœ์˜ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ด๋ฅผ ํ‰๊ฐ€ํ•˜์˜€๋Š”๋ฐ, ๊ทธ์ค‘ MMLU, Reasoning, Comprehension์—์„œ ๋Œ€๋žต Llama2 model์ด Mistrla 7B๋ณด๋‹ค 3๋ฐฐ ์ •๋„ ๋” ๋งŽ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.

 

"equivalent model sizes" ๊ฒฐ๊ณผ

 

Sliding Window Attention(SWA)

 Mistral 7B๋Š” Sliding Window Attention(SWA) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด method์—์„œ ๊ฐ ๋ ˆ์ด์–ด๋Š” ์ด์ „ 4,096๊ฐœ์˜ hiddent state๋ฅผ ์ฐธ์กฐํ•œ๋‹ค. ์ฃผ๋œ ๊ฐœ์„ ์ ๊ณผ ์ด method๊ฐ€ ์ฒ˜์Œ์— ๊ณ ์•ˆ๋œ ์ด์œ ๋Š” ์„ ํ˜• ๊ณ„์‚ฐ ๋น„์šฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. (O(sliding_window_length)) ์ด๋Ÿฌํ•œ ๋ณ€ํ™”๋Š” FlashAttention๊ณผ xFormers๊ฐ€ 2๋ฐฐ ๋” ๋น ๋ฅธ ์†๋„๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ์—ˆ๋‹ค.

 

 Sliding Windwo Attention์€ ์œ„๋„์šฐ ์‚ฌ์ด์ฆˆ๋ฅผ ๋„˜์–ด์„œ ๊ณผ๊ฑฐ์˜ context๋ฅผ ์ฐธ์กฐํ•˜๊ธฐ ์œ„ํ•ด Transformer์˜ stacked layer๋ฅผ ํ™œ์šฉํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด higher layer๋Š” attention pattern์ด ์ˆ˜๋ฐ˜ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ๊ณผ๊ฑฐ์˜ ์ •๋ณด์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Sliding WIndow Attention์˜ ์ž‘๋™ ๋ฐฉ์‹

 

 ์ตœ์ข…์ ์œผ๋กœ ๊ณ ์ •๋œ attention span์€ rotating buffer๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์บ์‹œ๋ฅผ sliding_window token ์‚ฌ์ด์ฆˆ๋กœ ์ œํ•œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๊ฒƒ์€ ๋ชจ๋ธ ํ€„๋ฆฌํ‹ฐ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ณ , 8,192์˜ ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ถ”๋ก ์— ๋Œ€ํ•œ ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค.

 

Fine-tuning Mistral 7B for chat

 Mistral 7B model์€ chat์— ํŠนํ™”๋œ Mistral 7B Instruct model ๋˜ํ•œ ๊ณต๊ฐœ๋˜์—ˆ๋‹ค. ์ด Instruct model์€ ์–ด๋– ํ•œ ํŠธ๋ฆญ๊ณผ ์ƒ์—…์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , HuggingFace์— ๊ณต๊ฐœ์ ์œผ๋กœ ๊ณต๊ฐœ๋˜์–ด ์žˆ๋Š” instruction dataset์—์„œ fine-tune๋˜์—ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์„œ ๋‚˜์˜จ Mistral 7B Instruct ๋ชจ๋ธ์€ MT-Bench์—์„œ ๋‹ค๋ฅธ ๋ชจ๋“  7B model๋“ค์€ ๋Šฅ๊ฐ€ํ•˜์˜€๊ณ , 13B chat model๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

MT-Bench ๊ฒฐ๊ณผ

 

The new paradigm of open-source LLM? ๐Ÿค”

 ์ด๋ ‡๊ฒŒ ํ•ด์„œ Llama์˜ ์ƒˆ๋กœ์šด open-source LLM ๋Œ€ํ•ญ๋งˆ์ธ Mistral 7B์— ๋Œ€ํ•ด์„œ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด์•˜๋‹ค. (๋”์šฑ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด Misral AI์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๊ธธ ๋ฐ”๋ž€๋‹ค.) Mistral 7B์˜ ๋ฒค์น˜๋งˆํ‚น ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ํ™•์‹คํžˆ ์—„์ฒญ๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ๋งˆ์น˜ Llama3์˜ ๋ฏธ๋ฆฌ ๋ณด๊ธฐ ๊ฐ™์€ ๋А๋‚Œ์„ ์ฃผ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์„ ๋ณด๋ฉด์„œ ํ™•์‹คํžˆ Llama๊ฐ€ ๋šซ์–ด๋†“์€ open-source LLM ์‹œ์žฅ์„ ์ด์ œ๋Š” ์—ฌ๋Ÿฌ ์„œ๋กœ ๋‹ค๋ฅธ open-source LLM๋“ค์ด ์ด ๋ถ„์•ผ๋ฅผ ํ™•์žฅ์‹œ์ผœ ๋‚˜๊ฐ„๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด์ œ๋Š” ์ˆœ์ „ํžˆ fine-tuning์„ ํ•ด์„œ ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“  open-source model๋„ ์ค‘์š”ํ•˜์ง€๋งŒ, Mistral model์ฒ˜๋Ÿผ ์ƒˆ๋กœ์šด base model์„ ์ œ์•ˆํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. 

 

 Proprietray model ๋งŒํผ์ด๋‚˜ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์ง€๋Š” open-source model์„ ๊ฐœ๋ฐœํ•ด๋‚˜๊ฐˆ ์ˆ˜ ์žˆ๋Š” ์ข‹์€ ๊ธฐํšŒ๊ฐ€ ๋˜๊ธธ ๋ฐ”๋ผ๋ฉฐ ํฌ์ŠคํŒ…์„ ๋งˆ์ณ๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

 

References

https://mistral.ai/news/announcing-mistral-7b/

 

Mistral 7B

The best 7B model to date, Apache 2.0

mistral.ai

HuggingFace Mistral 7B Model: https://huggingface.co/mistralai/Mistral-7B-v0.1

 

mistralai/Mistral-7B-v0.1 · Hugging Face

Model Card for Mistral-7B-v0.1 The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama 2 13B on all benchmarks we tested. For full details of this model please read o

huggingface.co

HuggingFace Mistral 7B Instruct Model: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

 

mistralai/Mistral-7B-Instruct-v0.1 · Hugging Face

Model Card for Mistral-7B-Instruct-v0.1 The Mistral-7B-Instruct-v0.1 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.1 generative text model using a variety of publicly available conversation datasets. For full details of t

huggingface.co