Paper Reading ๐Ÿ“œ/Natural Language Processing

GPT-4๋„ ์ž˜ ๋ชปํ•œ API ํ˜ธ์ถœ์„ ํ•œ๋‹ค๊ณ ?!? - Gorilla๐Ÿฆ: Large Language Model Connected with Massive APIs ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

Cartinoe 2023. 6. 27. 19:36

The overview of this paper

 LLM์€ ์ตœ๊ทผ์— ์—„์ฒญ ๋ฐœ์ „ํ–ˆ์œผ๋‚˜, ์ด๋“ค์˜ API ํ˜ธ์ถœ์„ ํ†ตํ•œ ํšจ๊ณผ์ ์ธ ํˆด ์‚ฌ์šฉ์— ๋Œ€ํ•œ ์ž ์žฌ์„ฑ์€ ๋งŒ์กฑ๋˜์ง€ ์•Š์€ ์ฑ„ ๋‚จ์•„์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” API ํ˜ธ์ถœ ์ž‘์„ฑ์—์„œ GPT-4์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” fine-tuned LLaMA-based model์ธ Gorilla๐Ÿฆ๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. Gorilla๋Š” document retriever์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋  ๋•Œ, test-time ๋ฌธ์„œ ๋ณ€ํ™”์— ์ ์‘ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ , ์œ ์—ฐํ•œ ์‚ฌ์šฉ์ž ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๋ฒ„์ „ ๋ณ€ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด ์ฃผ์—ˆ๋‹ค. ์ด๊ฒƒ์€ LLM์„ direct ํ•˜๊ฒŒ prompting ํ•  ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋งž๋‹ฅ๋œจ๋ฆฌ๋Š” hallucination์˜ ๋ฌธ์ œ์ ์„ ์ƒ๋‹นํžˆ ์™„ํ™”ํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” Gorilla์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ HugguingFace, TorchHub, TensorHub API๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ธ API Bench๋„ ์ œ์•ˆํ•˜์˜€๋‹ค.

 

 

Table of Contents

1. Introduction

2. Methodology

3. Evaluation

 

 

1. Introduction

 ๋…ผ๋ฌธ์—์„œ๋Š” API์™€ API ๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํฌ๊ณ , ovelapping ํ•˜๊ณ , ๋ณ€ํ™”ํ•˜๋Š” tool set๋กœ๋ถ€ํ„ฐ LLM์ด ์ •ํ™•ํ•˜๊ฒŒ API๋ฅผ ์„ ํƒํ•˜๋„๋ก ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด Self-Instruct fine-tuning๊ณผ retrieval์˜ ์‚ฌ์šฉ์„ ํƒ๊ตฌํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ณต์žกํ•˜๊ณ  ์ข…์ข… overlapping ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๋Š” API์˜ ๊ฑฐ๋Œ€ corpus์ธ API Bench๋„ ๋งŒ๋“ค์—ˆ๋‹ค. ๋˜ํ•œ Self-Instruct๋ฅผ ์‚ฌ์šฉํ•ด์„œ API ๋‹น 10๊ฐœ์˜ ์‚ฌ์šฉ์ž question prompt๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ entry๋Š” instruction ์ฐธ์กฐ API ์Œ์ด ๋œ๋‹ค. ์ƒ์„ฑ๋œ API์˜ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์ผ๋ฐ˜์ ์ธ AST sub-tree ๋งค์นญ ๊ธฐ์ˆ ์„ ์ฑ„ํƒํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” LLM์— ๋Œ€ํ•œ functional ์ •ํ™•๋„์™€ hallucination ๋ฌธ์ œ๋ฅผ ํ•ด๋‹น ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜๋ฉด์„œ ํ™•์ธํ•˜์˜€๋‹ค.

 

 ๊ทธ๋‹ค์Œ์— ๋…ผ๋ฌธ์—์„œ๋Š” LLaMA-7B-based model์„ API Bench ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ document retrieval๊ณผ ํ•จ๊ป˜ fine-tune ํ•ด์„œ Gorilla๋ฅผ ์–ป์—ˆ๋‹ค. Gorilla๋Š” API ๊ธฐ๋Šฅ ์ •ํ™•๋„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ hallucination ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋Š” ์ธก๋ฉด์—์„œ GPT-4๋ฅผ ์ƒ๋‹นํžˆ ๋Šฅ๊ฐ€ํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ 1์€ ์˜ˆ์‹œ output์„ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ Gorilla์˜ retrieval-aware training์€ ๋ชจ๋ธ์ด API ๋ฌธ์„œ ๋ณ€ํ™”์— ์ ์‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด ์ฃผ์—ˆ๋‹ค.

 

๊ทธ๋ฆผ 1. API ํ˜ธ์ถœ์˜ ์˜ˆ์‹œ

 

๊ทธ๋ฆผ 2. ์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ Accuracy vs Hallucination

 

2. Methodology

2-1. Dataset Collection

 

 ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘์„ ์œ„ํ•ด HuggingFace์˜ 'The Model Hub', PyTorchHub, TensorFlowHub Model์— ๋Œ€ํ•œ ๋ชจ๋“  online model card๋ฅผ ๊ผผ๊ผผํžˆ ๊ธฐ๋กํ•˜์˜€๋‹ค. 

 

API Documentation.  HuggingFace Hub, TensorFlow Hub, Torch Hub๋กœ๋ถ€ํ„ฐ ์–ป์€ 1,645๊ฐœ์˜ API ํ˜ธ์ถœ์— ๋Œ€ํ•ด ๋ชจ๋ธ ์นด๋“œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€ํ™˜ํ•˜์˜€๋‹ค: {domain, framework, functionality, api_name, api_call, api_arguments, environment_requirements, example_code, performance, and description.}. ์ด๋Ÿฌํ•œ ํ•„๋“œ๋Š” ML ๋„๋ฉ”์ธ ๋‚ด์˜ API ํ˜ธ์ถœ์„ ๋„˜์–ด RESTful API๋ฅผ ํฌํ•จํ•˜์—ฌ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์„ ํƒํ•œ๋‹ค.

 

Instruction Generation.  ์ธ์กฐ instruction data๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด GPT-4๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Self-Instruct๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 3๊ฐœ์˜ in-context example์„ ์ฐธ์กฐ API ๋ฌธ์„œ์™€ ํ•จ๊ป˜ ์ œ๊ณตํ•˜๊ณ , ๋ชจ๋ธ์—๊ฒŒ API๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” real-world ์‚ฌ์šฉ ์ผ€์ด์Šค๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก task๋ฅผ ์‹œํ‚จ๋‹ค. 1,645๊ฐœ์˜ API datapoint์˜ ๊ฐ๊ฐ์— ๋Œ€ํ•ด ์ด 10๊ฐœ์˜ instruction-API ์Œ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด 6๊ฐœ์˜ ํ•ด๋‹น instruction example ์ค‘ 3๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค.

 

๊ทธ๋ฆผ 3. Gorilla: LLM์„ API์™€ ํ•จ๊ป˜ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ์‹œ์Šคํ…œ

 

2-2. Gorilla

 

 ๊ตฌ์ฒด์ ์œผ๋กœ API ํ˜ธ์ถœ์— ๋Œ€ํ•ด retrieve-aware ํ•˜๊ฒŒ fine-tune ๋œ LLaMA-7B model์ด Gorilla์ด๋‹ค. ๊ทธ๋ฆผ 3์—์„œ ๋ณด์ด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, {instruction, API} ์Œ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด Self-Instruct๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. LLaMA๋ฅผ fine-tune ํ•˜๊ธฐ ์œ„ํ•ด ์ด๊ฒƒ์„ user-agent chat-style ๋Œ€ํ™”๋กœ ๋ณ€๊ฒฝํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ์‚ฌ์šฉ์ž์™€ ์—์ด์ „ํŠธ์— ๋Œ€ํ•ด ๊ฐ๊ฐ ํ•œ ๋ผ์šด๋“œ์˜ ๋Œ€ํ™”์ด๋‹ค. ๊ทธ๋‹ค์Œ์—, ํ‘œ์ค€ instruction fine-tuning์„ base LLaMA-7B model์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์‹คํ—˜์„ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” retriever๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” Gorilla๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค.

 

API Call with Contraints.  API ํ˜ธ์ถœ์€ ์ข…์ข… ๋‚ด์žฌ์  ์ œ์•ฝ๊ณผ ํ•จ๊ป˜ ์˜ค๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ฝ์€ LLM์ด API์˜ ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ์ œ์•ฝ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ํ˜ธ์ถœ์„ ์นดํ…Œ๊ณ ๋ฆฌํ™”ํ•œ๋‹ค. LLM์€ ์‚ฌ์šฉ์ž์˜ ํ•จ์ˆ˜์  ์„ค๋ช…์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์š”์ฒญ ์•ˆ์— ์ž„๋ฒ ๋”ฉ๋˜์–ด ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์ œ์•ฝ์„ ์ถ”๋ก ํ•ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์€ ๊ทธ์ € API ํ˜ธ์ถœ์˜ ๊ธฐ๋ณธ์  ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๊ณ , ์ด๋Ÿฌํ•œ ํ˜ธ์ถœ์„ ๋™๋ฐ˜ํ•œ ์ œ์•ฝ์˜ ๋ณต์žกํ•œ ํ’๊ฒฝ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ๋„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ฐฐ์€ API์— ๋Œ€ํ•ด LLM์„ ํ•„์ˆ˜์ ์œผ๋กœ fine-tune ํ•ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ์–ด์ง์„ ๋ณด์—ฌ์ค€๋‹ค.

 

Retriever-Aware training.  retriever์™€ ํ•จ๊ป˜ ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•ด instruction-tuned ๋ฐ์ดํ„ฐ์…‹์„ user prompt์— ์ถ”๊ฐ€์ ์ธ "Use this API documentation for reference: <retrieved_API_doc_JSON>"์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” LLM์ด ์ดˆ๋ฐ˜๋ถ€ ์งˆ๋ฌธ์— ์‘๋‹ตํ•˜๊ธฐ ์œ„ํ•ด ์งˆ๋ฌธ์˜ ํ›„๋ฐ˜๋ถ€๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€๋ฅด์น˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์•˜๋‹ค. ์ด๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ์ค€๋‹ค.

 

  • test-time ๋ณ€ํ™”์— ์ ์‘ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์คŒ
  • In-Context Learning์œผ๋กœ๋ถ€ํ„ฐ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ด
  • Hallucination error๋ฅผ ์ค„์ž„

 

Gorilla Inference.  Gorilla์˜ ์ถ”๋ก ์„ ์œ„ํ•œ prompt๋Š” 2๊ฐœ์˜ ๋ชจ๋“œ๋ฅผ ๊ฐ€์ง„๋‹ค: zero-shot & with retrieval.

 

  • zero-shot setting: ์ด prompt๋Š” Gorilla LLM์— ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๊ณ , ๊ทธ๋‹ค์Œ์— task์™€ ๋ชฉํ‘œ๋ฅผ ์„ฑ์ทจํ•˜๋Š”๋ฐ ๋„์›€์„ ์ฃผ๋Š” API ํ˜ธ์ถœ์„ ๋ฐ˜ํ™˜ํ•จ
  • with retrieval: retriever๋Š” API ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ €์žฅ๋˜์–ด ์žˆ๋Š” ๊ฐ€์žฅ ์ตœ์‹  ์œ ํ˜•์˜ API ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•จ. ๊ทธ ๋‹ค์Œ์— ์‚ฌ์šฉ์ž prompt์™€ ์—ฐ๊ฒฐ๋จ.

 

2-3. Verifying APIs

 

 Gorilla์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ด๋“ค์˜ ๊ธฐ๋Šฅ์  ๋™๋“ฑํ•จ์„ ๋น„๊ตํ•˜์˜€๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ์–ด๋–ค API๊ฐ€ LLM ํ˜ธ์ถœ์ธ์ง€ ์ถ”์ ํ•˜๊ธฐ ์œ„ํ•ด AST ํŠธ๋ฆฌ ๋งค์นญ ์ „๋žต์„ ์ฑ„ํƒํ•˜์˜€๋‹ค. ์ด๋•Œ ํ›„๋ณด API๊ฐ€ ์ฐธ์กฐ API์˜ sub-tree๋ฉด ์ด๊ฒƒ์€ API๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์— ์‚ฌ์šฉ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. 

 

 hallucination์€ ์ •์˜ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— AST ํŠธ๋ฆฌ ๋งค์นญ์„ ์‚ฌ์šฉํ•ด์„œ ์ •์˜ํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” hallucination์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์–ด๋–ค API์˜ sub-tree๋„ ์•„๋‹Œ API ํ˜ธ์ถœ๋กœ ์ •์˜๋œ๋‹ค๊ณ  ํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ์™„์ „ํžˆ ์ƒ์ƒ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ํˆด์„ hallucination์œผ๋กœ ๊ณ ๋ คํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

AST Sub-Tree Matching.  ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ ์–ด๋–ค API๊ฐ€ LLM ํ˜ธ์ถœ์ธ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด AST sub-tree matching์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. 

 

๊ทธ๋ฆผ 4. API ํ˜ธ์ถœ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ AST Sub-tree ๋งค์นญ

 

 

3. Evaluation

 ๋…ผ๋ฌธ์—์„œ๋Š” ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ Gorilla์™€ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ๋ฒค์น˜๋งˆํ‚นํ•˜๊ณ , ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒ€์ƒ‰ method๊ฐ€ API ํ˜ธ์ถœ์„ ๋งŒ๋“œ๋Š”๋ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ํƒ๊ตฌํ•˜์˜€๋‹ค. 

 

Baselines.  Gorilla์™€ ๋‹ค๋ฅธ SoTA ๋ชจ๋ธ๋“ค์„ zero-shot ์„ธํŒ…์—์„œ ๋น„๊ตํ•˜์˜€๋‹ค: GPT-4, GPT-3.5-Turbo, Claude, LLaMA-7B.

 

Retrievers.  zero-shot์€ retriever์ด ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์–ธ๊ธ‰ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์œ ์ผํ•œ input์€ ์‚ฌ์šฉ์ž์˜ ์ž์—ฐ์–ด prompt์ด๋‹ค. retrieval ์ค‘์— ์‚ฌ์šฉ์ž์˜ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ , ๊ฐ€์žฅ ์—ฐ๊ด€๋œ API๋ฅผ ๊ฐ€์ง€๊ณ  ์˜จ๋‹ค. ์ด API๋Š” LLM์—๊ฒŒ ์ฟผ๋ฆฌ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ์ž์˜ prompt์™€ ํ•จ๊ป˜ ์—ฐ๊ฒฐ๋œ๋‹ค. 

 

3-1. AST Accuracy on API call

 

 ๋…ผ๋ฌธ์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋Œ€ํ•œ AST์˜ ์ •ํ™•๋„์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒฐ๊ณผ๋Š” ํ‘œ 1์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋ชจ๋ธ์„ ์„œ๋กœ ๋‹ค๋ฅธ retriever ์„ธํŒ…์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•˜์˜€๋‹ค. 

 

Finetuning without Retrieval.  ํ‘œ 1์—์„œ๋Š” ์•ฝํ•˜๊ฒŒ fine-tune ๋œ Gorilla๊ฐ€ zero-shot์—์„œ SoTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ตœ์†Œํ•œ ์ด ๋ฒ”์œ„์—์„œ๋Š” retrieval๋ณด๋‹ค fine-tuning์ด ๋” ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 ๊ฒŒ๋‹ค๊ฐ€, ground-truth retriever๋Š” ์„ฑ๋Šฅ์„ ์‚ด์ง ๋–จ์–ด๋œจ๋ ธ์œผ๋‚˜, BM25 ๋˜๋Š” GPT-Index๋ฅผ retriever๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์€ ์ƒ๋‹นํžˆ ๋–จ์–ด์กŒ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ๋น„์ตœ์  retriever๋ฅผ test ์‹œ์— ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์„ ์ž˜๋ชป ์ง€๋„ํ•˜๊ณ  ๋” ๋งŽ์€ ์—๋Ÿฌ๋ฅผ ๋‚ณ๋Š”๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

ํ‘œ 1. Torch Hub, HuggingFace, Tensorflow Hub API์—์„œ LLM์˜ ํ‰๊ฐ€

 

Finetuning with Retrieval.  ๋…ผ๋ฌธ์—์„œ๋Š” retriever์™€ ํ•จ๊ป˜ LM์„ fine-tune ํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์— ์–ด๋–ค ๋„์›€์„ ์ฃผ๋Š”์ง€๋ฅผ ๋…ผ์˜ํ•˜์˜€๋‹ค. ์ด ์‹คํ—˜์„ ์œ„ํ•ด base LLaMA๋ฅผ prompt, ์ฐธ์กฐ API ๋ฌธ์„œ, GPT-4์— ์˜ํ•ด ์ƒ์„ฑ๋œ example๋กœ fine-tune ํ•˜์˜€๋‹ค. ํ‘œ 2์—์„œ fine-tuning pipeline์— ground-truth retriever๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ, retriever์ด ์—†๋Š” ๊ฒƒ๋ณด๋‹ค ์ƒ๋‹นํžˆ ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. ํ•˜์ง€๋งŒ ํ‰๊ฐ€ ์‹œ์— ํ™•์ธํ•œ ๊ฒฐ๊ณผ ํ˜„์žฌ์˜ retriever๋Š” ground-truth retriever์™€ ํฐ ๊ฐญ์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋” ๋‚˜์€ retriever๋กœ fine-tune ํ•˜๋Š” ๊ฒƒ์ด ์•„์ง ๋” ๋‚˜์€ method๋ผ๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

ํ‘œ 2. retrieval ๊ธฐ์ˆ ์˜ ๋น„๊ต

 

๊ทธ๋ฆผ 5. GPT-retriever๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ

 

Hallucination with LLM.  ๋…ผ๋ฌธ์—์„œ ๊ด€์ฐฐํ•œ ํ•œ ๊ฐ€์ง€ ํ˜„์ƒ์€ API๋ฅผ ํ˜ธ์ถœํ•˜๊ธฐ ์œ„ํ•ด LLM๊ณผ ํ•จ๊ป˜ zero-shot prompting์„ ํ•˜๋ฉด ์‹ฌ๊ฐํ•œ hallucination์„ ๋‚ณ๊ฒŒ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋†€๋ž๊ฒŒ๋„, ๋…ผ๋ฌธ์—์„œ๋Š” ๋˜ํ•œ GPT-3.5๊ฐ€ GPT-4 ๋ณด๋‹ค ์ ์€ hallucination์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ RLHF๊ฐ€ ๋ชจ๋ธ์„ ์ง„์‹คํ•˜๊ฒŒ ๋งŒ๋“œ๋Š”๋ฐ ์ค‘์‹ฌ์  ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•”์‹œํ•œ๋‹ค. 

 

3-2. Test-time Documentation Change

 

 ๋น ๋ฅด๊ฒŒ ์ง„ํ™”ํ•˜๋Š” API ๋ฌธ์„œํ™”์˜ ํ™˜๊ฒฝ์€ LLM์˜ ์žฌํ•™์Šต ๋˜๋Š” fine-tuning ์Šค์ผ€์ค„์„ ์•ž์งˆ๋Ÿฌ ๊ฐ€๊ธฐ๋„ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์—…๋ฐ์ดํŠธ ๋นˆ๋„์˜ ๋ฏธ์Šค๋งค์น˜๋Š” LLM์˜ ํ™œ์šฉ์„ฑ๊ณผ ์‹ ๋ขฐ์„ฑ์„ ์ค„์–ด๋“ค๊ฒŒ ์ด๋Œ ์ˆ˜๋„ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, Gorilla์˜ retriever-aware training์˜ ์†Œ๊ฐœ๋กœ, API ๋ฌธ์„œํ™”์˜ ๋ณ€ํ™”์— ์ฆ‰์‹œ ์ ์‘ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ์ƒˆ๋กœ์šด ๋ฐฉ์‹์€ ๋ชจ๋ธ์ด ์ตœ์‹  ๋ฐ ์ ์ ˆํ•˜๊ฒŒ ๋จธ๋ฌผ๋Ÿฌ ์žˆ์„ ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ๋ฝํ•ด ์ค€๋‹ค.

 

 ์˜ˆ๋ฅผ ๋“ค์–ด ๊ทธ๋ฆผ 6์— ๋ฌ˜์‚ฌ๋œ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ƒ๊ฐํ•ด ๋ณด๋ฉด, ์—ฌ๊ธฐ์„œ Gorilla์˜ training์€ API์˜ ๋ณ€ํ™”์— ํšจ๊ณผ์ ์œผ๋กœ ๋ฐ˜์‘ํ•˜๋„๋ก ํ—ˆ๋ฝํ•ด ์ค€๋‹ค. ์ด ๋Šฅ๋ ฅ์€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ์‹œ์Šคํ…œ์ด ์—…๊ทธ๋ ˆ์ด๋“œ์™€ ๊ฐœ์„ ์„ ๊ฒช์–ด๋„ LLM์ด ์ ์ ˆํ•˜๊ณ  ์ •ํ™•ํ•˜๋„๋ก ๋ณด์žฅํ•ด ์ค€๋‹ค. ์ด๋Š” ์กฐ์ง์ด ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์„ ํ˜ธํ•˜๋Š” ๋ชจ๋ธ ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ API ์†Œ์Šค์˜ ๋ณ€ํ™”์— ์ ์‘ํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ธฐ๋Šฅ์„ ๋ฐ˜์˜ํ•œ๋‹ค.

 

 ์š”์•ฝํ•˜๋ฉด, API ๋ฌธ์„œํ™”์˜ test-time ๋ณ€ํ™”์— ์ ์‘ํ•˜๊ธฐ ์œ„ํ•œ Gorilla์˜ ๋Šฅ๋ ฅ์€ ๋‹ค์–‘ํ•œ ์ด์ ์„ ์ œ๊ณตํ•ด ์ฃผ๊ณ , ์‹œ๊ฐ„์ด ์ง€๋‚จ์—๋„ ์ •ํ™•๋„ & ์—ฐ๊ด€์„ฑ์„ ์œ ์ง€์‹œ์ผœ ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  API ๋ฌธ์„œ ์—…๋ฐ์ดํŠธ์˜ ๋น ๋ฅธ ์†๋„์—๋„ ์ ์‘ํ•ด์„œ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ์‹œ์Šคํ…œ์—์„œ์˜ ์ˆ˜์ •๋„ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์€ ๋ชจ๋ธ์„ API ํ˜ธ์ถœ์„ ์œ„ํ•œ robust ํ•˜๊ณ  ์‹ ๋ขฐ๋„ ์žˆ๋Š” tool๋กœ ๋งŒ๋“ค์–ด ์ค€๋‹ค. 

 

๊ทธ๋ฆผ 6. Gorilla์˜ retriever-aware training์€ API์˜ ๋ณ€ํ™”์— ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด ์คŒ

 

3-3. API Call with Constraints

 

 ์ œ์•ฝ์„ ์ดํ•ดํ•˜๋Š” LM์˜ ๋Šฅ๋ ฅ์— ์ดˆ์ ์„ ๋งž์ถฐ์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ํ‘œ 3๊ณผ ๊ฐ™๋‹ค.

 

ํ‘œ 3. constraint-aware API ๋ฐœ๋™์—์„œ LLM์˜ ํ‰๊ฐ€

 

 ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์ œ์•ฝ์ด ์ถ”๊ฐ€๋˜๋ฉด, retriever์ด ์žˆ๋“  ์—†๋“  ๋ชจ๋“  ๋ชจ๋ธ์— ๊ฑธ์ณ์„œ ์ •ํ™•๋„๋Š” ๋–จ์–ด์ง„๋‹ค. Gorilla๋Š” retrieval์„ ์‚ฌ์šฉํ•  ๋•Œ GPT-3.5์™€ ๋งž๋จน๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  zero-shot์˜ ๊ฒฝ์šฐ์—๋Š” ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ œ์•ฝ ๊ฐ„์— trade-off๋ฅผ ๊ณ ๋ คํ•˜๋ฉด์„œ API๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•œ Gorilla์˜ ๋Šฅ๋ ฅ์„ ๊ฐ•์กฐํ•œ๋‹ค.

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/2305.15334

 

Gorilla: Large Language Model Connected with Massive APIs

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains u

arxiv.org