Paper Reading ๐Ÿ“œ/Natural Language Processing

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

2023. 4. 3. 14:40

The overview of this paper

 BERT์™€ RoBERTa๋Š” semantic textual simialrity((STS)) ๊ฐ™์€ ๋ฌธ์žฅ ์Œ ํšŒ๊ท€ task์— ๋Œ€ํ•ด์„œ ์ƒˆ๋กœ์šด SoTA performance๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ task๋Š” ๋‘ ๋ฌธ์žฅ์ด ๋„คํŠธ์›Œํฌ์— ์ž…๋ ฅ๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ ์ƒ๋‹นํ•œ computational overhead๋ฅผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค. BERT๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 10,000๊ฐœ ๋ฌธ์žฅ์˜ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๋น„์Šทํ•œ ์ง์„ ์ฐพ๋Š” ๊ฒƒ์€ 5,000๋งŒ ๋ฒˆ์˜ ์ถ”๋ก  ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ BERT์˜ ๊ตฌ์กฐ๋Š” semantic similarity search ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ clustering ๊ฐ™์€ unsupervised task์— ๋Œ€ํ•ด์„œ๋Š” ๋ถ€์ ํ•ฉํ•˜๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” simase & triplet network๋ฅผ ์‚ฌ์šฉํ•ด์„œ cosine-similarity๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋Š” ์˜๋ฏธ์ƒ์œผ๋กœ ์˜๋ฏธ ์žˆ๋Š” sentence embedding์„ ์–ป๋Š” pre-train BERT network์— ์ˆ˜์ •์„ ๊ฐ€ํ•œ Sentence-BERT((SBERT))๋ฅผ ์„ ๋ณด์˜€๋‹ค. ์ด SBERT๋Š” BERT์™€ RoBERTa๊ฐ€ ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋ฌธ์žฅ์„ ์ฐพ๋Š”๋ฐ 65์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋Š”๋ฐ ๋น„ํ•ด ๊ฒจ์šฐ 5์ดˆ์˜ ์‹œ๊ฐ„๋งŒ์ด ๊ฑธ๋ ธ๋‹ค!! ๊ทธ ์™€์ค‘์—๋„ BERT์˜ ์„ฑ๋Šฅ์€ ์œ ์ง€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” SBERT์™€ SRoBERTa๋ฅผ ์ผ๋ฐ˜์ ์ธ STS task์™€ ์ „์ด ํ•™์Šต task์— ๋Œ€ํ•ด์„œ ํ‰๊ฐ€ํ•˜์˜€๋Š”๋ฐ, ๋‹ค๋ฅธ SoTA sentence embedding method๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

 

Table of Contents

1. Instroduction

2. Model

   2-1. Training Details

3. Evaluation - Semantic Textual Similarity

4. Evaluation - SentEval

5. Computational Efficiency

 

 

1. Introduction

 ๋…ผ๋ฌธ์—์„œ๋Š” siamese & triplet network๋ฅผ BERT network์— ์ ์šฉํ•ด์„œ ์˜๋ฏธ์ƒ์œผ๋กœ ์˜๋ฏธ ์žˆ๋Š” sentence embedding์„ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š” Sentence-BERT๋ฅผ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด SBERT๋Š” ์ง€๊ธˆ๊นŒ์ง€๋„ BERT๊ฐ€ ์ ์šฉ๋  ์ˆ˜ ์—†์—ˆ๋˜ ๋ถ„์•ผ์ธ ํŠน์ •์˜ ์ƒˆ๋กœ์šด task์— ๋Œ€ํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ task์—๋Š” ํฐ ๊ทœ๋ชจ์˜ ์˜๋ฏธ ์œ ์‚ฌ๋„ ๋น„๊ต, clustering, semantic search๋ฅผ ํ†ตํ•œ ์ •๋ณด ๊ฒ€์ƒ‰์ด ์žˆ๋‹ค.

 

 BERT๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ฐ ๋ฌธ์žฅ ์Œ ํšŒ๊ท€ task์—์„œ ์ƒˆ๋กœ์šด SoTA performance๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. BERT๋Š” cross-encoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค: ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์ด transformer network๋กœ ๋“ค์–ด๊ฐ€๊ณ  target ๊ฐ’์ด ์˜ˆ์ธก๋œ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ์…‹์—…์€ ๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ ๊ฐ€๋Šฅํ•œ ๊ณ„์‚ฐ๋Ÿ‰ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ์Œ ํšŒ๊ท€ task์— ๋Œ€ํ•ด์„œ๋Š” ๋ถ€์ ํ•ฉํ•˜๋‹ค.

 

 clustering๊ณผ semantic search๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์€ ๊ฐ ๋ฌธ์žฅ์„ ๋น„์Šทํ•œ ๋ฌธ์žฅ๋ผ๋ฆฌ ๊ฐ™์€ ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์—ฐ๊ตฌ์ž๋“ค์€ ๊ฐ๊ฐ์˜ ๋ฌธ์žฅ์„ BERT์— ์ž…๋ ฅ์œผ๋กœ ๋„ฃ๊ธฐ ์‹œ์ž‘ํ•ด์„œ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ sentence embedding์„ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์€ BERT์˜ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋ฅผ ํ‰๊ท ๋‚ด๊ฑฐ๋‚˜ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ((CLS token))์˜ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์‹คํ—˜์—์„œ๋„ ๋ฐํž ๊ฑฐ์ง€๋งŒ ์ด๋Ÿฌํ•œ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•๋“ค์€ sentence embedding๋ณด๋‹ค ์•ˆ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ GloVe embedding๋ณด๋‹ค๋„ ์•ˆ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

 ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด SBERT๊ฐ€ ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. siamese network architecture๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์œผ๋กœ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ ์–ป์–ด์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ์—ˆ๋‹ค. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๋˜๋Š” Manhatten, Euclidean distance ๊ฐ™์€ ์œ ์‚ฌ๋„ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜๋ฏธ์ƒ์œผ๋กœ ์œ ์‚ฌํ•œ ๋ฌธ์žฅ์„ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์‚ฌ๋„ ์ธก์ •์€ ํ˜„๋Œ€์˜ ํ•˜๋“œ์›จ์–ด์—์„œ ๊ทน๋„๋กœ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SBERT๊ฐ€ semantic similarity search ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ clustering์—๋„ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ์—ˆ๋‹ค. 10,000๊ฐœ์˜ ๋ฌธ์žฅ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋ฌธ์žฅ ์Œ์„ ์ฐพ๋Š” task๋Š” BERT๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ 65์‹œ๊ฐ„์ด ๊ฑธ๋ ธ์ง€๋งŒ, SBERT๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” ๊ฒจ์šฐ 5์ดˆ ์ •๋„์˜ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ๊ณ , ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ๋Š” 0.01์ดˆ ์ •๋„์˜ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ๋‹ค. ์ตœ์ ํ™”๋œ ์ธ๋ฑ์Šค ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ๋น„์Šทํ•œ Quora ์งˆ๋ฌธ์„ ์ฐพ๋Š” task๋Š” 50์‹œ๊ฐ„์—์„œ ๋ช‡ ๋ฐ€๋ฆฌ์ดˆ๋กœ ์ค„์–ด๋“ค๊ฒŒ ๋˜์—ˆ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” SBERT๋ฅผ NLI dataset์—์„œ fine-tune ํ•˜์˜€๋Š”๋ฐ, ์ด๊ฒƒ์€ ๊ธฐ์กด์˜ SoTA setence embedding ์ด์—ˆ๋˜ InferSent์™€ Universal Sentence Encoder๋ฅผ ์ƒ๋‹นํžˆ ๋Šฅ๊ฐ€ํ•˜๋Š” sentence embedding์„ ์ƒ์„ฑํ•œ๋‹ค. SBERT๋Š” ํŠน์ • task์— ๋Œ€ํ•ด์„œ ์ ์‘ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. SBERT๋Š” ์–ด๋ ค์šด ์š”์†Œ ์œ ์‚ฌ๋„ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์„œ๋กœ ๋‹ค๋ฅธ Wikipedia ๋ฌธ์„œ์—์„œ ๋‚˜์˜จ ๋ฌธ์žฅ์„ ๊ตฌ๋ถ„ํ•˜๋Š” triplet ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ƒˆ๋กœ์šด SoTA performance๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. 

 

 

2. Model

 SBERT๋Š” ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ sentence embedding์„ ์–ป๊ธฐ ์œ„ํ•ด BERT์™€ RoBERTa์˜ ์ถœ๋ ฅ์— pooling ์—ฐ์‚ฐ์„ ๊ฐ€ํ•˜์˜€๋‹ค. pooling ์—ฐ์‚ฐ์œผ๋กœ๋Š” ๋‹ค์Œ์˜ 3๊ฐ€์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

  • CLS-ํ† ํฐ์˜ ์ถœ๋ ฅ
  • ๋ชจ๋“  ์ถœ๋ ฅ ๋ฒกํ„ฐ์˜ ํ‰๊ท  ๊ณ„์‚ฐ((MEAN-Strategy))
  • ์ถœ๋ ฅ ๋ฒกํ„ฐ์˜ ์ตœ๋Œ€ ์‹œ๊ฐ„ ๊ณ„์‚ฐ((MAX-Strategy))

 

 BERT์™€ RoBERTa๋ฅผ fine-tuneํ•˜๊ธฐ ์œ„ํ•ด ์˜๋ฏธ์ƒ์œผ๋กœ ์˜๋ฏธ ์žˆ๊ณ  ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์™€ ๋น„๊ต๋  ์ˆ˜ ์žˆ๋Š” ์ƒ์„ฑ๋œ sentence embedding ๊ฐ™์€ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ์œ„ํ•œ simase & triplet network๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” training data์— ๋”ฐ๋ผ์„œ ๋ณ€ํ•œ๋‹ค. ๋‹ค์Œ์˜ ๊ตฌ์กฐ์™€ objective function์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜์ด ์ง„ํ–‰๋˜์—ˆ๋‹ค.

 

Classification Objective Function.  sentence embedding uu์™€ vv, element-wise ์ฐจ์ด |uโˆ’v||uโˆ’v|๋ฅผ ํ•ฉ์น˜๊ณ  ํ•™์Šต ๊ฐ€๋Šฅ ๊ฐ€์ค‘์น˜ WtโˆˆR3nร—k์™€ ๊ณฑํ•˜์˜€๋‹ค.

 

o=softmax(Wt(u,v,|uโˆ’v|))

 

 ์—ฌ๊ธฐ์„œ n์€ sentence embedding์˜ ์ฐจ์›์ด๊ณ , k๋Š” ๋ผ๋ฒจ์˜ ์ˆ˜์ด๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š” cross-entropy loss๋ฅผ ์ตœ์ ํ™”ํ•˜์˜€๋‹ค. ์ด ๊ตฌ์กฐ๋Š” ๋‹ค์Œ์˜ ๊ทธ๋ฆผ 1์— ๋ฌ˜์‚ฌ๋˜์–ด ์žˆ๋‹ค.

 

๊ทธ๋ฆผ 1. Classification Objective Function์„ ์‚ฌ์šฉํ•  ๋•Œ SBERT์˜ architecture

 

Regression Objective Function.  ๋‘ sentence embedding u์™€ v ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๊ณ„์‚ฐ๋œ๋‹ค(๊ทธ๋ฆผ 2.). ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ๊ฐ€ objective function์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

 

๊ทธ๋ฆผ 2. Regression Objective Function์„ ์‚ฌ์šฉํ•  ๋•Œ SBERT์˜ architecture

 

Triplet Objective Function.  anchor ๋ฌธ์žฅ a, ๊ธ์ • ๋ฌธ์žฅ p, ๋ถ€์ • ๋ฌธ์žฅ n์ด ์ฃผ์–ด์ง€๋ฉด triplet loss๋Š” ๋„คํŠธ์›Œํฌ๊ฐ€ a์™€ p ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ a์™€ n์˜ ๊ฑฐ๋ฆฌ๋ณด๋‹ค ์ž‘๋„๋ก ํ•™์Šตํ•œ๋‹ค. ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค์Œ์˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜์˜€๋‹ค.

 

max(||saโˆ’sp||โˆ’||saโˆ’sn||+ฯต,0)

 

 sx ๊ฐ๊ฐ์˜ sentence embedding a/n/p์™€ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๊ณต์‹ ||โ‹…||, ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง„ ฯต์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ง„ ฯต์€ sp๊ฐ€ sn ๋ณด๋‹ค sa์— ์ตœ์†Œํ•œ ฯต ๋” ๊ฐ€๊น๊ฒŒ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์žฅํ•ด์ค€๋‹ค. Euclidean distance๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ฯต=1๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

 

2-1. Training Details

 

๋…ผ๋ฌธ์—์„œ๋Š” SBERT๋ฅผ SNLI์™€ Multi-Genre NLI ๋ฐ์ดํ„ฐ์…‹์˜ ์กฐํ•ฉ์—์„œ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๊ทธ๋ฆฌ๊ณ  SBERT๋ฅผ ํ•œ ์—ํญ์—์„œ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ softmax ๋ถ„๋ฅ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tune ํ•˜์˜€๋‹ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ๋Š” 16, Adam optimizer๋กœ๋Š” ํ•™์Šต๋ฅ  2e-5, ์„ ํ˜• ํ•™์Šต๋ฅ  warm-up์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ 10%๋กœ ํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ๋ณธ pooling strategy๋Š” MEAN์ด์—ˆ๋‹ค.

 

 

3. Evaluation - Semantic Textual Simialrity

 ๋…ผ๋ฌธ์—์„œ๋Š” SBERT์˜ ์„ฑ๋Šฅ์„ STS task์— ๋Œ€ํ•ด์„œ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. SoTA method๋Š” ์ข…์ข… setence embedding์„ similarity score์— ๋งคํ•„ํ•˜๋Š” ํšŒ๊ท€ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ํšŒ๊ท€ ํ•จ์ˆ˜๋Š” ์Œ์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ ์กฐํ•ฉ ํญ๋ฐœ๋กœ ์ธํ•ด ๋ฌธ์žฅ ๋ชจ์Œ์ด ํŠน์ • ํฌ๊ธฐ์— ๋„๋‹ฌํ•˜๋ฉด ํ™•์žฅํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๊ทธ ๋Œ€์‹ ์—, ๋‘ sentence embedding ๊ฐ„์˜ ์œ ์‚ฌ๋„์™€ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ํ•ญ์ƒ ๋น„๊ตํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ํ•œ negative Manhatten๊ณผ Euclidean distance์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋Š”๋ฐ ๋ชจ๋“  ๋ฐฉ๋ฒ•์˜ ๊ฒฐ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. 

 

3-1. Unsupervised STS

 

๋…ผ๋ฌธ์—์„œ๋Š” STS ํŠน์ • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  STS์— ๋Œ€ํ•œ SBERT์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ํ”ผ์–ด์Šจ ์ƒ๊ด€ ๊ณ„์ˆ˜ ๋ฉด์—์„œ๋Š” STS๊ฐ€ ์ž˜ ๋งž์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ ๋Œ€์‹ ์—, sentence embedding๊ณผ gold label์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐ„์˜ Spearman's rank ์ƒ๊ด€๋„๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ์ด์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ํ‘œ 1์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค.

 

ํ‘œ 1. ๋‹ค์–‘ํ•œ STS task์— ๋Œ€ํ•œ sentence representation๊ณผ gold label์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐ„์˜ Spearman rank ์ƒ๊ด€๋„ p

 

 ์„ค๋ช…๋œ siamese network ๊ตฌ์กฐ์™€ fine-tuning ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ƒ๊ด€๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ์ฃผ๊ณ  InferSent์™€ Universal Sentence Encoder๋ฅผ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค. RoBERTa๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ supervised task์— ๋Œ€ํ•ด์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์ง€๋งŒ, sentence embedding์„ ์ƒ์„ฑํ•˜๋Š” task์— ๋Œ€ํ•ด์„œ๋Š” SBERT์™€ SRoBERTa๊ฐ€ ์‚ฌ์†Œํ•œ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค.

 

3-2. Supervised STS

 

 STS ๋ฒค์น˜๋งˆํฌ๋Š” supervised STS ์‹œ์Šคํ…œ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ๋ช…ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. BERT๋Š” ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ๋„คํŠธ์›Œํฌ์— ํ˜๋ ค๋ณด๋‚ด๊ณ  ์ถœ๋ ฅ์— ๊ฐ„๋‹จํ•œ ํšŒ๊ท€ method๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์ƒˆ๋กœ์šด SoTA performance๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” training set๋ฅผ ์‚ฌ์šฉํ•ด์„œ SBERT๋ฅผ fine-tune ํ•˜์˜€๋Š”๋ฐ, ์ด๋•Œ regression objective function์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์˜ˆ์ธก์„ ํ•  ๋•Œ, sentence embedding ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ๋ชจ๋“  ์‹œ์Šคํ…œ๋“ค์€ variance๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 10๊ฐœ์˜ ๋žœ๋คํ•œ ์‹œ๋“œ์—์„œ ํ•™์Šต๋˜์—ˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ํ‘œ 2์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค.

 

 

ํ‘œ 2. Supervised STS ๋ฒค์น˜๋งˆํฌ๋ฅผ test set์—์„œ ํ‰๊ฐ€

 

4. Evaluation - SentEval

 SentEval์€ sentence embedding์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํˆดํ‚ท์ด๋‹ค. sentence embedding์€ logistic regression ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์œ„ํ•œ feature๋กœ ์‚ฌ์šฉ๋œ๋‹ค. logistic regression ๋ถ„๋ฅ˜๊ธฐ๋Š” 10-fold cross-validation ์…‹์—…์—์„œ ๋‹ค์–‘ํ•œ task์—์„œ ํ•™์Šต๋˜์—ˆ๊ณ  ์˜ˆ์ธก ์ •ํ™”๋„๋Š” test-fold์— ๋Œ€ํ•ด์„œ ํ•™์Šต๋˜์—ˆ๋‹ค.

 

 SBERT sentence embedding์˜ ๋ชฉ์ ์€ ๋‹ค๋ฅธ task๋ฅผ ์œ„ํ•ด ์ „์ด ํ•™์Šต์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด BERT ๋…ผ๋ฌธ์—์„œ ์ฒ˜๋Ÿผ ์ƒˆ๋กœ์šด task์— ๋Œ€ํ•ด BERT๋ฅผ fine-tuning ํ•˜๋Š” ๊ฒƒ์€ BERT ๋„คํŠธ์›Œํฌ์˜ ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋”์šฑ ์ ํ•ฉํ•œ method์ด๋‹ค. ํ•˜์ง€๋งŒ SentEval์€ ๋‹ค์–‘ํ•œ task์— ๋Œ€ํ•ด sentence embedding์˜ ํ€„๋ฆฌํ‹ฐ์— ์ธ์ƒ์„ ์ฃผ๊ณ  ์žˆ๋‹ค.

 

 ๋…ผ๋ฌธ์—์„œ๋Š” SBERT sentence embedding์„ 7๊ฐ€์ง€์˜ SentEval transfer task์— ๋Œ€ํ•ด์„œ ๋น„๊ตํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ํ‘œ 3์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค.

 

ํ‘œ 3. SentEval ํˆดํ‚ท์„ ์‚ฌ์šฉํ•ด์„œ SBERT sentence embedding์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ

 

 ํ‰๊ท  BERT embedding ๋˜๋Š” BERT network๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ CLS-ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์–‘ํ•œ STS task์— ๋Œ€ํ•ด์„œ ์•ˆ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ , ํ‰๊ท  GloVe embedding๋ณด๋‹ค ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ํ•˜์ง€๋งŒ, SentEval์— ๋Œ€ํ•ด์„œ ํ‰๊ท  BERT embedding๊ณผ BERT CLS-ํ† ํฐ ์ถœ๋ ฅ์€ ํ‰๊ท  GloVe embedding์„ ๋„˜์–ด์„œ๋Š” ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. ์ด์— ๋Œ€ํ•œ ์ด์œ ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์…‹์—… ๋•๋ถ„์ด๋‹ค. STS task์— ๋Œ€ํ•ด์„œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ sentence embedding ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋ ค ํ–ˆ๋‹ค. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋ชจ๋“  ์ฐจ์›์„ ๋™๋“ฑํ•˜๊ฒŒ ๋‹ค๋ฃฌ๋‹ค. ๋ฐ˜๋Œ€๋กœ, SentEval์€ logistic regression ๋ถ„๋ฅ˜๊ธฐ๋ฅผ sentence embedding์— ์ ์šฉํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ํŠน์ • ์ฐจ์›์ด ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ํฌ๊ฑฐ๋‚˜ ์ž‘์€ ์˜ํ–ฅ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค.

 

 

5. Computational Efficiency

 sentence embedding์€ ์ž ์žฌ์ ์œผ๋กœ ๋ช‡๋ฐฑ๋งŒ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ๊ณ„์‚ฐํ•ด์•ผํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋น ๋ฅธ ๊ณ„์‚ฐ ์†๋„๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” SBERT์™€ ํ‰๊ท  GloVe embedding, InferSent, Universal Sentenve Encoder์„ ๋น„๊ตํ•˜์˜€๋‹ค. ๊ทธ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์˜ ํ‘œ 4์™€ ๊ฐ™๋‹ค.

 

ํ‘œ 4. sentence embedding method์˜ ๊ณ„์‚ฐ ์†๋„. ๋†’์„ ์ˆ˜๋ก ์ข‹์Œ.

 

 CPU์—์„œ๋Š” InferSent๊ฐ€ SBERT๋ณด๋‹ค ๋” ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. ์ด๋Š” ๋”์šฑ ๊ฐ„๋‹จํ•œ ๋„คํŠธ์›Œํฌ architecture ๋•Œ๋ฌธ์ด๋‹ค. InferSent๋Š” ํ•˜๋‚˜์˜ Bi-LSTM ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ˜๋ฉด์— BERT๋Š” 12๊ฐœ์˜ ์Œ“์—ฌ์žˆ๋Š” transformer ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ, transformer ๋„คํŠธ์›Œํฌ์˜ ๊ณ„์‚ฐ์  ์žฅ์ ์€ GPU์—์„œ ๋“œ๋Ÿฌ๋‚œ๋‹ค. ์—ฌ๊ธฐ์„œ smart batching์„ ์‚ฌ์šฉํ•œ SBERT๋Š” ๊ฐ€์žฅ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค. 

 

 

 

 

์ถœ์ฒ˜

https://arxiv.org/abs/1908.10084

 

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a

arxiv.org

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

์ง€๊ธˆ ๊นŒ์ง€์˜ LM Scaling Law์—๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค?!?! ๐Ÿ˜ถโ€๐ŸŒซ๏ธ Chinchilla: Training Compute-Optimal Large Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (2) 2023.05.19
ChatGPT์˜ hallucination, ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผ ํ• ๊นŒ? - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback  (0) 2023.04.05
Data Augmentation methods in NLP  (0) 2023.03.29
GPT-4 Techinal Report Review  (0) 2023.03.28
BigBird: Transformers for Longer Sequences ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.25
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • ์ง€๊ธˆ ๊นŒ์ง€์˜ LM Scaling Law์—๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค?!?! ๐Ÿ˜ถโ€๐ŸŒซ๏ธ Chinchilla: Training Compute-Optimal Large Language Models ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ChatGPT์˜ hallucination, ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผ ํ• ๊นŒ? - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
  • Data Augmentation methods in NLP
  • GPT-4 Techinal Report Review
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
Cartinoe's paper reviewWelcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • MT-Bench
  • LM
  • Chinchilla
  • Vicuna
  • LLM
  • Evaluation Metric
  • closed-source model
  • context length
  • proprietary model
  • context window
  • LLAMA2
  • ChatGPT
  • open-source model
  • GPT-4
  • Vicuna Evaluation
  • Open-source
  • RLHF
  • closed-source
  • transformer
  • scaling law
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.