Paper Reading ๐Ÿ“œ/Natural Language Processing

Data Augmentation methods in NLP

2023. 3. 29. 21:39

 ํ˜„์žฌ ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ€์กฑ์— ์‹œ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์ˆ˜์ ์ธ๋ฐ ์ด๋ฅผ ์œ„ํ•ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘์€ ํ•œ์ •์ ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐœ๋ช…๋œ ๊ธฐ์ˆ ์ด Data Augmentation์ด๋‹ค. Data Augmentation์— ๋Œ€ํ•ด ๊ฐ„๋žตํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ์•ฝ๊ฐ„์˜ ๋ณ€ํ˜• ๋˜๋Š” ์†์ƒ์„ ๊ฐ€ํ•ด์„œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ฃผ๋กœ Computer VIsion ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ๋˜๋Š”๋ฐ NLP์—๋„ Data Augmentation ๊ธฐ๋ฒ•์ด ์กด์žฌํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ๊ฒŒ ๋˜๊ณ  ํ•œ ๋ฒˆ ๊ณต๋ถ€ํ•ด๋ณด๋ฉด์„œ ํฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•˜์˜€๋‹ค. ์ด ํฌ์ŠคํŠธ๋Š” ๋‹ค์Œ์˜ ๋ธ”๋กœ๊ทธ๋“ค์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ๋‹ค.

 

https://neptune.ai/blog/data-augmentation-nlp

 

Data Augmentation in NLP: Best Practices From a Kaggle Master - neptune.ai

There are many tasks in NLP from text classification to question answering but whatever you do the amount of data you have to train your model impacts the model performance heavily. What can you do to make your dataset larger? Simple option -> Get more dat

neptune.ai

https://amitness.com/2020/05/data-augmentation-for-nlp/?fbclid=IwAR11MkccCti-2cD93RYftNPHb7Wxdj7AlZG7NNG4EhPaBkmiJkcBPtdl1eo 

 

A Visual Survey of Data Augmentation in NLP

An extensive overview of text data augmentation techniques for Natural Language Processing

amitness.com

 

 

What is the difference of augmentation between vision and NLP?

 vision์˜ augmentation๊ณผ NLP์˜ augmentation์—๋Š” ์ฐจ์ด์ ์ด ๋ถ„๋ช…ํžˆ ์กด์žฌํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ƒ๊ฐํ•ด๋ณด์ž. vision์—์„œ ๊ณ ์–‘์ด ์‚ฌ์ง„์ด ์ฃผ์–ด์กŒ๋‹ค๊ณ  ํ•˜์˜€์„ ๋•Œ, ๋น„์ „์—์„œ ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ธ grayscale์ด๋‚˜ ํšŒ์ „ ๋“ฑ์„ ์ ์šฉํ•œ๋‹ค๊ณ  ํ•ด๋ณด์ž. ๊ทธ๋Ÿฌ๋ฉด ์ด์™€ ๋น„์Šทํ•˜๊ฒŒ NLP์—์„œ๋„ ๋ฌธ์žฅ๋‚ด ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ์…”ํ”Œํ•˜๋ฉด augmentation ํ•˜๊ธฐ ์ด์ „์˜ ๋ฌธ์žฅ๊ณผ ๋˜‘๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋ฌธ์žฅ์ด ๋‚˜์˜ฌ๊นŒ? ๋‹ค์Œ์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ๊ทธ๋ ‡์ง€ ์•Š์Œ์„ ํ™•์—ฐํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

vision์˜ augmentation method์™€ NLP์˜ augmentation method์—๋Š” ๋ถ„๋ช…ํ•œ ์ฐจ์ด๊ฐ€ ์กด์žฌ

 

 ํ•œ ๋งˆ๋””๋กœ ๋น„์ „์—์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด data generator๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‹ ๊ฒฝ๋ง ๋„คํŠธ์›Œํฌ์— ๋“ค์–ด๊ฐ€๋Š” ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋ณ€ํ˜•$($augmentation$)$ํ•˜๋ฉด ๋˜๋ฏ€๋กœ ํ•™์Šต ์ „์— ๋”ฐ๋กœ ์ค€๋น„ํ•ด์•ผํ•  ์‚ฌํ•ญ์€ ์—†๋‹ค.

 

 ํ•˜์ง€๋งŒ NLP์—์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š๋‹ค. ๋ฌธ์žฅ์˜ ๋ฌธ๋ฒ•์ ์ธ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ์„ธ๋ฐ€ํ•˜๊ฒŒ data augmentation์„ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. ์•ž์œผ๋กœ์˜ ๋‚ด์šฉ๋“ค์€ ๋ชจ๋‘ ํ•™์Šต ์ด์ „์— ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” method๋“ค์ด๋‹ค. ์ด์ œ๋ถ€ํ„ฐ NLP์˜ data augmentation method์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž!! ๐Ÿ”ฅ

 

 

NLP Data Augmentation Methods

1. Lexical Substitution

 

 Lexical Substitution์€ ํ…์ŠคํŠธ์— ๋‚˜ํƒ€๋‚˜ ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ํ•ด์น˜์ง€ ์•Š๋Š” ์„ ์—์„œ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

 

a. Thesaurus-based substitution

 

 ์ด method ์—์„œ๋Š” ๋ฌธ์žฅ ๋‚ด์—์„œ ๋žœ๋คํ•œ ๋‹จ์–ด๋ฅผ ๊ณ ๋ฅด๊ณ  Thesaurus๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์˜์–ด๋กœ ๋Œ€์ฒดํ•˜์˜€๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์˜์–ด์— ๋Œ€ํ•œ WordNet ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋™์˜์–ด๋ฅผ ์ฐพ๊ณ  ๋Œ€์ฒด๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์„ ์ˆ˜๋™์œผ๋กœ ์ •๋ฆฌํ•ด๋†“์€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์ด๋‹ค.

 

Thesaurus-based substitution ์˜ˆ์‹œ. awesome๊ณผ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” amazing์„ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€์ฒดํ•จ.

 

b. Word-Embeddings Substitution

 

 ์ด ๋ฐฉ์‹์—์„œ๋Š” Word2Vec, GloVe ๊ฐ™์€ pre-trained ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ€์ง€๊ณ  ๋ฌธ์žฅ ๋‚ด์˜ ๋ช‡๋ช‡ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋Œ€์ฒด๋ฅผ ์œ„ํ•ด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ ๊ทผ์ ‘ํ•ด ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

Word2Vec์—์„œ ์ธ์ ‘ํ•œ ๋‹จ์–ด๋“ค

 

 ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ€์žฅ ๋น„์Šทํ•œ 3๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ด 3๊ฐœ์˜ ๋ณ€ํ˜•๋œ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Word-Embedding Substitution์˜ ์˜ˆ์‹œ

 

c. Masked Language Model

 

 BERT, RoBERTa, ALBERT ๊ฐ™์€ Transformer model๋“ค์€ pretext task์ธ "Masked Language Modeling"์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฉ๋Œ€ํ•œ ์•ผ์˜ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต๋˜์—ˆ๋‹ค. ์ด MLM์€ ๋ฌธ๋งฅ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ masked word๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋Š” task์ด๋‹ค.

 

 ์ด๋Š” ํ…์ŠคํŠธ๋ฅผ augmentํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, pre-trained BERT model์„ ์‚ฌ์šฉํ•  ๋•Œ, ํ…์ŠคํŠธ์˜ ๋ช‡๋ช‡ ๋ถ€๋ถ„์„ maskํ•ด๋‘๊ณ  BERT model์—๊ฒŒ masked token์„ ์˜ˆ์ธกํ•˜๋„๋ก ๋ฌผ์–ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

BERT์˜ MLM์˜ ์ž‘๋™ ๋ฐฉ์‹

 

 ๋”ฐ๋ผ์„œ, maske prediction์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์ „์˜ ๋ฐฉ์‹๋“ค๊ณผ ๋น„๊ตํ•  ๋•Œ ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๋Š” ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•  ๋•Œ ์ปจํ…์ŠคํŠธ๋ฅผ ๊ณ ๋ คํ•˜๋ฏ€๋กœ ๋ฌธ๋ฒ•์ ์œผ๋กœ ๋” ์ผ๊ด€์„ฑ์ด ์žˆ๋‹ค.

 

MLM method์˜ ์˜ˆ์‹œ

 

d. TF-IDF based word replacement

 

 ์ด augmentation method์˜ ๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๋Š” ๋‚ฎ์€ TF-IDF ์ ์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๋‹จ์–ด๋“ค์€ ์ •๋ณด๊ฐ€ ์—†๊ณ  ๋”ฐ๋ผ์„œ, ๋ฌธ์žฅ์˜ ground-truth ๋ผ๋ฒจ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ณ  ๋Œ€์ฒด๋  ์ˆ˜ ์žˆ๋‹ค.

 

 ๊ธฐ์กด ๋‹จ์–ด์—์„œ ๋Œ€์ฒด๋˜๋Š” ๋‹จ์–ด๋Š” ์ „์ฒด ๋ฌธ์„œ์—์„œ ๋‹จ์–ด์˜ TF-IDF ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ ์„ ํƒ๋˜๊ณ  ๊ฐ€์žฅ ๋‚ฎ์€ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ๋‹ค.

 

2. Back Translation

 

 ์ด method์—์„œ๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ค ์–ธ์–ด๋กœ ๋ฒˆ์—ญํ•˜๊ณ  ๋‹ค์‹œ ์›๋ž˜์˜ ์–ธ์–ด๋กœ ๋ฒˆ์—ญํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋Š” ๋ณด์กดํ•˜๋Š” ๋Œ€์‹  ๋‹จ์–ด๋“ค์˜ ๊ตฌ์„ฑ์ด ๋‹ฌ๋ผ์ง„๋‹ค.

 

Back Translation ์˜ˆ์‹œ. ์˜์–ด ๋ฌธ์žฅ์„ ํ”„๋ž‘์Šค์–ด๋กœ ๋ฒˆ์—ญํ•œ ๋’ค ๋‹ค์‹œ ์˜์–ด๋กœ ์žฌ๋ฒˆ์—ญํ•จ.

 

 ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์‹ค์ œ๋กœ ๋ฌธ์žฅ์˜ ํ˜•ํƒœ๋Š” ๋ฐ”๋€Œ๊ธด ํ–ˆ์ง€๋งŒ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋Š” ๋ฐ”๋€Œ์ง€ ์•Š์•˜์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

3. Text Surface Transformation

 

 ์ด method๋Š” ์ •๊ทœ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์šฉ๋˜๋Š” ๊ฐ„๋‹จํ•œ ํŒจํ„ด ์ผ์น˜ ๋ณ€ํ™˜์ด๋‹ค. Text Surface Transformation ๋…ผ๋ฌธ์—์„œ๋Š” ์ถ•์•ฝ์—์„œ ํ™•์žฅ์œผ๋กœ ๋˜๋Š” ๊ทธ ๋ฐ˜๋Œ€๋กœ ์–ธ์–ด ํ˜•์‹์„ ๋ณ€ํ™˜ํ•˜๋Š” ์˜ˆ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฅผ ์ ์šฉํ•˜์—ฌ augmented text๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ถ•์•ฝํ˜•๊ณผ ์ผ๋ฐ˜ํ˜•

 

 ์ถ•์•ฝํ˜•๊ณผ ์ผ๋ฐ˜ํ˜•์˜ ๋ณ€ํ™˜์€ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ชจํ˜ธํ•œ ์ƒํ™ฉ์—์„œ๋Š” ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•œ๋‹ค.

 

She's๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ด์„ํ•˜๋А๋ƒ์— ๋”ฐ๋ผ์„œ ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค.

 

 ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจํ˜ธํ•œ ์ถ•์•ฝ์€ ํ—ˆ๋ฝํ•˜์ง€๋งŒ ๋ชจํ˜ธํ•œ ํ™•์žฅ์€ ๊ฑด๋„ˆ๋›ฐ์—ˆ๋‹ค.

 

์ถ•์•ฝ์€ ํ—ˆ์šฉ! ํ™•์žฅ์€ ๋ถˆ๊ฐ€!

 

4. Easy Data Auhmentation$($EDA$)$

 

 Easy Data Augmentation์€ ํ†ต์ƒ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ data augmentation ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. EDA๋Š” 4๊ฐœ์˜ ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋งค์šฐ ์ข‹์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ๋Š” ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•๋“ค์€ overfitting์„ ๋ฐฉ์ง€ํ•ด์ฃผ๊ณ  robust model์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค. EDA๋Š” ํ•˜๋‚˜์˜ ๋…ผ๋ฌธ์ด ์žˆ๋Š”๋ฐ ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด๋Š” ๊ฒƒ๋„ ์ถ”์ฒœํ•œ๋‹ค. 

 

a. Synonym Replacement

 

 ์ด method๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ Thesaurus-based substitution์™€ ์œ ์‚ฌํ•œ ๋‚ด์šฉ์œผ๋กœ ๋ฌธ์žฅ ๋‚ด์—์„œ ๋žœ๋คํ•œ ๋‹จ์–ด๋ฅผ ๋™์˜์–ด๋กœ ๋ณ€ํ˜•ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•œ ๋งˆ๋””๋กœ paraphrase๋ฅผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

 

b. Random Insertion

 

 ๋ฌธ์žฅ ๋‚ด์—์„œ ๋ถˆ์šฉ์–ด๊ฐ€ ์•„๋‹Œ ๋žœ๋คํ•œ ๋‹จ์–ด์˜ ๋žœ๋คํ•œ ๋™์˜์–ด๋ฅผ ์ฐพ๋Š”๋‹ค. ์ด ๋™์˜์–ด๋ฅผ ๋ฌธ์žฅ ๋‚ด์—์„œ ๋žœ๋คํ•œ ์œ„์น˜์— ์‚ฝ์ž…ํ•œ๋‹ค. ์ด ๊ณผ์ •์„ $n$๋ฒˆ ๋ฐ˜๋ณตํ•œ๋‹ค. 

 

Random Insertion ์˜ˆ์‹œ

 

c. Random Swap

 

 ๋ฌธ์žฅ ๋‚ด์—์„œ ๋žœ๋คํ•œ ๋‘ ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด๋‹ค. 

 

Random Swap ์˜ˆ์‹œ

 

d. Random Deletion

 

 ๋ฌธ์žฅ์—์„œ ๊ฐ ๋‹จ์–ด๋ฅผ ํ™•๋ฅ  $p$๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋คํ•˜๊ฒŒ ์ œ๊ฑฐํ•œ๋‹ค.

 

Random Deletion ์˜ˆ์‹œ

 

5. Instance Crossover Augmentation

 

์ด method๋Š” ์œ ์ „ํ•™์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์—ผ์ƒ‰์ฒด ๊ต์ฐจ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” tweet์ด ๋ฐ˜์œผ๋กœ ๋‚˜๋ˆ ์ง€๊ณ  ๋˜‘๊ฐ™์€ ์–‘๊ทน$($positive/negative$)$์˜ ๋‘ ๊ฐœ์˜ ๋žœ๋คํ•œ tweet๋“ค๋ผ๋ฆฌ๋Š” ์„œ๋กœ์˜ ์ ˆ๋ฐ˜์ด ๋ฐ”๋€Œ์–ด์ง„๋‹ค. ๊ฐ€์„ค์€ ๊ฒฐ๊ณผ๊ฐ€ ๋น„๋ฌธ๋ฒ•์ ์ด๊ณ  ์˜๋ฏธ๋ก ์ ์œผ๋กœ ๋ถ€์ ์ ˆํ•˜๋”๋ผ๋„ ์ƒˆ ํ…์ŠคํŠธ๊ฐ€ ์—ฌ์ „ํžˆ sentiment๋ฅผ ๋ณด์กดํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

Instance Crossover Augmentation ์˜ˆ์‹œ

 

 ์ด method๋Š” ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ ๋ณ„ ํšจ๊ณผ๊ฐ€ ์—†์ง€๋งŒ F1-score์—๋Š” ๋„์›€์ด ๋˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์คฌ๋‹ค.

 

6. NLP Albumentation

 

 ์ด ํฌ์ŠคํŠธ์˜ ์•ž๋ถ€๋ถ„์—์„œ computer vision data augmentation๊ณผ NLP data augmentation์˜ ์ฐจ์ด์— ๋Œ€ํ•ด์„œ ์–˜๊ธฐํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด ์„น์…˜์—์„œ๋Š” CV data augmentation์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์–ด๋–ป๊ฒŒ NLP์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์„ ์ง€์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณผ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด Albumentation ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ• ํ…๋ฐ, ์—ฌ๊ธฐ์—๋Š” ์–ด๋– ํ•œ ๊ธฐ์ˆ ๋“ค์ด ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.

 

a. Shuffle Sentences Transform

 

 ์ด ๋ณ€ํ˜•์—์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ํฌํ•จํ•˜๋Š” ํ…์ŠคํŠธ ์ƒ˜ํ”Œ์ด ์ฃผ์–ด์ง€๋ฉด ์ด ๋ฌธ์žฅ๋“ค์€ ์…”ํ”Œ๋˜์„œ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด text = '<Sentence1>. <Sentence2>. <Sentence3>. <Sentence4>. <Sentence5>. <Sentence5>'๊ฐ€ ์ฃผ์–ด์ง€๋ฉด text = '<Sentence2>. <Sentence3>. <Sentence1>. <Sentence5>. <Sentence5>. <Sentence4>.'๋กœ ๋ณ€ํ˜•๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

b. Exclude Duplicate Transform

 

 ์ด ๋ณ€ํ˜•์—์„œ๋Š” ๋ณต์ œ๊ฐ€ ๋˜์–ด ์žˆ๋Š” ๋ฌธ์žฅ์„ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ํฌํ•จํ•œ ํ…์ŠคํŠธ ์ƒ˜ํ”Œ์ด ์ฃผ์–ด์ง€๋ฉด, ์ด ๋ณต์ œ ๋ฌธ์žฅ๋“ค์€ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์‚ญ์ œ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด text = โ€˜<Sentence1>. <Sentence2>. <Sentence4>. <Sentence4>. <Sentence5>. <Sentence5>.โ€™ ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด โ€˜<Sentence1>. <Sentence2>.<Sentence4>. <Sentence5>.โ€™ ๋กœ ๋ณ€ํ˜•๋˜๋Š” ๊ฒƒ์ด๋‹ค. 

 

 ์ด ์™ธ์—๋„ Albumentation ํŒจํ‚ค์ง€์—๋Š” ๋งŽ์€ ๋ณ€ํ˜•๋“ค์ด ์กด์žฌํ•˜์ง€๋งŒ, ๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด ์ •๋„๋งŒ ๋‹ค๋ค„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

 

7. NLP Mixup

 

 Mixup์€ ๊ฐ„๋‹จํ•˜์ง€๋ง‰ ํšจ๊ณผ์ ์ธ image augmentation ๋ฐฉ๋ฒ•์ด๋‹ค. Mixup์˜ ์•„์ด๋””์–ด๋Š” ํ•™์Šต์„ ์œ„ํ•œ ํ•ฉ์„ฑ ์˜ˆ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ • ๋น„์œจ๋กœ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์—์„œ ๋‘ ๊ฐœ์˜ ๋ฌด์ž‘์œ„ ์ด๋ฏธ์ง€๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์ด๊ฒƒ์€ ๋‘ ๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค์˜ ์ด๋ฏธ์ง€ ํ”ฝ์…€์„ ํ•ฉ์นœ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. Mixup์€ ํ•™์Šต ์ค‘์— ์ •๊ทœํ™”์˜ ํ˜•ํƒœ๋กœ ์ž‘๋™ํ•œ๋‹ค.

 

๊ธฐ์กด Mixup ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

 Mixup์˜ ์•„์ด์ด๋”๋ฅผ NLP๋กœ ๋Œ๊ณ  ์™€์„œ ํ…์ŠคํŠธ์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ Mixup์„ ํ…์ŠคํŠธ์— ์ ์šฉํ•œ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค.

 

a. wordMixup

 

 ์ด ๋ฐฉ๋ฒ•์—์„œ๋Š” ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์—์„œ ๋‘ ๊ฐœ์˜ ๋žœ๋คํ•œ ๋ฌธ์žฅ๋“ค์ด ๋“ค์–ด์˜ค๊ณ  ๋˜‘๊ฐ™์€ ๊ธธ์ด๋กœ zero-padding ๋œ๋‹ค. ๊ทธ ๋‹ค์Œ์— ์ด๋“ค์˜ word embedding์€ ๋ณดํ†ต์˜ text classification์— ๋Œ€ํ•œ ํ๋ฆ„์œผ๋กœ ์ง€๋‚˜๊ฐ„๋‹ค. cross-entropy loss์€ ์ฃผ์–ด์ง„ ๋น„์œจ์—์„œ ๊ธฐ์กด ํ…์ŠคํŠธ์˜ ๋‘ ๋ผ๋ฒจ ๋ชจ๋‘์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋œ๋‹ค.

 

wordMixup ๊ณผ์ •

 

b. sentMixup

 

 ์ด ๋ฐฉ๋ฒ•์—์„œ๋Š” ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์ด ๋“ค์–ด์˜ค๊ณ  ๋˜‘๊ฐ™์€ ๊ธธ์ด๋กœ zero-padding ๋œ๋‹ค. ๊ทธ ๋‹ค์Œ์— ์ด๋“ค์˜ word embedding์€ LSTM/CNN encoder์„ ํ†ต๊ณผํ•˜๊ณ  ๋งˆ์ง€๋ง‰ hidden state๋ฅผ sentence embedding์œผ๋กœ ๋ฐ›์•„๋“ค์ธ๋‹ค. ์ด๋Ÿฌํ•œ embedding๋“ค์€ ํŠน์ • ๋น„์œจ๋กœ ํ•ฉ์ณ์ง€๊ณ  ๋งˆ์ง€๋ง‰ ๋ถ„๋ฅ˜ ๋ ˆ์ด์–ด๋กœ ํ˜๋Ÿฌ๊ฐ„๋‹ค. cross-entropy loss๋Š” ์ฃผ์–ด์ง„ ๋น„์œจ์—์„œ ๊ธฐ์กด ๋ฌธ์žฅ์˜ ๋‘ ๋ผ๋ฒจ์— ๊ธฐ๋ฐ˜ํ•ด์„œ ๊ณ„์‚ฐ๋œ๋‹ค.

 

SentNixup ๊ณผ์ •

 

 

Conclusion

 ์ด๋ ‡๊ฒŒ ํ•ด์„œ NLP์˜ data augmentation method๋“ค์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ดค๋Š”๋ฐ ์ƒ๊ฐ๋ณด๋‹ค computer vision๋งŒํผ์ด๋‚˜ ๋‹ค์–‘ํ•ด์„œ ๋†€๋ž๋‹ค. ๐Ÿ˜ฒ data augmentation์€ ๋งค์šฐ ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๊ธฐ ๋•Œ๋ฌธ์— NLP ์™ธ์— Computer vision์„ ๊ณต๋ถ€ํ•  ๋•Œ๋„ ํ•„์š”ํ•˜๋ฏ€๋กœ ํ•œ ๋ฒˆ ์ฆˆ์Œ ๊ณต๋ถ€ํ•ด๋ณด๊ธฐ๋ฅผ ์ถ”์ฒœํ•œ๋‹ค. ๐Ÿ‘

 

 

 

 

์ถœ์ฒ˜

https://amitness.com/2020/05/data-augmentation-for-nlp/?fbclid=IwAR11MkccCti-2cD93RYftNPHb7Wxdj7AlZG7NNG4EhPaBkmiJkcBPtdl1eo 

 

A Visual Survey of Data Augmentation in NLP

An extensive overview of text data augmentation techniques for Natural Language Processing

amitness.com

https://neptune.ai/blog/data-augmentation-nlp

 

Data Augmentation in NLP: Best Practices From a Kaggle Master - neptune.ai

There are many tasks in NLP from text classification to question answering but whatever you do the amount of data you have to train your model impacts the model performance heavily. What can you do to make your dataset larger? Simple option -> Get more dat

neptune.ai

 

'Paper Reading ๐Ÿ“œ > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

ChatGPT์˜ hallucination, ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผ ํ• ๊นŒ? - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback  (0) 2023.04.05
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.04.03
GPT-4 Techinal Report Review  (0) 2023.03.28
BigBird: Transformers for Longer Sequences ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.25
Sparse Transformers: Generating Long Sequence with Sparse Transformers ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ  (0) 2023.03.22
'Paper Reading ๐Ÿ“œ/Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • ChatGPT์˜ hallucination, ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผ ํ• ๊นŒ? - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
  • Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • GPT-4 Techinal Report Review
  • BigBird: Transformers for Longer Sequences ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Cartinoe
Cartinoe
Welcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
Cartinoe's paper reviewWelcome! I'm a student studying about deep learning(NLP) ๐Ÿ˜‰ The goal of my study is to develop a competent LLM helping people!
  • faviconinstagram
  • faviconfacebook
  • favicongithub
  • faviconLinkedIn
Cartinoe's paper review
Cartinoe
Cartinoe
Cartinoe's paper review
Cartinoe
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • My Posting (141)
    • Paper Reading ๐Ÿ“œ (113)
      • Natural Language Processing (67)
      • Alignment Problem of LLM (11)
      • Computer Vision (4)
      • Deep Learning (6)
      • multimodal models (17)
      • Mathematics(์„ ํ˜•๋Œ€์ˆ˜, ํ™•๋ฅ ๊ณผ ํ†ต๊ณ„, ๋ฏธ.. (8)
    • Lecture ๐Ÿง‘โ€๐Ÿซ (16)
      • Hugging Face Course (1)
      • Coursera (15)
    • Insight ๐Ÿ˜Ž (10)
    • Research & Project ๐Ÿ”ฌ (2)

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

๊ณต์ง€์‚ฌํ•ญ

  • ๋ธ”๋กœ๊ทธ ๊ณต์ง€์‚ฌํ•ญ - ๋ชจ๋ฐ”์ผ ์ˆ˜์‹ ๊นจ์ง

ํƒœ๊ทธ

  • Chinchilla
  • RLHF
  • Evaluation Metric
  • proprietary model
  • open-source model
  • Open-source
  • GPT-4
  • Vicuna
  • MT-Bench
  • transformer
  • LLM
  • closed-source
  • ChatGPT
  • LLAMA2
  • context length
  • Vicuna Evaluation
  • LM
  • closed-source model
  • scaling law
  • context window
hELLO ยท Designed By ์ •์ƒ์šฐ.
Cartinoe
Data Augmentation methods in NLP
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.