sml8648 / final-project-level3-nlp-05

final-project-level2-nlp-05 created by GitHub Classroom

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

๐Ÿ“ฐ NEWS.tar Hits

Table of content

  • Intro : ํŒ€ ์†Œ๊ฐœ/ ํ”„๋กœ์ ์Šค ์†Œ๊ฐœ(๋ฌธ์ œ ์ •์˜) / ๊ฐœ๋ฐœ ๋ชฉํ‘œ
  • Dataset & Model: ๋ฐ์ดํ„ฐ์…‹ / ๋ชจ๋ธ / ์—ฐ๊ตฌ / ์ตœ์ข… ์ ์šฉ ๋ชจ๋ธ
  • Product Serving: ์•„ํ‚คํ…์ณ/ ๊ตฌํ˜„/ ๋ฐ๋ชจ
  • Result / Conclusion: ์‹œ์—ฐ ์˜์ƒ / ํ›„์† ๊ฐœ๋ฐœ ๋ฐ ์—ฐ๊ตฌ / ๊ฒฐ๊ณผ ๋ฐ ๊ณ ์ฐฐ
  • Appendix: ๋„์ „์ ์ธ ์‹คํ—˜ / ๋ ˆ์Šจ๋Ÿฐ / ์˜ˆ์ƒ Q&A

Intro

โ€œํ•œ๋ˆˆ์— ํŒŒ์•…ํ•˜๋Š” ๊ธฐ์—…๋‰ด์Šค NEWS.tar"

NEWs.tar๋Š” ๋‰ด์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ์ œ ๋ณ„๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๊ธฐ์‚ฌ ๋‚ด์šฉ์„ ์š”์•ฝํ•˜์—ฌ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ ์‚ฌ์šฉ์ž๋“ค์ด ์งง์€ ์‹œ๊ฐ„์— ์ฃผ์š” ๋‰ด์Šค ๋‚ด์šฉ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค๋‹ˆ๋‹ค.*

Motivation and Objective

โœ”๏ธ ๋‰ด์Šค๋ฐ์ดํ„ฐ๋Š” ์–‘์ด ๋ฐฉ๋Œ€ํ•˜๊ณ  ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜๊ฐ€ ์žˆ์Œ
โœ”๏ธ ํ•˜์ง€๋งŒ ํˆฌ์ž๋ฅผ ํ•˜๊ณ  ์‹ถ์–ด ๊ธฐ์—… ๊ด€๋ จ ๋‰ด์Šค๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋ฉด ๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ ์ •๋ณด๋“ค์ด ์ œ๊ณต์ด๋จ
โœ”๏ธ ์ด๋Ÿฌํ•œ ๋‰ด์Šค๋ฐ์ดํ„ฐ๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ๋ง & ์š”์•ฝํ•ด์„œ ํŠน์ • ๊ธฐ์—…์— ๋Œ€ํ•œ ์ฃผ์ œ๋ฅผ ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ์‹ถ์Œ

  • ๋น„์Šทํ•œ ์ฃผ์ œ์˜ ๋‰ด์Šค๋ฅผ ๋ชจ์•„์„œ ์ œ๊ณต
  • ๊ฐ ์ฃผ์ œ์˜ ๊ธฐ์‚ฌ๋“ค์„ ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์œผ๋กœ ์š”์•ฝ
  • ํ•ด๋‹น ์ฃผ์ œ์— ๋Œ€ํ•œ ๊ฐ์ • ๋ถ„์„ ์ œ๊ณต
  • ๊ฐ™์€ ์ฃผ์ œ๋กœ ๋ฌถ์ธ ๊ธฐ์‚ฌ๋“ค์˜ ์ „๋ฐ˜์ ์ธ ์š”์•ฝ ๋ฌธ๋‹จ ์ œ๊ณต

Team member

๊น€์ง„ํ˜ธ ์‹ ํ˜œ์ง„ ์ดํšจ์ • ์ด์ƒ๋ฌธ ์ •์ง€ํ›ˆ
ํ† ํ”ฝ ๋ชจ๋ธ๋ง ๋ณธ๋ฌธ ์ถ”์ถœ ์š”์•ฝ
ํ•œ์ค„ ์ƒ์„ฑ ์š”์•ฝ
ํ”„๋ก ํŠธ, ๋ฐฑ์—”๋“œ
ํ•œ์ค„ ์š”์•ฝ ๊ฐ์„ฑ ๋ถ„์„
๋‰ด์Šค ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
DB ๊ตฌ์ถ•
ํ•œ์ค„ ์ƒ์„ฑ ์š”์•ฝ
์œ ์‚ฌ๋„ ๋ถ„๋ฅ˜

Dataset & Model

โš™๏ธ flow overview

๐Ÿ’พ dataset

  • Naver developer api์™€ bigkinds์˜ ๋‰ด์Šค๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด์„œ ๋‰ด์Šค ๋ณธ๋ฌธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
  • 2022.11.01 ~ 2023.02.03 ๊ธฐ๊ฐ„์˜ ์ด 66๋งŒ๊ฑด์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
  • ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์ณ ElasticSeach์— Insert

๐Ÿง  Model

ํ† ํ”ฝ๋ชจ๋ธ๋ง(BERTopic)

  • BERTopic์€ Document๋“ค์„ Embedding ๋ชจ๋ธ์„ ๊ฑฐ์ณ ์ดํ›„์˜ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นœ ํ›„ TF_IDF๋ฅผ ํ†ตํ•˜์—ฌ document๋ฅผ topic๋ณ„๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ•จ
  • Embedding ๋ชจ๋ธ์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๊ณ  Paraphrase mpnet์„ ์‚ฌ์šฉ
Embedding Model Shilhoutte Score Speed(sec)
Paraphrase mpnet 0.7585 7.34
KR-SBERT 0.7439 6.68
DistillBERT 0.7012 7.88
Paraphrase MiniLM 0.6994 5.81
QA mpnet 0.6927 11.16

ํ† ํ”ฝ ํ•œ ์ค„ ์š”์•ฝ(Generative summary)

  • ๊ฐ๊ฐ์˜ ๊ธฐ์‚ฌ์˜ ์ œ๋ชฉ๊ณผ ๋ณธ๋ฌธ ์•ž 2๋ฌธ์žฅ์„ Concatํ•˜๊ณ  ๊ฐ™์€ ์ฃผ์ œ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋œ ๊ธฐ์‚ฌ๋“ค์„ Concat ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์Œ
  • KoBART ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํ† ํ”ฝ์— ๋Œ€ํ•ด์„œ๋Š” ํ•˜๋‚˜์˜ ํ•œ์ค„ ์š”์•ฝ๋ฌธ ์ƒ์„ฑ
Embedding Model Rouge-1(F1) Rouge-2(F1) Rouge-3(F1) Length Speed(sec)
kobart-summarization 0.495 0.339 0.413 115.83 0.46
KR-SBERT 0.495 0.329 0.385 201.49 3.19
DistillBERT 0.488 0.324 0.394 180.29 0.64

๊ฐ์„ฑ ๋ถ„์„(Sentimental analysis)

  • ํ† ํ”ฝ๋ณ„๋กœ ์ƒ์„ฑ๋œ ๊ฐ๊ฐ์˜ ๋ฌธ์žฅ์— ๋Œ€ํ•˜์—ฌ Sequence Classification Model์— ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์Œ
  • Positive, Neutral, Negative 3์ข…๋ฅ˜์˜ Class๋กœ ๋ถ„๋ฅ˜
  • roberta-large ๋ชจ๋ธ์„ ์‚ฌ์šฉ
Model Loss AUPRC Micro F1 Speed(sec) Easy data (#48) Medium data(#22) Hard data (#23) Total data (#93)
roberta-large 0.4667 88.1713 82.7956 0.7371 43 18 16 77
roberta-base 1 0.9074 87.4126 76.3440 0.2793 42 17 12 71
roberta-base 2 0.5078 88.6208 78.4946 0.2668 42 14 17 73
KorFinASC-XLM-RoBERTa 4.3266 29.8050 32.2580 0.8201 14 7 7 28

ํ† ํ”ฝ ๋‚ด ๋‰ด์Šค ์š”์•ฝ(Extractive summary)

  • ์‚ฌ์šฉ ์ถ”์ถœ ์š”์•ฝ ๋ชจ๋ธ : KorBertSum
  • Etri์—์„œ ์ œ๊ณตํ•˜๋Š” pretrained ํ•œ๊ตญ์–ด BERT ์–ธ์–ด๋ชจ๋ธ์„ AIHub์˜ ์ถ”์ถœ์š”์•ฝ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต
  • ํ•˜๋‚˜์˜ ํ† ํ”ฝ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋œ ๋‰ด์Šค๊ธฐ์‚ฌ๋“ค์—์„œ ์ค‘์š”ํ•œ ๋ฌธ์žฅ๋“ค๋งŒ ์ถ”์ถœํ•˜์—ฌ ์š”์•ฝ ์‹คํ–‰
Model Rouge-1(F1) Rouge-2(F1) Rouge-3(F1) Rouge-1(Recall) Rouge-2(Recall) Rouge-3(Recall)
Etri pretrained model 0.7550 0.5944 0.7045 0.7213 0.5661 0.6714
AIHub data fine-tuned model 0.7834 0.6365 0.7295 0.7969 0.6467 0.7421

Product Serving

Architecture

  • ๋ชจ๋“  ์„œ๋ฒ„๋Š” aistage ์„œ๋ฒ„(V100) ํ™œ์šฉ
  • Database Server
    • Naver Developer api ์™€ bigkinds ๋‰ด์Šค๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋กค๋ง & ์ „์ฒ˜๋ฆฌ & ElasticSearch์— ์‚ฝ์ž…
    • Kibana๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์ƒํƒœ ๊ฐ€์‹œํ™”
    • Airflow๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ์ถ”๊ฐ€ ์ž‘์—… ์ž๋™ํ™”
  • Frontend Server
    • Streamlit์œผ๋กœ Frontend server ์šด์šฉ
    • Client์˜ query๋ฅผ Database server๋กœ ์š”์ฒญ
    • Database์˜ ์‘๋‹ต์„ Model server๋กœ ์ „๋‹ฌ ํ›„ ์‘๋‹ต ์ˆ˜์‹ 
  • Model Server
    • Frontend Server์—์„œ ์˜ค๋Š” ์š”์ฒญ ์ˆ˜ํ–‰ ํ›„ ์‘๋‹ต

Demo

  • ์œ ์ €์˜ query๋ฅผ ๋ฐ›์•„ ๊ด€๋ จ๋œ ๋‰ด์Šค๋ฅผ ํ† ํ”ฝ๋ณ„๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง & ํ•œ์ค„ ์ƒ์„ฑ ์š”์•ฝ (ex. ์‚ผ์„ฑ์ „์ž)

  • ํ•œ์ค„๋กœ ์š”์•ฝ๋œ ํด๋Ÿฌ์Šคํ„ฐ๋œ ํ† ํ”ฝ ํด๋ฆญ -> ํด๋Ÿฌ์Šค๋ง๋œ ๋‰ด์Šค๋“ค์„ ์ถ”์ถœ ์š”์•ฝ

Result / Conclusion / Appendix

์‹œ์—ฐ์˜์ƒ

NEWS.tar

Conclusion & ํ›„์† ๊ฐœ๋ฐœ & Appendix

5,6 ๋ฌธ๋‹จ ์ฐธ์กฐ
์ตœ์ข… ๋ฐœํ‘œ ์ž๋ฃŒ : ๋ฐœํ‘œ์ž๋ฃŒ

Reference

  • Grootendorst, Maarten. "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv preprint arXiv:2203.05794 (2022).
  • Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.
  • Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
  • Lee, Dongyub, et al. "Reference and document aware semantic evaluation methods for Korean language summarization." arXiv preprint arXiv:2005.03510 (2020).
  • Liu, Yang, and Mirella Lapata. "Text summarization with pretrained encoders." arXiv preprint arXiv:1908.08345 (2019).

About

final-project-level2-nlp-05 created by GitHub Classroom


Languages

Language:Python 98.2%Language:Shell 1.5%Language:CSS 0.3%