argon1025 / Comments-in-Korean_Dataset

15,000 comment data parsed in the Korean community

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

๐Ÿ“œ Korean community comment Dataset

DC์ธ์‚ฌ์ด๋“œ์˜ 15000์—ฌ๊ฐœ์˜ ๋Œ“๊ธ€๋ฐ์ดํ„ฐ์…‹
15,000 comment data parsed in the Korean community
Project Date ๐Ÿ“† 2020-06-20

label_g

์ „์ฒด ๋ฐ์ดํ„ฐ์ค‘ 18%์ •๋„์˜ ์•…์„ฑ๋Œ“๊ธ€์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ž…๋‹ˆ๋‹ค(2-way ๊ธฐ์ค€)

1.Dataset_Class

Class Description
Text ์›๋ฌธ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค
Malignant index ์•…์„ฑ์ง€์ˆ˜ ์ž…๋‹ˆ๋‹ค 0~2์˜ ๊ฐ’์œผ๋กœ ๋ถ€์—ฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค

1-1.Malignant index

Malignant index Description
0 ๊ฒŒ์‹œ๋˜์–ด๋„ ์ „ํ˜€ ๋ฌธ์ œ๊ฐ€ ์—†๋Š” ๋Œ“๊ธ€ ์ž…๋‹ˆ๋‹ค.
1 ๋น„์†์–ด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ์•…์„ฑ๋Œ“๊ธ€์ด๋ผ ํŒ๋‹จํ•˜๊ธฐ์— ๋ถ€์กฑํ•จ์ด ์—†๋Š” ๋Œ“๊ธ€ ์ž…๋‹ˆ๋‹ค.
2 ๋น„์†์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋ช…๋ฐฑํ•˜๊ฒŒ ์•…์„ฑ๋Œ“๊ธ€์ด๋ผ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๋Œ“๊ธ€์ž…๋‹ˆ๋‹ค

3-way Classification๋กœ ์ž‘์„ฑ๋˜์–ด ์žˆ์ง€๋งŒ ํ•™์Šต๊ฒฐ๊ณผ Binary Classification ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋ฉด ์ •ํ™•๋„๊ฐ€ ์ž˜ ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜์—์„œ ์„ค๋ช…ํ•˜๋Š” ํ•จ์ˆ˜๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๊ฐ€๊ณตํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์„ ์ถ”์ฒœ๋“œ๋ฆฝ๋‹ˆ๋‹ค

1-2.Rework malicious index(3-way -> 2-way)

malicious index๋ฅผ Binary Classification ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ธฐ์ค€์€ ๋‘๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

Low level (malicious index Value Change 1 -> 0, 2 -> 1)

def Row_rework_label(data): #Binary Classification (Low level)  
count = 0
    for i in data:
        if(i==2):
            data[count] = 1
        elif(i==1):
            data[count] = 0
        count = count+1
    return data

malicious index๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—” 0์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š”, ๋‚ฎ์€ ์—„๊ฒฉ๋„๋ฅผ ๊ฐ€์ง€๋Š” ์ˆ˜์ •๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค

High level (malicious index Value Change 1 -> 0, 2 -> 1)

def High_rework_label(data): #Binary Classification ํ†ตํ•ฉ (high level) 
count = 0
    for i in data:
        if(i==2):
            data[count] = 1
        count = count+1
    return data

malicious index๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—” 0์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š”, ๋†’์€ ์—„๊ฒฉ๋„๋ฅผ ๊ฐ€์ง€๋Š” ์ˆ˜์ •๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค

Dataset Load and Rework malicious index

dataset_csv = pd.read_csv('DCcomment.csv', names=['Text', 'label'])
X, Y = dataset_csv['Text'].values, dataset_csv['label'].values
#Y = High_rework_label(Y)
#Y = Row_rework_label(Y)

About

15,000 comment data parsed in the Korean community