❤️‍🩹 Sensai: Toxic Chat Dataset

Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.

Download the dataset from Huggingface Hub or alternatively from Kaggle Datasets.

Join #livechat-dataset channel on holodata Discord for discussions.

Provenance

Source: YouTube Live Chat events (all streams covered by Holodex, including Hololive, Nijisanji, 774inc, etc)
Temporal Coverage: From 2021-01-15T05:15:33Z
Update Frequency: At least once per month

Research Ideas

Toxic Chat Classification
Spam Detection
Sentence Transformer for Live Chats

See public notebooks for ideas.

Files

filename	summary	size
`chats_flagged_%Y-%m.csv`	Chats flagged as either deleted or banned by mods (3,100,000+)	~ 400 MB
`chats_nonflag_%Y-%m.csv`	Non-flagged chats (3,100,000+)	~ 300 MB

To make it a balanced dataset, the number of chats_nonflags is adjusted (randomly sampled) to be the same as chats_flagged. Ban and deletion are equivalent to markChatItemsByAuthorAsDeletedAction and markChatItemAsDeletedAction respectively.

Dataset Breakdown

Chats (`chats_%Y-%m.parquet`)

column	type	description
body	string	chat message
authorChannelId	string	anonymized author channel id
channelId	string	source channel id
label	string	{deleted, hidden, nonflagged}

Usage

Pandas

import pandas as pd
from glob import glob

df = pd.concat([pd.read_parquet(x) for x in glob('../input/sensai/*.parquet')], ignore_index=True)

Huggingface Transformers

https://huggingface.co/docs/datasets/loading_datasets.html

# $ pip3 install datasets
from datasets import load_dataset, Features, ClassLabel, Value

dataset = load_dataset("holodata/sensai",
    features=Features(
        {
            "body": Value("string"),
            "toxic": ClassLabel(num_classes=2, names=['0', '1'])
        }
    ))

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["body"], padding="max_length", truncation=True)

tokenized_datasets = dataset['train'].shuffle().select(range(50000)).map(tokenize_function, batched=True)
tokenized_datasets.rename_column_("toxic", "label")
splitset = tokenized_datasets.train_test_split(0.2)
training_args = TrainingArguments("test_trainer")

trainer = Trainer(
    model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']
)

trainer.train()

Tangram

python3 ./examples/prepare_tangram_dataset.py
tangram train --file ./tangram_input.csv --target label

Consideration

Anonymization

authorChannelId are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.

Handling Custom Emojis

All custom emojis are replaced with a Unicode replacement character U+FFFD.

Citation

@misc{sensai-dataset,
 author={Yasuaki Uechi},
 title={Sensai: Toxic Chat Dataset},
 year={2021},
 month={8},
 version={31},
 url={https://github.com/holodata/sensai-dataset}
}

License

Code: MIT License
Dataset: ODC Public Domain Dedication and Licence (PDDL)

sigvt / sensai-dataset

❤️‍🩹 Sensai: Toxic Chat Dataset

Provenance

Research Ideas

Files

Dataset Breakdown

Chats (`chats_%Y-%m.parquet`)

Usage

Pandas

Huggingface Transformers

Tangram

Consideration

Anonymization

Handling Custom Emojis

Citation

License

About

Languages

❤️‍🩹 Sensai: Toxic Chat Dataset

Provenance

Research Ideas

Files

Dataset Breakdown

Chats (chats_%Y-%m.parquet)

Usage

Pandas

Huggingface Transformers

Tangram

Consideration

Anonymization

Handling Custom Emojis

Citation

License

About

Languages

Chats (`chats_%Y-%m.parquet`)