Pre-training Corpora

Assamese Wikipedia v1

Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

Assamese - 10k

Assamese Wikipedia v2

Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

Assamese - 100k

CC-100 Monolingual*

"This corpus comprises of monolingual data for 100+ languages [...] This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots."

Assamese (7.6 MB)

*external resource

The C4 Multilingual Dataset*

"A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset."

Assamese (?)

*external resource

Fine-tuning Corpora

Assamese Sentiments Dataset*

Ritik Kumar Jain, from Kaggle.com

*external resource

Assamese Wikipedia Sentences Dataset*

Sagar Tamang, cleaned and formatted into jsonl format. Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

Non-chat | Non-commmands | TXT "text" atrs:

*external resource

Assamese ChatGPT Generated Dataset for Fine Tuning*

Special thanks to Prasurjan Pran Borah for providing me the API for ChatGPT, through which this dataset was generated for the purpose of finetuning an LLM in Assamese.