indian-nlp / assamese-dataset

Assamese dataset for developing models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logo

Assamese Datasets

A collection of various NLP datasets in Assamese. These datasets are split into two: pre-training corpora and fine-tuning datasets.
Explore the docs »

Report Bug Request Feature


Table of Contents


Pre-training Corpora

Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

  • Assamese - 10k

Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

  • Assamese - 100k

"This corpus comprises of monolingual data for 100+ languages [...] This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots."

  • Assamese (7.6 MB)

*external resource


"A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset."

  • Assamese (?)

*external resource


Fine-tuning Corpora

Ritik Kumar Jain, from Kaggle.com

*external resource


Sagar Tamang, cleaned and formatted into jsonl format. Assamese Wikipedia documents from Wikidump from 2021. Collected on June 2024.

Non-chat | Non-commmands | TXT "text" atrs:

*external resource


Special thanks to Prasurjan Pran Borah for providing me the API for ChatGPT, through which this dataset was generated for the purpose of finetuning an LLM in Assamese.

The words are taken from Assamese Wikipedia v2's Word split in 1/10.

Each word is sent along with a prompted to the ChatGPT to generate valid dataset that describes that word, making it suitable for fine tuning.

Non-Chat | Non-commmands | JSONL "word" and "sentence":

Chat | Non-commmands | JSONL "text" attrs:


Chat | Non-commmands | TXT "text" attrs:

Chat | Non-commmands | JSONL "text" attrs:


Sani Kamal, from Kaggle.com

*external resource


TBA

About

Assamese dataset for developing models.

License:MIT License