dsfsi / PuoData

Curated corpora for Setswana. Used to train PuoBERTa.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PuoData: A curated corpora for Setswana

arXiv

Give Feedback 📑: DSFSI Resource Feedback Form

We believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.

Dataset Curation

Dataset Name Kind Num. of Tokens
PuoData
NCHLT Setswana \cite{eiselen2014developing} Government Documents 1,010,147
Nalibali Setswana Childrens Books 57,654
Setswana Bible Book(s) 879,630
SA Constitution Official Document 56,194
Leipzig Setswana Corpus BW Curated Dataset 219,149
Leipzig Setswana Corpus ZA Curated Dataset 218,037
SABC Dikgang tsa Setswana FB (Facebook) News Headlines 167,119
SABC MotswedingFM FB Online Content 33,092
Leipzig Setswana Wiki Online Content 230,333
Setswana Wiki Online Content 183,168
Vukuzenzele Monolingual TSN Government News 157,798
gov-za Cabinet speeches TSN Government Speeches 591,920
Department Basic Education TSN Education Material 708,965
PuoData Total 25MB on disk 4,513,206
PuoData+JW300
JW300 Setswana Book(s) 19,782,122
PuoData+JW300 124MB on disk 24,295,328

Dataset Uses

We used this corpus to train PuoBERTa, 🤗 https://huggingface.co/dsfsi/PuoBERTa. It is also part of the corpus used for PuoBERTaJW300.

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {SACAIR 2023 (To Appear)},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

License

The license of PuoData is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license

Dataset Contact

For more details, reach out or check our website.

Email: vukosi.marivate@cs.up.ac.za

Enjoy exploring Setswana through AI!

About

Curated corpora for Setswana. Used to train PuoBERTa.

License:Creative Commons Attribution Share Alike 4.0 International