Data Science for Social Impact Research Group @ University of Pretoria's repositories
textaugment
TextAugment: Text Augmentation Library
vukuzenzele-nlp
The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtained from the Vuk'uzenzele website.
gov-za-multilingual
The data set contains cabinet statements from the South African government. Data was scraped from the governments website: https://www.gov.za/cabinet-statements
Higher_Education_EDA
This is an EDA Git for education researchers and practitioners
dsfsi-datasets
Datasets made available for different small projects
embedding-eval-data
Embedding Evaluation Data for South African Languages
izindaba-zesizulu
Categorised isiZulu News. Source data is the isiZulu news from the SABC social media posts.
zabantu-beta
ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu Languages
healthfacilitymap
South African Health Facility map. Created to aid in covid19za responses
StatsSA-Language
StatsSA statistical language glossary in machine-readable format
za-fake-news-2020
Dataset of South African Disinformation [Fake News] Website Data collected in 2020
academic-project-page-template
A project page template for academic papers. Demo at https://eliahuhorwitz.github.io/Academic-project-page-template/
bibtextomd
Convert BibTeX entries to formatted Markdown
dlindaba-2019-uber
UBER Rider Rating Data from the DLIndaba 2019
edu-assessment-llm-prompt
Educational Assesement using LLMs
thapelo-sindane-msc-public
Public Repository containing msc code
za-lid
This repository contains datasets extracted from Vuk'zenzele prepared to train N-gram models, and traditional ML models (Naive Bases, SVM, and Logistic Regression), and Large pretrained multilingual models for language identification