There are 1 repository under training-data topic.
A system for quickly generating training data with weak supervision
Synthetic data generators for tabular and time-series data
skweak: A software toolkit for weak supervision applied to NLP tasks
Computer vision based ML training data generation tool :rocket:
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
Web application for image labeling and segmentation
Augmenty is an augmentation library based on spaCy for augmenting texts.
A repository of NSFW images to be used for machine learning/image classification purposes
Generating training data from the Carla driving simulator in the KITTI dataset format
Aubo i5 Dual Arm Collaborative Robot - RealSense D435 - 3D Object Pose Estimation - ROS
Collection of casual conversations that can be used with the Rasa Stack
COVID-19 Coughs files for training AI models
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
Data Programming by Demonstration (DPBD) for Document Classification
🔎 Classification helper for sex classification feature of InstaPy
Full resources supporting the publication "A Pragmatic Guide to Geoparsing Evaluation."
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
Machine Learning project aimed at converting images into .obj 3D models by representing them as Blender hair-type particle systems.
AI Assisted Image and Video Training Data Labeling @ Scale
A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
PyTorch reimplementation of computing Shapley values via Truncated Monte Carlo sampling from "What is your data worth? Equitable Valuation of Data" by Amirata Ghorbani and James Zou [ICML 2019]
Simple Optical Character Recognizer (english-ocr-image-to-text-recognition-sample-trainig-alphabet-photo-data-database-dataset)
Realtime Thermal Solar Plant Dataset for Machine Learning
A set of questions & answers used to train a chatGPT model.
Training images for training self-driving cars on Udacity Nanodegree Self-driving Car Simulator
Minimal customization of Quill.js Rich Text Editor for easy annotation of text snippets for NER model training with spaCy.