There are 1 repository under training-data topic.
A system for quickly generating training data with weak supervision
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
Synthetic data generators for tabular and time-series data
skweak: A software toolkit for weak supervision applied to NLP tasks
Computer vision based ML training data generation tool :rocket:
A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
Web application for image labeling and segmentation
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
A lightweight web application for brushing labels onto time series data; useful for building training sets.
Augmenty is an augmentation library based on spaCy for augmenting texts.
Generating training data from the Carla driving simulator in the KITTI dataset format
Aubo i5 Dual Arm Collaborative Robot - RealSense D435 - 3D Object Pose Estimation - ROS
Collection of casual conversations that can be used with the Rasa Stack
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
COVID-19 Coughs files for training AI models
Data Programming by Demonstration (DPBD) for Document Classification
🔎 Classification helper for sex classification feature of InstaPy
Full resources supporting the publication "A Pragmatic Guide to Geoparsing Evaluation."
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
Machine Learning project aimed at converting images into .obj 3D models by representing them as Blender hair-type particle systems.
Convert all files in git repository to .txt files. Useful for training LLMs on your codebase.
PyTorch reimplementation of computing Shapley values via Truncated Monte Carlo sampling from "What is your data worth? Equitable Valuation of Data" by Amirata Ghorbani and James Zou [ICML 2019]
AI Assisted Image and Video Training Data Labeling @ Scale
A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
Simple Optical Character Recognizer (english-ocr-image-to-text-recognition-sample-trainig-alphabet-photo-data-database-dataset)
A set of questions & answers used to train a chatGPT model.
Realtime Thermal Solar Plant Dataset for Machine Learning
Training images for training self-driving cars on Udacity Nanodegree Self-driving Car Simulator
Minimal customization of Quill.js Rich Text Editor for easy annotation of text snippets for NER model training with spaCy.
明日方舟相关机器学习训练数据 | Machine learning training data for Arknights