There are 258 repositories under dataset topic.
A collective list of free APIs
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
A MNIST-like fashion product database. Benchmark :point_down:
A powerful tool for creating fine-tuning datasets for LLM
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Techniques for deep learning with satellite & aerial imagery
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Documentation on how to access and use the Quick, Draw! Dataset.
Browser compatibility data for Web technologies as displayed on MDN
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
esProc SPL is a JVM-based programming language designed for structured data computation, serving as both a data analysis tool and an embedded computing engine.
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Transformer: PyTorch Implementation of "Attention Is All You Need"
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain full control over the lifecycle of LLMs, datasets, and agents, with Python SDK compatibility with Hugging Face. Join us! ⭐️
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A synthetic data generator for text recognition
A curated list of awesome JSON datasets that don't require authentication.
We are building an open database of COVID-19 cases with chest X-ray or CT images.
Extract data from a wide range of Internet sources into a pandas DataFrame.
A quick guide (especially) for trending instruction finetuning datasets
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
Large list of handpicked color names 🌈