dataset

There are 258 repositories under dataset topic.

public-apis
public-apis / public-apis
A collective list of free APIs
api apis dataset development free list lists open-source public public-api public-apis resources software
Language:Python 365079
label-studio
HumanSignal / label-studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
computer-vision deep-learning image-annotation annotation-tool annotation labeling labeling-tool image-labeling image-labelling-tool boundingbox image-classification annotations semantic-segmentation dataset datasets label-studio data-labeling text-annotation yolo mlops
Language:JavaScript 24725
joke2k / faker
Faker is a Python package that generates fake data for you.
dataset fake fake-data faker faker-generator python test-data test-data-generator testing
Language:Python 18703
LaTeX-OCR
lukas-blecher / LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
machine-learning transformer im2latex deep-learning image2text latex dataset pytorch im2markup ocr latex-ocr vit math-ocr vision-transformer image-processing python im2text
Language:Python 15258
cvat
cvat-ai / cvat
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
video-annotation computer-vision computer-vision-annotation deep-learning image-annotation annotation-tool annotation labeling labeling-tool image-labeling image-labelling-tool boundingbox image-classification annotations imagenet tensorflow semantic-segmentation dataset object-detection pytorch
Language:Python 14392
zalandoresearch / fashion-mnist
A MNIST-like fashion product database. Benchmark :point_down:
mnist deep-learning benchmark machine-learning dataset computer-vision fashion fashion-mnist gan zalando convolutional-neural-networks
Language:Python 12427
ConardLi / easy-dataset
A powerful tool for creating fine-tuning datasets for LLM
dataset javascript llm
Language:JavaScript 10730
doccano / doccano
Open source annotation tool for machine learning practitioners.
natural-language-processing machine-learning annotation-tool python datasets dataset data-labeling text-annotation nuxtjs vue vuejs nuxt
Language:Python 10273
brightmart / nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
bert chinese chinese-corpus chinese-dataset chinese-nlp corpus dataset language-model news nlp pretrain question-answering text-classification wiki word2vec
9776
techniques
satellite-image-deep-learning / techniques
Techniques for deep learning with satellite & aerial imagery
deep-learning deep-neural-networks satellite-imagery pytorch python machine-learning sentinel satellite-images dataset remote-sensing datasets convolutional-neural-networks image-classification satellite-data earth-observation object-detection
9698
awesome-project-ideas
NirantK / awesome-project-ideas
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
deep-learning forecasting machine-learning classification series-forecasting image-classification awesome-list awesome dataset multi-label-classification
8637
quickdraw-dataset
googlecreativelab / quickdraw-dataset
Documentation on how to access and use the Quick, Draw! Dataset.
dataset quickdraw-dataset
6527
browser-compat-data
mdn / browser-compat-data
Browser compatibility data for Web technologies as displayed on MDN
compat compatibility data dataset json
Language:JSON 5396
lonePatient / awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models，高质量中文预训练模型&大模型&多模态模型&大语言模型集合
chinese nlp pretrained-models bert roberta xlnet nezha ernie gpt gpt-2 nlu-nlg simbert pangu dataset llm large-language-models multimodel
Language:Python 5394
modelscope / data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
data-analysis data-science large-language-models llm data-visualization llms instruction-tuning pre-training multi-modal synthetic-data data data-pipeline data-processing foundation-models
Language:Python 5193
SPLWare / esProc
esProc SPL is a JVM-based programming language designed for structured data computation, serving as both a data analysis tool and an embedded computing engine.
cluster-computing database dataset esproc java sql
Language:Java 4670
tensorflow / datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
tensorflow machine-learning data datasets numpy jax dataset
Language:Python 4481
whoiskatrin / sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
data-analysis data-engineering dataquery datascience dataset openai postgresql query sql
Language:TypeScript 4292
CLUEbenchmark / CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
nlu benchmark chinese corpus dataset bert albert chineseglue glue roberta language-model pretrained-models transformers tensorflow pytorch
Language:Python 4188
wainshine / Chinese-Names-Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
corpus dataset dict names ner
4183
rom1504 / img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
deep-learning dataset big-data image multimodal image-dataset download-images
Language:Python 4153
hyunwoongko / transformer
Transformer: PyTorch Implementation of "Attention Is All You Need"
pytorch transformer attention dataset
Language:Python 4043
OpenCSGs / csghub
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain full control over the lifecycle of LLMs, datasets, and agents, with Python SDK compatibility with Hugging Face. Join us! ⭐️
ai huggingface llm management-system platform asset-management dataset deepseek deploy finetune git inference model prompt ray space
Language:Vue 3946
Surface-Defect-Detection
Charmve / Surface-Defect-Detection
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
surface-detection surface-defects image-segmentation pcb-surface-defect surface-defect-detection paper defects dataset surface deep-learning charmve
Language:Python 3751
mlabonne / llm-datasets
Curated list of datasets and tools for post-training.
data dataset llm
3690
Belval / TextRecognitionDataGenerator
A synthetic data generator for text recognition
synthetic data text-recognition training-set-generator ocr dataset fake text
Language:Python 3564
text
pytorch / text
Models, data loaders and abstractions for language processing, powered by PyTorch
data-loader dataset deep-learning models nlp pytorch
Language:Python 3532
jdorfman / awesome-json-datasets
A curated list of awesome JSON datasets that don't require authentication.
json-dataset json awesome awesome-list list data dataset datasets
Language:JavaScript 3405
covid-chestxray-dataset
ieee8023 / covid-chestxray-dataset
We are building an open database of COVID-19 cases with chest X-ray or CT images.
covid-19 deep-learning computer-vision dataset xray computed-tomography
Language:Jupyter Notebook 3044
pydata / pandas-datareader
Extract data from a wide range of Internet sources into a pandas DataFrame.
html data-analysis data dataset stock-data finance financial-data python pydata pandas econdb fama-french economic-data fred
Language:Python 3024
LLMDataHub
Zjh-819 / LLMDataHub
A quick guide (especially) for trending instruction finetuning datasets
chatbot dataset llm chatgpt
2999
linhandev / dataset
医学影像数据集列表『An Index for Medical Imaging Datasets』
4d-lung ct dataset grand-challenge medical-imaging mri msd qin-lung-ct qin-prostate-repeatability tcia
2993
waymo-research / waymo-open-dataset
Waymo Open Dataset
autonomous-driving dataset
Language:Python 2897
whylabs / whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
ai-pipelines approximate-statistics statistical-properties data-quality calculate-statistics python logging mlops dataops ml-pipelines data-pipeline dataset machine-learning data-science analytics constraints data-constraints model-performance
Language:Jupyter Notebook 2754
color-names
meodai / color-names
Large list of handpicked color names 🌈
colors color naming colour colours palette dataset rgb-color dictionary
Language:JavaScript 2652
datasets
unsplash / datasets
🎁 6,500,000+ Unsplash images made available for research and machine learning
dataset images unsplash machine-learning research data search-engine keywords photos semantics
Language:Jupyter Notebook 2595

dataset

public-apis / public-apis

HumanSignal / label-studio

joke2k / faker

lukas-blecher / LaTeX-OCR

cvat-ai / cvat

zalandoresearch / fashion-mnist

ConardLi / easy-dataset

doccano / doccano

brightmart / nlp_chinese_corpus

satellite-image-deep-learning / techniques

NirantK / awesome-project-ideas

googlecreativelab / quickdraw-dataset

mdn / browser-compat-data

lonePatient / awesome-pretrained-chinese-nlp-models

modelscope / data-juicer

SPLWare / esProc

tensorflow / datasets

whoiskatrin / sql-translator

CLUEbenchmark / CLUE

wainshine / Chinese-Names-Corpus

rom1504 / img2dataset

hyunwoongko / transformer

OpenCSGs / csghub

Charmve / Surface-Defect-Detection

mlabonne / llm-datasets

Belval / TextRecognitionDataGenerator

pytorch / text

jdorfman / awesome-json-datasets

ieee8023 / covid-chestxray-dataset

pydata / pandas-datareader

Zjh-819 / LLMDataHub

linhandev / dataset

waymo-research / waymo-open-dataset

whylabs / whylogs

meodai / color-names

unsplash / datasets