There are 2 repositories under data-synthesis topic.
List of useful data augmentation resources. You will find here some not common techniques, libraries, links to GitHub repos, papers, and others.
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
Official Repository of "LLM × DATA" Survey Paper
Computer vision utils for Blender (generate instance annoatation, depth and 6D pose by one line code)
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
[CVPR 2023] Label-Free Liver Tumor Segmentation
[CVPR 2024] Generalizable Tumor Synthesis - Realistic Synthetic Tumors in Liver, Pancreas, and Kidney
[ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
[ICLR 2025] Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video
Official repository for Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning [ICLR 2025]
Repository for the results of my master thesis, about the generation and evaluation of synthetic data using GANs
Code & data for ICLR 2024 spotlight paper: 🍯MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
Official Code for “EarthSynth: Generating Informative Earth Observation with Diffusion Models”
Source code for LDPTrace: Locally Differentially Private Trajectory Synthesis. VLDB 2023.
[Preprint] Deformation-Recovery Diffusion Model (DRDM): Instance Deformation for Image Manipulation and Synthesis
Official implementaion of EMNLP 2022 paper "Generate, Discriminate, and Contrast: A Semi-Supervised Sentence Representation Learning Framework"
A data synthesizer for creating datasets of feet from a first-person perspective.
Apache NiFi Data Synthesizer
Coursera - RNN Programming Assignment: In this project, we will construct a speech dataset and implement an algorithm for trigger word detection (sometimes also called keyword detection, or wake word detection).
data synthesis for simulation of pen-based interaction
Blender Python Package for extracting internal data from blender scenes for 3d related data generation purposes.
Releases for 「Synthesizing Realistic Data for Table Recognition」
The Coastal Carbon Network Data Library: An open-source database featuring carbon data from tidal wetlands around the world
A Label-Free and Data-Free Synthesis Engine and Training Framework for Vascular Segmentation of sOCT Data with PyTorch.
Code for ICCV'25 paper: Learn2Synth: Learning Optimal Data Synthesis using Hypergradients for Brain Image Segmentation
Generate Synthetic Shapes in 3D for Biomedical Image Augmentation and Synthesis.
An Improved Data Synthesis Method Driven by Large Language Models for Proactively Mining Implicit User Intentions in the Chinese Tourism Domain
Synthesis data in YOLO format given background and object images