djslzx / datagen

Diverse dataset generation for program synthesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

datagen

David J. Lee, 2022-2024

L-systems

Synthesize novel and diverse datasets for low-resource program synthesis domains.

Domains

  • Lindenmayer systems
  • Mujoco/2D ant walker programs
  • Python programming puzzles
  • Regular expressions

Samples

Lindenmayer systems

Sample 1 Sample 2 Sample 3 Sample 4

Ant walker paths

Ant outputs!

Programming problems

See examples/puzzles

Framework

Treat dataset generation as Markov chain Monte Carlo search over sets of candidate programs, with the following proposal distributions:

  • LLM: generate new programs via prompting an LLM with program mutation prompts (Python puzzles)
  • PCFG: fit a probabilistic context-free grammar (PCFG) to a program or set of programs using inside-outside, and then sample from the fitted grammar

and the following target distributions:

  • energy: maximize a physically motivated "energy" function over program embeddings (pulled from an off-the-shelf embedding model and compressed using dimensionality reduction techniques when necessary)
  • variance: maximize the sum of embedding distances from the mean embedding among other variations

Note: this is research code, so browse at your own peril. :)

About

Diverse dataset generation for program synthesis


Languages

Language:Python 96.5%Language:Shell 2.0%Language:Rust 1.5%Language:Dockerfile 0.0%