Extracting Structured Data from Free-form Organic Synthesis Text

Organic synthesis procedures are traditionally represented by free-form texts. This project explores how large language models can convert such unstructured texts to structured data, so they can be used for downstream data science or machine learning applications.

For more details, see

The 2-min demo video by Marcus Schwarting.
Section I.C.b of the preprint arXiv:2306.06283.
The demo app on GitHub pages.

OPENAI models

Data processing and inference scripts for OPENAI models can be found in the folder models_openai. These models are fine-tuned with 300 data points and evaluated using another set of 50 data points.

Demo page: OPENAI inference

The app in demo_apps/dash_app shows inference results from fine-tuned OPENAI models. OPENAI API key is required in the deployment script.

Demo page: inferences on the test set

The app in demo_apps/github_page shows precomputed inference results from an OPENAI davinci model. It is a static page from Dash using Epix Zhang's code, and is synced to the github_page branch.

Synthesis procedure data

Throughout this project, organic synthesis procedures, free text or structured, are extracted from the Open Reaction Database. Related scripts can be found in the folder ord_data.

About

The current team (06/2023) includes:

Qianxiang Ai
Stefan Bringuier
Hassan Harb
Brenden Pelkie
Jacob N Sanders
Marcus Schwarting
Jiale Shi

This project was conceived during the LLM Hackathon on 2023/03/29. We thank Ben Blaiszik for his generous financial support to this project.

About

https://qai222.github.io/LLM_organic_synthesis/

MIT License

Languages

Language:HTML 96.7%Language:Jupyter Notebook 1.8%Language:Python 1.5%Language:JavaScript 0.0%Language:Shell 0.0%