Organic synthesis procedures are traditionally represented by free-form texts. This project explores how large language models can convert such unstructured texts to structured data, so they can be used for downstream data science or machine learning applications.
For more details, see
- The 2-min demo video by Marcus Schwarting.
- Section I.C.b of the preprint arXiv:2306.06283.
- The demo app on GitHub pages.
Data processing and inference scripts for OPENAI models can be found in the folder models_openai. These models are fine-tuned with 300 data points and evaluated using another set of 50 data points.
The app in demo_apps/dash_app shows inference results from fine-tuned OPENAI models. OPENAI API key is required in the deployment script.
The app in demo_apps/github_page shows precomputed inference results from an OPENAI davinci
model.
It is a static page from Dash
using
Epix Zhang's
code,
and is synced to the github_page branch.
Throughout this project, organic synthesis procedures, free text or structured, are extracted from the Open Reaction Database. Related scripts can be found in the folder ord_data.
The current team (06/2023) includes:
- Qianxiang Ai
- Stefan Bringuier
- Hassan Harb
- Brenden Pelkie
- Jacob N Sanders
- Marcus Schwarting
- Jiale Shi
This project was conceived during the LLM Hackathon on 2023/03/29. We thank Ben Blaiszik for his generous financial support to this project.