pnnl / cactus

LLM Agent that leverages cheminformatics tools to provide informed responses.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CACTUS 🌡 | Chemistry Agent Connecting Tool Usage to Science

arXiv License Ruff Rye

Spaces

Introduction

CACTUS is an innovative tool-augmented language model designed to assist researchers and chemists in various chemistry-related tasks. By integrating state-of-the-art language models with a suite of powerful cheminformatics tools, CACTUS provides an intelligent and efficient solution for exploring chemical space, predicting molecular properties, and accelerating drug discovery. Just as the cactus thrives in the harsh desert environment, adapting to limited resources and extreme conditions, CACTUS has been implemented by Pacific Northwest National Laboratory (PNNL) Scientists to navigate the complex landscape of chemical data and extract valuable insights.

Cactus_header

Preprint Available here

Demo (API-only) on HuggingFace Spaces here

Running Cactus πŸƒ

Getting started with Cactus is as simple as:

from cactus.agent import Cactus

Model = Cactus(model_name="google/gemma7b", model_type="vllm")
Model.run("What is the molecular weight of the smiles: OCC1OC(O)C(C(C1O)O)O")

Installation πŸ’»

To install cactus:

pip install git+https://github.com/pnnl/cactus.git

The default PyTorch version is compiled for cuda 12.1 (or cpu for non-cuda systems). If you want to install for an older version of cuda, you should install from source and edit the pyproject.toml file at the [[tool.rye.sources]] section before installing. But be aware vllm may not work properly for older versions of PyTorch.

Alternatively for development, you can install in an editable configuration using:

git clone https://gitlab.pnnl.gov/computational_data_science/cactus.git
cd cactus
python -m pip install -e .

or install using rye by running:

git clone https://gitlab.pnnl.gov/computational_data_science/cactus.git
cd cactus
rye sync

Benchmarking πŸ“Š

We provide scripts for generating lists of benchmarking questions to evaluate the performance of the CACTUS agent.

These scripts are located in the benchmark directory.

To build the dataset used in the paper, we can run:

python benchmark_creation.py

This will generate a readable dataset named QuestionsChem.csv for use with the Cactus agent.

Models Tested

For this application we are benchmarking the following models:

Model model_name
llama2-7b meta-llama/Llama-2-7b-hf
mistral-7b mistralai/Mistral-7B-v0.1
gemma-7b google/gemma-7b-it
falcon-7b tiiuae/falcon-7b
MPT-7b mosaicml/mpt-7b
Phi-2 microsoft/phi-2
OLMo-1b allenai/OLMo-1B

These models were selected based on their strong performance in natural language tasks and their potential for adaptation to domain-specific applications.

Tools Available

For the initial release, we have simple cheminformatics tools available:

Tool Name Tool Usage
calculate_molwt Calculate Molecular weight
calculate_logp Calculate the Partition Coefficient
calculate_tpsa Calculate the Topological Polar Surface Area
calculate_qed Calculate the Qualitative Estimate of Drug-likeness
calculate_sa Calculate the Synthetic Accessibility
calculate_bbb_permeant Calculate Blood Brain Barrier Permeance
calculate_gi_absorption Calculate the Gastrointestinal Absorption
calculate_druglikeness Calculate druglikeness based on Lipinski's Rule of 5
brenk_filter Calculate if molecule passes the Brenk Filter
pains_filter Calculate if molecule passes the PAINS Filter

⚠️ Notice: These tools currently expect a SMILES as input, tools for conversion between identifiers are available but not yet working as intended. Fix to come soon.

Future Directions

We are continuously working on improving CACTUS and expanding its capabilities for molecular discovery. Some of our planned features include:

🧬 Integration with physics-based models for 3D structure prediction and analysis
πŸ”§ Support for advanced machine learning techniques (e.g., graph neural networks)
🎯 Enhanced tools for target identification and virtual screening    
πŸ“œ Improved interpretability and explainability of the model's reasoning process

We welcome contributions from the community and are excited to collaborate with researchers and developers to further advance the field of AI-driven drug discovery.

Citation

If you use CACTUS in your research, please cite our preprint:

@article{mcnaughton2024cactus,
    title={CACTUS: Chemistry Agent Connecting Tool-Usage to Science},
    author={Andrew D. McNaughton and Gautham Ramalaxmi and Agustin Kruel and Carter R. Knutson and Rohith A. Varikoti and Neeraj Kumar},
    year={2024},
    eprint={2405.00972},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

LLM Agent that leverages cheminformatics tools to provide informed responses.

License:BSD 2-Clause "Simplified" License


Languages

Language:Jupyter Notebook 98.7%Language:Python 1.3%