bertiewooster / mol_frame

Chemical Structure Handling for Pandas DataFrames

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MolFrames

Working with Pandas DataFrames that can handle chemical structures.

[Update 30-May-2018]: The module has been rewritten again and the Dask functionality has been removed again because it was too limiting. If you want to continue using Dask, please switch to the dask branch. Large datasets which do not fit in memory can either be first parsed and pre-filtered with the pipeline module of this project or be pre-processed with a pure dask dataframe, like this:

import dask.dataframe as dd
ddf = dd.read_csv("mylargedataset.csv")
# Pre-filtering by columns that are already present in the dataset:
ddf = ddf[(ddf["LogP"] < 5.0) & (ddf["MW"] < 500.0)]
ddf = ddf.compute()

and then imported into a Mol_Frame:

from mol_frame import mol_frame as mf
molf = mf.Mol_Frame()
molf.data = ddf

Module Mol_Frame

Operations on molecules. For long operations, this displays a progress bar in the Notebook.

Output as sortable and optionally selectable HTML Tables: HTML Tables
(this is an extension of bluenote10's NimData html browser which uses JQuery Datatables.)

Interactive HoloViews plots with structure tooltips: Scatter Plot

Module Pipeline

A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. The use of generators allows working with arbitrarily large data sets, the memory usage at any given time is low.

Example use:

from mol_frame import pipeline as p
s = p.Summary()
rd = p.start_csv_reader(test_data_b64.csv.gz", summary=s)
b64 = p.pipe_mol_from_b64(rd, summary=s)
filt = p.pipe_mol_filter(b64, "[H]c2c([H])c1ncoc1c([H])c2C(N)=O", summary=s)
p.stop_sdf_writer(filt, "test.sdf", summary=s)

or, using the pipe function:

from mol_frame import pipeline as p
s = p.Summary()
p.pipe((p.start_sdf_reader("test.sdf", {"summary": s})),
     p.pipe_keep_largest_fragment,
     (p.pipe_neutralize_mol, {"summary": s}),
     (p.pipe_keep_props, ["Ordernumber", "NP_Score"]),
     (p.stop_csv_writer, "test.csv", {"summary": s})
    )

The progress of the pipeline is displayed as a HTML table in the Notebook and can also be followed in a separate terminal with: watch -n 2 cat pipeline.log.

Currently Available Pipeline Components:

Starting Running Stopping
start_cache_reader pipe_calc_props stop_cache_writer
start_csv_reader pipe_custom_filter stop_count_records
start_mol_csv_reader pipe_custom_man stop_csv_writer
start_sdf_reader pipe_do_nothing stop_df_from_stream
start_stream_from_dict pipe_has_prop_filter stop_dict_from_stream
start_stream_from_mol_list pipe_id_filter stop_mol_list_from_stream
pipe_inspect_stream stop_sdf_writer
pipe_join_data_from_file
pipe_keep_largest_fragment
pipe_keep_props
pipe_merge_data
pipe_mol_filter
pipe_mol_from_b64
pipe_mol_from_smiles
pipe_mol_to_b64
pipe_mol_to_smiles
pipe_murcko_smiles
pipe_calc_fp_b64
pipe_neutralize_mol
pipe_remove_props
pipe_remove_dups
pipe_rename_prop
pipe_sim_filter
pipe_sleep

Limitation: unlike in other pipelining tools, because of the nature of Python generators, the pipeline can not be branched. You can use the stop_cache_writer to use results in multiple pipelines.

Module SAR

Currently implements a RandomForestClassifier and offers SimilarityMap visualization as shown in the RDKit Cookbook:

Similarity Map

Please also have a look at the sar_example notebook in the tutorials folder.

Requirements

The recommended way to install the dependencies is via conda.

The code is written and tested on Ubuntu 20.04 and Python 3.9 and is intended for use in the Jupyter Notebook.
It also works in JupyterLab, but the custom javascript progress bars are then not displayed (yes, I probably should have used tqdm, but I didn't).

The main class is the MolFrame, which is a wrapper around a Pandas dataframe, exposing all DataFrame methods and extending it with some chemical functionality from the RDKit.
The underlying DataFrame is contained in the MolFrame.data object and can always be accessed directly, if necessary.

See the accompanying Tutorial notebook for further examples.

Installation

After installing the requirements, clone this repo, then the module can be used by including the project's base directory (mol_frame) in Python's import path (I actually prefer this to using setuptools, because a simple git pull will get you the newest version). This can be achieved as follows:

Put a file with the extension .pth, e.g. my_packages.pth, into one of the site-packages directories of your Python installation (e.g. ~/anaconda3/envs/chem/lib/python3.6/site-packages/) and put the path to the base directory of this project (mol_frame) into it. (I have the path to a dedicated folder on my machine included in such a .pth file and link all my development projects to that folder. This way, I need to create / modify the .pth file only once.)

Disclaimer

This is work in progress and AFAIA currently mainly used by me, compatibility-breaking changes will happen.

About

Chemical Structure Handling for Pandas DataFrames

License:MIT License


Languages

Language:Jupyter Notebook 94.0%Language:HTML 3.8%Language:Python 2.2%