ifps with large number of protein/lig pairs

Question

ifps with large number of protein/lig pairs

abazabaaa opened this issue 2 years ago · comments

Hi --

I use a docking approach that results in the minimization of a large number of ligands (800k-1M) in a single protein and the protein is also minimized. The result is that I end up with a large list of PDBs with protein/lig pairs. I can split all of these fine and end up with PDBs of the protein and mol2 files for each ligand.

The primary question I have relates to the fact that I have a large number of distinct protein conformations along with a respective ligand. It seems that LUNA is best set up to generate IFPs for a single rigid protein and a large number of ligands.

I am happy to write my own code based on your libraries and contribute (I feel that your approach is the best I have seen so far), but I am not quite sure where to start. I would like to avoid creating projects or pickled objects and instead append the IFP onbits to a column within a dataframe (along with other descriptors) and then serialize this with apache arrow (pandas df -> arrow table -> parquet file).

I wondered if you have the time to briefly suggest some approaches where I could iteratively load protein/ligand pairs and generate IFPs. I can build this out to work with arrow.

I understand that this is probably a different use case than is intended based on your library, and you have limited time to work on it.

Thanks!

Tom

Alexandre Fassio · Answer 1 · Thu Oct 06 2022 12:34:26 GMT+0800 (China Standard Time)

Hi @abazabaaa

1 - Regarding your question about the LUNA being best set up to generate IFPs for a single rigid protein and a large number of ligands, that isn't true. Actually, LUNA can be used to generate IFPs for any complexes, even if the proteins and ligands are different. Our IFPs are independent of the protein/ligand structures as there isn't any structural alignment. That means you can still calculate the similarities between IFPs for complexes containing different entities. In the end, if the ligands have a similar binding mode, the IFPs will be able to indicate that.

2 - Regarding the preprocessing of files you mentioned, let me show you the magic!!!

From what I understood, you have multiple conformations from a single protein given by a post-docking minimization. Right?
It just wasn't clear to me if you only have PDB files (one per complex) or a mix of PDB files (for protein structures) and MOL2 files (for ligands).

If you have multiple PDB files (one per complex), don't preprocess your files. Just use MolEntry.from_file( <INPUT_FILE> ). The input file should contain one complex per line. See below how to define a complex:

Example:

# Format: <PDB_ID--or--FILENAME>:<CHAIN_ID>:<LIG_NAME>:<LIG_NUMBER>
#
# If LUNA didn't find a PDB file with the provided PDB id (filename), 
# it will try to download it from the RCSB PDB.

# Complex 1: to be downloaded from the RCSB PDB
3QQK:A:X02:497

# Complex 2: your complex
docking1:A:LIG:999

Now, let's say you end up with multiple MOL2 files (one per ligand) or a single multimol MOL2 file, and probably multiple PDB files (one per ligand due to the minimization). You don't need to preprocess (split) your files as well. LUNA can handle it easily with MolFileEntry.

You can create a list of entries using the function MolFileEntry.from_file( <INPUT_FILE> ). It expects an input file where each line defines a complex. Note that each line should contain the protein PDB id (filename without the .pdb), the ligand name (it only makes a difference in case there is more than one ligand in the MOL2 file), the MOL2 file path, and a flag indicating if there are multiple ligands in the MOL2 file.

Example:

# Complex 1:
#
# The minimized protein is stored in a file named PDB1.pdb
# The ligand can be found in the MOL2 file "input/ligand1.mol2"
# The MOL2 file contains only one ligand, so the flag is set to False.

PDB1,ligand1,input/ligand1.mol2,False

# Complex 1:
#
# The minimized protein is stored in a file named PDB2.pdb
# The ligand can be found in the MOL2 file "input/other_ligands.mol2"
# The MOL2 file contains multiple ligands, so the flag is set to True.

PDB2,ligand2,input/other_ligands.mol2,True

Alternatively, you could also use MolFileEntry.from_mol_file(). However, this function can be used to define only one complex. That means you'd have to loop through your list of complexes and call this function multiple times, producing the same result as the previous option. So, you can choose what works better for you.

After creating the entries, just provide them to a LocalProject object.

3 - Finally, regarding the creation of local files...

By default, LocalProject was designed to produce local files, as the name suggests. There is a backburner option designed for storing data in a database, but it is still not my priority to finish it right now.

In your case, you can either run LocalProject as it is and then read the IFP file (a .CSV file) it created with Pandas. Then proceed with the parquet file creation and remove the created project from the disk.

Alternatively, you can also access all IFPs as shown below:

project_obj.ifps

This yields Fingerprint objects (one per ligand). You can then transform the bits/counts into the format you prefer.

Personally, I think the first option is better as you'll end up with a local project anyway. Then you'll just need to read the created CSV file.

In any case, if you really need to define a custom function to create parquet files without creating any pickled file, we'd have to define a new class that instantiates from Project.

Hope I've been able to cover all your questions. If there's any other just let me know.

bests

Thomas Graham · Answer 2 · Thu Oct 27 2022 22:50:18 GMT+0800 (China Standard Time)

Thanks so much! Working on giving this a try and will let you know.

Alexandre Fassio · Answer 3 · Wed Jul 12 2023 21:18:45 GMT+0800 (China Standard Time)

Please, let me know if you need any further help.

bests