mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:
- Apache Spark a fast and general engine for large-scale distributed data processing.
- MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
- Hadoop Sequence File a Big Data file format for parallel I/O
- Apache Parquet a columnar data format to store dataframes
This project is still currently under development.
We strongly recommend that you have anaconda and we require at least python 3.6 installed. To check your python version:
python --version
If Anaconda is installed, and if you have python 3.6, the above command should return:
Python 3.6.4 :: Anaconda, Inc.
Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:
The MMTF Hadoop sequence files of all PDB structures can be downloaded by:
curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar tar -xvf full.tar curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar tar -xvf reduced.tar
For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:
curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh . ./download_mmtf_files.sh