fsimkovic / mmtf-pyspark

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MMTF PySpark

Build Status GitHub license Version Download MMTF Download MMTF Reduced Binder Twitter URL

mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:

  • Apache Spark a fast and general engine for large-scale distributed data processing.
  • MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
  • Hadoop Sequence File a Big Data file format for parallel I/O
  • Apache Parquet a columnar data format to store dataframes

This project is still currently under development.

Installation

Python

We strongly recommend that you have anaconda and we require at least python 3.6 installed. To check your python version:

python --version

If Anaconda is installed, and if you have python 3.6, the above command should return:

Python 3.6.4 :: Anaconda, Inc.

mmtfPyspark and dependencies

Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:

MacOS and LINUX

Windows

Hadoop Sequence Files

The MMTF Hadoop sequence files of all PDB structures can be downloaded by:

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:

curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh
. ./download_mmtf_files.sh

About

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

License:Apache License 2.0


Languages

Language:Python 95.1%Language:Jupyter Notebook 4.1%Language:Shell 0.8%Language:Batchfile 0.0%