mobiusklein / cazy-parser

A way to extract specific information from CAZy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cazy-parser

A way to extract specific information from the Carbohydrate-Active enZYmes.

status DOI

License: GNU GPLv3

If you are using this tool please read and cite the paper!

RV Honorato. CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database. The Journal of Open Source Software, 1(8), dec 2016.

doi: 10.21105/joss.00053

Also make sure to visit and cite the CAZy website

  • http://www.cazy.org/
  • Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42:D490–D495. [PMID: 24270786].

Introduction

cazy-parser is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.

Changelog

v1.1 - Fixed bug when identifying page indexes

Installation

$ pip install cazy-parser

or

Download latest source from this link

$ tar -zxvf cazy-parser-x.x.x.tar.gz
$ cd cazy-parser-x.x.x
$ python setup.py install

Usage

Please note that both steps require an internet conection

  1. Database creation

$ create_cazy_db

(-h for help)

  • This script will parse the CAZy database website and create a comma separated table containing the following information:
    • domain
    • protein_name
    • family
    • tag (characterized status)
    • organism_code
    • EC number (ec stands for enzyme comission number)
    • GENBANK id
    • UNIPROT code
    • subfamily
    • organism
    • PDB code
  1. Extract sequences
  • Based on the previously generated csv table, extract accession codes for a given protein family.

$ extract_cazy_ids --db <database> --family <family code>

(-h for help)

  • Optional:

--subfamilies Create a file for each subfamily, default = False

--characterized Create a file containing only characterized enzymes, default = False

Usage examples

  1. Extract all accession codes from family 9 of Glycosyl Transferases.

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9

This will generate the following files:

GT9.csv
  1. Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies

This will generate the following files:

GH43.csv
GH43_sub1.csv
GH43_sub2.csv
GH43_sub3.csv
(...)
GH43_sub37.csv
  1. Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized

This will generate the following files:

PL42.fasta
PL42_characterized.fasta

To-do and how to contribute

Please refer to CONTRIBUTE.md

Known bugs

None, yet.

Contact info

If there are any inquires please contact me on rvhonorato at gmail.com

About

A way to extract specific information from CAZy

License:GNU General Public License v3.0


Languages

Language:Python 98.5%Language:TeX 1.5%