cazy-parser

A way to extract specific information from the Carbohydrate-Active enZYmes.

If you are using this tool please read and cite the paper!

RV Honorato. CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database. The Journal of Open Source Software, 1(8), dec 2016.

doi: 10.21105/joss.00053

Also make sure to visit and cite the CAZy website

http://www.cazy.org/
Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42:D490–D495. [PMID: 24270786].

Introduction

cazy-parser is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.

Changelog

v1.1 - Fixed bug when identifying page indexes

Installation

$ pip install cazy-parser

Download latest source from this link

$ tar -zxvf cazy-parser-x.x.x.tar.gz
$ cd cazy-parser-x.x.x
$ python setup.py install

Usage

Please note that both steps require an internet conection

Database creation

$ create_cazy_db

(-h for help)

This script will parse the CAZy database website and create a comma separated table containing the following information:
- domain
- protein_name
- family
- tag (characterized status)
- organism_code
- EC number (ec stands for enzyme comission number)
- GENBANK id
- UNIPROT code
- subfamily
- organism
- PDB code

Extract sequences

Based on the previously generated csv table, extract accession codes for a given protein family.

$ extract_cazy_ids --db <database> --family <family code>

(-h for help)

Optional:

--subfamilies Create a file for each subfamily, default = False

--characterized Create a file containing only characterized enzymes, default = False

Usage examples

Extract all accession codes from family 9 of Glycosyl Transferases.

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9

This will generate the following files:

GT9.csv

Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies

This will generate the following files:

GH43.csv
GH43_sub1.csv
GH43_sub2.csv
GH43_sub3.csv
(...)
GH43_sub37.csv

Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries

$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized

This will generate the following files:

PL42.fasta
PL42_characterized.fasta

To-do and how to contribute

Please refer to CONTRIBUTE.md

Known bugs

None, yet.

Contact info

If there are any inquires please contact me on rvhonorato at gmail.com

mobiusklein / cazy-parser

cazy-parser

Introduction

Changelog

Installation

Usage

Usage examples

To-do and how to contribute

Known bugs

Contact info

About

Languages