atgenomix / graphseq

String Graph Construction on Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

Abstract

De novo genome assembly is an important application on both uncharacterized genome assembly and variant identification in a reference-unbiased way. In comparison with de Brujin graph, string graph is a lossless data representation for de novo assembly. However, string graph construction is computational intensive. We propose GraphSeq to accelerate string graph construction by leveraging the distributed computing framework.

Workflow

Usage

$ /usr/local/spark/bin/spark-submit   --master spark://XXX:7077   --class com.atgenomix.seqslab.cli.SparkSTMain   ./target/graphseq-1.0.0.jar overlap
INPUT                  : Input path (generated by Adam transform)
OUTPUT                 : Output path
-cache                 : Cache the reads in memory to speedup data processing
-h (-help, --help, -?) : Print help
-max_edges N           : Maximal number of edges per read [default = Integer.MAX_VALUE]
-max_read_length N     : Maximal read length [default = 151]
-mlcp N                : Minimal longest common prefix [default = 45]
-packing_size N        : The number of reads will be packed together [default = 100]
-pl_batch N            : Prefix length for number of batches [default=1]
-pl_partition N        : Prefix length for number of partitions [default=7]
-print_metrics         : Print metrics to the log on completion
-profiling             : Enable performance profiling and output to $OUTPUT/STATS
-rmdup                 : Remove duplication of reads
-stats                 : Enable to output statistics of String Graph to $OUTPUT/STATS

Citing GraphSeq

GraphSeq is published at BioRxiv for open access.

@techreport{Su18,
    title={{GraphSeq}: Accelerating String Graph Construction for De Novo Assembly on Spark},
    author={Su, Chung-Tsai and Chang, Ming-Tai and Cheng, Yun-Chian and Li, Yun-Lung and Wang, Yao-Ting},
    year={2018},
    institution={Atgenomix}
}

About

String Graph Construction on Spark


Languages

Language:Python 100.0%