dcjones / gtf-parse-off

Experiments with parsing gene transfer format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Motivation

These repository contains some rough experimentation in parsing simple file formats. The main use case I'm evaluating is parsing data in the gene transfer format (GTF). The format can be parsed by a regular expression, but it's weird enough that it takes a little work.

GTF is just a stand-in for the sort simple, but not quite trivial file formats of which there are an abundance of in scientific computing. If we can build fast parsers for all of these formats with minimal effort, it would help a lot with Julia's already increasing viability.

Benchmarks

I timed the parsing of the first 100000 lines (24MB) of Ensembl's version 71 human genome annotations. The full file is 2253155 lines (502MB), so scaling each of these numbers by about 20 gives the time needed to parse the whole thing.

These are timings are not terribly scientific. E.g. I'm not counting time spent on I/O in Julia, but am in the other methods. Also, I may or may not have been watching youtube videos while I waited for the julia PCRE benchmark to finish.

Language Method Elapsed Seconds
C Ragel table-based 0.42
C Ragel goto-based 0.05
Python Hand written 28.28
Python Regex 0.64
PyPy Regex 1.09
Ruby Ragel table-based 199.39
Julia Ragel table-based 3.52
Julia PCRE 1560.50

Notes

My julia backend for ragel at dcjones/ragel-julia.

The hand-written python parser is from bcbb.

About

Experiments with parsing gene transfer format


Languages

Language:Ragel in Ruby Host 68.7%Language:Python 17.8%Language:Julia 13.4%