Rosalind in Chapel

Slowly implementing the Rosalind problems, a very nice set of simple, pedagogical intro-to-bioinformatics problems, in python and Chapel to compare the two.

(If you're planning on working through the problems yourself, don't cheat yourself by looking at these and other answers!)

One of my complaints with many parallel scientific computing comparisons and examples that are frequently used is that they are too trivial - a diffusion equation on a structured mesh, for instance, which is so regularly structured that it is quite easy to get decent performance using a small number of extremely well understood optimizations.

To provide a relevant and more challenging set of problems, then, I took the full set of 105 Rosalind problems - an excellent set of pedagogical coding challenges to teach basic bioinformatics algorithms - and implemented them in both Python and Chapel, along with implementing a representative subset in C++, to examine Chapel’s performance and productivity compared to other languages frequently used for these tasks.

As I’ve written before, one of my concerns with scientific programming languages and frameworks is that they can too often hinder the all too important early, still partly exploratory phase of scientific computing method development - either by requiring too much boilerplate or reimplementation of primitives to make coding easy, or by being so slow and memory-heavy that they limit what one problems are reasonably accessible with early prototypes. With that in mind, for these tests I was interested in measuring both code size (as an admittedly crude proxy for developer effort) and performance on decent sized example problems with substantially more complex access patterns and algorithms.

A few notes on how I approached the coding - I wrote each solution as its own, self-contained module. I feel that this is important for what I’m interested in measuring here. I’m not trying to judge the languages’ suitability to implement a nice DRY library of methods (although that would be another interesting set of experiments) but rather how naturally and efficiently they allow a (very) moderately capable scientific developer to implement a realistic range of algorithms on bioinformatic data types. Even though a decent set of bioinformatics libraries would have most of the methods in these problems already implemented, a developer trying to implement a variation on those methods or a new method entirely would be cobbling together code with similar methods and approaches as these examples as a first pass. This does mean that the solution base as a whole has a fair bit of repetition, but it also means that it’s easier to understand the entirety of the code for any particular problem.

The « as a first pass » rule applied to my approach to performance, as well. Given the input sizes required for the problem, I generally went with the first thing that seemed like it would work - e.g., avoiding quadratic algorithms for input sizes in the tens of thousands, but not going out of my way to ensure performant implementations. There were a couple of annoying exceptions to this rule, in those few cases where the 5-minute timeout on the server for the Rosalind solution checker was actually a somewhat stringent constraint, but even that was maybe a useful exercise, to examine the scope for improvement from naive implementations (and to see those cases where the things I had to do to make the Python code fast were noticeably different from the things I had to do to the Chapel code.)

As I am much more familiar with Python, I generally approached these problems by first writing the code in Python, and then re-implementing the same method in Chapel. This has the advantage that the two implementations shared a common approach - it would make the development effort harder to compare between the two if that were not true - but it also means that my Chapel implementations tend to be rather Pythonic, rather than - Chapelesque? Ecclesiastical?

Finally, the python implementations here use the standard library and numpy/scipy - what I (maybe idiosyncratically) consider to be “core scientific python”, and nothing else. I think there’s an argument that I should have taken advantage of a broader python toolset; obviously using biopython would have been cheating, but there are other tools like pypy and numba that would be useful next steps for performance. I’ve run some experiments, which I’ll discuss below, with using each of these. As you might expect, they can dramatically improve performance in some cases; but those are precisely the cases I’m least interested in. The biggest changes are in the dynamic programming cases - nice regular rectangular array access problems which fall under “trivial to improve performance” situations that tell us absolutely nothing new. The key data structures for many of the bioinformatics problems that are of more interest are dictionaries and strings, both of which have already been optimized to the point of being almost unreasonably fast in pure CPython, and which neither numba nor pypy help enormously with.

ljdursi / rosalind_chapel

Rosalind in Chapel

About

Languages