kcleal / sortedintersect

Fast intersection of sorted reference intervals

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sorted-intersect

Installation

Download from pypi:

pip install sortedintersect

To build from source:

git clone https://github.com/kcleal/sortedintersect
cd sortedintersect
pip install .

Overview

Check if a point or interval overlaps a reference interval set:

from sortedintersect import IntervalSet

# intervals without data
itv = IntervalSet()
itv.add(0, 10)
itv.search_point(1)
# >>> [(0, 10)]

# Return a boolean only (faster)
itv = IntervalSet(bool_only=True)
itv.add(0, 10)
itv.search_point(1)
# >>> True

# search intervals with additional data
itv = IntervalSet(with_data=True)
itv.add(0, 10, 'interval1')
itv.add(20, 30, {'a': 1})
itv.search_interval(1, 2)
# >>> [(0, 10, 'interval1')]
itv.search_point(20)
# >>> [(20, 30, {'a': 1})]


itv = IntervalSet()
itv.add(-1, 1)
itv.add(2, 4)
itv.add(9, 10)
assert tuple(itv.search_point(10)) == ((9, 10),)
assert tuple(itv.search_point(0)) == ((-1, 1),)

# set custom distance threshold
itv = IntervalSet(distance_threshold=500_000)

An interval tree is often used for this purpose relying on binary search. However, if the reference intervals are sorted ahead of time, and query intervals are roughly sorted, then a simple linear search can be used for a small speedup. Sorted-intersect uses a simple heuristic (a distance measure) to decide when to use binary-search and when to use a plane-sweep. When a query-point is tested, its position is recorded. When a subsequent query point is tested, if the distance to the previous point is below a threshold, then a plane sweep is used on the reference intervals instead of a binary search.

Benchmarks

sortedintersect was compared to popular python implementations based on interval trees, see tests/benchmark.py for details.

random1 - randomly generated intervals using a python script random2 - 1million random intervals of 1kb in size over chromosome 1, generated by bedtools reads+genes - reads from a 2mb region converted to bed format, then intersected with ucsc genes from hg19

alt text

library time (s) intersections test shuffle
sortedintersect 0.0116878 2947 random1 True
quicksect 0.014946 2947 random1 True
cgranges 0.0177999 2878 random1 True
ncls 0.027061 2947 random1 True
sortedintersect 0.740114 8001594 random2 True
quicksect 1.28441 8001594 random2 True
cgranges 1.00505 7993517 random2 True
ncls 1.352 8001594 random2 True
sortedintersect 0.111097 320618 reads+genes True
quicksect 0.125926 320618 reads+genes True
cgranges 0.15797 320610 reads+genes True
ncls 0.235777 320618 reads+genes True
sortedintersect 0.011116 2916 random1 False
quicksect 0.0142658 2916 random1 False
cgranges 0.0173151 2847 random1 False
ncls 0.0269301 2916 random1 False
sortedintersect 0.589765 8001594 random2 False
quicksect 0.644668 8001594 random2 False
cgranges 0.860898 7993517 random2 False
ncls 1.15578 8001594 random2 False
sortedintersect 0.0999217 320618 reads+genes False
quicksect 0.123064 320618 reads+genes False
cgranges 0.14915 320610 reads+genes False
ncls 0.230041 320618 reads+genes False

shuffle here means that the query intervals have been shuffled.

Limitations

  • Reference queries must be added in sorted order
  • Reference intervals can not be deleted

About

Fast intersection of sorted reference intervals

License:MIT License


Languages

Language:C++ 62.4%Language:Python 20.3%Language:Cython 16.5%Language:Makefile 0.8%