IlyaGrebnov / libcubwt

libcubwt is a library for GPU accelerated suffix array and burrows wheeler transform construction.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

libcubwt

The libcubwt is a library for fast (see Benchmarks below) GPU-based Burrows-Wheeler transform construction and inversion on the CUDA platform using prefix doubling + Skew/DC3 approach as described in the following papers:

  • Vitaly Osipov, Parallel Suffix Array Construction for Shared Memory Architectures, 2012
  • Leyuan Wang, Sean Baxter, and John D. Owens Fast Parallel Suffix Array on the GPU, 2015
  • Florian Büren, Daniel Jünger, Robin Kobus, Christian Hundt, Bertil Schmidt Suffix Array Construction on Multi-GPU Systems, 2019

Copyright (c) 2022-2024 Ilya Grebnov ilya.grebnov@gmail.com

Introduction

The libcubwt provides simple API to construct Burrows-Wheeler transform from a given string over constant-size alphabet using 20.5n bytes of GPU memory.

  • The libcubwt works with any CUDA compatible GPU, but I recommend SM 8.9+ Ada Lovelace (CUDA 11.8 and later) due to very large L2 cache.
  • The libcubwt is sensitive to fast GPU memory and might not be suitable for some workloads. Please benchmark yourself.

Suffix array computation

  • Starting with version 1.5.0, libcubwt algorithm was changed and no longer supports the computation of suffix arrays or inverse suffix arrays. To compute these arrays, please consider using a version prior to 1.5.0.

License

The libcubwt is released under the Apache License Version 2.0 and is considered suitable for production use. However, no warranty or fitness for a particular purpose is expressed or implied.

Changes

  • January 24, 2024 (1.6.0)
    • Inverse Burrows-Wheeler transform.
  • March 24, 2023 (1.5.0)
    • Reduced memory usage and improved performance.
  • February 10, 2023 (1.0.0)
    • Initial public release of the libcubwt.

Examples of APIs (see libcubwt.cuh for complete APIs list)

    /**
    * Allocates storage on the CUDA device that allows reusing allocated memory with each libcubwt operation.
    * @param max_length The maximum length of string to support.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_allocate_device_storage(void ** device_storage, int64_t max_length);

    /**
    * Destroys the previously allocated storage on the CUDA device.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_free_device_storage(void * device_storage);

    /**
    * Constructs the Burrows-Wheeler Transform (BWT) of a given string.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @param T [0..n-1] The input string.
    * @param L [0..n-1] The output string (can be T).
    * @param n The length of the input string.
    * @return The primary index if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_bwt(void * device_storage, const uint8_t * T, uint8_t * L, int64_t n);

    /**
    * Reconstructs the original string from a given burrows-wheeler transformed string (BWT) with primary index.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @param T [0..n-1] The input string.
    * @param U [0..n-1] The output string (can be T).
    * @param n The length of the given string.
    * @param freq [0..255] The input symbol frequency table (can be NULL).
    * @param i The primary index.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_unbwt(void * device_storage, const uint8_t * T, uint8_t * U, int64_t n, const int32_t * freq, int32_t i);

Benchmarks

Methodology

  • Input files were capped at 352MB due to GPU memory limit.
  • All timings exclude initialization and memory allocations. However, the time for data transfer in and out of GPU is included.
  • The timings are minimum of five runs measuring multi-threading performance of Burrows-Wheeler transform construction.

Specification (x86-64 architecture)

  • OS: Microsoft Windows 10 Pro 64-Bit
  • MB: MSI MPG Z390M GAMING EDGE AC, PCIe 3.0 x16
  • CPU: Intel Core i7-9700K Processor (12M Cache, 5GHz all cores)
  • RAM: 2x8 GB dual-channel DDR4 (4133 MHz, 17-17-17-37)
  • GPU: ASUS TUF Gaming GeForce RTX 4070 Ti 12GB GDDR6X OC Edition
  • Compiler: Microsoft Visual C++ compiler v19.34.31937 for x64
  • Optimizations (CPU): /MD /DNDEBUG /O2 /GL /arch:AVX2 /openmp
  • Optimizations (GPU): -arch=native --use_fast_math

Silesia Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
dickens 10192446 0.012 sec ( 844.66 MB/s) 0.115 sec ( 88.27 MB/s) 0.317 sec ( 32.12 MB/s)
mozilla 51220480 0.107 sec ( 479.70 MB/s) 0.722 sec ( 70.91 MB/s) 1.296 sec ( 39.51 MB/s)
mr 9970564 0.044 sec ( 228.12 MB/s) 0.124 sec ( 80.11 MB/s) 0.283 sec ( 35.21 MB/s)
nci 33553445 0.097 sec ( 347.06 MB/s) 0.378 sec ( 88.67 MB/s) 1.141 sec ( 29.40 MB/s)
ooffice 6152192 0.007 sec ( 924.93 MB/s) 0.085 sec ( 72.66 MB/s) 0.117 sec ( 52.59 MB/s)
osdb 10085684 0.012 sec ( 841.83 MB/s) 0.152 sec ( 66.35 MB/s) 0.299 sec ( 33.72 MB/s)
reymont 6627202 0.008 sec ( 834.20 MB/s) 0.077 sec ( 86.48 MB/s) 0.199 sec ( 33.23 MB/s)
samba 21606400 0.044 sec ( 488.45 MB/s) 0.277 sec ( 77.87 MB/s) 0.502 sec ( 43.04 MB/s)
sao 7251944 0.006 sec (1275.49 MB/s) 0.145 sec ( 50.11 MB/s) 0.158 sec ( 46.04 MB/s)
webster 41458703 0.065 sec ( 640.98 MB/s) 0.603 sec ( 68.71 MB/s) 1.768 sec ( 23.46 MB/s)
x-ray 8474240 0.007 sec (1297.70 MB/s) 0.149 sec ( 56.84 MB/s) 0.198 sec ( 42.90 MB/s)
xml 5345280 0.009 sec ( 594.05 MB/s) 0.054 sec ( 98.87 MB/s) 0.117 sec ( 45.77 MB/s)

Large Canterbury Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
bible.txt 4047392 0.004 sec (1063.62 MB/s) 0.042 sec ( 97.15 MB/s) 0.106 sec ( 38.16 MB/s)
E.coli 4638690 0.006 sec ( 771.33 MB/s) 0.040 sec ( 116.30 MB/s) 0.166 sec ( 27.98 MB/s)
world192.txt 2473400 0.003 sec ( 885.03 MB/s) 0.027 sec ( 90.86 MB/s) 0.053 sec ( 46.41 MB/s)

Maximum Compression Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
A10.jpg 842468 0.001 sec (1338.95 MB/s) 0.020 sec ( 42.90 MB/s) 0.018 sec ( 45.71 MB/s)
AcroRd32.exe 3870784 0.007 sec ( 577.69 MB/s) 0.059 sec ( 65.08 MB/s) 0.070 sec ( 55.50 MB/s)
english.dic 465211 0.001 sec ( 567.47 MB/s) 0.007 sec ( 70.86 MB/s) 0.008 sec ( 54.87 MB/s)
FlashMX.pdf 4526946 0.004 sec (1104.21 MB/s) 0.099 sec ( 45.58 MB/s) 0.086 sec ( 52.37 MB/s)
FP.LOG 20617071 0.047 sec ( 437.20 MB/s) 0.261 sec ( 79.07 MB/s) 0.715 sec ( 28.82 MB/s)
MSO97.DLL 3782416 0.003 sec (1178.36 MB/s) 0.061 sec ( 62.34 MB/s) 0.072 sec ( 52.74 MB/s)
ohs.doc 4168192 0.006 sec ( 674.87 MB/s) 0.051 sec ( 81.72 MB/s) 0.094 sec ( 44.17 MB/s)
rafale.bmp 4149414 0.004 sec (1060.26 MB/s) 0.043 sec ( 96.65 MB/s) 0.083 sec ( 49.81 MB/s)
vcfiu.hlp 4121418 0.005 sec ( 797.61 MB/s) 0.046 sec ( 89.91 MB/s) 0.078 sec ( 52.78 MB/s)
world95.txt 2988578 0.004 sec ( 836.76 MB/s) 0.033 sec ( 91.77 MB/s) 0.067 sec ( 44.32 MB/s)

Large Text Compression Benchmark Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
enwik8 100000000 0.147 sec ( 681.07 MB/s) 1.633 sec ( 61.22 MB/s) 4.616 sec ( 21.67 MB/s)
enwik9 369098752 0.579 sec ( 637.83 MB/s) 6.363 sec ( 58.01 MB/s) 20.633 sec ( 17.89 MB/s)

The Gauntlet Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
abac 200000 0.004 sec ( 47.69 MB/s) 0.002 sec ( 89.59 MB/s) 0.002 sec ( 105.77 MB/s)
abba 10500596 0.038 sec ( 278.77 MB/s) 0.092 sec ( 114.41 MB/s) 0.494 sec ( 21.26 MB/s)
book1x20 15375420 0.076 sec ( 201.57 MB/s) 0.192 sec ( 80.14 MB/s) 0.535 sec ( 28.75 MB/s)
fib_s14930352 14930352 0.111 sec ( 134.06 MB/s) 0.181 sec ( 82.57 MB/s) 1.164 sec ( 12.83 MB/s)
fss10 12078908 0.073 sec ( 166.32 MB/s) 0.135 sec ( 89.26 MB/s) 0.885 sec ( 13.65 MB/s)
fss9 2851443 0.014 sec ( 205.39 MB/s) 0.027 sec ( 107.11 MB/s) 0.132 sec ( 21.53 MB/s)
houston 3839141 0.023 sec ( 164.68 MB/s) 0.020 sec ( 188.82 MB/s) 0.028 sec ( 138.15 MB/s)
paper5x80 956322 0.002 sec ( 403.97 MB/s) 0.010 sec ( 94.51 MB/s) 0.021 sec ( 46.35 MB/s)
test1 2097152 0.006 sec ( 358.13 MB/s) 0.023 sec ( 92.50 MB/s) 0.082 sec ( 25.46 MB/s)
test2 2097152 0.006 sec ( 356.80 MB/s) 0.022 sec ( 96.95 MB/s) 0.055 sec ( 37.91 MB/s)
test3 2097088 0.004 sec ( 474.53 MB/s) 0.035 sec ( 60.43 MB/s) 0.049 sec ( 42.93 MB/s)

Manzini Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
chr22.dna 34553758 0.093 sec ( 369.82 MB/s) 0.436 sec ( 79.18 MB/s) 1.837 sec ( 18.81 MB/s)
etext99 105277340 0.199 sec ( 528.25 MB/s) 1.780 sec ( 59.14 MB/s) 5.590 sec ( 18.83 MB/s)
gcc-3.0.tar 86630400 0.221 sec ( 392.32 MB/s) 1.215 sec ( 71.32 MB/s) 2.682 sec ( 32.30 MB/s)
howto 39422105 0.065 sec ( 603.01 MB/s) 0.565 sec ( 69.75 MB/s) 1.366 sec ( 28.86 MB/s)
jdk13c 69728899 0.174 sec ( 400.57 MB/s) 1.031 sec ( 67.66 MB/s) 2.734 sec ( 25.50 MB/s)
linux-2.4.5.tar 116254720 0.252 sec ( 461.40 MB/s) 1.698 sec ( 68.45 MB/s) 3.808 sec ( 30.53 MB/s)
rctail96 114711151 0.254 sec ( 451.11 MB/s) 1.864 sec ( 61.54 MB/s) 5.452 sec ( 21.04 MB/s)
rfc 116421901 0.228 sec ( 510.92 MB/s) 1.725 sec ( 67.48 MB/s) 4.510 sec ( 25.81 MB/s)
sprot34.dat 109617186 0.211 sec ( 519.63 MB/s) 1.756 sec ( 62.42 MB/s) 4.810 sec ( 22.79 MB/s)
w3c2 104201579 0.286 sec ( 364.16 MB/s) 1.603 sec ( 65.01 MB/s) 3.885 sec ( 26.82 MB/s)

Pizza & Chilli Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
dblp.xml 296135874 0.558 sec ( 530.99 MB/s) 4.860 sec ( 60.93 MB/s) 13.964 sec ( 21.21 MB/s)
dna 369098752 0.591 sec ( 624.79 MB/s) 5.837 sec ( 63.24 MB/s) 30.630 sec ( 12.05 MB/s)
english.1024MB 369098752 0.758 sec ( 486.98 MB/s) 6.794 sec ( 54.33 MB/s) 24.330 sec ( 15.17 MB/s)
pitches 55832855 0.097 sec ( 578.30 MB/s) 0.807 sec ( 69.19 MB/s) 1.521 sec ( 36.70 MB/s)
proteins 369098752 0.740 sec ( 499.06 MB/s) 7.085 sec ( 52.10 MB/s) 20.326 sec ( 18.16 MB/s)
sources 210866607 0.391 sec ( 539.50 MB/s) 3.254 sec ( 64.80 MB/s) 7.826 sec ( 26.94 MB/s)

Pizza & Chilli Repetitive Corpus

file size libcubwt 1.5.0 (GPU) libsais 2.7.1 (CPU) divsufsort 2.0.2 (CPU)
cere 369098752 1.788 sec ( 206.43 MB/s) 5.648 sec ( 65.35 MB/s) 24.792 sec ( 14.89 MB/s)
coreutils 205281778 0.892 sec ( 230.01 MB/s) 3.097 sec ( 66.29 MB/s) 8.828 sec ( 23.25 MB/s)
einstein.de.txt 92758441 0.547 sec ( 169.56 MB/s) 1.363 sec ( 68.03 MB/s) 4.038 sec ( 22.97 MB/s)
einstein.en.txt 369098752 2.195 sec ( 168.17 MB/s) 5.845 sec ( 63.15 MB/s) 21.941 sec ( 16.82 MB/s)
Escherichia_Coli 112689515 0.302 sec ( 372.84 MB/s) 1.695 sec ( 66.48 MB/s) 6.832 sec ( 16.49 MB/s)
influenza 154808555 0.527 sec ( 293.57 MB/s) 2.234 sec ( 69.31 MB/s) 8.538 sec ( 18.13 MB/s)
kernel 257961616 1.063 sec ( 242.78 MB/s) 3.856 sec ( 66.89 MB/s) 10.715 sec ( 24.08 MB/s)
para 369098752 1.335 sec ( 276.43 MB/s) 5.725 sec ( 64.47 MB/s) 25.627 sec ( 14.40 MB/s)
world_leaders 46968181 0.312 sec ( 150.45 MB/s) 0.457 sec ( 102.68 MB/s) 1.119 sec ( 41.98 MB/s)
dblp.xml.00001.1 104857600 0.424 sec ( 247.09 MB/s) 1.739 sec ( 60.31 MB/s) 5.610 sec ( 18.69 MB/s)
dblp.xml.00001.2 104857600 0.432 sec ( 242.87 MB/s) 1.729 sec ( 60.63 MB/s) 5.669 sec ( 18.50 MB/s)
dblp.xml.0001.1 104857600 0.371 sec ( 282.50 MB/s) 1.743 sec ( 60.17 MB/s) 5.600 sec ( 18.73 MB/s)
dblp.xml.0001.2 104857600 0.372 sec ( 282.22 MB/s) 1.729 sec ( 60.64 MB/s) 5.585 sec ( 18.77 MB/s)
dna.001.1 104857600 0.275 sec ( 381.66 MB/s) 1.571 sec ( 66.73 MB/s) 6.097 sec ( 17.20 MB/s)
english.001.2 104857600 0.284 sec ( 369.03 MB/s) 1.795 sec ( 58.43 MB/s) 4.912 sec ( 21.35 MB/s)
proteins.001.1 104857600 0.490 sec ( 214.02 MB/s) 1.898 sec ( 55.26 MB/s) 4.256 sec ( 24.64 MB/s)
sources.001.2 104857600 0.300 sec ( 349.51 MB/s) 1.717 sec ( 61.07 MB/s) 4.295 sec ( 24.42 MB/s)
fib41 267914296 2.480 sec ( 108.03 MB/s) 3.898 sec ( 68.73 MB/s) 40.566 sec ( 6.60 MB/s)
rs.13 216747218 2.100 sec ( 103.23 MB/s) 3.062 sec ( 70.79 MB/s) 34.326 sec ( 6.31 MB/s)
tm29 268435456 2.343 sec ( 114.58 MB/s) 4.134 sec ( 64.93 MB/s) 37.548 sec ( 7.15 MB/s)

About

libcubwt is a library for GPU accelerated suffix array and burrows wheeler transform construction.

License:Apache License 2.0


Languages

Language:Cuda 100.0%