A class project for CS508 from UIUC to implement the classic BLAST algorithm using different GPU computing techniques.
Members: Chengze Shen, Kaining Zhou, Zhuxuan Liu, Zutai Chen
- The download link to the most recent BLAST (v2.13.0) packages: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- For each input unmapped sequence, represent it with consecutive words of length
W
. - The matched positions are referred to as hits.
- Hits are extended to ungapped alignments, allowing for mismatches.
- Ungapped extensions that pass a certain threshold score are passed to gapped alignments.
- Gapped alignments that pass a certain threshold score are saved as "preliminary results".
- Preliminary matches are considered for ambiguities (e.g., ambiguous nucleotides) and indel locations.
- Results are then reported back to users.
- G-BLASTN by Kaiyong Zhao and Xiaowen Chu. link to software.
- at the time of publication, G-BLASTN compared to BLAST v2.2.28. In our analysis we will run the G-BLASTN and our implementation and compare to the latest BLAST version (v2.13.0).
- Scanning stage of G-BLASTN --> broken into two sub-steps, (1) "scan" and (2) "lookup" .
- "scan" goes over each database sequence in parallel and records offsets in the sequence that have >= 1 matches to a query.
- "lookup" is another kernel that re-checks all matched offsets and obtain the corresponding matched offset pairs (of database and query sequences).
- MERITS:
- Having 2 sub-steps (kernels) greatly reduces control divergence.
- Using shared memory to record offsets in the "scan" sub-step. This is similar to what we did in the BFS lab (block-queuing kernel).
- FURTHER POSSIBLE IMPROVEMENTS:
- TBD
We will use the latest version of the Homo Sapien whole genome assembly available on NCBI.
- The one used in G-BLASTN is an older version (NCBI36, #accession GCF_000001405.12).
- The new one we will use is an aggregated database including five genomes (human, mouse, cat, dog, and gorilla):
- Human: #accession GCF_000001405.40.
- Mouse: #accession GCF_000001635.27.
- Cat: #accession GCF_018350175.1.
- Dog: #accession GCF_000002285.5.
- Gorilla: #accession GCF_029281585.1.
- The combined database can be found at here.
- To create the database (blast database version 4) locally using
makeblastdb
, run the following command:
makeblastdb -in [path/to/fasta] -parse_seqids -blastdb_version 4 -title "homo_sapien_grch38" -dbtype nucl
- To use the created database when running
blastn
orgblastn
, input the db as follows (assuming the database has been created under the same directory as the fasta file):
blastn -db <path/to/fasta> ... [other parameters]
OR
gblastn -db <path/to/fasta> ... [other parameters]
We will use the same set of query sequences used in G-BLASTN to search against the target database mentioned above.
- The queries are the first 500 bacterial sequences of the study SRX338063 from the NCBI server at http://www.ncbi.nlm.nih.gov/sra/SRX338063.
- The queries are made available in
data/queries.fasta
.
NCBI BLAST+ is a mess for newer compiler and linux system to compile. Although its binaries generally work, it is quite hard to compile the codes locally.
To compile ncbi-blast-2.2.28+-src
successfully, some files need to change (see files in to_change.tar.gz).
After decompressing files from to_change.tar.gz, there will be two folders:
*. to_change/include/
*. to_change/your-configured-folder/
Assume you already downloaded and decompressed the source codes of BLAST+ and you are in the directory ncbi-blast-2.2.28+-src/c++
, and you have run ./configure
already, do the following steps:
- Copy and overwrite files (remember to make backups) in
to_change/include
to the corresponding directories inncbi-lbast-2.2.28+-src/c++/include
. - Change line 4749 of "c++/src/build-system/configure" script, add "7" or "7.*" to support GC 7.0.0 or 7+.
- After running
./configure
, you should have generated a folder underncbi-blast-2.2.28+-src/c++
, for example,ncbi-blast-2.2.28+-src/c++/GCC750-Debug64
depending on your system and compiler version.- Copy
to_change/your-configured-folder/build/Makefile.lib
and overwite (make a backup) the correspondingncbi-blast-2.2.28+-src/c++/your-configured-folder/build/Makefile.lib
.
- Copy
- Now, try running
make
ormake -j
(using all available cores for compiling) and hope everything works. - If step 3 works without errors, you should have successfully compiled binaries in directory
ncbi-blast-2.2.28+-src/c++/your-configured-folder/bin
. Then, you can install g-blastn by using this path.
- To simplify the process, all files used to compile G-BLASTN on Chengze's environment is uploaded to https://drive.google.com/file/d/1ecGFSnw80KM6_ofPhCZx-XhdFCR00oMu/view?usp=share_link. The compiled file has BLASTN compiled already, so a new user should probably remove the entire
ncbi-blast-2.2.28+-src/
folder and follow the steps above to rebuild BLASTN. - After getting the correctly compiled BLASTN, the user can try to compile G-BLASTN using either
./install
or./install.noask
, for which the user needs to change theLINE 50: installed_dir=
to the BLASTNbin/
folder. 2.2 Additionally, the user needs to change one additional file atc++/src/algo/blast/gpu_blast/makefile
. G-BLASTN requires the header files from~/NVIDIA_CUDA-{version}_Samples/common/inc
to work, so please changeLINE 18: COMMON_DIR :=
to the correct path. - If everything is done correctly, you should be able to compile G-BLASTN, but might face issues running it on test dataset (I am struggling too).
CURRENT ISSUES Although BLAST+ can compile now (with the instructions above), compilation for GPU codes still have issues that have not been resolved, which Chengze Shen is currently working on.
To run blastn
with a given database and a given set of query sequences (without WindowMasker and DUST):
blastn -db <database> -query <query> -task blastn -outfmt 7 -out <file> \
-num_threads <1|4|8>
To run gblastn
similarly:
gblastn -db <database> -query <query> -task blastn -outfmt 7 -out <file> \
-use_gpu true -mode <1|2> -num_threads <1|4|8>
- Chengze Shen - improvement of G-BLASTN, mainly on warp-queuing for the seeding step.
- Kaining Zhou -
- Zhuxuan Liu -
- Zutai Chen -