athenarc / hitmap-aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

To solve a sequence alignment problem with Hitmap, it is vital that we clarify the exact steps that a user needs to follow build the code correctly. In the bigger picture one has to generate the dataset, generate queries to be aligned, then build the index (Hash Table) and finally execute. However, to get more detailed, it is important to note that to attain proper execution of the code and potentially solving the problem, we make use of a Makefile. In that, someone can take a look at the dependences between the .cpp files and more specifically look at the targets. The targets of the Makefile are many since we are trying to build objective files out of every .cpp file. Though, The targets that we need to pay attention to for now are:

  1. make all
  2. make data_builder
  3. make query_builder
  4. make build_idx

Dependencies

  1. Before proceeding, make sure you have downloaded the latest Boost Library (https://www.boost.org/users/history/version_1_83_0.html) locally and placing it in a folder named External in the Libraries folder.
  2. Data_builder, query_builder and build_idx targets require a folder named bin to exist in the "Aligner" directory since the ".out" executables that are generated by those targets, are saved in "Aligner/bin".

Data_builder

In order to solve a DNA sequence alignment problem you first need to have some type of reference sequence with which the query sequences will be compared and eventually aligned with.

There are 2 ways Hitmap lets this requirement be resolved:

  1. Importing a pre made dataset
  2. Creating a dataset from scratch

Both these options are implemented in and executed by the same source code (dataset_creator.cpp) which with the help of command line arguments gives user the option of choosing between the 2 approaches. To start things off for either creation or import, target data_builder (make data_builder) should be used to construct the executable file out of dataset_creator.cpp. Following that, the execution of the '.out' file (saved in '/bin' folder) should include the parameter '-mode' in the command line. This parameter differentiates the 2 modes using the values 'imp' for import of a premade dataset or 'new' for the creation of one.

-mode new (create a dataset)

If one would opt for dataset creation, the following arguments are essential for a successful build:

  1. '-loc': The location where the data sequence files are going to be created (required for either 'new' or 'imp' mode) Here you just need to specify the name of the folder to be created in the parent folder "/bin" in which the DNA sequences will be saved in a ".d" extension file. The number of ".d" files that will be created is proportional to and indicated by "-dnum" value and content-wise they include a data sequence of length "-dlen" and of alphabet "-ab".

    If folder already exists there will be an error.

    eg. "-loc new"

  2. '-dlen': The length of the data sequences to be created (required). Based on the argument "-ab", which indicates the letters that will be used to generate sequences, dlen specifies the length of such a sequence. That is the number of alphabet's characters a sequence will contain.

  3. '-ab': The name of the alphabet to be used (required). For a sequence alignment problem to be solved, data and query sequences must be built. These seuqences may differ depending on the field of study. When it comes to DNA sequence alignments, the sequences depict a valid DNA sequence which consists of the bases adenine (A), cytosine (C), guanine (G), and thymine (T). Consequently, both data and query sequences should consist of the letters A, C, G and T rendering this set of letters as "the alphabet".

    Before proceeding to the execution of "data_aligner.out", alphabet's path should be determined. To do so, a folder "Data" and a subfolder "Aligner" should be created forming a "Data/Aligner" path. In "Data/Aligner" a ".conf" file named "alphabets.conf" ought to include the alphabet of the Alignment problem in a specific form. It should contain alphabet's name eg. "ab_name" and in the next line the vocabulary eg. "ACGT".

    Example of an alphabets.conf file:

    ab_name
    ACGTacgt
    

    The value of the command line argument '-ab' is the name of the alphabet in alphabets.conf file (eg. 'ab_name')

  4. '-dnum': The number of data sequences to be created (required) This will determine the number of sequences to be generated and consequently the number of ".d" files that will be created

-mode new (import a dataset)

If one would opt for dataset import, the following arguments are essential for a successful build:

  1. '-df': Determines the location of the dataset to be imported. Considering the fact that the first experiments on Hitmap where conducted using NCBI dataset, it became important to replicate the older experiments to assess Hitmap's behaviour on newer version of NCBI datasets. Thus, having an already existed dataset how can one import it into hitmap and run experiments on it?

    Using the mode 'new' this is now attainable. However it is important that the dataset is already downloaded on user's machine and the path to the data file is provided after the '-df' argument.

  2. '-chr': The chromosome to be retrieved from the dataset. NCBI datasets consist of all the chromosomes of a genome (e.g mus musculus, homo sapiens etc.) seperated by lines starting with the character ">" and followed by information regarding the chromosome. Given a chromosome number, data_builder reads the dataset file efficiently and looks for text lines containing the string "chromosome <chr_num>, Primary Assembly". If such a chromosome is found, its content will be copied-pasted on a file named "chromosome.txt" in the folder stated in '-loc' argument.

The datasets that Hitmap was built upon are ".fna", NCBI genome dataset files that can be downloaded locally at https://www.ncbi.nlm.nih.gov/datasets/.

Query_builder

The query_builder.out executable is responsible for generating the query sequences that will be checked with the reference sequences in the ".d" files. As with data_builder, query_builder takes some command line arguments that need to be clarified.

  1. '-conf': The location where the configuration file resides (required) Following the idea of data_builder, query_builder requires a configuration (".conf") file as well. This time the location and name of the file dont matter since you just need to provide the existed path in the command line. What matters most is the content of the configuration file.

    More specifically, the file should have 6 parameters: dfile, qnum, qlen, qthr, meas and inum.

    1. dfile: The path including the name of the ".d" file that its sequence will be used as the reference DNA sequence.

    2. qnum/qlen: Number of query sequences to be generated/length of each query sequence.

    3. qthr: Is the query threshold.

      There are 2 ways for the query sequences to be generated: one is to export a subsquence off of the reference sequence and use this to find alignments and another to export a subsequence and add some errors to it. The number of errors that will be added to the subsequence will equal the value of "-qthr".

    4. meas: Distance Function.

      "Meas" argument is not used during the building of the query sequences but in the last stage of the build in "main.cpp" since they both make use of "config_query.conf". As a result, "query_builder.out" simply acknowledges its existance without actually using it.

    5. inum: Number of Iterations.

      The same applies on this argument too

    The format of the file should be: <param_name> <param_value> and there should be one pair in each line.

  2. '-mode': Two modes. 'datasub' (sequence is taken from the data sequence), 'rand' (random errors in seq)

    The 2 methods of subsequence exctraction. "datasub" exctracts a plain subsequence from the reference sequence whil "rand" adds errors in to it.

Build_idx

Build_idx essentially constructs the indexer and the aligner of Hitmap which takes care of filtering, Bitset operations, windows operations, caclulationg Hamming/Edit distances, creating/storing/loading the index and also developing a lighter aligner with less attributes. Moreover, we again need to provide some arguments just like in the previous executions.

  1. '-d': The location of data seq file (required) It is essential to specify the exact path of a ".d" file containing a reference/data sequence.

  2. '-c': The location of the index config file (required).

Just like before, a configuration file should be defined evaluating the following parameters: The format of the file should be: <param_name> <param_value> and there should be one pair in each line.

  1. 'i' Argument "i" indicates the index type to be constructed. Valid values are "hitmap" and "hitmap2". The latter simply consists of a lighter (memory-wise) version of a "hitmap" index.

  2. 'q' Query length

  3. 'r' or 'e' Error ratio or Error threshold for the alignments. These parameters are different ways to declare how many errors you will allow matches to have.

  4. 'f' Fragments number. Given a query Q and an integer f, where 0 < f ≤ q, Q is divided into φ = floor(q/f) non-overlapping subsequences called fragments. Their length is equal to f.

  5. 'b' Hash table blocks number

  6. 's' Bitset block size

  7. 'a' Alphabet. The Alhpabet that was used.

About


Languages

Language:C++ 98.7%Language:Makefile 0.9%Language:C 0.4%