Checkout and build
$ git clone https://github.com/cloudozer/BWT.git
$ cd BWT
$ git checkout master
$ ./rebar get-deps
# Setup domain config files (bwtm.dom and bwtw.dom)
$ make
Getting DNA files
- Download an archive: https://docs.google.com/uc?id=0B2DPaltm6IwpYVFHOEZYSGpldHc&export=download
- Extract it to the BWT folder
Run test on local machine using 2 workers
$ ./scripts/start_local.sh SRR770176_1.fastq GL000193.1 2
Cluster's nodes requirements
- Friendly Linux
- Xen
- Erlang OTP 17
- Git
- Internet access
Master node Setup
Edit domain config file 'bwtm.dom', setup expected number of workers, ssh port, etc.
$ make
$ sudo xl create -c bwtm.dom
Worker node Setup
Edit domain config file 'bwtm.dom', setup master ip address, etc.
$ make
$ sudo xl create -c bwtw.dom
Secure Shell connection to a Ling node
$ ssh %NODE_HOST% -p %PORT% # (password: 1)
[Disregards Info below this line]
Big file processing
These are small tools to help process large files in Erlang. In general, the strategy is to read in the file as an array of possibly overlapping Erlang binary "chunks". These can then be processed in parallel/concurrently.
How to run
- download bio_pfile.erl
- launch Erlang shell
- compile: c(bio_pfil)
- run: bio_pfile:read(Filename,NumerOfChunks) or bio_pfile:read(FileName,NumberOfChunks,SizeOfOverlap) both of which return an array of chunk elements: {{StartPos,Length},BinaryData}
Example
1> Data = bio_pfile:read("../data/GCA_000001405.15_GRCh38_full_analysis_set.fna",10000).
[{{0,32553715},<<">chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS"...>>},
{{32543715,32563715},<<"ACCTCATAGATTGGTCATCTTTTTCTC\nCTATATTTCTCTAATATTTAATCTCTCTCTCTCTCTCTCTTTGTATGTGCATTGCCTTTGGAGAGATTTC\nC"...>>},
{{65097430,32563715},<<"AATCAAGAAAATATGTTTACCAAAA\nTGCATTGCAATTTTCCCAAACCTGAGTCTTCAAATAACAAACATGAACTTATAGGTACTGTGAACTAGAA"...>>},
{{97651145,32563715},<<"CAAGAATTGAGGTTTGGGAAACT\nCCATCTAGATTTCAGAGGATGTATGGAAATACCTGGATGTCCAGGCAGTAGTTTGCTGCAAGGGTGTG"...>>},
{{813832875,32563715},<<"TT\nT"...>>},
{{846386590,...},<<...>>},
{{...},...},
{...}|...]
2> length(Data).
10001
3> lists:nth(1,Data).
{{0,32553715},<<">chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRC"...>>}
4>
- run: bio_pfile:spawn_find_pattern(ChunkArray,BinaryPattern) which returns an array of all the stat positions where the pattern was found as {StartPosition,LengthOfPattern}.
Example
4> bio_pfile:spawn_find_pattern(Data,<<"TATATTCAGTCTTTCTAACACCATTTATTGAAGAGACTGTAG">>).
[{162758595,42}]