About the runtime
LiShuhang-gif opened this issue · comments
Hello, I'm using last-train to determine the rates of insertion, deletion, and substitutions between my reads and the genome:
last-train -P12 -Q0 mydb 17HanZZ0034.fastq > myseq.par
Unfortunately, the program has been running for 70 hours and is still running now. Although I chose to prepare the genome without Repeat - Masking, the current run time seems too long. I want to know if this running time is normal or abnormal. And if it's abnormal, what can I do to speed up. By the way, the size of my fastq file(17HanZZ0034.fastq) is 72G. Thanks a lot!
I'm surprised it's that slow. If you can attach your (incomplete) myseq.par, it might show some clues. The 72G shouldn't matter, because last-train uses a fixed-size sample of the query sequences. What genome are you using?
I'm using the human genome. The above step has been completed in nearly 78 hours. And now I'm using lastal to align DNA reads to their orthologous bases in the genome. Here is my command:
lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf
So far, this step has taken 40 hours and is still running now, which still takes a lot of time I think.
I'll put the complete myseq.par
file here. Please let me know if you have any suggestion to solve this problem and help me to speed up it. Thanks.
# lastal version: 1256
# maximum percent identity: 100
# scale of score parameters: 4.5512
# scale used while training: 91.024
# lastal -j7 -S1 -P12 -Q0 -r5 -q5 -a15 -b3
# aligned letter pairs: 910388.1
# deletes: 59366.28
# inserts: 34921.5338
# delOpens: 34326.18
# insOpens: 22506.0228
# alignments: 558
# mean delete size: 1.72948
# mean insert size: 1.55165
# matchProb: 0.940698
# delOpenProb: 0.035469
# insOpenProb: 0.0232553
# delExtendProb: 0.42179
# insExtendProb: 0.355526
# substitution percent identity: 92.3214
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 242472 4147.0099 13431.5314 5116.1765
# C 4379.9383 173498.66 2847.70253 5875.3163
# G 14191.693 2485.130373 167794.16 3758.22718
# T 4795.61218 5336.880759 3540.324008 256724.7
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.266337 0.00455518 0.0147535 0.00561973
# C 0.00481103 0.190575 0.00312799 0.00645359
# G 0.0155885 0.00272973 0.184309 0.00412813
# T 0.00526762 0.00586216 0.00388878 0.281993
# delExistCost: 280
# insExistCost: 292
# delExtendCost: 74
# insExtendCost: 90
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 98 -239 -133 -255
# C -235 133 -242 -210
# G -129 -255 128 -252
# T -260 -218 -256 100
# lastal -j7 -S1 -P12 -Q0 -t91.0723 -p-
# aligned letter pairs: 893238.2
# deletes: 63016.81
# inserts: 40230.9481
# delOpens: 38189.95
# insOpens: 25792.2964
# alignments: 578
# mean delete size: 1.65009
# mean insert size: 1.5598
# matchProb: 0.932594
# delOpenProb: 0.0398726
# insOpenProb: 0.0269287
# delExtendProb: 0.393972
# insExtendProb: 0.358894
# substitution percent identity: 94.7581
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 242609.71 1927.340774 12907.436195 2254.84882
# C 2184.642908 175679.33 1155.5313591 3631.06656
# G 13769.9899 942.8457702 169549.65 1537.011703
# T 2054.001752 3118.2846542 1344.2370255 258653.7
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.271582 0.0021575 0.0144488 0.00252412
# C 0.00244553 0.196659 0.00129353 0.00406469
# G 0.0154144 0.00105544 0.189797 0.00172056
# T 0.00229929 0.00349067 0.00150477 0.289542
# delExistCost: 260
# insExistCost: 280
# delExtendCost: 79
# insExtendCost: 89
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 100 -308 -136 -328
# C -297 135 -324 -253
# G -131 -342 129 -332
# T -337 -266 -344 102
# lastal -j7 -S1 -P12 -Q0 -t91.3324 -p-
# aligned letter pairs: 886784.7
# deletes: 67261.25
# inserts: 44839.3056
# delOpens: 41196.03
# insOpens: 28310.4247
# alignments: 582
# mean delete size: 1.63271
# mean insert size: 1.58384
# matchProb: 0.926752
# delOpenProb: 0.0430527
# insOpenProb: 0.0295864
# delExtendProb: 0.387522
# insExtendProb: 0.368625
# substitution percent identity: 95.4855
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 242742.9 1294.178206 12792.055635 1426.286513
# C 1548.539595 175712.03 675.8587138 2870.859088
# G 13583.8445 542.2332017 169457.2 925.261326
# T 1264.153482 2365.3611071 746.05339484 258854.4
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.273729 0.00145938 0.0144249 0.00160835
# C 0.00174621 0.198141 0.000762131 0.00323732
# G 0.0153178 0.000611448 0.191088 0.00104337
# T 0.00142552 0.0026673 0.000841286 0.291897
# delExistCost: 251
# insExistCost: 276
# delExtendCost: 80
# insExtendCost: 86
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -344 -137 -370
# C -328 136 -372 -274
# G -132 -392 129 -379
# T -381 -291 -397 102
# lastal -j7 -S1 -P12 -Q0 -t91.0781 -p-
# aligned letter pairs: 885006.1
# deletes: 69713.81
# inserts: 47438.5439
# delOpens: 42799.94
# insOpens: 29606.6777
# alignments: 590
# mean delete size: 1.62883
# mean insert size: 1.60229
# matchProb: 0.923802
# delOpenProb: 0.0446762
# insOpenProb: 0.0309046
# delExtendProb: 0.386062
# insExtendProb: 0.375894
# substitution percent identity: 95.8208
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243073.3 1023.549276 12681.13977 1056.7107291
# C 1284.40529 175990.55 482.3514595 2517.854682
# G 13448.37729 384.94506597 169807.11 667.2841279
# T 928.1245674 2001.6120012 513.58975493 259238.9
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.274628 0.00115642 0.0143274 0.00119389
# C 0.00145114 0.198837 0.000544968 0.00284471
# G 0.0151942 0.000434917 0.191851 0.000753908
# T 0.00104861 0.00226145 0.000580262 0.292892
# delExistCost: 247
# insExistCost: 275
# delExtendCost: 80
# insExtendCost: 84
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -365 -138 -397
# C -345 136 -403 -286
# G -133 -424 129 -409
# T -409 -306 -432 102
# lastal -j7 -S1 -P12 -Q0 -t91.0806 -p-
# aligned letter pairs: 884838.6
# deletes: 71372.83
# inserts: 49059.9165
# delOpens: 43780.62
# insOpens: 30388.6841
# alignments: 592
# mean delete size: 1.63024
# mean insert size: 1.61441
# matchProb: 0.92209
# delOpenProb: 0.0456238
# insOpenProb: 0.031668
# delExtendProb: 0.386593
# insExtendProb: 0.38058
# substitution percent identity: 95.9969
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243524.9 891.127115 12594.69529 869.2600543
# C 1161.369069 176279.56 391.22735866 2338.867059
# G 13354.0229 310.63178636 170057.83 541.6494231
# T 758.1233217 1812.1867268 400.83204362 259622.4
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275198 0.00100703 0.0142328 0.000982316
# C 0.00131242 0.199206 0.00044211 0.00264306
# G 0.0150908 0.000351033 0.192176 0.000612096
# T 0.000856725 0.00204788 0.000452964 0.293389
# delExistCost: 245
# insExistCost: 275
# delExtendCost: 80
# insExtendCost: 83
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -378 -139 -415
# C -355 136 -422 -292
# G -134 -443 129 -428
# T -428 -315 -454 102
# lastal -j7 -S1 -P12 -Q0 -t91.0505 -p-
# aligned letter pairs: 884323.3
# deletes: 72203.83
# inserts: 49920.0574
# delOpens: 44280
# insOpens: 30773.555
# alignments: 594
# mean delete size: 1.63062
# mean insert size: 1.62217
# matchProb: 0.921197
# delOpenProb: 0.0461264
# insOpenProb: 0.0320567
# delExtendProb: 0.386736
# insExtendProb: 0.383543
# substitution percent identity: 96.0951
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243644.2 819.307969 12529.710094 763.011848
# C 1092.851629 176362.47 342.63519497 2249.897176
# G 13282.2126 272.90354627 170149.03 473.9446493
# T 660.3587931 1707.1879268 341.41887092 259729.5
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275485 0.000926378 0.0141671 0.000862725
# C 0.00123567 0.19941 0.000387412 0.00254392
# G 0.015018 0.000308568 0.192385 0.000535881
# T 0.000746657 0.00193029 0.000386037 0.293672
# delExistCost: 245
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 83
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -386 -139 -427
# C -360 136 -434 -296
# G -135 -455 129 -440
# T -440 -320 -469 102
# lastal -j7 -S1 -P12 -Q0 -t90.9734 -p-
# aligned letter pairs: 883486.1
# deletes: 72534.03
# inserts: 50281.1878
# delOpens: 44420.63
# insOpens: 30980.8254
# alignments: 594
# mean delete size: 1.63289
# mean insert size: 1.62298
# matchProb: 0.920794
# delOpenProb: 0.0462964
# insOpenProb: 0.0322891
# delExtendProb: 0.387589
# insExtendProb: 0.383849
# substitution percent identity: 96.1462
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243508.3 779.049721 12505.212987 701.6762278
# C 1061.763832 176285.46 315.34366977 2197.380885
# G 13234.85637 252.33082239 170099.03 436.8713964
# T 605.6020096 1653.2986672 306.19728647 259598.1
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275605 0.000881736 0.0141535 0.000794164
# C 0.00120171 0.199522 0.000356909 0.00248702
# G 0.0149793 0.000285591 0.19252 0.000494455
# T 0.000685426 0.00187122 0.000346557 0.293816
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 83
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -390 -140 -435
# C -363 136 -442 -298
# G -135 -462 129 -447
# T -448 -323 -479 102
# lastal -j7 -S1 -P12 -Q0 -t90.9616 -p-
# aligned letter pairs: 883227.9
# deletes: 72817.92
# inserts: 50544.8579
# delOpens: 44605.62
# insOpens: 31109.7555
# alignments: 594
# mean delete size: 1.63248
# mean insert size: 1.62473
# matchProb: 0.920472
# delOpenProb: 0.0464865
# insOpenProb: 0.0324216
# delExtendProb: 0.387436
# insExtendProb: 0.384512
# substitution percent identity: 96.1781
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243513 759.394831 12469.516286 662.7568431
# C 1043.019864 176281.82 298.81464313 2171.479071
# G 13218.19326 240.30829139 170109.03 415.5769021
# T 571.3877196 1622.0672939 284.77885023 259592.6
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.2757 0.00085977 0.0141177 0.000750358
# C 0.00118088 0.199582 0.000338311 0.0024585
# G 0.0149653 0.000272072 0.192594 0.000470507
# T 0.000646912 0.00183647 0.00032242 0.293905
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -392 -140 -440
# C -364 136 -447 -299
# G -135 -467 129 -452
# T -453 -325 -485 102
# lastal -j7 -S1 -P12 -Q0 -t91.0048 -p-
# aligned letter pairs: 882970.8
# deletes: 73078.79
# inserts: 50823.7388
# delOpens: 44724.84
# insOpens: 31230.4763
# alignments: 594
# mean delete size: 1.63396
# mean insert size: 1.62738
# matchProb: 0.92022
# delOpenProb: 0.0466116
# insOpenProb: 0.032548
# delExtendProb: 0.387992
# insExtendProb: 0.385514
# substitution percent identity: 96.2011
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243502.4 746.324946 12450.459489 636.1814449
# C 1032.6109 176271.1 287.44690141 2149.651302
# G 13196.64427 231.2572848 170098.63 399.4260808
# T 547.8090392 1595.0326348 271.12648913 259582.9
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275767 0.000845216 0.0141002 0.000720478
# C 0.00116944 0.199628 0.000325535 0.00243449
# G 0.0149453 0.0002619 0.192637 0.000452352
# T 0.000620396 0.00180638 0.000307052 0.293979
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -394 -140 -444
# C -365 136 -450 -300
# G -135 -470 129 -455
# T -457 -327 -490 102
# lastal -j7 -S1 -P12 -Q0 -t90.9797 -p-
# aligned letter pairs: 882890.8
# deletes: 73177.41
# inserts: 50918.2687
# delOpens: 44769.16
# insOpens: 31273.6662
# alignments: 594
# mean delete size: 1.63455
# mean insert size: 1.62815
# matchProb: 0.92013
# delOpenProb: 0.0466575
# insOpenProb: 0.0325927
# delExtendProb: 0.388211
# insExtendProb: 0.385807
# substitution percent identity: 96.2125
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243502 737.721451 12449.957388 618.8685265
# C 1027.271887 176266.77 281.71500221 2138.53712
# G 13195.38323 226.85896076 170098.03 391.3771389
# T 532.6066046 1577.8669476 261.49345601 259581.3
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275802 0.000835578 0.0141014 0.000700959
# C 0.00116354 0.199648 0.000319084 0.00242221
# G 0.0149457 0.000256951 0.192661 0.000443292
# T 0.000603255 0.00178717 0.00029618 0.294014
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -395 -140 -446
# C -366 136 -452 -301
# G -135 -472 129 -457
# T -460 -327 -493 102
# lastal -j7 -S1 -P12 -Q0 -t90.9658 -p-
# aligned letter pairs: 882773.8
# deletes: 73202.45
# inserts: 50967.3986
# delOpens: 44777.48
# insOpens: 31299.1961
# alignments: 595
# mean delete size: 1.63481
# mean insert size: 1.62839
# matchProb: 0.920087
# delOpenProb: 0.0466701
# insOpenProb: 0.0326221
# delExtendProb: 0.388306
# insExtendProb: 0.385898
# substitution percent identity: 96.2191
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243492 732.974624 12447.916387 610.3701628
# C 1021.845502 176262.95 277.94192676 2127.732562
# G 13195.02321 223.73471847 170091.83 386.1160161
# T 521.1742646 1577.2818045 255.73610094 259569.9
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.27582 0.000830289 0.0141006 0.000691407
# C 0.00115751 0.199665 0.000314843 0.00241022
# G 0.0149469 0.000253439 0.192674 0.000437379
# T 0.000590369 0.00178669 0.000289689 0.294032
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -447
# C -366 136 -453 -301
# G -135 -473 129 -458
# T -462 -328 -495 102
# lastal -j7 -S1 -P12 -Q0 -t90.9579 -p-
# aligned letter pairs: 882733.8
# deletes: 73263.4
# inserts: 51001.0686
# delOpens: 44803.02
# insOpens: 31316.3961
# alignments: 594
# mean delete size: 1.63523
# mean insert size: 1.62857
# matchProb: 0.920043
# delOpenProb: 0.0466967
# insOpenProb: 0.03264
# delExtendProb: 0.388467
# insExtendProb: 0.385966
# substitution percent identity: 96.2224
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243498.7 728.900857 12450.525286 606.1124315
# C 1022.093829 176265.34 276.13028846 2127.508449
# G 13195.8532 222.44561863 170094.63 383.4949563
# T 514.0464249 1568.8933984 252.30576189 259579.6
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.27583 0.000825682 0.0141037 0.00068659
# C 0.0011578 0.199669 0.000312794 0.00240999
# G 0.014948 0.000251981 0.192679 0.000434414
# T 0.0005823 0.00177721 0.000285806 0.294046
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -448
# C -366 136 -454 -301
# G -135 -474 129 -459
# T -463 -328 -497 102
# lastal -j7 -S1 -P12 -Q0 -t90.9542 -p-
# aligned letter pairs: 882731.8
# deletes: 73283.13
# inserts: 51020.1286
# delOpens: 44809.75
# insOpens: 31325.6361
# alignments: 594
# mean delete size: 1.63543
# mean insert size: 1.6287
# matchProb: 0.920028
# delOpenProb: 0.046703
# insOpenProb: 0.0326492
# delExtendProb: 0.388539
# insExtendProb: 0.386014
# substitution percent identity: 96.2241
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243499.5 728.929657 12451.305286 601.9976592
# C 1022.202506 176264.94 274.27387431 2127.528348
# G 13196.42319 220.96275348 170093.63 380.9513207
# T 510.4233786 1568.7967974 248.73720285 259578.5
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275836 0.000825731 0.0141048 0.000681942
# C 0.00115795 0.199673 0.000310697 0.00241006
# G 0.0149489 0.000250306 0.192682 0.000431541
# T 0.000578207 0.00177713 0.000281769 0.29405
# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -449
# C -366 136 -455 -301
# G -135 -474 129 -460
# T -464 -328 -498 102
# lastal -j7 -S1 -P12 -Q0 -t90.9512 -p-
# aligned letter pairs: 883620.8
# deletes: 73538.02
# inserts: 51113.6886
# delOpens: 44920.56
# insOpens: 31380.6561
# alignments: 595
# mean delete size: 1.63707
# mean insert size: 1.62883
# matchProb: 0.919942
# delOpenProb: 0.046767
# insOpenProb: 0.0326706
# delExtendProb: 0.389152
# insExtendProb: 0.386062
# substitution percent identity: 96.2214
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243559.2 731.154747 12465.885286 597.9639979
# C 1024.870416 176618.93 273.09495025 2136.903337
# G 13224.68318 222.89115718 170428.43 379.1039278
# T 508.1372043 1576.8193974 248.13743083 259657.8
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275627 0.000827422 0.0141072 0.000676695
# C 0.00115981 0.199873 0.000309052 0.00241826
# G 0.0149659 0.000252238 0.192868 0.000429019
# T 0.000575041 0.00178443 0.000280808 0.293846
# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -449
# C -366 136 -455 -301
# G -135 -474 128 -460
# T -464 -328 -498 102
# lastal -j7 -S1 -P12 -Q0 -t90.7452 -p-
# aligned letter pairs: 883582.8
# deletes: 73586.69
# inserts: 51156.688
# delOpens: 44827.53
# insOpens: 31367.7955
# alignments: 595
# mean delete size: 1.64155
# mean insert size: 1.63087
# matchProb: 0.92004
# delOpenProb: 0.0466772
# insOpenProb: 0.0326621
# delExtendProb: 0.39082
# insExtendProb: 0.386829
# substitution percent identity: 96.2246
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243560.1 728.384735 12465.855382 594.8409534
# C 1021.714178 176616.77 272.31061313 2128.647094
# G 13222.94618 222.34998721 170407.13 377.8824812
# T 505.4494746 1571.3704051 247.29998078 259657.4
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.275645 0.000824337 0.014108 0.000673201
# C 0.00115631 0.199883 0.000308183 0.00240906
# G 0.0149648 0.000251641 0.192855 0.000427662
# T 0.000572034 0.00177837 0.000279878 0.293863
# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -450
# C -366 136 -456 -301
# G -135 -474 128 -461
# T -465 -328 -498 102
# lastal -j7 -S1 -P12 -Q0 -t90.7427 -p-
# aligned letter pairs: 883560.7
# deletes: 73598.7
# inserts: 51169.328
# delOpens: 44832.83
# insOpens: 31373.6055
# alignments: 595
# mean delete size: 1.64163
# mean insert size: 1.63097
# matchProb: 0.920028
# delOpenProb: 0.0466832
# insOpenProb: 0.0326685
# delExtendProb: 0.390848
# insExtendProb: 0.386867
# substitution percent identity: 96.2259
# count matrix (query letters = columns, reference letters = rows):
# A C G T
# A 243560.1 728.414824 12465.844382 590.8640321
# C 1021.732958 176615.66 270.45205909 2128.725194
# G 13223.04618 222.31203211 170405.93 375.4225772
# T 501.8593603 1571.2857741 247.34986978 259656.3
# probability matrix (query letters = columns, reference letters = rows):
# A C G T
# A 0.27565 0.000824385 0.0141083 0.000668712
# C 0.00115635 0.199885 0.000306085 0.00240919
# G 0.0149652 0.000251602 0.192857 0.000424885
# T 0.000567981 0.00177831 0.000279939 0.293867
# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82
# score matrix (query letters = columns, reference letters = rows):
# A C G T
# A 99 -396 -140 -450
# C -366 136 -456 -301
# G -135 -474 128 -461
# T -465 -328 -498 102
#last -Q 0
#last -t4.46385
#last -a 12
#last -A 14
#last -b 4
#last -B 4
#last -S 1
# score matrix (query letters = columns, reference letters = rows):
A C G T
A 5 -20 -7 -23
C -18 7 -23 -15
G -7 -24 6 -23
T -23 -16 -25 5
It looks like last-train has worked well (but slowly). I suggest repeat-masking: that's what we would typically do with 72G of human long reads. No need to re-run last-train. (Or you could try the -k
or -uRY
options mentioned here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst)
Since lastal processes the reads in order, by looking at the incomplete output, you can figure out how far it's got. I can only guess, there might be slowdown due to insufficient memory or multi-threading being ineffective for some reason (maybe top
or something would show that).
Hi! Thank you for your reply. The following command has been running for 70 hours:
lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf
And now the size of the file myseq.maf
is 67M. Can you try to estimate the size of the file myseq.maf
in general? I wonder if it's about to finish.
And my another question is:
lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf
How much memory does this step normally require to run at normal speed?
I'd expect the final output to have roughly similar size to the input fastq file. Is it really only 67M?
Memory: with repeat-masking, for the human genome, I think < 20G. Without masking, it could be quite a lot more, sorry I'm not exactly sure.
Hi, I'm also using LAST to align my HiFi reads to genomes and encountered the same problem as LiShuhang. My program is running very slowly. When I change the threads to 48, lastal runs faster but not too much. It has been running for about 6 hours and the maf file is 235M. Due to the runtime limitation of our server(program can run at most 120 hours), I'm afraid that it won't be finished before the runtime limitation. My fq.gz file is about 90G. I wonder if the alignment can be broken down into multiple sub-tasks so that each sub-task can be finished before the runtime limitation.
Thanks for your reply.
Sub-tasks: you could split your fq file into parts, and align each part separately. Here's one way (not tested):
gzip -cd fq.gz | split -l200000 - myPrefix-
That puts each 200000-line chunk into a different file named starting with myPrefix-
.
But I recommend trying suggestions in the doc to go faster! (Especially for HiFi.) An easy one to try is lastal
option -k64
(say).
Thanks for your suggestion. But I don't understand what the -k parameter mean, could you explain it in more detail?
Besides, I found that -i parameter can specify the memory lastal used. If I specify more larger memory, would lastal run more faster?
The -i
parameter is explained here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst
You're right: it increases the memory usage and probably increases speed (by making multi-threading more effective).
-k
is explained here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-tuning.rst
So -k64
checks every 64-th position in each DNA read, and if it finds a potential match to the genome, it tries extending an alignment from that position. I would expect quite high values of k
to work fine for HiFi (even for non-HiFi), except that tiny rearranged fragments might be missed.
Thanks a lot! I will read the docs and try the parameters you suggested.
Hi, I have tried the -k and -i parameters using my data. Unfortunately, the improvement of speed was samll. For now, the result maf file can be generated 500M an hour. I have checked size of my fastq file and it is about 220G, that means I need nearly 20 days to run lastal for this file. Besides, I have other three fastq files that are about the same size of the first one. I know last is a great software but it is not realistic for me to run lastal for all files.
I need to use tandem-genotypes
to find the changes in length of tandem repeats of my data. So if there are other solutions or softwares that can generate the maf file that tandem-genotypes
used for input?
If so, please let me know. Thank you most sincerely.
Sorry it's not working great first time, or second time. I tested aligning some human HiFi reads (SRR9087598) to the human genome (hg38), with neither repeat-masking nor multi-threading. This is what I get:
lastal -k64 -p myMat myDb qry.fq | last-split
2.3G per hour
lastal -k64 -R00 -p myMat myDb qry.fq | last-split
3G per hour
lastal -k64 -m2 -p myMat myDb qry.fq | last-split
5.6G per hour
lastal -k64 -m2 -R00 -p myMat myDb qry.fq | last-split
12G per hour
I hope that's fast enough. The -R00
has no effect on the alignments: it skips detection and lowercasing of simple repeats in the reads. (But they are still detected and lowercased in the genome.)
With the -P
multithreading option, you can make it a few-fold faster. Then last-split becomes the bottleneck, which can be overcome by using parallel-fastq (see https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md).
The peak memory use was 15G, so your computer should have comfortably more than that. (Else you could use -uRY32
instead of -k
, or repeat-masking).
Sorry for not replying to you for a long time! I have tried to split my original file into some small files and one fastq file is 5.2G. Then I run last for this file using the parameters you suggested above. Since the runtime limit of one program is 120 hours, the alignment didn't finish before the limit. I don't know why. Maybe the reason is that our hardware can not let last perform best.
Anyway, thanks for your patient answers.
How strange. Are you using the latest LAST (lastal --version
)? The only other reason I can think of is your computer may not have enough memory (in which case I'd run lastdb
with option -uRY32
).
Yes, I'm using the latest LAST. Here is output of lastal --version
:
$ lastal --version
lastal 1256
You are right. The reason why last run so slow is that it needs more memory. When I set -P10 -i30G -k64
, lastal
run faster than before. It took about 8 hours to finish the alignment of a 5G fastq file.
I'm little confused about -i
parameter. The doc(https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst) says -i
can specify the batch size of input, so -i100
means to process one hundred sequences once a time?
Thanks for your reply. I will try it again.
-i100
means 100 bases at a time (https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst). But it always does at least one whole sequence at a time, so it effectively means one sequence at a time.
With multi-threading, the default is -i8M
(undocumented, bad). I would expect increasing -i
to not greatly change the run time in this case: I wonder how much faster it got. Maybe this default -i
should be increased...
It still seems a bit slower than expected. You could maybe try -P10 -k64 -m2 -R00
.
Yes, it's still slower than expected. I tried -P20 -i100G -k64 -m5 -R00
and the speed is not ideal. It took 54 hours to genearte the 23G file. While -P20 -i100G -k64
only need 34 hours to generate 23G, the -m5 -R00
seems to slowdown the speed. That is confusing.
I'm out of ideas. It seems impossible for -m5 -R00
to make it slower...
Thanks for your considerate replies! Maybe it is the -i100G
parameter that slows the speed. Since the more sequences it processed, the slower speed the result file is generated. It's just my guess.
I will try other parameters combination, if there is anything different I will let you know.
Best wishes!
Hi, I used this parameters -P20 -k64 -m5 -R00
(which removed -i
parameter) and the generation speed of result file is much faster than before! Now it can generate about 3G an hour. Maybe it is the process of loading too many sequences into memory once that slows the speed.
While this problem is solved, I have another question. Does lastal
support mutiprocessing? Our computational resources have many nodes and a node can use up to 28 threads. So if lastal
can be run in two or more nodes at the same time?
To run it on 2 or more nodes: one way is to split the fastq file into parts with something like split -l200000
, as mentioned earlier in this issue. Then align each part on a different node.
Alternatively, it should be possible by specifying suitable GNU parallel options to parallel-fastq
(https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst). Actually I've never done that, but apparently GNU parallel can do it: https://www.gnu.org/software/parallel/parallel_tutorial.html#remote-execution
I see. It seems that running this program or other programs on multiple nodes is not commonly used. Thanks for you suggestions. If needed I will try GNU parallel.
Thanks again!
Hi, I found the reason why lastal
is running slow on my computation server. It is the I/O speed of our conputation system that limit the speed of result file. When there are many high IO consuming tasks, the speed of result file of lastal
is too slow.
While the IO speed is limited, is there anything I can do to speed it up? May the priority of lastal
be increased?
Aha: yes, I/O trouble could explain it. How about reducing the output size with gzip
before writing it out, as mentioned here: https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md
Try to figure out which parts of your file system are on fast/local disks.
Thanks for your reply. I will try it.
Besides, I have another question. I used lastal
to align human assemblies to reference genome and the parameters are -P24 -k64 -m5 -R00
.
While the program is running about 26 hours, the size of result maf file is just 1.4k. The memory used is about 3.5G.
I don't know why it is so slow. Are last not suitable to align assemblies?
It should be suitable for aligning genome assemblies, and again I'm surprised it's not very fast, with those parameters.
One point is that the -P24
multithreading may be ineffective, and it might help to add something like -i3G
as mentioned here:
https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst
Yes, you are right. Finally I found the true reason why last
is not very fast. It is the lack of space of our server that limits the speed. For now our disk space is enough, last
can generate maf file about 3G per hour. So last
is still great when processing our data!
Thanks for all the help you have had over the past days!
Best wishes.