Error in rule get_nt_coverage:
byee4 opened this issue · comments
I'm having an issue that is causing an error in Skipper on mice data, but I don't know if the mice annotations are ultimately causing the overflow. Is there anything obviously wrong with the command or the annotations?
zcat output/reproducible_enriched_windows/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.reproducible_enriched_windows.tsv.gz | tail -n +2 | sort -k1,1 -k2,2n | awk -v OFS=" " '{print $1, $2 -37, $3+37,$4,$5,$6}' | bedtools merge -i - -s -c 6 -o distinct | awk -v OFS=" " '{for(i=$2;i< $3;i++) {print $1,i,i+1,"MW:" NR ":" i - $2,0,$4, NR} }' > output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed; samtools cat output/bams/dedup/genome/CA3_IN_1.genome.Aligned.sort.dedup.bam output/bams/dedup/genome/CA3_IN_2.genome.Aligned.sort.dedup.bam | bedtools intersect -s -wa -a - -b output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed | bedtools bamtobed -i - | awk '($1 != "chrEBV") && ($4 !~ "/2$")' | bedtools flank -s -l 1 -r 0 -g /tscc/projects/ps-yeolab3/bay001/annotations/mm10/star_2_7_6a_gencode25_sjdb/chrNameLength.txt -i - | bedtools shift -p 1 -m -1 -g /tscc/projects/ps-yeolab3/bay001/annotations/mm10/star_2_7_6a_gencode25_sjdb/chrNameLength.txt -i - | bedtools sort -i - | bedtools coverage -counts -s -a output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed -b - | awk '{print $NF}' > output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_coverage.input.counts;samtools cat output/bams/dedup/genome/CA3_IP_1.genome.Aligned.sort.dedup.bam output/bams/dedup/genome/CA3_IP_2.genome.Aligned.sort.dedup.bam | bedtools intersect -s -wa -a - -b output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed | bedtools bamtobed -i - | awk '($1 != "chrEBV") && ($4 !~ "/2$")' | bedtools flank -s -l 1 -r 0 -g /tscc/projects/ps-yeolab3/bay001/annotations/mm10/star_2_7_6a_gencode25_sjdb/chrNameLength.txt -i - | bedtools shift -p 1 -m -1 -g /tscc/projects/ps-yeolab3/bay001/annotations/mm10/star_2_7_6a_gencode25_sjdb/chrNameLength.txt -i - | bedtools sort -i - | bedtools coverage -counts -s -a output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed -b - | awk '{print $NF}' > output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_coverage.clip.counts;paste output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_census.bed output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_coverage.input.counts output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_coverage.clip.counts > output/finemapping/nt_coverage/x_eCLIP_CA3-26-02-2024-20-42-57-29-02-2024-21-51-06.nt_coverage.bed
Activating singularity image /tscc/projects/ps-yeolab4/software/skipper/d0055ff/singularity/3a69c84662a103b04ab9cb379236f2d6.simg
Error: Invalid record in file -. Record is
chrM -37 105 12146428 0 +
Here is the first few lines of the bam files:
[bay001@login1 x_mouse_hippocampus_29-02-2024-21-52-22]$ samtools view output/bams/dedup/genome/CA3_IN_1.genome.Aligned.sort.dedup.bam | grep chrM | less
VH01429:45:AACHKTTHV:1:2507:34099:18228:AGCGCACTTA 16 chrM 6 255 31M * 0 0 TGTAGCTTAATAACAAAGCAAAGCACTGAAA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:30 nM:i:0 NM:i:0 MD:Z:31 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:2601:21545:10674:AAAGGCAGGG 0 chrM 10 255 41M * 0 0 GCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGATGGAT CCCCCCCCCCCC;CCCCCCCCCCCCCC;CCCCCCC;CCCC; NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:1406:59151:53687:GCTTGATATC 0 chrM 14 255 37M * 0 0 AATAACAAAGCAAAGCACTGAAAATGCTTAGGGATAA CCC-CCCC-CCCCC;CCC;C;CCCC-C-CCCC-C;-C NH:i:1 HI:i:1 AS:i:26 nM:i:5 NM:i:5 MD:Z:31A0T0G0G1T0 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:2211:30293:52968:AGCTCTCCAT 0 chrM 17 255 41M * 0 0 AACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTA CCCCCCC-CC;-CCC-CCCCCCCCCCCC;;CCCCCCCCCC; NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:2302:52694:24741:ACAATTATCG 0 chrM 18 255 41M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTAT CCCC;CCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:2614:69357:10731:TCCCATCTAT 0 chrM 18 255 34M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:33 nM:i:0 NM:i:0 MD:Z:34 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:1205:26961:50280:GGAGTTTCAC 0 chrM 18 255 41M * 0 0 ACAAAGCAAAGCACTGAATATGCTTAGATGGATAATTGTAT CCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:38 nM:i:1 NM:i:1 MD:Z:18A22 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:1210:40462:16941:GGTGGACCAC 0 chrM 18 255 41M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTAT CCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:1506:73902:28678:GATTTAACTC 0 chrM 18 255 34M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATA CCCC-CCCC;CCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:33 nM:i:0 NM:i:0 MD:Z:34 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:2506:46350:14120:GTGGAGTTGT 0 chrM 20 255 41M * 0 0 AAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTATCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:1213:31941:18039:GGCCACCTGG 0 chrM 22 255 41M * 0 0 AGCAAAGCACTGAATATGCTTAGATGGATAATTGTATCCCA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCC NH:i:1 HI:i:1 AS:i:38 nM:i:1 NM:i:1 MD:Z:14A26 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:2514:49418:9463:CCCAACAAAG 0 chrM 24 255 41M * 0 0 CAAAGCACTGAAAATGCTTAGATGGATAATTGTATCCCATA CCC;CCC;CC;CCCCCCCCCCCCCCCCCCCCCCCCC-CCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:1302:10600:34755:GATACATAAC 0 chrM 34 255 37M * 0 0 AAAATGCTTAGATGGATAATTGTATCCCATAAACACC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCC NH:i:1 HI:i:1 AS:i:34 nM:i:1 NM:i:1 MD:Z:36A0 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:2304:62446:34869:AGTGTAGATG 0 chrM 34 255 38M * 0 0 AAAATGCTTAGATGGATAATTGTATCCCATAAACACCA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC;CC NH:i:1 HI:i:1 AS:i:35 nM:i:1 NM:i:1 MD:Z:36A1 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:2:1606:60230:37614:GGCCCCTCAG 0 chrM 35 255 36M * 0 0 AAATGCTTAGATGGATAATTGTATCCCATAAACACC CCCCCC-CC;;CC;CCCC;CCCCCCCC-CCC-CCCC NH:i:1 HI:i:1 AS:i:33 nM:i:1 NM:i:1 MD:Z:35A0 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
VH01429:45:AACHKTTHV:1:1306:9540:25744:AAGCATAAAC 0 chrM 36 255 36M * 0 0 AATGCTTAGATGGATAATTGTATCCCATAAACACCA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:33 nM:i:1 NM:i:1 MD:Z:34A1 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_1
...
[bay001@login1 x_mouse_hippocampus_29-02-2024-21-52-22]$ samtools view output/bams/dedup/genome/CA3_IN_2.genome.Aligned.sort.dedup.bam | grep chrM | less
VH01429:45:AACHKTTHV:1:2210:56235:44070:AATACCCAGT 0 chrM 14 255 41M * 0 0 AATAACAACGCAAAGCACTGAAAATGCTTAGATGGATAATT CCCCCCCC-CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC NH:i:1 HI:i:1 AS:i:38 nM:i:1 NM:i:1 MD:Z:8A32 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:2505:24442:15085:GTAATGCATA 0 chrM 14 255 41M * 0 0 AATAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:1314:21091:42328:GAATATTAGT 0 chrM 16 255 41M * 0 0 TAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:2204:70001:7550:CACAAGGCCA 0 chrM 16 255 41M * 0 0 TAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGT C;CCCC-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:1:2109:16716:21446:TTCTTCGAGG 0 chrM 17 255 41M * 0 0 AACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTA CC;CCCCCCCCC;CCCCCCCCCC;CCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:1:2606:62919:8648:ATGGTATCCG 0 chrM 17 255 35M * 0 0 AACAAAGCAAAGCACTGAAAATGCTTAGATGGATA CCCCCCCCCC;CCCC;CCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:34 nM:i:0 NM:i:0 MD:Z:35 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:1:1205:17985:41230:TTAACACACA 0 chrM 18 255 34M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATA CCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:33 nM:i:0 NM:i:0 MD:Z:34 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:1:2113:12399:44316:GATACAATAC 0 chrM 18 255 41M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTAT CCCC-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:1604:42961:53252:GGTGTGTGAA 0 chrM 18 255 41M * 0 0 ACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTAT CCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:2207:43643:14934:TGCGGAGCAG 0 chrM 18 0 16M * 0 0 ACAAAGCAAAGCACTG CCCC;CCCCCCC-C;C NH:i:7 HI:i:1 AS:i:15 nM:i:0 NM:i:0 MD:Z:16 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:2202:27396:30799:GTCAGTACGC 0 chrM 21 255 41M * 0 0 AAGCAAAGCACTGAAAATGCTTAGATGGATAANTGTATCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC#CCCCCCCC NH:i:1 HI:i:1 AS:i:39 nM:i:0 NM:i:1 MD:Z:32T8 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:1:2508:62332:19156:GGAAACAACT 16 chrM 29 255 41M * 0 0 CACTGAAAATGCTTAGATGGATAATTGTATCCCATAAACAC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 AS:i:40 nM:i:0 NM:i:0 MD:Z:41 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
VH01429:45:AACHKTTHV:2:1508:56216:10504:AATGCCATAC 0 chrM 33 255 19M * 0 0 GAAAATGCTTAGATGGATA CCCCCCCCCCCC-CCCCCC NH:i:1 HI:i:1 AS:i:18 nM:i:0 NM:i:0 MD:Z:19 jM:B:c,-1 jI:B:i,-1 RG:Z:CA3_IN_2
contents of chrom-sizes:
chr1 195471971
chr10 130694993
chr11 122082543
chr12 120129022
chr13 120421639
chr14 124902244
chr15 104043685
chr16 98207768
chr17 94987271
chr18 90702639
chr19 61431566
chr1_GL456210_random 169725
chr1_GL456211_random 241735
chr1_GL456212_random 153618
chr1_GL456213_random 39340
chr1_GL456221_random 206961
chr2 182113224
chr3 160039680
chr4 156508116
chr4_GL456216_random 66673
chr4_JH584292_random 14945
chr4_GL456350_random 227966
chr4_JH584293_random 207968
chr4_JH584294_random 191905
chr4_JH584295_random 1976
chr5 151834684
chr5_JH584296_random 199368
chr5_JH584297_random 205776
chr5_JH584298_random 184189
chr5_GL456354_random 195993
chr5_JH584299_random 953012
chr6 149736546
chr7 145441459
chr7_GL456219_random 175968
chr8 129401213
chr9 124595110
chrM 16299
chrX 171031299
chrX_GL456233_random 336933
chrY 91744698
chrY_JH584300_random 182347
chrY_JH584301_random 259875
chrY_JH584302_random 155838
chrY_JH584303_random 158099
chrUn_GL456239 40056
chrUn_GL456367 42057
chrUn_GL456378 31602
chrUn_GL456381 25871
chrUn_GL456382 23158
chrUn_GL456383 38659
chrUn_GL456385 35240
chrUn_GL456390 24668
chrUn_GL456392 23629
chrUn_GL456393 55711
chrUn_GL456394 24323
chrUn_GL456359 22974
chrUn_GL456360 31704
chrUn_GL456396 21240
chrUn_GL456372 28664
chrUn_GL456387 24685
chrUn_GL456389 28772
chrUn_GL456370 26764
chrUn_GL456379 72385
chrUn_GL456366 47073
chrUn_GL456368 20208
chrUn_JH584304 114452
There doesn't appear to be any negative start coords in the bam files (these are the output/bams/dedup/genome/*genome.Aligned.sorted.dedup.bam
files).
My best guess is that in the above command, bedtools shift or bedtools flank are doing something strange, but I can't really tell what it's doing. The commands themselves shouldn't be causing the overflow, as I chrom.sizes is specified, so I'm also confused why I see negative values here.
Thank you! This makes sense. If you want to update Github I can pull the changes into whatever branch we're currently working off of.