issues with script soloCountMatrixFromBAM.awk
marlmatos opened this issue · comments
Hi Alex,
Thanks for always being so helpful. I am trying to recreate the counts matrix of STARsolo with the soloCountMatrixFromBAM.awk that you have posted here. For context, I am filtering the AlignedSorted.out.bam for autosomes and reads that passed WASP and those that do not have the WASP SAMtAg. But i keep getting syntax errors with the awk script, and I cannot find where the error is coming from. Here is how I am using the script:
samtools view scGEX-A29.sortedByCoord.filtered.out.bam | awk -v fileWL=scGEX-A29.Solo.out/Gene/raw/barcodes.tsv \
-v fileGenes=scGEX-A29.Solo.out/Gene/raw/features.tsv \
-f soloCountMatrixFromBAM.awk | sort -k2,2n -k1,1n > scGEX-A29.matrix.mtx
And this is the error:
awk: soloCountMatrixFromBAM.awk:60: Unexpected token
I also redownloaded the script to make sure it was not a typo that I had introduced. But the error persists.
For reference, this is how I am running STAR
STAR --runMode alignReads \\
--genomeDir ${index_ch} \\
--readFilesIn ${reverse.join( "," )} ${forward.join( "," )} \\
--outFileNamePrefix ${meta.sample_ID}. \\
--soloCBwhitelist <(gzip -cdf $barcode_whitelist) \\
--soloType CB_UMI_Simple \\
--soloFeatures Gene \\
--soloUMIlen 12 \\
--outSAMtype BAM SortedByCoordinate \\
--soloCellFilter EmptyDrops_CR ${expected_cells} 0.99 10 45000 90000 500 0.01 20000 0.01 10000 \\
--outSAMattrRGline ID:${meta.sample_ID} CN:NYGC SM:${meta.sample_ID} \\
--readFilesCommand zcat \\
--runDirPerm All_RWX \\
--outWigType bedGraph \\
--twopassMode Basic \\
--waspOutputMode SAMtag \\
--varVCFfile ${varVCF} \\
--limitOutSJcollapsed 5000000 \\
--limitSjdbInsertNsj 1500000 \\
--outSAMattributes NH HI AS NM MD vA vG vW GX CB UB
and another question: Is this the right way that you would go about this?
I thought that using the raw/barcodes.tsv and raw/features.tsv would give me a raw matrix? But how do I get to the filtered /matrix.mtx? Shouuld I use the filtered/barcodes.tsv and filtered/features.tsv instead?
Since the soloCountMatrixFromBAM.awk was not working for me, I opted then to use the STARsolo mapped BAM as input again for STAR, for remapping. I am still dealing with errors on that part, but I figured if you can help me with this, I would not have to do that...
Thanks again!
Marliette
just looking at line 60: U[cb][ge][UB]++;
, I assume that you need gawk
for support for nested arrays.
samtools view scGEX-A29.sortedByCoord.filtered.out.bam | gawk -v fileWL=scGEX-A29.Solo.out/Gene/raw/barcodes.tsv \
-v fileGenes=scGEX-A29.Solo.out/Gene/raw/features.tsv \
-f soloCountMatrixFromBAM.awk | sort -k2,2n -k1,1n > scGEX-A29.matrix.mtx
This script is just an example of how to process the data. You may want to re-write it for your own needs, maybe in python.
Thanks, I will try that