`count-matrix.py` isn't for stranded RNA-seq protocols
matrs opened this issue · comments
The count-matrix.py
script uses column 1 (second column) to get the counts, which is for unstranded RNA-seq protocols (STAR manual, section "Counting number of reads per gene"). I'm testing the pipeline with data from a single-end stranded protocol and to my understanding, the 3rd column should be used, which is the equivalent to htseq-count --stranded yes
. Other popular stranded protocols should use the fourth column. Are you interested in accepting a change to this script to accept user input to choose the count column/s? If so, how would you do it (config file, others)?
Yes of course, a PR would be great! Thanks a lot! So, I think it might make sense to have a optional column in the unit sheet, called stranded
. If the value is 0, false, or empty, everything would be handled as it is now for that unit, otherwise the workflow should consider a stranded matrix according to the protocol name specified in that column. Does that make sense?
So I though it would be simpler but I've had some problems, mainly because I don't really understand a few things about snakemake
.
-
I called the new columnn
strandness
and notstranded
, just because there are 3 options for it and not only it's a binary answer, it isn't just "is it stranded? yes or not" but well, I can change that. -
Looking into
units.schema.yaml
I found that the columnfq2
isn't required, so I though the same for the new columnstrandness
. I defined the following function to handle the optional new columnstrandness
#takes as input the dataframe `units`, defined in the `Snakefile`
def exist_strandness(units):
return "strandness" in units.columns
Then I defined the following function in diffexp.smk
. The idea is to return the column index that will be used and pass that number to the count_matrix.py
.
def strandness(sample, unit):
if exist_strandness(units):
strandness_val = units.loc[(sample, unit), "strandness"]
if pd.isnull(strandness_val) or strandness_val == 0:
return 1 #non stranded protocol
if strandness == "forward":
return 2 #3rd column
if strandness_val == "reverse":
return 3 #4th column, usually for illumina truseq
else:
raise ValueError('"strandness" column should be empty or have the value 0,
"forward" or "reverse"')
else:
return 1 #non stranded for cases where there
#isn't a "strandness" column in units.tsv
So i can write something like this:
idx = strandness(**wildcards)
counts = [pd.read_table(f, index_col=0, usecols=[0, idx], header=None, skiprows=4) for f in snakemake.input]
I tried all of this outside snakemake
, that is defining a dataframe from units.tsv
and from there on testing a few things. I tried dataframes with and without the column strandness
and it worked. My problem is that I'm confused about a few things when I try to use these function s within snakemake
.
- How can I access the
units
dataframe defined in theSnakefile
to use it withinstrandness()
? - How could I pass the wildcards to
strandness()
? - How could I pass the returned value of
strandness()
tocount_matrix.py
?
I tried many things and obviously I don't understand a few concepts about snakemake
, specially how to access some of its objects, so if you have some time maybe you can help me with this.
Tangencial issue :
def is_single_end(sample, unit):
return pd.isnull(units.loc[(sample, unit), "fq2"])
The above function fails when there isn't a column called fq2
. It works when there is a empty column called fq2
, but it doesn't when such a column doesn't exist. If fq2
isn't required maybe this is a bug, that's why I defined the function exist_strandness
using units.columns
and not pd.isnull()