snakemake-workflows / rna-seq-star-deseq2

RNA-seq workflow using STAR and DESeq2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`count-matrix.py` isn't for stranded RNA-seq protocols

matrs opened this issue · comments

The count-matrix.py script uses column 1 (second column) to get the counts, which is for unstranded RNA-seq protocols (STAR manual, section "Counting number of reads per gene"). I'm testing the pipeline with data from a single-end stranded protocol and to my understanding, the 3rd column should be used, which is the equivalent to htseq-count --stranded yes. Other popular stranded protocols should use the fourth column. Are you interested in accepting a change to this script to accept user input to choose the count column/s? If so, how would you do it (config file, others)?

Yes of course, a PR would be great! Thanks a lot! So, I think it might make sense to have a optional column in the unit sheet, called stranded. If the value is 0, false, or empty, everything would be handled as it is now for that unit, otherwise the workflow should consider a stranded matrix according to the protocol name specified in that column. Does that make sense?

So I though it would be simpler but I've had some problems, mainly because I don't really understand a few things about snakemake.

  1. I called the new columnn strandness and not stranded, just because there are 3 options for it and not only it's a binary answer, it isn't just "is it stranded? yes or not" but well, I can change that.

  2. Looking into units.schema.yaml I found that the column fq2 isn't required, so I though the same for the new column strandness. I defined the following function to handle the optional new column strandness

#takes as input the dataframe `units`, defined in the `Snakefile`
def exist_strandness(units):
    return "strandness" in units.columns

Then I defined the following function in diffexp.smk. The idea is to return the column index that will be used and pass that number to the count_matrix.py.

def strandness(sample, unit):
    if exist_strandness(units):        
        strandness_val = units.loc[(sample, unit), "strandness"]
        if pd.isnull(strandness_val) or strandness_val == 0:
            return 1 #non stranded protocol
        if strandness == "forward":
            return 2 #3rd column
        if strandness_val == "reverse":
            return 3 #4th column, usually for illumina truseq
        else:
            raise ValueError('"strandness" column should be empty or have the value 0,
                     "forward" or "reverse"')
    else:
        return 1 #non stranded for cases where there 
                 #isn't a "strandness" column in units.tsv

So i can write something like this:

idx = strandness(**wildcards) 

counts = [pd.read_table(f, index_col=0, usecols=[0, idx], header=None, skiprows=4) for f in snakemake.input]

I tried all of this outside snakemake, that is defining a dataframe from units.tsv and from there on testing a few things. I tried dataframes with and without the column strandness and it worked. My problem is that I'm confused about a few things when I try to use these function s within snakemake.

  • How can I access the units dataframe defined in the Snakefile to use it within strandness()?
  • How could I pass the wildcards to strandness() ?
  • How could I pass the returned value of strandness() to count_matrix.py ?

I tried many things and obviously I don't understand a few concepts about snakemake , specially how to access some of its objects, so if you have some time maybe you can help me with this.

Tangencial issue :

def is_single_end(sample, unit):
    return pd.isnull(units.loc[(sample, unit), "fq2"])

The above function fails when there isn't a column called fq2. It works when there is a empty column called fq2, but it doesn't when such a column doesn't exist. If fq2 isn't required maybe this is a bug, that's why I defined the function exist_strandness using units.columns and not pd.isnull()