BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

salmon results show no gene names - with latest guix rnaseq 0.1

smoe opened this issue · comments

Hello, I read through #35 but somehow, even though I am using your latest GUIX-version, the missing gene names affect me:

$ which pigx-rnaseq
.../.guix-profile/bin/pigx-rnaseq
$ pigx-rnaseq --version
PiGx RNAseq Pipeline.
Version: 0.1.0

Copyright © 2017-2021 Bora Uyar, Jona Ronen, Ricardo Wurmus.

The .fa file's gene identifier indeed has that version suffix that does not appear in the .gtf file:

$grep ENSOCUG00000010950.3 cdna/Oryctolagus_cuniculus.OryCun2.0.cdna.all.fa
>ENSOCUT00000010949.3 cdna chromosome:OryCun2.0:3:442600:454640:-1 gene:ENSOCUG00000010950.3 gene_biotype:protein_coding transcript_biotype:protein_coding description:Oryctolagus cuniculus transmembrane p24 trafficking protein 7 (TMED7), mRNA. [Source:RefSeq mRNA;Acc:NM_001204345]
$ grep ENSOCUG00000010950.3 gtf/Oryctolagus_cuniculus.OryCun2.0.104.gtf
$ grep ENSOCUG00000010950 gtf/Oryctolagus_cuniculus.OryCun2.0.104.gtf
3       ensembl gene    442600  454640  .       -       .       gene_id "ENSOCUG00000010950"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";
3       ensembl transcript      442600  454640  .       -       .       gene_id "ENSOCUG00000010950"; gene_version "3"; transcript_id "ENSOCUT00000010949"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
3       ensembl exon    454267  454640  .       -       .       gene_id "ENSOCUG00000010950"; gene_version "3"; transcript_id "ENSOCUT00000010949"; transcript_version "3"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSOCUE00000127393"; exon_version "3";
...

with transcript_version and gene_version apparently providing the respective info,
but starting pigx-rnaseq, there is no warning:

cat .../output/.snakemake/log/2022-09-30T151230.778536.snakemake.log
Building DAG of jobs...
Nothing to be done.
Complete log: .../output/.snakemake/log/2022-09-30T151230.778536.snakemake.log

The hisat2 mapping apparently works just fine, cannot we somehow fix this by communicating with the salmon folks or whoever should be addressing this?

Hi @smoe,
the check for consistency between annotations is done here: https://github.com/BIMSBbioinfo/pigx_rnaseq/blob/master/scripts/validate_input_annotation.R

Could you maybe send me the links to the annotation files so that I can check this?
There might be a bug in my validation script too.

An inconsistency doesn't kill the processes, though, it should print a warning only.

Hi @borauyar,
The gtf is from http://ftp.ensembl.org/pub/release-107/gtf/oryctolagus_cuniculus/ , the cdna from http://ftp.ensembl.org/pub/release-107/fasta/oryctolagus_cuniculus/cdna/

From the validation script, the

message(date(), " Checking annotation files for potential issues")

does not appear in above shown (very short) logs, so the check likely was not performed since everything happened already. I would have expected that the settings are validated every time, even when my runs have already completed. Can I somehow execute that check directly?

Either way, the folks at https://github.com/COMBINE-lab/salmon should address this. They have a version 1.9.0 out (regular GUIX pigx-rnaseq is 1.6.0). Is there an easy way to check if salmon has changed its behaviour?

Oh I see. If you had run this from scratch, you should have received the warning. But, if you are running this after you have all the outputs, then it won't work. The validation script assumes that it is the first thing that runs so that the pipeline fails as early as possible. (although for this problem the pipeline wouldn't fail).

I think the problem is not with Salmon. It is up to the annotation database to have consistent nomenclature/ids between annotation files. I am sure there must be a reason why Ensembl provides such annotations.

Even if this specific issue is addressed for Ensembl, then you could have another source of annotations where you may have different kind of annotation inconsistencies. So, this should be fixed upstream I think.

The warning was likely scrolling up too quickly for me to notice it. So we agree that we prefer to see this fixed (by whomever) in salmon. I'll update the package locally and see how this goes.

Upstream is aware of that problem: COMBINE-lab/salmon#598