ScienceParkStudyGroup / rna-seq-lesson

A Carpentries-style lesson on RNA-Sequencing

Home Page:https://scienceparkstudygroup.github.io/rna-seq-lesson/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Episode 7 Functional Enrichment Analysis GO/KEGG for non model organisms

Fred-White94 opened this issue · comments

I just wanted to point out that using clusterprofiler with OrgDb objects is not ideal for less well annotated species. This is the case where the OrgDb comes from AnnotationHub.

This includes rice for example. The issue is with OrgDb not having translations from EntrezIDs to GO terms ~75% of the input EntrezIDs do not map to GO terms through this method.
Since the OrgDb object does not have a ensembl keytype I was forced to translate using biomart from ensembl to entrez. This also loses some IDs.
A direct translation from ensembl to GO terms (using biomaRt) leads to only ~39 % non-mapping genes.
I am unaware of a method to update OrgDb objects with, for example, new keyTypes. But need to look into it as this clusterprofiler method for GSEA is unusable for lesser annotated species.

I have not tried creating an OrgDb from ncbi, but I would not recommend using AnnotationHub as was recommended by the authors of clusterProfiler

Yes, I finally got a GO 😆 at it!
It seems not 100% trivial indeed. First of all, most genes from non-model species do not seem to have any GO terms. I checked for tomato (version ITAG4.0) on Sol Genomics. Check the first genes:

Solyc01g004000.1
Solyc01g004002.1
Solyc01g004004.1
Solyc01g004006.1
Solyc01g004008.1
Solyc01g005000.2	GO:0016831,GO:0019752,GO:0030170
Solyc01g005020.2
Solyc01g005040.2
Solyc01g005060.2
Solyc01g005070.2	GO:0005543
Solyc01g005080.2
Solyc01g005090.2
Solyc01g005110.2	GO:0006486
Solyc01g005120.2	GO:0004553,GO:0005975
Solyc01g005130.2
Solyc01g005140.2
Solyc01g005150.2
Solyc01g005180.1
Solyc01g005220.2
Solyc01g005253.1
Solyc01g005257.1
Solyc01g005280.2
Solyc01g005310.2
Solyc01g005320.2
Solyc01g005330.2	GO:0007010,GO:0008017
Solyc01g005340.2
Solyc01g005370.2
Solyc01g005380.2
Solyc01g005390.2	GO:0016787
Solyc01g005400.2
Solyc01g005420.2
Solyc01g005430.2

To perform the GO enrichement with clusterProfiler, I have to:

  1. Annotate the differential genes with their NCBI Entrez ID. Im my experience, around 40% of my tomato genes have a corresponding Entrez IDs.
  2. Create an orgDb class object: ah = AnnotationHub() followed by tomato <- ah[["AH76051"]] (since AH76051 corresponds to org.Solanum_lycopersicum.eg.sqlite)
    1. Run the enrichGO with
ora_analysis_go_mf <- enrichGO(gene = diff_genes_annotated$entrezgene_id, 
                            OrgDb = tomato,
                            readable = TRUE,
                            ont = "BP",
                            universe = all_genes_expressed_annotated$entrezgene_id, 
                            keyType = "ENTREZID", 
                            minGSSize = 5, 
                            maxGSSize = 100, 
                            pAdjustMethod = "BH", 
                            qvalueCutoff = 0.05)

In my hands, this yields meaningful results.

Does this contributes to the discussion? A section for non-model organisms could definitely be added to this episode.

Hey @Fred-White94 for me this issue is closed as I found a way for tomato. I will open a new issue because as Lieke mentioned, the problem might be more generic.