smth wrong with enrichGO
tsoshukming opened this issue · comments
Issue: qvalueCutoff parameter in enrichGO seems not working as expected
Code Example
go_mf <- enrichGO(gene = gene_ids,
OrgDb = org.Hs.eg.db,
ont = "MF",
qvalueCutoff = 0.05,
pAdjustMethod = "BH",
readable = TRUE)
head(go_mf@result$qvalue)
args(enrichGO)
packageVersion("clusterProfiler")
packageVersion("GO.db")
go_mf1 <- enrichGO(gene = gene_ids,
OrgDb = org.Hs.eg.db,
ont = "MF",
qvalueCutoff = 0.1,
pAdjustMethod = "BH",
readable = TRUE)
nrow(go_mf@result)
nrow(go_mf1@result)
args(enrichGO)
Current Output
> go_mf <- enrichGO(gene = gene_ids,
+ OrgDb = org.Hs.eg.db,
+ ont = "MF",
+ qvalueCutoff = 0.05,
+ pAdjustMethod = "BH",
+ readable = TRUE)
> head(go_mf@result$qvalue)
[1] 0.002392067 0.069597336 0.089149123 0.089149123 0.103898592 0.103898592
> args(enrichGO)
function (gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10,
maxGSSize = 500, readable = FALSE, pool = FALSE)
NULL
> packageVersion("clusterProfiler")
[1] '4.12.6'
> packageVersion("GO.db")
[1] '3.19.1'
> go_mf1 <- enrichGO(gene = gene_ids,
+ OrgDb = org.Hs.eg.db,
+ ont = "MF",
+ qvalueCutoff = 0.1,
+ pAdjustMethod = "BH",
+ readable = TRUE)
> nrow(go_mf@result)
[1] 262
> nrow(go_mf1@result)
[1] 262
> args(enrichGO)
function (gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10,
maxGSSize = 500, readable = FALSE, pool = FALSE)
NULL
Version Information
packageVersion("clusterProfiler")
[1] '4.12.6'
Expected Behavior
Setting qvalueCutoff = 0.05
should filter results to only show entries with qvalue < 0.05. However:
- Current results show entries with qvalue > 0.05 (0.103898592)
- Changing qvalueCutoff (0.05 vs 0.1) doesn't affect number of results (both 262 rows)
This appears to contradict the parameter's intended filtering function as shown in args(enrichGO)
.
I agree that it might be somewhat counter-intuitive, but to exclusively filter on qvalue
set the argument pvalueCutoff
to 1 (pvalueCutoff = 1
). See example 3 below.
Since in your code you did not specify this argument, the default setting pvalueCutoff = 0.05
is still used,
Reason for the need to set pvalueCutoff = 1
is that the results are first filtered on p-values (column pvalue
), 'surviving' sets will then be filtered for p.adjust
(thus using the same cutoff value!), and finally, the last surviving sets will be filtered for the qvalue
cutoff (default setting is qvalueCutoff = 0.2
).
On the help page of e.g. enrichGO
:
qvalueCutoff - qvalue cutoff on enrichment tests to report as significant. Tests must pass i) pvalueCutoff on unadjusted pvalues, ii) pvalueCutoff on adjusted pvalues and iii) qvalueCutoff on qvalues to be reported.
This also becomes clear when looking at the source code used for filtering,
The code for filtering is he internal function get_enriched
: https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/enricher_internal.R#L213-L234
Also:
- realize that q-values are less conservative than adjusted p-values; in other words for a gene set the adjusted p-value is always (slightly) smaller than the q-value.
- to extract the significant results, use
as.data.frame()
. This will only return the gene sets filtered by the (internal) functionget_enriched
linked to above. Directly accessing the results through the@
accessor is not recommended, since it will return the unfiltered (= all) results.
To illustrate this with some code (note the use of tail to show the last 6 gene sets (+ corresponding statistics) that 'survive' filtering :
> library(clusterProfiler)
>
> ## load example data
> data(geneList, package = "DOSE")
> de <- names(geneList)[1:100]
>
Then:
enrichGO example 1.
> ## keep only GO-CC terms that have **adjusted** p-value less than 0.05.
> ## note that all qvalues are also smaller than qvalueCutoff, but filtering is in the end
> ## thus NOT based on qvalue cutoff (but on cutoff for pvalue)
> ## also note that adjusted p-values are smaller then corresponding q-values.
> ## this last observations still is evident when checking the unfiltered results (using @).
>
> y1 <- enrichGO(de, 'org.Hs.eg.db', ont = "CC", pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.2)
>
> tail(as.data.frame(y1))
ID Description
GO:0072687 GO:0072687 meiotic spindle
GO:0000307 GO:0000307 cyclin-dependent protein kinase holoenzyme complex
GO:0000939 GO:0000939 inner kinetochore
GO:0000940 GO:0000940 outer kinetochore
GO:0043073 GO:0043073 germ cell nucleus
GO:0005828 GO:0005828 kinetochore microtubule
GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0072687 2/100 15/19894 0.13333333 26.525333 7.029155 0.002516925
GO:0000307 3/100 56/19894 0.05357143 10.657500 5.143907 0.002816259
GO:0000939 2/100 17/19894 0.11764706 23.404706 6.568582 0.003238747
GO:0000940 2/100 20/19894 0.10000000 19.894000 6.008688 0.004480550
GO:0043073 3/100 66/19894 0.04545455 9.042727 4.651780 0.004483798
GO:0005828 2/100 21/19894 0.09523810 18.946667 5.848508 0.004936032
p.adjust qvalue geneID Count
GO:0072687 0.01353932 0.009775354 259266/6790 2
GO:0000307 0.01464454 0.010573322 9133/890/983 3
GO:0000939 0.01629821 0.011767263 79019/55839 2
GO:0000940 0.02119614 0.015303554 10403/220134 2
GO:0043073 0.02119614 0.015303554 7153/9319/6790 3
GO:0005828 0.02264768 0.016351562 1062/81930 2
>
> tail(y1@result)
ID Description GeneRatio BgRatio
GO:0034702 GO:0034702 monoatomic ion channel complex 1/100 344/19894
GO:0030139 GO:0030139 endocytic vesicle 1/100 350/19894
GO:0098984 GO:0098984 neuron to neuron synapse 1/100 360/19894
GO:0097060 GO:0097060 synaptic membrane 1/100 397/19894
GO:0098978 GO:0098978 glutamatergic synapse 1/100 409/19894
GO:0031252 GO:0031252 cell leading edge 1/100 426/19894
RichFactor FoldEnrichment zScore pvalue p.adjust qvalue
GO:0034702 0.002906977 0.5783140 -0.5607618 0.8259969 0.8531455 0.6159688
GO:0030139 0.002857143 0.5684000 -0.5790180 0.8312700 0.8531455 0.6159688
GO:0098984 0.002777778 0.5526111 -0.6088695 0.8397091 0.8561740 0.6181553
GO:0097060 0.002518892 0.5011083 -0.7136758 0.8674560 0.8787217 0.6344347
GO:0098978 0.002444988 0.4864059 -0.7459588 0.8753898 0.8810374 0.6361067
GO:0031252 0.002347418 0.4669953 -0.7904211 0.8858306 0.8858306 0.6395673
geneID Count
GO:0034702 2568 1
GO:0030139 8685 1
GO:0098984 6790 1
GO:0097060 2568 1
GO:0098978 6790 1
GO:0031252 81930 1
>
enrichGO example 2.
> ## keep only GO-CC terms that have **none-adjusted** p-value less than 0.05.
> ## note that all qvalues are also smaller than qvalueCutoff; but filtering is again NOT based on qvalue cutoff...
> ## note that p-values are equal to adjusted p-values (because pAdjustMethod="none");
> ## also note that **none-adjusted** p-values are in this case (still) smaller then corresponding q-values, and
> ## since cutoff for (none-adjusted) p-values is set to 0.05, still no filtering on qvalue cutoff occurs!
> y2 <- enrichGO(de, 'org.Hs.eg.db', ont = "CC", pvalueCutoff = 0.05, pAdjustMethod = "none", qvalueCutoff = 0.2)
>
> tail(as.data.frame(y2))
ID Description GeneRatio
GO:0001673 GO:0001673 male germ cell nucleus 2/100
GO:1902554 GO:1902554 serine/threonine protein kinase complex 3/100
GO:1902911 GO:1902911 protein kinase complex 3/100
GO:0005814 GO:0005814 centriole 3/100
GO:0001939 GO:0001939 female pronucleus 1/100
GO:0070938 GO:0070938 contractile ring 1/100
BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust
GO:0001673 52/19894 0.03846154 7.651538 3.413614 0.02818283 0.02818283
GO:1902554 136/19894 0.02205882 4.388382 2.818212 0.03125759 0.03125759
GO:1902911 150/19894 0.02000000 3.978800 2.602872 0.03998797 0.03998797
GO:0005814 160/19894 0.01875000 3.730125 2.464440 0.04690189 0.04690189
GO:0001939 10/19894 0.10000000 19.894000 4.247715 0.04915537 0.04915537
GO:0070938 10/19894 0.10000000 19.894000 4.247715 0.04915537 0.04915537
qvalue geneID Count
GO:0001673 0.07214266 7153/9319 2
GO:1902554 0.07823538 9133/890/983 3
GO:1902911 0.09791104 9133/890/983 3
GO:0005814 0.11239646 7153/55165/6790 3
GO:0001939 0.11298871 890 1
GO:0070938 0.11298871 9055 1
>
>
enrichGO example 3.
> ## keep only GO-CC terms that have adjusted p-value less than **1**.
> ## note that **now** filtering on qvalue cutoff occurs!
> ## also note that adjusted p-values are calculated, but again, filtering is NOW based on q-values (because pvalueCutoff=1)
> y3 <- enrichGO(de, 'org.Hs.eg.db', ont = "CC", pvalueCutoff = 1, pAdjustMethod = "BH", qvalueCutoff = 0.2)
>
> tail(as.data.frame(y3))
ID Description GeneRatio BgRatio RichFactor
GO:0005902 GO:0005902 microvillus 2/100 97/19894 0.02061856
GO:0000932 GO:0000932 P-body 2/100 100/19894 0.02000000
GO:1902711 GO:1902711 GABA-A receptor complex 1/100 19/19894 0.05263158
GO:0005881 GO:0005881 cytoplasmic microtubule 2/100 106/19894 0.01886792
GO:1902710 GO:1902710 GABA receptor complex 1/100 21/19894 0.04761905
GO:0035371 GO:0035371 microtubule plus-end 1/100 24/19894 0.04166667
FoldEnrichment zScore pvalue p.adjust qvalue geneID
GO:0005902 4.101856 2.176663 0.08569445 0.2191530 0.1582279 8140/6286
GO:0000932 3.978800 2.122550 0.09024423 0.2261892 0.1633080 9582/3669
GO:1902711 10.470526 2.935491 0.09134562 0.2261892 0.1633080 2568
GO:0005881 3.753585 2.020383 0.09955066 0.2411197 0.1740878 81930/146909
GO:1902710 9.473333 2.761313 0.10046654 0.2411197 0.1740878 2568
GO:0035371 8.289167 2.539615 0.11397823 0.2694031 0.1945083 146909
Count
GO:0005902 2
GO:0000932 2
GO:1902711 1
GO:0005881 2
GO:1902710 1
GO:0035371 1
>
>
enrichGO example 4.
> ## keep only GO-CC terms that adjusted p-value less than 0.05.
> ## but now set **stringent** qvalue cutoff, e.g. 0.01 (default = 0.2)
> ## note that now, because of its stringency, filtering on qvalue occurs!
> ## also note that adjusted p-values are calculated, but again, filtering is
> ## in the end thus based on q-values.
> y4 <- enrichGO(de, 'org.Hs.eg.db', ont = "CC", pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.01)
>
> tail(as.data.frame(y4))
ID Description GeneRatio BgRatio
GO:0097431 GO:0097431 mitotic spindle pole 3/100 38/19894
GO:0000235 GO:0000235 astral microtubule 2/100 11/19894
GO:0005818 GO:0005818 aster 2/100 11/19894
GO:0000152 GO:0000152 nuclear ubiquitin ligase complex 3/100 47/19894
GO:0010369 GO:0010369 chromocenter 2/100 14/19894
GO:0072687 GO:0072687 meiotic spindle 2/100 15/19894
RichFactor FoldEnrichment zScore pvalue p.adjust
GO:0097431 0.07894737 15.70579 6.449375 0.0009149504 0.005947178
GO:0000235 0.18181818 36.17091 8.293204 0.0013357836 0.008014701
GO:0005818 0.18181818 36.17091 8.293204 0.0013357836 0.008014701
GO:0000152 0.06382979 12.69830 5.706995 0.0017020535 0.009834087
GO:0010369 0.14285714 28.42000 7.294687 0.0021884907 0.012193020
GO:0072687 0.13333333 26.52533 7.029155 0.0025169253 0.013539322
qvalue geneID Count
GO:0097431 0.004293846 259266/9212/6790 3
GO:0000235 0.005786593 81930/146909 2
GO:0005818 0.005786593 81930/146909 2
GO:0000152 0.007100184 991/11065/27338 3
GO:0010369 0.008803327 55143/9212 2
GO:0072687 0.009775354 259266/6790 2
>
>