CarveMe doesn’t satisfy the original protein complex definitions during the building of GPRs
lazzarigioele opened this issue · comments
Using carveme v1.5.2.
My opinion reading your code is that there is no real check whether a particular protein complex, as it is defined in a particular model for a particular reaction, is really satisfied or not. This in my opinion can lead to wrong GPRs.
I report here below your code contained in carveme/reconstruction/scoring.py , converting the table 'gene_scores' in 'protein_scores', and finally converting 'protein_scores' in 'reaction_scores'. Then I try to give a practical example.
From gene to protein scores:
def merge_subunits(genes):
genes = genes.dropna()
if len(genes) == 0:
return None
else:
protein = ' and '.join(sorted(genes))
if len(genes) > 1:
return '(' + protein + ')'
else:
return protein
def merge_subunit_scores(scores):
return scores.fillna(0).mean()
protein_scores = gene_scores.groupby(['protein', 'reaction', 'model'], as_index=False) \
.agg({'query_gene': merge_subunits, 'score': merge_subunit_scores})
protein_scores.rename(columns={'query_gene': 'GPR'}, inplace=True)
From protein to reaction scores:
def merge_proteins(proteins):
proteins = set(proteins.dropna())
if not proteins:
return None
gpr_str = ' or '.join(sorted(proteins))
if len(proteins) > 1:
return '(' + gpr_str + ')'
else:
return gpr_str
def merge_protein_scores(scores):
return scores.max(skipna=True)
reaction_scores = protein_scores.groupby(['reaction'], as_index=False) \
.agg({'GPR': merge_proteins, 'score': merge_protein_scores}).dropna()
avg_score = reaction_scores.query('score > 0')['score'].median()
reaction_scores['normalized_score'] = (reaction_scores['score'] / avg_score).apply(lambda x: round(x, 1))
Suppose now this is my 'gene_scores' table:
query_gene | BiGG_gene | score | gene | protein | reaction | model |
---|---|---|---|---|---|---|
NaN | STM_v1_0.STM0970 | NaN | G_STM0970 | P_STM0970+STM0973 | R_PFL | STM_v1_0 |
gene_7500 | STM_v1_0.STM0843 | 784.0 | G_STM0843 | P_STM0843+STM0844 | R_PFL | STM_v1_0 |
gene_18818 | STM_v1_0.STM0973 | 864.0 | G_STM0973 | P_STM0970+STM0973 | R_PFL | STM_v1_0 |
NaN | STM_v1_0.STM0970 | NaN | G_STM0970 | P_STM0970+STM3241 | R_PFL | STM_v1_0 |
gene_18818 | STM_v1_0.STM3241 | 846.0 | G_STM3241 | P_STM0970+STM3241 | R_PFL | STM_v1_0 |
NaN | STM_v1_0.STM4114 | NaN | G_STM4114 | P_STM4114+STM4115 | R_PFL | STM_v1_0 |
NaN | STM_v1_0.STM4115 | NaN | G_STM4115 | P_STM4114+STM4115 | R_PFL | STM_v1_0 |
gene_13306 | STM_v1_0.STM0844 | 156.0 | G_STM0844 | P_STM0843+STM0844 | R_PFL | STM_v1_0 |
Applying the above code would result in this 'protein_score' table:
protein | reaction | model | GPR | score |
---|---|---|---|---|
P_STM0843+STM0844 | R_PFL | STM_v1_0 | (gene_13306 and gene_7500) | 470.0 |
P_STM0970+STM0973 | R_PFL | STM_v1_0 | gene_18818 | 432.0 |
P_STM0970+STM3241 | R_PFL | STM_v1_0 | gene_18818 | 423.0 |
P_STM4114+STM4115 | R_PFL | STM_v1_0 | None | 0.0 |
And finally in this 'reaction_score' table:
reaction | GPR | score | normalized_score |
---|---|---|---|
R_PFL | ((gene_13306 and gene_7500) or gene_18818) | 470.0 | 1.0 |
As you can read , the final GPR is ((gene_13306 and gene_7500) or gene_18818)
.
The member “(gene_13306 and gene_7500)” is correct, as the both the components of the complex P_STM0843+STM0844 were well matched by Diamond.
Instead, the member “gene_18818” should not appear in the final GPR in my opinion, because neither the complex P_STM0970+STM0973 nor the complex P_STM0970+STM3241 were satisfied, since Diamond didn’t really catch the gene STM_v1_0.STM0970 (score = Nan).
This simplified example pretend to work with a 'gene_scores' table having just 1 model with 1 reaction, anyway I think it is enough to highlight the issue.
With a real-life 'gene_scores' table, there is the chance that other models containing the same reaction can make the final GPR actually correct, balancing the errors. But it is just a chance...
Maybe it could be useful to have a dedicated option, like carve --strictgpr
, if the user wants the original protein complex definitions to be strictly satisfied.