Non-deterministic behavior for UniProt queries

Question

Non-deterministic behavior for UniProt queries

odietric opened this issue 3 years ago · comments

I am mapping UniProt ID to AA sequences and I found out that that the same query can sometimes result in different outputs. As a specific example, the query for UniProt ID P42328 usually results in one sequence, but sometimes (~1/10) it matches two sequences... After investigation, the second sequence returned corresponds to the ID F9XB45, which belongs to the gene MYCGRDRAFT_42328, thus the confusion. I am aware I can filter the result afterward with the ID, but I believe this behavior is not expected at all.

Code to reproduce this behaviour:

    import io
    import pandas as pd
    from collections import Counter
    from bioservices import UniProt

    u = UniProt()
    n_results = []
    for i in range(100):
        res = u.search("P42328", frmt="tab", columns="id, sequence")
        # Store results in pandas dataframe
        df_res = pd.read_csv(io.StringIO(res), sep="\t")
        n_results.append(len(df_res))
    print(Counter(n_results))

Results in something like Counter({1: 85, 2: 15}) (can vary !)

Thomas Cokelaer · Answer 1 · Tue Jan 18 2022 07:35:05 GMT+0800 (China Standard Time)

@odietric thanks for reporting this issue.

Indeed, that is quite surprising behaviour. I've never noticed it but looks like that the way uniprot service works for now.

Not sure this is an expected feature from Uniprot side and I do not know whether it is a regression bug on the server side.

for now, I could recommend to use the get_df function. It wil return the list of entries (1 or 2) and from there you can query the Entry column:

u.get_df("P42328").query("Entry=='P42328'")

Of course it is not ideal, but could help you for now. I won't fix the issue (lack of time) but maybe a new Uniprot API is on the way so I let the issue open to look at it later. best