biothings / mygene.info

MyGene.info: A BioThings API for gene annotations

Home Page:http://mygene.info

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Genome Interval Queries Responses don't seem to be consistent

tomkp75 opened this issue · comments

Hello,

Thanks a lot for this great API.

I figured that performing the following request would return different results upon refreshing the page:
https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&limit=1

In this particular case it switches between WASH7P and DDX11L1.

Thanks,

Tom

Thank you for contacting us. That's a very acute observation, a result of the internal workings of the distributed database system we use (most likely). Each time the request may reach a different server replica, that computes the score of a query by itself, taking into account of server specific statistics, and may under rare circumstances result in inconsistent scoring. The query endpoint's normal presentation is more or less for data exploration. Once you have decided on a specific query, you can use the fetch all feature to lock on a frozen view of the data for consistent retrieval (with pagination).

Thanks @namespacestd0. I believe your suggestion is to use the parameter fetch_all=TRUE, is that correct? In that case the issue remains.
ex. https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&fetch_all=TRUE

Do you mean the results across the fetch all calls are inconsistent or using the scroll id provided by a fetch all call cannot consistently retrieve information by pagination?

Results across the fetch all calls are inconsistent. I'm not using pagination in this example.

That's expected, what I meant was if scoring prevented you from getting all the results through pagination, you can use fetch all to lock on one version to go through. We'll leave this issue open and evaluate the cost of ensuring consistent scoring to determine if we can introduce this additional guarantee in the future.

I understand now. It wouldn't make it for me as I'm implementing an automated process and the greater match could be wrong on the example I provided, but I also read you mentioned The query endpoint's normal presentation is more or less for data exploration

I see. We'll continue to explore the possibility to provide stable scoring. Meanwhile, another option is to consider adding a customized sorting parameter https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&sort=entrezgene I think it could make sense in the interval query but I understand this may or may not be practical in your automated process depending on other factors.

That probably makes more sense considering entrezgene is a string field, not suitable for sorting.

The above solutions should be practical enough to address this issue.