Upgrade schema/interface for cache

Question

Upgrade schema/interface for cache

widdowquinn opened this issue 4 years ago · comments

Leighton Pritchard commented 4 years ago

Summary:

The current cache access is handled through hard-coded SQL statements. This should be reimplemented as an ORM using SQLAlchemy.

Also, UniProt can return multiple EMBL entries for a single protein sequence. The current schema has (accession, aa_query, nt_query) as a row in the seqdata table, with accession as primary key. This permits only one aa/nt query string per record ID. When upgrading, we should revise the schema so that there's a 1:* relationship between accession and each query type.

Leighton Pritchard · Answer 1 · Mon Jul 27 2020 21:26:11 GMT+0800 (China Standard Time)

This is a more serious problem than I first thought. Some of the cross-references lead to a record where we can recover the coding sequence, but some do not. This means that results may be inconsistent over time.

We should try each query term in order, and record whether the coding sequence is recovered. If recovered we can move on to the next input sequence; if not recovered, we try the next query term (until we run out).

As a practical example:

[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query KM038907
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query JNBR01001935
[WARNING] [ncbi_cds_from_protein.sequences]: Additional query terms found for tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161: JNBR01001935 (not used)

The JNBR01001935 record has a CDS which we can readily cross-reference to the input sequence, but the KM038907 record does not: