Upgrade schema/interface for cache
widdowquinn opened this issue · comments
Summary:
The current cache access is handled through hard-coded SQL statements. This should be reimplemented as an ORM using SQLAlchemy.
Also, UniProt can return multiple EMBL entries for a single protein sequence. The current schema has (accession, aa_query, nt_query)
as a row in the seqdata
table, with accession
as primary key. This permits only one aa/nt query string per record ID. When upgrading, we should revise the schema so that there's a 1:* relationship between accession and each query type.
This is a more serious problem than I first thought. Some of the cross-references lead to a record where we can recover the coding sequence, but some do not. This means that results may be inconsistent over time.
We should try each query term in order, and record whether the coding sequence is recovered. If recovered we can move on to the next input sequence; if not recovered, we try the next query term (until we run out).
As a practical example:
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query KM038907
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query JNBR01001935
[WARNING] [ncbi_cds_from_protein.sequences]: Additional query terms found for tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161: JNBR01001935 (not used)
The JNBR01001935
record has a CDS which we can readily cross-reference to the input sequence, but the KM038907
record does not: