widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.

Home Page:https://widdowquinn.github.io/ncfp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upgrade schema/interface for cache

widdowquinn opened this issue · comments

Summary:

The current cache access is handled through hard-coded SQL statements. This should be reimplemented as an ORM using SQLAlchemy.

Also, UniProt can return multiple EMBL entries for a single protein sequence. The current schema has (accession, aa_query, nt_query) as a row in the seqdata table, with accession as primary key. This permits only one aa/nt query string per record ID. When upgrading, we should revise the schema so that there's a 1:* relationship between accession and each query type.

This is a more serious problem than I first thought. Some of the cross-references lead to a record where we can recover the coding sequence, but some do not. This means that results may be inconsistent over time.

We should try each query term in order, and record whether the coding sequence is recovered. If recovered we can move on to the next input sequence; if not recovered, we try the next query term (until we run out).

As a practical example:

[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query KM038907
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161 to cache with query JNBR01001935
[WARNING] [ncbi_cds_from_protein.sequences]: Additional query terms found for tr|A0A0A7CPC4|A0A0A7CPC4_9STRA/83-161: JNBR01001935 (not used)

The JNBR01001935 record has a CDS which we can readily cross-reference to the input sequence, but the KM038907 record does not: