ggavelis / HGT_v_Contamination_assessor

How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is removing legitimate HGT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HGT_v_Contamination_assessor

This script takes existing metadata about each DNA/AA sequence, and uses that--in combination with an alien index value--to determine whether each sequence should be flagged as a contaminant.

Inputs

  1. (parameter) AI cutoff used for screening. (default AI_cutoff = 0.01)
  2. (file) Fasta file to decontaminate
  3. (file) a 'supertsv' metadata file that contains the following fields
    (seq_id | alien_index_value | num_splice_variants | lineage_of_best_BLAST_hit | spliced_leader[True/False] | polyA_tail[True/False])

Rationale

Alien indices can be used as heuristics to infer whether a sequence is likely to be native or foreign (e.g. a contaminant or HGT). But decontaminating a 'dirty' dataset based on AI alone is inadvisable, since this approach is also likely to remove bona fide HGT. To mitigate this problem of 'overcleaning,' I have broken AI cleaning into two steps.

  1. A first-pass "flagging" step that flags alls seqs whose AI excede the cutoff.
  2. A second-pass "rescue" step that uses sequence metadata to redeem certain sequences. For example:
    A. Any sequence with a dinoflagellate spliced-leader is unflagged as native.
    B. Any best-hit to prokaryotes is unflagged if it has:
          i. A poly-A tail
          ii. Multiple splice isoforms

This script also gathers metrics about the frequency of HGT from various groups.

About

How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is removing legitimate HGT.


Languages

Language:Python 100.0%