lenaschimmel / sc2rf

SARS-Cov-2 Recombinant Finder for fasta sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ENH: Output full internal representation of analysis result for sharing without need for recomputation

corneliusroemer opened this issue · comments

The analysis is one off at the moment, in order to see the result, I have to rerun the whole analysis.

That's fine if analysing only 1000 samples or so, but if one were to run it on all of GISAID or even only 50k samples, the waiting would be very inconvenient.

It would be nice if the acts of analysis and the act of viewing were independent.

So one could run the analysis on a server, download the results and view locally.

All that's required, I think, would be to output the internal representation of whatever you use to create the terminal output. One could start off with that simply being a pickled python object.

Or alternatively, turn it into JSON to make the output human readable and also usable for machine analysis.

As a result, one could run sc2rf wiht --output sc2rf_analysis.json to get the analysis result. Share that file, and view it with --precomputed sc2rf_analysis.json.

Your issue can be interpreted in two different ways, which both make sense. Given the separation of analysis and viewing, it depends on what viewing actually means.

Static output

If you just want to save the static, fixed output to a file for later viewing, you can already do this with built-in terminal features:

# analysing:
./sc2rf.py --ansi some.fasta > result.ansi

# viewing (possibly on another computer):
cat result.ansi

Here, the --ansi parameter is not strictly needed, it just makes sure that no UTF-8 characters are used so that it is a proper ansi file. The terminal doesn't care, but tools like ansilove (see #9) do.

Furhter analysis

It might make sense to really store data structures, and not just colorful ansi text, if the later invocation with --precomputed is not just a static display, but allows further analysis, filtering, sorting, different viewing settings… What data structure is needed will depend on which steps will be taken before saving the file, and which ones afterward.

I think this partly overlaps with the existing issues, and some more issues which only live in my head and have not yet been assigned numbers :) Most importantly:

  • #19
  • #24
  • Output structured results: fasta of all sequences that match the criteria, which enables efficient multi-pass strategies
  • interactive mode, for filtering, reordering, etc.

Continued: my assumption is, that even when you put in 50k samples, you would like filter that down to a few dozens or hundreds. If Sc2rf would just output a filtered fasta (or maple) file, these could be handled quite efficenly, without the need for another complicated data structure / file format, and without potential loss of any details.

(That's how I actually work with my sequences, and how I created the fastas in my shared-sequences repository. Sc2rf can't to it alone yet, I used some copy-paste action, a text editor and a modified version of this simple NodeJS script for this.)

Do you have any use case in mind which are not solved by outputting a filtered fasta file?