a-slide / pycoQC

pycoQC computes metrics and generates Interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecaller (Albacore/Guppy)

Home Page:https://a-slide.github.io/pycoQC/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pycoQC looks for 'read_len' column in sequencing_summary. it isnt there

DrOllyGomez opened this issue · comments

Describe the bug
see traceback attached.
traceback.txt

The sequencing_summary.txt file was produced in following way:

  • multi_to_single_fast5
  • porplex driving albacore 2.3.4
  • all that done in parallel, then individual sequencing_summary.txt files collected and simply concatenated.

Output of 'head' command on sequencing_summary.txt attached
headseqsum.txt

Expected behavior
Expected a pycoQC report to be generated by the attached call.
call.txt

Desktop (please complete the following information):

  • OS: Ubuntu 18.04

Additional context
Add any other context about the problem here.

Hi @DrOllyGomez,

It seems that the summary file is generated by Poreplex itself, I haven't come across it so far.
The format is a little different from the ONT. I pushed a compatibility fix on the dev Branch.
Would you be able to give it a try with your full file?
=> pip install git+https://github.com/a-slide/pycoQC.git@dev --upgrade
And check that you have upgraded to version 2.5.0.11
Thanks

And you don't have to concatenate the files yourself. pycoQC also works with regular expressions to match all the files

I forgot to say thanks a lot for the detailed error reporting. That's probably the best I had so far :D

Hi, progress I believe, but not quite there yet:

Here ..
upgrade.txt
.. is the upgrade commentary from pip, showing dependencies, versions etc.

and here...
Traceback2.txt
... is the output, with apparently successful parsing, but with later problem.....

any help very gratefully received.
Mike

I believe there must be non-numeric entries in your file in the sequence_length column.
Could you please confirm that or send me the entire file you used so I can replicate the issue ?

Yes, I can confirm there are non-numerics: there are multiple instances of the header line...... eg
'filename\tread_id\trun_id\tchannel\tstart_time\tduration\tnum_events....etc
...... from me concatenating the original, parallelised! Arrgh! :)

I will write a script to excise these (leaving in the very first) and I imagine pycoQC will work fine on them... will report back....
Thanks and apologies
Mike

You don't have to. PycoQC can take multiple summary files as input.
Then it merges the file data without the header. :D

Yep: that's nailed it: and it looks beautiful! Many thanks!!

paths crossed there! will try the method you suggest too.... :)

From the documentation
Path to a sequencing_summary generated by Albacore 1.0.0 + (read_fast5_basecaller.py) / Guppy 2.1.3+ (guppy_basecaller). One can also pass multiple space separated file paths or a UNIX style regex matching multiple files (Required)

I will close the issue then.
thanks

Hi,

I also ran into the same issue, but the separate sequence_summary.txt files came as is from the run. Still they generate the same error.

pycoQC v2.5.1.dev6

I ran it from a conda environment.

pycoQC --summary_file sequencing_summary_FAR75694_601e48d8.txt sequencing_summary_FAR75694_aa9a0e51.txt sequencing_summary_FAR75694_c645dc34.txt --barcode_file barcode/barcoding_summary.txt --html_outfile pycoQC_FlevoRUN2.html

I have attached the smaller of the three files.
sequencing_summary_FAR75694_c645dc34.txt

Hi,

I realized that I had to use the sequencing_summary.txt after basecalling.
Not the file that is generated during the run...

Now it works.

Kind regards,

T.