Generated summary file missing required column
skchronicles opened this issue · comments
Hello @a-slide,
I hope you are having a great day! I was testing out pycoQC and ran into an issue after generating a summaries file with Fast5_to_seq_summary
.
Describe the bug
The Fast5_to_seq_summary
output summaries file was passed to pycoQC
and produced the following error:
Traceback (most recent call last):
File "/usr/local/bin/pycoQC", line 8, in <module>
sys.exit(main_pycoQC())
File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 115, in main_pycoQC
pycoQC (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC.py", line 120, in pycoQC
parser = pycoQC_parse (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 96, in __init__
summary_reads_df = self._parse_summary()
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 136, in _parse_summary
df = self._select_df_columns (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 397, in _select_df_columns
raise pycoQCError("Column {} not found in the provided sequence_summary file".format(col))
pycoQC.common.pycoQCError: Column read_len not found in the provided sequence_summary file
To Reproduce
Steps to reproduce the behavior:
Fast5_to_seq_summary
command to generate the summary file:
$ Fast5_to_seq_summary --threads 8 -f sample/fast5 -s summary.tsv --verbose 2
Here are the first few lines of the output summary.tsv
file:
read_id run_id channel start_time
000a1b52-fad6-4d6f-b113-c4b24013fcf9 8d6deda632c3a7303f91016b7707e7310e0bc054 256 42618
0026ba30-0061-401d-8dc1-3cb556d71cb9 8d6deda632c3a7303f91016b7707e7310e0bc054 133 29349
000d264a-1a98-4a55-beb5-9f02dd42fce2 8d6deda632c3a7303f91016b7707e7310e0bc054 170 42809
001ddc14-ccb8-42c3-9fd3-74db3c431a75 8d6deda632c3a7303f91016b7707e7310e0bc054 110 42649
0048af85-5c18-4745-b51e-2fab957aceab 8d6deda632c3a7303f91016b7707e7310e0bc054 61 42292
00519880-3d53-4ee3-8528-7a388ad69b24 8d6deda632c3a7303f91016b7707e7310e0bc054 198 42873
As you can see here, there is no column containing sequence/read length information.
pycoQC
command to generate the report:
$ pycoQC -f summary.tsv -o test.html -j test.json --verbose
Expected behavior
I was expecting the summaries file generated by Fast5_to_seq_summary
to be compatible with pycoQC
. I also tried re-running the Fast5_to_seq_summary
with the following fields option (to include everything):
--fields barcode_arrangement barcode_full_arrangement barcode_score calibration_strand_end calibration_strand_genome_template calibration_strand_identity calibration_strand_start called_events channel channel_digitisation channel_offset channel_range channel_sampling_rate device_id duration flow_cell_id mean_qscore_template protocol_run_id read_id read_number run_id sample_id sequence_length_template skip_prob start_mux start_time stay_prob step_prob strand_score
however, that did not seem to help, and I am getting the same error message.
I can see here, in your parser, that you are looking for these columns to rename and then check to see if they exist.
however, if I try to pass sequence_length_2
or sequence_length
to the --fields
option of Fast5_to_seq_summary
, it errors out:
Check input data and options
Traceback (most recent call last):
File "/usr/local/bin/Fast5_to_seq_summary", line 8, in <module>
sys.exit(main_Fast5_to_seq_summary())
File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 168, in main_Fast5_to_seq_summary
Fast5_to_seq_summary (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/Fast5_to_seq_summary.py", line 119, in __init__
raise pycoQCError ("Field {} is not valid, please choose among the following valid fields: {}".format(field, ",".join(self.attrs_grp_dict.keys())))
pycoQC.common.pycoQCError: Field sequence_length_2d is not valid, please choose among the following valid fields: mean_qscore_template,sequence_length_template,called_events,skip_prob,stay_prob,step_prob,strand_score,read_id,start_time,duration,start_mux,read_number,channel,channel_digitisation,channel_offset,channel_range,channel_sampling_rate,run_id,sample_id,device_id,protocol_run_id,flow_cell_id,calibration_strand_genome_template,calibration_strand_end,calibration_strand_start,calibration_strand_identity,barcode_arrangement,barcode_full_arrangement,barcode_score
Desktop:
- OS: Ubuntu 20.04
- pycoQC Version: v.2.5.2, installed from pypi
If you need anything else, please let me know.
Best Regards,
@skchronicles