nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No basecalls found during resquiggle on RNA fast5s

pterzian opened this issue · comments

My command (with tombo 1.5) :

tombo resquiggle single_fast5/chrm1/  fasta/chrm1.fa --rna --processes 10  --num-most-common-errors 5  --overwrite  --ignore-read-locks

the error :

[15:12:07] Loading minimap2 reference.
[15:12:07] Getting file list.
******************** ERROR ********************
        Reads do not to contain basecalls. Check --basecall-group option if basecalls are stored in non-standard location or use `tombo annotate_raw_with_fastqs` to add basecalls from FASTQ files to raw FAST5 files.

This is the sequence in the fast5 (If I am not mistaken) :

GROUP "BaseCalled_template" {
               DATASET "Fastq" {
                  DATATYPE  H5T_STRING {
                     STRSIZE 4727;
                     STRPAD H5T_STR_NULLTERM;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
                  DATASPACE  SCALAR
                  DATA {
                  (0): "@a9dd4e77-37f8-4364-95ff-b855b8e9cd1e runid=0cfa50683d560fac3f644990b1fb76437eb2a00c read=88911 ch=147 start_time=2021-09-17T06:46:19Z flow_cell_id=XXX protocol_group_id=XXXX sample_id=XXXXXXX
           CAAAUCCCAUUUGGGGGAGAUCUACUUGUUUUUGGGCAGCCAGCCCUUGCAUUUUUUUUAAUAUAACCAUUGGCUUUGCAUGCCUGCCUGGCUUAGUAUUUGAAUUGAUUUUUCUUCCUUUUUUAUGGUAUAACUUCAUCGUUGAUCACUGAAAUCAUGGUUCCUUGUCACUACUUUUGUAAAUUAAUCUGAAACUUAACCUCAAGGUAACUGAGAUGAUAUGUUUCUGUAAAAGGAUUUGUUUGCCUUUGCUUUUAUUUGGUGUACCUCCUUAAUCCCUUUCCAGUAUGAUUUAUCUGUGAGAAUAAGGUGUAACCUUGUGGCUAAAGACUUGCAAAGAGUAAAGAUAGUGGGACCUUCUAUUCUCAAAACAGUGGAAAAUAUUAUCUGCAGCCACAGAUUUCACUAUCAGACUUGAGAAUAAAGAUUAUCAUCUUGGUACUCCUUAGUGGUAACAUUUGAAGUGAGCUGCAUUUCUAAUUGUACAGAUUGCCUUUCAUUGUCUAAAAAGCCGCUGGAGAGACCCUCAGGAUUUUCAGCAAUUAGAUAUACUGGUGUUUUGUUUUUUUGAAACUAUUCUUAACCAGGGUUUGUGGUGGUUGUGAUAUGGUGGGAAAAAUUUCUUGGAGCUUGGUUUUUAUUUUCUCUGUCUUUGCUAGCUACACAACACACUUGGGCAAAUCCUUACUACUUCUCUGCUCUUAUUCCCACUAUUCAUCUAUAAAAUUAGAUUAUCAGACUAGGUCACUCAUACAUUUGGACUCUUUUCAGGUUUGUGAUCCUGAACCUUAAACUGCUUUUUUAGUGACAACCCCUACAUGUUCUUCGAGGUAGUACAAACGUGUUGGGGCAGAGAGCUUGGCCAGAGAGAAGGGAGCCAGACAAACUGACACUAAAGGGCCAGCUGGAGGAGUAUGAGGUGUCAACUUAGGAGGGAAUGAAGACUGUGUGUUUUACAAACUGUUGGUGCAGCCUCACUUCUGAGUAUUCUUAGCUUCUUCUGAGAAUAAAUCAGCUUUGCUUUUAAUGCUUAUUGAAGAGGUAGUGAUGUCCUACUUCAGAGCUGAUUGUGCCAAGUAUGCUUUAUUUAAAUAACAGCUUCGCUGCUUAAAAUUAGUACCACUUUGAAAGCUAGCUGUGUUGUGUUAAGACAACCAGCACCAAAAAUGUUUUCCCCAUAUUCCCUCAGACACGUAAUUGCUGAACACUAACGUUCUGCUCAGUGAUGGAACCAGGUAGUAGUGGGGCUGUUGCCCAAAUGCUGCUCAACCCAAGAUCUGCGUAGAGUGAUAAUAGCUGGAUGUUCCUGUGUAGUAAUGGUGUGCACUUAAAUUUGGAAACGAAAUGUUUAGUGCACAUUGGCAUGCUUCCUUGGACGUAGAUCUGUUGUUUGGUUACUAUGAUAUGAAAGAGCAUUUAGGAAGGUUCCCUAGGUGGCUCAGUGGUUAACGAACCUAACAUUAGGGAUCCAUGAGGGAAGCAUCGAAGAAAAACCAUUUAAAACCUGGGUGUAUCGGUGCUGUUUUGUUGACAGAUGAUAGAACUCGGUUCUUUCGAACAAAGGUCAAGAGUGUAAACCUGGGACAGUGUGUGUAUGAGGCAGUCAGUAACCACUUGUCCUCUUACUCCUCAGUAAGUUUCAAUAGCUUAUCAUCAUGCCUGCCUUGCUCAUCUGUUGAAUCAACCUAUGGAAAGAGUGUGUUAAUGAAAUGCACUGCAGAAGAGGUUUUUUUUGAGAAAAUUAGUAUAUCAUGCAGAUGUAAUGAAAAUGCAGUUUUAUUAUUGCUUGCGCCUCAGCUGUUUAAGUGAUAUUAAAGGGCUUGGAGAGUAAAAAAUUCCCUCCCUUCCCAUCUACUACCCUACAUUUAAAAAAAAAAAAUAAAAAAAAAAAAAAAGCAUUGUUCUUUCCUCUCCAUGAGAGAGAGAGAGAGAGAGAGAGCUGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAUGACUGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAAGAGAGACUGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAAAGAGAGACUGAGAUGAGAGAUUUGAAAUGAUAUAUCCAGAGGUGAGAGAGAGAGAGACUGACGAUAUCAUCAGUCAUUCUUUCCAUUCUCCAUUCUAAUUGCAAAU
           +

           "
                  }
               }
            }

This GROUP "BaseCalled_template" goes along the GROUP "Basecall_1D_000". So I really don't get where this issue comes from...

Hey! I had a similar problem, and seems like the cause was Fast5 compression! Try changing to gzip compression with compress_fast5 from the ont_fast5_api, hope this can help!

Hello,

I am having similar issue in running tombo requiggle and tombo preprocess. Here are the steps I followed:

  1. compress_fast5 -i ../fast5shGFP/ -s ../fast5_shGFP1_gzip -c gzip
  2. Then I ran, tombo preprocess annotate_raw_with_fastqs --fast5-basedir ./fast5_shGFP1_gzip/ --fastq-filenames ./pass/*fastq --sequencing-summary-filenames sequencing_summary.txt --overwrite
    It failed.

Then I ran multi_to_single_fast5 -i ./fast5_shGFP1_gzip/ -s ./single_fast5_shGFP_gzip/ --recursive
Then I tried
tombo preprocess annotate_raw_with_fastqs --fast5-basedir ./new_single_fast5_shGFP_gzip/ --fastq-filenames ./shGFP_FASTQ_1/pass/*fastq --sequencing-summary-filenames sequencing_summary.txt --overwrite

It failed with following error message:
[13:34:55] Getting read filenames.
[13:34:56] Parsing sequencing summary files.
******************** WARNING ********************
Some FASTQ records from sequencing summaries do not appear to have a matching file.
[13:34:57] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0it [00:07, ?it/s]
[13:35:05] Added sequences to a total of 0 reads.

Last I tried tombo resquiggle with multifast5 files (Just to test if this will work) tombo resquiggle ./fast5_shGFP_gzip/ ./Homo_sapiens.GRCh38.cdna.all.fa --rna
But this also failed with message:

13:37:09] Loading minimap2 reference.
[13:37:18] Getting file list.
******************** ERROR ********************
Reads do not to contain basecalls. Check --basecall-group option if basecalls are stored in non-standard location or use tombo preprocess annotate_raw_with_fastqs to add basecalls from FASTQ files to raw FAST5 files.

Please help. I am not sure what I missing here.
Note: It is RNA data.

The problem is that Tombo needs the fast5 files to contain the fastq sequence for your reads. Since your tombo preprocess annotate_raw_with_fastqs failed, the sequences were not added.

If you are not able to have tombo preprocess annotate_raw_with_fastqs work, you can always re-basecall your fast5 files using guppy with the --fast5_out argument.

Example :
guppy_basecaller --flowcell FLO-MIN106 --kit SQK-RNA002
-i "Folder_containing_fast5_files" -r -s "Output_folder" --nested_output_folder
--num_callers 6 --gpu_runners_per_device 6 --chunks_per_runner 1000 --chunk_size 1000 -q 0 -v
-x cuda:all:100% --fast5_out

Thank you so much for comment. I have re-basecall fast5 files using following guppy commands
guppy_basecaller
-i /data/user/mbansal/Fast5/GFP/fast5_pass/fast5shGFP
-s /data/user/mbansal/Fast5/GFP/fast5_pass/shGFP_FASTQ_2
-c rna_r9.4.1_70bps_hac.cfg
--reverse_sequence
--post_out
--trace_category_logs
--moves_out
--trim_adapters
--compress_fastq
-x "cuda:all" --num_callers 5 --gpu_runners_per_device 8
--chunks_per_runner 1000 --chunk_size 1000
--fast5_out

What should I do next

First you should validate that you truly added the fastq sequence to the file. You can view your fast5 file with the HDFView software, or with h5dump or others in command line.
You should confirm the basecall-group where the info was added (for me, it was in "Basecall_1D_000").

Then you proceed as before: Make sure the fast5 are compressed with gzip, transform from multi-fast5 files to single-fast5 files, then run tombo resquiggle. Here was my command :

tombo resquiggle --overwrite --basecall-group Basecall_1D_000 subset_single_fast5_uncompressed ~/Software/nanom6A_2021_10_22/data/ref.fa --processes 12 --fit-global-scale --include-event-stdev --num-most-common-errors 5

Note : Depending on the downstream softwares/applications that you want to use, you may or may not want to include all of these options, but you should use --basecall-group to specify where your fastq sequence is

Hope this helps!

I don't think --fast5_out made any changes in original input fast files and also did not provide new fast5 files. Upon checking orignal fast5 files using h5dump -n Here is the content of the file
group /read_ffc65009-af19-4423-85e0-24c5f77f8a6a
group /read_ffc65009-af19-4423-85e0-24c5f77f8a6a/Raw
dataset /read_ffc65009-af19-4423-85e0-24c5f77f8a6a/Raw/Signal
group /read_ffc65009-af19-4423-85e0-24c5f77f8a6a/channel_id
group /read_ffc65009-af19-4423-85e0-24c5f77f8a6a/context_tags -> /read_0004e238-da00-4950-992e-fd5a4457f6ac/context_tags
group /read_ffc65009-af19-4423-85e0-24c5f77f8a6a/tracking_id -> /read_0004e238-da00-4950-992e-fd5a4457f6ac/tracking_id
} I do not see basecall_id? Am I missing anything here?

The --fast5_out command should create a new directory called "workspace" in which you will find new fast5 files that contain the fastq sequence