Bonito Train missing dataset.py

Question

Bonito Train missing dataset.py

VBHarrisN opened this issue 7 months ago · comments

When using the bonito train function to train pretrained model, we encounter two errors:
(venv3) (base) @rnascience:~~/packages/bonito$ bonito train --directory /data/training/ctc-train /d(venv3) (base) @rnascience:~~/packages/bonito$ bonito train --directory /data/training/ctc-train /data/training/rna004_130bps_sup@v3.0.1

[loading model]
[loading data]
Traceback (most recent call last):
File "/home//packages/bonito/bonito/cli/train.py", line 58, in main
train_loader_kwargs, valid_loader_kwargs = load_numpy(
File "/home//packages/bonito/bonito/data.py", line 40, in load_numpy
train_data = load_numpy_datasets(limit=limit, directory=directory)
File "/home//packages/bonito/bonito/data.py", line 66, in load_numpy_datasets
chunks = np.load(os.path.join(directory, "chunks.npy"), mmap_mode='r')
File "/home//packages/bonito/venv3/lib/python3.10/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/data/training/ctc-train/chunks.npy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home//packages/bonito/venv3/bin/bonito", line 8, in
sys.exit(main())
File "/home//packages/bonito/bonito/init.py", line 34, in main
args.func(args)
File "/home//packages/bonito/bonito/cli/train.py", line 62, in main
train_loader_kwargs, valid_loader_kwargs = load_script(
File "/home//packages/bonito/bonito/data.py", line 31, in load_script
spec.loader.exec_module(module)
File "", line 879, in exec_module
File "", line 1016, in get_code
File "", line 1073, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/data/training/ctc-train/dataset.py'

The first error has no reason for occuring as chunks.npy IS in the specified folder. The second error, however, I do not understand where dataset.py is supposed to come from. Is it supposed to be generated by the train function? Am I supposed to write the dataset.py script? Should it be included by default? Any clarity would help. Thanks!

Nate Harris · Answer 1 · Mon Feb 05 2024 23:57:56 GMT+0800 (China Standard Time)

Just for anyone encountering this specific error, the missing dataset.py is not the actual error being thrown. The problem is in chunks.npy which is generated using the --save-ctc flag in the basecalling step. When chunks.npy cannot be found and you have verified that it is in the folder where it is supposed to be, it means that chunks.npy is 0 by 10000 which means it is empty. We ran into this when running our data through the RNA004 model, or if the data was bad.

I just want to give a heads up to anyone in the future who encounters this problem!

Laura White · Answer 2 · Tue Apr 02 2024 02:40:36 GMT+0800 (China Standard Time)

Are you saying that chunks.npy is currently never generated when data is run through the RNA004 model?

Nate Harris · Answer 3 · Tue Apr 02 2024 02:46:50 GMT+0800 (China Standard Time)

When attempting to generate CTC data using the RNA004 model, Bonito will generate chunks.npy, but it is empty, with shape 10000 by 0. Therefore, when you try to use Bonito to train/tune a model, it cannot read any data and throws an error as there is no data to be read in.

Laura White · Answer 4 · Tue Apr 02 2024 03:30:14 GMT+0800 (China Standard Time)

Thanks, that is clear (if annoying). Hopefully @iiSeymour will provide an update on #379 and where this falls in the development priority list as not being able to train basecallers for the new chemistry is a real roadblock.

Chris Seymour · Answer 5 · Thu Apr 04 2024 08:41:41 GMT+0800 (China Standard Time)

The bonito --save-ctc workflow depends on reads being mappable with an existing model and --min-accuracy-save-ctc should be lowered for RNA004 data.