Bonito train

Question

Bonito train

partha434 opened this issue a year ago · comments

Partha Sarathi Tripathy commented a year ago

Whilr running bonito basecaller prior to training, I encountered following error...please help

bonito basecaller dna_r9.4.1 --reference mtDNA.mmi --save-ctc ./data/ > calls.sam
> reading pod5
> outputting aligned sam
> loading model dna_r9.4.1
> error: failed to load dna_r9.4.1
> available models:

dna_r10.4.1_e8.2_260bps_fast@v3.5.2
dna_r10.4.1_e8.2_260bps_fast@v4.0.0
dna_r10.4.1_e8.2_260bps_fast@v4.1.0
dna_r10.4.1_e8.2_260bps_hac@v3.5.2
dna_r10.4.1_e8.2_260bps_hac@v4.0.0
dna_r10.4.1_e8.2_260bps_hac@v4.1.0
dna_r10.4.1_e8.2_260bps_sup@v3.5.2
dna_r10.4.1_e8.2_260bps_sup@v4.0.0
dna_r10.4.1_e8.2_260bps_sup@v4.1.0
dna_r10.4.1_e8.2_400bps_fast@v3.5.2
dna_r10.4.1_e8.2_400bps_fast@v4.0.0
dna_r10.4.1_e8.2_400bps_fast@v4.1.0
dna_r10.4.1_e8.2_400bps_hac@v3.5.2
dna_r10.4.1_e8.2_400bps_hac@v4.0.0
dna_r10.4.1_e8.2_400bps_hac@v4.1.0
dna_r10.4.1_e8.2_400bps_sup@v3.5.2
dna_r10.4.1_e8.2_400bps_sup@v4.0.0
dna_r10.4.1_e8.2_400bps_sup@v4.1.0
dna_r9.4.1_e8_fast@v3.4
dna_r9.4.1_e8_hac@v3.3
dna_r9.4.1_e8_sup@v3.3

Chris Seymour · Answer 1 · Mon Apr 24 2023 00:01:00 GMT+0800 (China Standard Time)

dna_r9.4.1 is not a pre-trained bonito model - you need select one of the models from the list of available models:

dna_r9.4.1_e8_fast@v3.4
dna_r9.4.1_e8_hac@v3.3
dna_r9.4.1_e8_sup@v3.3

Partha Sarathi Tripathy · Answer 2 · Mon Apr 24 2023 00:10:38 GMT+0800 (China Standard Time)

Thanks it worked.

But after running this one I encountered error with bonito train...

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1_e8_sup@v3.3 --directory ./ ./data/training/fine-tuned-model
[loading model]
[using pretrained model dna_r9.4.1_e8_sup@v3.3]
[loading data]
Traceback (most recent call last):
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/cli/train.py", line 58, in main
    train_loader_kwargs, valid_loader_kwargs = load_numpy(
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/data.py", line 40, in load_numpy
    train_data = load_numpy_datasets(limit=limit, directory=directory)
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/data.py", line 66, in load_numpy_datasets
    chunks = np.load(os.path.join(directory, "chunks.npy"), mmap_mode='r')
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/numpy/lib/npyio.py", line 405, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './chunks.npy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/partha/anaconda3/envs/bonito/bin/bonito", line 8, in <module>
    sys.exit(main())
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/__init__.py", line 34, in main
    args.func(args)
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/cli/train.py", line 62, in main
    train_loader_kwargs, valid_loader_kwargs = load_script(
  File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/data.py", line 31, in load_script
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1073, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/home/partha/model_test/dataset.py'

Chris Seymour · Answer 3 · Mon Apr 24 2023 00:17:28 GMT+0800 (China Standard Time)

You passed ./data as the location to save the training chunks in the basecalling step:

bonito basecaller dna_r9.4.1 --reference mtDNA.mmi --save-ctc ./data/ > calls.sam

But passed ./ in the training command, so just replace --directory ./ for --directory ./data:

> bonito train --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1_e8_sup@v3.3 --directory ./data ./data/training/fine-tuned-model

Partha Sarathi Tripathy · Answer 4 · Mon Apr 24 2023 00:22:34 GMT+0800 (China Standard Time)

./data/ contains pod5 files from my sequencing run

Chris Seymour · Answer 5 · Mon Apr 24 2023 00:25:25 GMT+0800 (China Standard Time)

The directory location used given to --save-ctc when basecalling needs to match the --directory argument when training (you need to pass the directory containing the *npy files producing from the basecalling step with --save-ctc).

Partha Sarathi Tripathy · Answer 6 · Mon Apr 24 2023 00:29:07 GMT+0800 (China Standard Time)

But after basecalling I didn't get any npy files. I got only calls.sam file.
Is I am doing something wrong?

Partha Sarathi Tripathy · Answer 7 · Mon Apr 24 2023 00:32:25 GMT+0800 (China Standard Time)

In the basecalling step it says no suitable ctc data to write...

bonito basecaller dna_r9.4.1_e8_sup@v3.3 --reference mtDNA.mmi --save-ctc ./data/ > calls.sam
> reading pod5
> outputting aligned sam
> loading model dna_r9.4.1_e8_sup@v3.3
> loading reference
> no suitable ctc data to write
> completed reads: 87002
> duration: 0:05:01
> samples per second 2.9E+06
> done

Partha Sarathi Tripathy · Answer 8 · Mon Apr 24 2023 19:37:05 GMT+0800 (China Standard Time)

@iiSeymour After changing --min-accuracy-save-ctc the basecalling worked for me. But during bonito train I am getting following error.

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1_e8_sup@v3.3 --directory ./ ./data/training/fine-tuned-model
[loading model]
[using pretrained model dna_r9.4.1_e8_sup@v3.3]
[loading data]
[validation set not found: splitting training set]
[0/834]: 0%| | [00:01]
Traceback (most recent call last):
File "/home/partha/anaconda3/envs/bonito/bin/bonito", line 8, in
sys.exit(main())
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/init.py", line 34, in main
args.func(args)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/cli/train.py", line 101, in main
trainer.fit(workdir, args.epochs, lr)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/training.py", line 216, in fit
train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/training.py", line 135, in train_one_epoch
losses, grad_norm = self.train_one_step(batch)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/training.py", line 99, in train_one_step
losses_ = self.criterion(scores_, targets_, lengths_)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/crf/model.py", line 189, in loss
return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/crf/model.py", line 129, in ctc_loss
logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/koi/ctc.py", line 115, in logZ_cu
return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/koi/ctc.py", line 47, in forward
beta_move = stay_scores.new_full((T, N, L), S.zero)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 738.00 MiB (GPU 0; 23.69 GiB total capacity; 19.39 GiB already allocated; 16.94 MiB free; 19.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Partha Sarathi Tripathy · Answer 9 · Mon Apr 24 2023 19:42:00 GMT+0800 (China Standard Time)

I tried the solution from this post #247 but got the same error.

Chris Seymour · Answer 10 · Mon Apr 24 2023 20:13:06 GMT+0800 (China Standard Time)

Try with --batch 32

Partha Sarathi Tripathy · Answer 11 · Tue Apr 25 2023 02:56:15 GMT+0800 (China Standard Time)

This time I got a new error...please help @iiSeymour

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1_e8_sup@v3.3 --directory ./ ./data/training/fine-tuned-model -f --batch 32
[loading model]
[using pretrained model dna_r9.4.1_e8_sup@v3.3]
[loading data]
[validation set not found: splitting training set]
[834/834]: 100%|##############################################################| [00:15, loss=0.1346]
Traceback (most recent call last):
File "/home/partha/anaconda3/envs/bonito/bin/bonito", line 8, in
sys.exit(main())
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/init.py", line 34, in main
args.func(args)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/cli/train.py", line 101, in main
trainer.fit(workdir, args.epochs, lr)
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/training.py", line 223, in fit
val_loss, val_mean, val_median = self.validate_one_epoch()
File "/home/partha/anaconda3/envs/bonito/lib/python3.10/site-packages/bonito/training.py", line 183, in validate_one_epoch
seqs, refs, accs, losses = zip(*(self.validate_one_step(batch) for batch in self.valid_loader))
ValueError: not enough values to unpack (expected 4, got 0)

Chris Seymour · Answer 12 · Fri May 05 2023 18:00:55 GMT+0800 (China Standard Time)

fixed in #341