jerryji1993 / DNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Home Page:https://doi.org/10.1093/bioinformatics/btab083

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

training with different k-mer

berkuva opened this issue · comments

I am trying to train the model on my own data, which consists of 10-mers. Running part 2.2 with the following command:

cd examples

export KMER=10
export TRAIN_FILE=PATH_TO_MY_10MER_FILE.txt
export TEST_FILE=PATH_TO_MY_10MER_FILE.txt
export SOURCE=PATH_TO_DNABERT_REPO
export OUTPUT_PATH=output$KMER

python run_pretrain.py
--output_dir $OUTPUT_PATH
--model_type=dna
--tokenizer_name=dna$KMER
--config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json
--do_train
--train_data_file=$TRAIN_FILE
--do_eval
--eval_data_file=$TEST_FILE
--mlm
--gradient_accumulation_steps 25
--per_gpu_train_batch_size 10
--per_gpu_eval_batch_size 6
--save_steps 500
--save_total_limit 20
--max_steps 200000
--evaluate_during_training
--logging_steps 500
--line_by_line
--learning_rate 4e-4
--block_size 512
--adam_epsilon 1e-6
--weight_decay 0.01
--beta1 0.9
--beta2 0.98
--mlm_probability 0.025
--warmup_steps 10000
--overwrite_output_dir
--n_process 24

I am getting the following error:

OSError: Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json' was not found in model name list. We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert//Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json/config.json' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

I believe this could be because k-10 is making vocab.txt unable to be found (#55 ). How do I train the model from scratch using a different k than 3,4,5,6? Create our own vocab.txt file? Thanks.

[NOT SOLVED YET]
@jerryji1993 and @Zhihan1996

1. Created vocab.txt from your k-mer. For me, k=10.

Here's how to create vocab.txt for k-mer for k=10. Change the number of times bases below for a different k.

# source: https://stackoverflow.com/a/38202625
import itertools

bases = 'ACTG'

vocabs = ["".join(seq) for seq in list(itertools.product(bases, bases, bases, bases, bases, bases, bases, bases, bases, bases))]

with open('PATH_TO_vocab.txt', 'w') as f:
    for vocab in vocabs:
        f.write(vocab)
        f.write('\n')

Then add the following to it.

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]

For k=10, vocab size is 4^10+5 = 1048581.

2. bert-config-10 folder

Besides vocab.txt, I created bert-config-10 folder in DNABERT/src/transformers/dnabert-config/ and created config.json, special_tokens_map.json, and tokenizer_config.json inside it along with vocab.txt.
My sequence length is 20000, so I
Edited vocab_size=1048581 and max_position_embeddings=20000 in config.json
Edited max_len=20000 in tokenizer_config.json.

3. Edited tokenization_dna.py

Edit VOCAB_KMER in line 54 in tokenization_dna.py.

VOCAB_KMER = {
    "69": "3",
    "261": "4",
    "1029": "5",
    "4101": "6",
    "1048581":"10"}

4. DNABERT-XL?:

From #39 :

Please use the tag --model_type=dnalong and set the --block_size as a multiple of 512. The DNABERT and DNABERT-XL use the same checkpoint (parameters).

dnalongcat and dnalong for --model_type do not work:
notice --block_size 20480, which I got from multiplying 512 by 40 (first multiple of 510 greater than my sequence length of 20000).

cd examples

export KMER=10
export TRAIN_FILE=PATH_TO_MY_10MER_FILE.txt
export TEST_FILE=PATH_TO_MY_10MER_FILE.txt
export SOURCE=PATH_TODNABERT
export OUTPUT_PATH=output$KMER

python run_pretrain.py
--output_dir $OUTPUT_PATH
--model_type=dna
--tokenizer_name=PATH_TO_vocab.txt
--config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json
--do_train
--train_data_file=$TRAIN_FILE
--do_eval
--eval_data_file=$TEST_FILE
--mlm
--gradient_accumulation_steps 25
--per_gpu_train_batch_size 10
--per_gpu_eval_batch_size 6
--save_steps 500
--save_total_limit 20
--max_steps 200000
--evaluate_during_training
--logging_steps 500
--line_by_line
--learning_rate 4e-4
--block_size 20480
--adam_epsilon 1e-6
--weight_decay 0.01
--beta1 0.9
--beta2 0.98
--mlm_probability 0.025
--warmup_steps 10000
--overwrite_output_dir
--n_process 24

Error: KeyError: '10'

============================================================
<class 'transformers.tokenization_dna.DNATokenizer'>
08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' not found in model shortcut name list (dna3, dna4, dna5, dna6, dna10). Assuming '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' is a path, a model identifier, or url to a directory containing tokenizer files.
08/31/2022 15:33:27 - WARNING - transformers.tokenization_utils - Calling DNATokenizer.from_pretrained() with the path to a single file or url is deprecated
08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - loading file /Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt
08/31/2022 15:33:28 - INFO - main - Training new model from scratch
08/31/2022 15:34:12 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-06, beta1=0.9, beta2=0.98, block_size=20480, cache_dir=None, config_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json', device=device(type='cpu'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=25, learning_rate=0.0004, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=200000, mlm=True, mlm_probability=0.025, model_name_or_path=None, model_type='dna', n_gpu=0, n_process=24, no_cuda=False, num_train_epochs=1.0, output_dir='output10', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=6, per_gpu_train_batch_size=10, save_steps=500, save_total_limit=20, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt', train_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', warmup_steps=10000, weight_decay=0.01)
08/31/2022 15:34:12 - INFO - main - Loading features from cached file /Users/hyunjaecho/Desktop/code/unencoded/dna_cached_lm_20480_chrY.txt
08/31/2022 15:34:12 - INFO - main - ***** Running training *****
08/31/2022 15:34:12 - INFO - main - Num examples = 555
08/31/2022 15:34:12 - INFO - main - Num Epochs = 100001
08/31/2022 15:34:12 - INFO - main - Instantaneous batch size per GPU = 10
08/31/2022 15:34:12 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 250
08/31/2022 15:34:12 - INFO - main - Gradient Accumulation steps = 25
08/31/2022 15:34:12 - INFO - main - Total optimization steps = 200000
Iteration: 0%| | 0/56 [00:00<?, ?it/s]
Epoch: 0%| | 0/100001 [00:00<?, ?it/s]
Traceback (most recent call last):
File "run_pretrain.py", line 888, in
main()
File "run_pretrain.py", line 838, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_pretrain.py", line 421, in train
inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
File "run_pretrain.py", line 254, in mask_tokens
mask_list = MASK_LIST[tokenizer.kmer]
KeyError: '10'

I want to use kmer=26, for that how can I prepare the vocab.txt file? because I tried the above code for that, but the process has been killed