training with different k-mer

I am trying to train the model on my own data, which consists of 10-mers. Running part 2.2 with the following command:

cd examples

export KMER=10
export OUTPUT_PATH=output$KMER

--output_dir $OUTPUT_PATH
--gradient_accumulation_steps 25
--per_gpu_train_batch_size 10
--per_gpu_eval_batch_size 6
--save_steps 500
--save_total_limit 20
--max_steps 200000
--logging_steps 500
--learning_rate 4e-4
--block_size 512
--adam_epsilon 1e-6
--weight_decay 0.01
--beta1 0.9
--beta2 0.98
--mlm_probability 0.025
--warmup_steps 10000
--n_process 24

I am getting the following error:

OSError: Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json' was not found in model name list. We assumed '' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

I believe this could be because k-10 is making vocab.txt unable to be found (#55 ). How do I train the model from scratch using a different k than 3,4,5,6? Create our own vocab.txt file? Thanks.

1. Created vocab.txt from your k-mer. For me, k=10.

Here's how to create vocab.txt for k-mer for k=10. Change the number of times bases below for a different k.

# source:
import itertools

bases = 'ACTG'

vocabs = ["".join(seq) for seq in list(itertools.product(bases, bases, bases, bases, bases, bases, bases, bases, bases, bases))]

with open('PATH_TO_vocab.txt', 'w') as f:
    for vocab in vocabs:

Then add the following to it.


For k=10, vocab size is 4^10+5 = 1048581.

2. bert-config-10 folder

Besides vocab.txt, I created bert-config-10 folder in DNABERT/src/transformers/dnabert-config/ and created config.json, special_tokens_map.json, and tokenizer_config.json inside it along with vocab.txt.
My sequence length is 20000, so I
Edited vocab_size=1048581 and max_position_embeddings=20000 in config.json
Edited max_len=20000 in tokenizer_config.json.

3. Edited

Edit VOCAB_KMER in line 54 in

    "69": "3",
    "261": "4",
    "1029": "5",
    "4101": "6",


From #39 :

Please use the tag --model_type=dnalong and set the --block_size as a multiple of 512. The DNABERT and DNABERT-XL use the same checkpoint (parameters).

dnalongcat and dnalong for --model_type do not work:
notice --block_size 20480, which I got from multiplying 512 by 40 (first multiple of 510 greater than my sequence length of 20000).

Error: KeyError: '10'

KeyError: '10'

I want to use kmer=26, for that how can I prepare the vocab.txt file? because I tried the above code for that, but the process has been killed