jerryji1993 / DNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Home Page:https://doi.org/10.1093/bioinformatics/btab083

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error

Moy4e opened this issue · comments

commented

ginkgo_total.kmer size is 60Gb.

`#!/bin/bash
#SBATCH -o out
module load anaconda/2020.11 cuda/10.2
source activate dnabert
export PYTHONUNBUFFERED=1

#SBATCH -N 2
export KMER=6
export TRAIN_FILE=sample_data/pre/ginkgo_total.kmer
export TEST_FILE=sample_data/pre/ginkgo_total.kmer
export SOURCE=/data/home/scy/github/DNABERT-master
export OUTPUT_PATH=output$KMER

python run_pretrain.py
--output_dir $OUTPUT_PATH
--model_type=dna
--tokenizer_name=dna$KMER
--config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json
--do_train
--train_data_file=$TRAIN_FILE
--do_eval
--eval_data_file=$TEST_FILE
--mlm
--gradient_accumulation_steps 25
--per_gpu_train_batch_size 10
--per_gpu_eval_batch_size 6
--save_steps 500
--save_total_limit 20
--max_steps 200000
--evaluate_during_training
--logging_steps 500
--line_by_line
--learning_rate 4e-4
--block_size 512
--adam_epsilon 1e-6
--weight_decay 0.01
--beta1 0.9
--beta2 0.98
--mlm_probability 0.025
--warmup_steps 10000
--overwrite_output_dir
--n_process 24

`


02/23/2022 16:57:13 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 8, distributed training: False, 16-bits training: False
02/23/2022 16:57:13 - INFO - transformers.configuration_utils - loading configuration file /data/home/scy/github/DNABERT-master/src/transformers/dnabert-config/bert-config-6/config.json
02/23/2022 16:57:13 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

============================================================
<class 'transformers.tokenization_dna.DNATokenizer'>
02/23/2022 16:58:34 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /data/home/scy/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
02/23/2022 16:58:34 - INFO - main - Training new model from scratch
02/23/2022 16:58:40 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-06, beta1=0.9, beta2=0.98, block_size=512, cache_dir=None, config_name='/data/home/scy/github/DNABERT-master/src/transformers/dnabert-config/bert-config-6/config.json', device=device(type='cuda', index=0), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='sample_data/pre/ginkgo_total.kmer', evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=25, learning_rate=0.0004, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=200000, mlm=True, mlm_probability=0.025, model_name_or_path=None, model_type='dna', n_gpu=8, n_process=24, no_cuda=False, num_train_epochs=1.0, output_dir='output6', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=6, per_gpu_train_batch_size=10, save_steps=500, save_total_limit=20, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='dna6', train_data_file='sample_data/pre/ginkgo_total.kmer', warmup_steps=10000, weight_decay=0.01)
02/23/2022 16:58:40 - INFO - main - Creating features from dataset file at sample_data/pre/ginkgo_total.kmer
0 start
1 start
2 start
3 start
4 start
5 start
6 start
7 start
8 start
9 start
10 start
11 start
12 start
13 start
14 start
15 start
16 start
17 start
18 start
19 start
20 start
21 start
22 start
23 start
Traceback (most recent call last):
File "run_pretrain.py", line 885, in
main()
File "run_pretrain.py", line 830, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
File "run_pretrain.py", line 200, in load_and_cache_examples
return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
File "run_pretrain.py", line 183, in init
ids = result.get()
File "/data/home/scy/.conda/envs/dnabert/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/data/home/scy/.conda/envs/dnabert/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/data/home/scy/.conda/envs/dnabert/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/data/home/scy/.conda/envs/dnabert/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647