Currently, the finetuning is done wrt meta-llama/Llama-2-7b-chat-hf
The main entry-points are finetune.py
and train.sh
- Training and validation sets are splitted (99% train, 1% valid) randomly based on all curated data from PubMed Central Open Access Subset (see Build dataset from scratch).
- The entire dataset is roughly 6GB and 1.3b tokens.
- Both partitions are hosted on huggingface hub (https://huggingface.co/datasets/BrainGPT/train_valid_split_pmc_neuroscience_2002-2022_filtered_subset/tree/main), and can be loaded:
from datasets import load_dataset
dataset = load_dataset(
"BrainGPT/train_valid_split_pmc_neuroscience_2002-2022_filtered_subset"
)
All documents (abstract and fulltext) are concatenated and chunk to 2048 tokens in finetune.py
def tokenize(element, tokenizer, args):
outputs = tokenizer(
element["text"],
truncation=True,
max_length=args.chunk_size,
return_overflowing_tokens=True,
return_length=True,
)
output_ids = list(itertools.chain(*outputs["input_ids"]))
output_mask = list(itertools.chain(*outputs["attention_mask"]))
output_ids = [output_ids[x:x+args.chunk_size] for x in range(0, len(output_ids), args.chunk_size)]
output_mask = [output_mask[x:x+args.chunk_size] for x in range(0, len(output_mask), args.chunk_size)]
return {"input_ids": output_ids, "attention_mask": output_mask}
- Training hyperparameters can be found in
train.sh
batch_size=1
chunk_size=2048
eval_batch_size=16
learning_rate=2e-5
gradient_accumulation_steps=8
num_train_epochs=1
num_warmup_steps=0.03
weight_decay=0.001
lr_scheduler_type="cosine"
- LoRA parameters can be found in
config/lora_config.json
lora_rank=8
lora_alpha=32
lora_dropout=0.1
lora_module=["gate_proj", "up_proj", "down_proj"]
(Variant 1; Fully-connected only)lora_module=["q_proj", "v_proj", "o_proj"]
(Variant 2; Attention only)lora_module=["gate_proj", "up_proj", "down_proj", "q_proj", "v_proj", "o_proj"]
(Variant 3; Full lora)
- Accelerate parameters can be found in
config/accel_config.yaml
All regarding dataset download and curation is in data
python fetch_journal_names.py
will extract top neuroscience journal names (based on https://research.com/journals-rankings/neuroscience) intojournal_names.json
python fetch_fulltext.py
will download articles from the above journals whose full-text versions are accessible from PubMed Central Open Access Subset.python fetch_abstract.py
will download article abstracts from the above journals that are available via PubMed E-utilities API.
.
├── data
│ └── dataset
│ ├── {journal_name}
│ ├── fulltext
│ └── abstract
│ ├── fetch_journal_names.py
│ ├── fetch_fulltext.py
│ └── fetch_abstract.py
Both fulltext/
and abstract/
follow the same structure where each json file is an article named by its doi.