A short tutorial on how to train a smaller biencoder model on custom dataset

Question

A short tutorial on how to train a smaller biencoder model on custom dataset

abhinavkulkarni opened this issue 2 years ago · comments

Abhinav M Kulkarni commented 2 years ago

Hi,

I have seen a lot of comments asking how to create a custom dataset or how to use a smaller, different BERT base model for biencoder or how to modify certain hyperparameters (such as context length), so I have decided to write a small tutorial for the same.

First of all, here's what the data directory structure looks like:

$ tree data/
data/
├── blink_format
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── documents
    └── documents.jsonl

documents.jsonl is the file containing all the candidates. Here's how it looks like:

$ cat data/documents/documents.jsonl | jq 'select((.title=="Elon Musk") or (.title=="Steve Jobs"))'
{
  "title": "Elon Musk",
  "text": "Elon Reeve Musk (; born June 28, 1971) is an entrepreneur and business magnate. He is the founder, CEO and chief engineer at SpaceX; early stage investor, CEO, and product architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. A centibillionaire, Musk is one of the richest people in the world.\nMusk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received bachelors' degrees in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding",
  "document_id": 909036
}
{
  "title": "Steve Jobs",
  "text": "Steven Paul Jobs (; February 24, 1955 – October 5, 2011) was an American business magnate, industrial designer, investor, and media proprietor. He was the chairman, chief executive officer (CEO), and co-founder of Apple Inc.; the chairman and majority shareholder of Pixar; a member of The Walt Disney Company's board of directors following its acquisition of Pixar; and the founder, chairman, and CEO of NeXT. Jobs is widely recognized as a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco, California, and put up for adoption. He was raised in the San Francisco Bay Area. He attended Reed College in 1972 before dropping out that same year, and traveled",
  "document_id": 7412236
}

document_id is some kind of identifier for the document, in my case, it is the Wikipedia page_id.
For e.g. 2nd document refers to https://en.wikipedia.org/?curid=7412236.

Here's how train/test/valid data looks like:

$ cat data/blink_format/*.jsonl | jq 'select((.label_title=="Elon Musk") or (.label_title=="Steve Jobs"))'
{
  "context_left": "3 console.\nSome users claim that Nvidia's Linux drivers impose artificial restrictions, like limiting the number of monitors that can be used at the same time, but the company has not commented on these accusations.\nIn 2014, with Maxwell GPUs, Nvidia started to require firmware by them to unlock all features of its graphics cards. Up to now, this state has not changed and makes writing open-source drivers difficult.\nDeep learning.\nNvidia GPUs are often used in deep learning, and accelerated analytics due to Nvidia's API CUDA which allows programmers to utilize the higher number of cores present in GPUs to parallelize programs. This parallelization can dramatically increase training speed of machine learning algorithms due to their extensive use of matrix and vector operations. They were included in many Tesla vehicles before",
  "context_right": "announced at Tesla Autonomy Day in 2019 that the company developed its own SoC and Full Self-Driving computer now and would stop using Nvidia hardware for their vehicles. According to \"TechRepublic\", Nvidia GPUs \"work well for deep learning tasks because they are designed for parallel computing and do well to handle the vector and matrix operations that are prevalent in deep learning\". These GPUs are used by researchers, laboratories, tech companies and enterprise companies. In 2009, Nvidia was involved in what was called the \"big bang\" of deep learning, \"as deep-learning neural networks were combined with Nvidia graphics processing units (GPUs)\". That year, the Google Brain used Nvidia GPUs to create Deep Neural Networks capable of machine learning, where Andrew Ng determined that GPUs could increase the speed",
  "mention": "Elon Musk",
  "label_title": "Elon Musk",
  "label": "Elon Reeve Musk (; born June 28, 1971) is an entrepreneur and business magnate. He is the founder, CEO and chief engineer at SpaceX; early stage investor, CEO, and product architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. A centibillionaire, Musk is one of the richest people in the world.\nMusk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received bachelors' degrees in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding",
  "label_id": 304371
}
{
  "context_left": "career, Evans was a postal worker with the Royal Mail, in Airdrie, near Glasgow. He has a college degree in web design.\n Music career .\nEvans had been posting performances of pop and folk songs to TikTok before beginning to post sea shanties. He posted his first traditional sea shanty, \"Leave Her Johnny\", to TikTok in July 2020. In the following months, viewers of his videos continued to request more sea shanties, leading Evans to posting videos of himself singing \"The Scotsman\" and New Zealand 19th-century shanty \"Wellerman\" in December 2020.\n\"Wellerman\" quickly gained views on TikTok, inspiring many others to record more sea shanties and to imitate and remix Evans's version, including renditions by composer Andrew Lloyd Webber, comedians Jimmy Fallon and Stephen Colbert, guitarist Brian May, and entrepreneur",
  "context_right": ". As of January 22, 2021, \"Wellerman\" had eight million views on TikTok and Evans had hundreds of thousands of followers. Because of its roots on TikTok, the sea shanties trend that Evans launched has been called \"ShantyTok\". In the \"Rolling Stone\" article discussing his success, Evans cited The Albany Shantymen version of the song as inspiration.\nIn January 2021, Evans signed a three-album recording contract with Polydor Records, releasing his official version of \"Wellerman\" on January 22, 2021. A dance remix of the song created with producers 220 Kid and duo Billen Ted was released simultaneously. Evans plans to release a five-song EP of sea shanties. His growing music career led him to quit his job as a postal worker.\nIn February 2021, he signed to United Talent Agency.\nIn",
  "mention": "Elon Musk",
  "label_title": "Elon Musk",
  "label": "Elon Reeve Musk (; born June 28, 1971) is an entrepreneur and business magnate. He is the founder, CEO and chief engineer at SpaceX; early stage investor, CEO, and product architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. A centibillionaire, Musk is one of the richest people in the world.\nMusk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received bachelors' degrees in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding",
  "label_id": 304371
}
{
  "context_left": "Inc. (later NeXT Computer, Inc. and NeXT Software, Inc.) was an American computer and software company founded in 1985 by Apple Computer co-founder",
  "context_right": ". Based in Redwood City, California, the company developed and manufactured a series of computer workstations intended for the higher education and business markets. NeXT was founded by Jobs after he was forced out of Apple, along with several co-workers. NeXT introduced the first NeXT Computer in 1988, and the smaller NeXTstation in 1990. The NeXT computers experienced relatively limited sales, with estimates of about 50,000 units shipped in total. Nevertheless, their innovative object-oriented NeXTSTEP operating system and development environment (Interface Builder) were highly influential.\nThe first major outside investment was from Ross Perot, who invested after seeing a segment about NeXT on a 1986 PBS documentary titled \"Entrepreneurs\". In 1987, he invested $20 million in exchange for 16 percent of NeXT's stock and subsequently joined the board of",
  "mention": "Steve Jobs",
  "label_title": "Steve Jobs",
  "label": "Steven Paul Jobs (; February 24, 1955 – October 5, 2011) was an American business magnate, industrial designer, investor, and media proprietor. He was the chairman, chief executive officer (CEO), and co-founder of Apple Inc.; the chairman and majority shareholder of Pixar; a member of The Walt Disney Company's board of directors following its acquisition of Pixar; and the founder, chairman, and CEO of NeXT. Jobs is widely recognized as a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco, California, and put up for adoption. He was raised in the San Francisco Bay Area. He attended Reed College in 1972 before dropping out that same year, and traveled",
  "label_id": 1053346
}

The label_id corresponds to the line number (0-indexed) for the label in documents.jsonl. For e.g.,

$ sed -n 304372p data/documents/documents.jsonl | jq
{
  "title": "Elon Musk",
  "text": "Elon Reeve Musk (; born June 28, 1971) is an entrepreneur and business magnate. He is the founder, CEO and chief engineer at SpaceX; early stage investor, CEO, and product architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. A centibillionaire, Musk is one of the richest people in the world.\nMusk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received bachelors' degrees in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding",
  "document_id": 909036
}

(sed uses 1-indexing)

You can train a biencoder model as follows:

$ cd BLINK
$ python3 blink/biencoder/train_biencoder.py \
  --data_path data/blink_format  \
  --output_path models/biencoder  \
  --learning_rate 3e-05  \
  --num_train_epochs 3  \
  --max_context_length 128  \
  --max_cand_length 128 \
  --train_batch_size 32  \
  --eval_batch_size 32  \
  --bert_model google/bert_uncased_L-8_H-512_A-8  \
  --type_optimization all_encoder_layers  \
  --data_parallel \
  --print_interval  100 \
  --eval_interval 2000

The training script output is pretty self-explanatory and you should be able to verify that the model is indeed making progress from one evaluation round to the next. I would highly recommend using a subset of the data and more frequent evaluation rounds to verify that the model training is indeed progressing well.

Please note, the main branch uses an older transformers library called pytorch-transformers. In order to use any of the HuggingFace base BERT model (such as google/bert_uncased_L-8_H-512_A-8 as above), you'll have to make minor changes to the BLINK codebase:

Replace pytorch-transformers with transformers
Use AutoModel and AutoTokenizer instead of BertModel and BertTokenizer, for e.g. in biencoder.py

ctxt_bert = AutoModel.from_pretrained(params["bert_model"])
cand_bert = AutoModel.from_pretrained(params['bert_model'])

and

self.tokenizer = AutoTokenizer.from_pretrained(
    params["bert_model"], do_lower_case=params["lowercase"]
)

Modify how inputs are passed to the model, for e.g. in ranker_base.py

output = self.bert_model(input_ids=token_ids, token_type_ids=segment_ids, attention_mask=attention_mask)
output_bert, output_pooler = output.last_hidden_state, output.pooler_output

There are minor issues (such as correct placement of data on CPU/GPU devices, freeing up GPU memory periodically, etc.) - please look at open pull requests and search through reported issues to fix those.

In my case, I also had to modify train/test/valid torch datasets and dataloaders. The ones in the main branch load all the data in memory, causing OOM errors. I created my own IterableDataset to read data on the fly. If you have a multi-core CPU, use multiple workers to feed data to the model on GPU.

I was able to train a much smaller google/bert_uncased_L-8_H-512_A-8 instead of bert-large-uncased (159MB vs 1.25GB) model on my custom dataset on a much smaller, older GPU (Nvidia GeForce GTX 1060 with 6GB of GPU memory).

After creating a FAISS index of candidate encodings with dimensionality reduction (512 => 384) (PCA or OPQ) and coarse- and fine-grained product quantization, I am able to run the model relatively quickly on CPU with good accuracy.

Thanks FB research team for the great effort!

Horatio Wang · Answer 1 · Thu Aug 31 2023 10:44:23 GMT+0800 (China Standard Time)

In BLINK's biencoder, score_candidates is the dot product operation on two matrices, but the native IndexHNSWFlat only supports L2 distance.
The idea of BLINK's encapsulation of faiss is: to transform the dot product space into an L2 space, by adding an extra dimension and making some mathematical transformations.
However, BLINK's source code doesn't use PCA to reduce the dimensionality of the vectors. I noticed you mentioned that you've performed a dimension reduction.
But when I actually used it, I encountered problems. I didn't use BLINK's approach, but instead normalized the vectors with L2 normalization. I've successfully trained the faiss index and got the correct search results.

import faiss

index = faiss.index_factory(768, "L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings.numpy())
index.add(embeddings.numpy())

However, when I applied PCA to reduce the vector dimensionality, I couldn't get the correct results, whether I reduced dimensions before normalization or vice versa.

index = faiss.index_factory(768, "PCA256,L2norm,IVF16384_HNSW32,Flat")

What I want to ask is: is your method the same as BLINK's? However, if so, adding new content requires re-indexing the entire KB because the maximum norm might change. Or, like me, do you normalize the vectors? Where might my approach have gone wrong that led to incorrect search results?