approach0/math-dense-retrievers

Math Dense Retrievers

This is the repository for replication of the experiments in our paper:

Wei Zhong, Jheng-Hong Yang, and Jimmy Lin. Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval.

https://arxiv.org/pdf/2203.11163v2

Clone Repository

git clone git@github.com:approach0/math-dense-retrievers.git math-dense-retrievers
cd math-dense-retrievers

Data Download

We have made our prebuilt-indexes (optional), experimenting models and corpus files available for download:

wget https://vault.cs.uwaterloo.ca/s/AFTWLbRdKSMBpsK/download -O prebuilt-indexes.tar
# wget https://vault.cs.uwaterloo.ca/s/mAiL4AoHqiSWF8R/download -O experiments.tar.gz # for older version w/o fusion on the NTCIR-12 dataset.
wget https://vault.cs.uwaterloo.ca/s/B2ywSd9N2jNjGYj/download -O experiments.tar.gz
wget https://vault.cs.uwaterloo.ca/s/q5tFQRf8RwZr7dW/download -O corpus.tar.gz

Extract tarballs:

tar xzf corpus.tar.gz
tar xzf experiments.tar.gz
tar xf prebuilt-indexes.tar

If you want to replicate our prebuilt indexes, just skip downloading the prebuilt-indexes tarball, and create an empty directory to hold the new indexes built by your own:

mkdir prebuilt-indexes

Replication Notice: If you choose to build the indexes by your own (from our checkpoints), there may be slight differences in your replicated evaluation scores. This is due to the non-deterministic process of FAISS index building (more specifically). However, these differences should be minor, in practice, they tend to differ in the 3rd decimal point.

Get Source Code

Download the pya0 build (source code can be found here) and our pyserini fork which are used to replicate our results:

pip install pya0==0.3.4
git clone -b patch-colbert git@github.com:w32zhong/pyserini.git ./code/pyserini
wget https://vault.cs.uwaterloo.ca/s/Pbni95czxLWGzJm/download -O ./code/pyserini/pyserini/resources/jars/anserini-0.13.4-SNAPSHOT-fatjar.jar

Download the pya0 source code as well, since it contains the evaluation config file and our experiment script:

git clone -b math-dense-retrievers-replication git@github.com:approach0/pya0.git ./code/pya0

Alternatively, you can only download what is needed for running our evaluations:

wget https://raw.githubusercontent.com/approach0/pya0/math-dense-retrievers-replication/utils/transformer_eval.ini
wget https://raw.githubusercontent.com/approach0/pya0/math-dense-retrievers-replication/experiments/dense_retriever.sh
chmod +x dense_retriever.sh

Modify Config Files

After downloading and decompression, your local directory structure should look like:

 |-code
 | |-.keep
 | |-pyserini
 | |-pya0
 | |-conda_list.txt
 |-prebuilt-indexes.tar
 |-corpus.tar.gz
 |-corpus
 | |-arqmath2
 | |-NTCIR12
 |-README.md
 |-experiments.tar.gz
 |-experiments
 | |-1ep-experiment
 | |-math-colbert
 | |-math-dpr
 | |-tokenizers
 | |-runs
 |-prebuilt-indexes
 | |-index-DPR-ntcir12
 | |-index-DPR-ntcir12__3ep_pretrain_1ep
 | |-index-DPR-arqmath2
 | |-index-DPR-ntcir12__7ep_pretrain_1ep
 | |-index-DPR-ntcir12__scibert_1ep
 | |-index-DPR-ntcir12__vanilla_1ep
 | |-index-DPR-arqmath2__3ep_pretrain_1ep
 | |-index-ColBERT-ntcir12
 | |-index-DPR-arqmath2__7ep_pretrain_1ep
 | |-index-DPR-arqmath2__scibert_1ep
 | |-index-DPR-arqmath2__vanilla_1ep
 | |-index-ColBERT-arqmath2

In the config file code/pya0/utils/transformer_eval.ini, change the following config variable to your current working directory where this README file locates, for example:

store = /store2/scratch/w32zhong/math-dense-retrievers

You may also need to insert your own local GPU information to the default devices (the corresponding array items represent the cuda device and the device capacity in GiB):

devices = {
        "cpu": ["cpu", "0"],
        "titan_rtx": ["cuda:2", "24"],
        "a6000_0": ["cuda:0", "48"],
        "a6000_1": ["cuda:1", "48"],
        "rtx2080": ["cuda:0", "11"]
    }

Run Experiments

In the following, you will need to run pya0 for evaluation.

For illustration, I assume you have cloned its source code and your working directory is at pya0 source code root:

cd code/pya0/

The reference exepriment script is located at experiments/dense_retriever.sh, please refer to this script for running all the experiments. Remember to replace the device name to your own GPU name as inserted in the transformer_eval.ini.

Here is an example of indexing the NTCIR12 dataset using our DPR model:

INDEX='python -m pya0.transformer_eval index ./utils/transformer_eval.ini'
$INDEX index_ntcir12_dpr --device <your_own_device_name>

The index will be generated under the prebuilt-indexes directory if you have not downloaded our prebuilt indexes.

To prevent from overwriting, we abort the indexer when an existing index is found. So you will need to delete any existing index before start indexing a new one:

rm -rf /path/to/your/math-dense-retrievers/prebuilt-indexes/index-DPR-ntcir12

As another example, use the DPR model (located at experiments/math-dpr) we trained to generate results for the NTCIR-12 dataset:

SEARCH='python -m pya0.transformer_eval search ./utils/transformer_eval.ini'
$SEARCH search_ntcir12_dpr --device cpu

(since the DPR searcher only needs to encode queries, feel free to only use CPU device this time)

The pre-existing run files under experiments/runs directory are what we have generated for reporting our results. Be aware that, by default, all newly generated run files will overwrite files under the experiments/runs directory. Also, for convenience, we put the official run files (can be downloaded here) from previous systems under: experiments/runs/official.

Evaluation

Regular evaluation

For NTCIR-12 run files, evaluate them by:

./eval-ntcir12.sh ../../experiments/runs/search_ntcir12_dpr.run 
Fully relevant:
P_5                     all     0.3200
P_10                    all     0.2150
P_15                    all     0.1700
P_20                    all     0.1550
bpref                   all     0.5159
Partial relevant:
P_5                     all     0.4100
P_10                    all     0.3050
P_15                    all     0.2567
P_20                    all     0.2400
bpref                   all     0.4269

For ARQMath-2 run files, evaluate them by our utility scripts (which internally invokes the official evaluation script but adds new statistics like BPref score and Judge Rate):

./eval-arqmath2-task1/preprocess.sh cleanup
./eval-arqmath2-task1/preprocess.sh ../../experiments/runs/search_arqmath2_dpr.run
./eval-arqmath2-task1/eval.sh
100000 ./eval-arqmath2-task1/input/search_arqmath2_dpr_run
++ sed -i 's/ /\t/g' ./eval-arqmath2-task1/input/search_arqmath2_dpr_run
++ python3 ./eval-arqmath2-task1/arqmath_to_prim_task1.py -qre topics-and-qrels/qrels.arqmath-2021-task1-official.txt -sub ./eval-arqmath2-task1/input/ -tre ./eval-arqmath2-task1/trec-output/ -pri ./eval-arqmath2-task1/prime-output/
++ python3 ./eval-arqmath2-task1/task1_get_results.py -eva trec_eval -qre topics-and-qrels/qrels.arqmath-2021-task1-official.txt -pri ./eval-arqmath2-task1/prime-output/ -res ./eval-arqmath2-task1/result.tsv
trec_eval topics-and-qrels/qrels.arqmath-2021-task1-official.txt ./eval-arqmath2-task1/prime-output/prime_search_arqmath2_dpr_run -m ndcg
trec_eval topics-and-qrels/qrels.arqmath-2021-task1-official.txt ./eval-arqmath2-task1/prime-output/prime_search_arqmath2_dpr_run -l2 -m map
trec_eval topics-and-qrels/qrels.arqmath-2021-task1-official.txt ./eval-arqmath2-task1/prime-output/prime_search_arqmath2_dpr_run -l2 -m P
trec_eval topics-and-qrels/qrels.arqmath-2021-task1-official.txt ./eval-arqmath2-task1/prime-output/prime_search_arqmath2_dpr_run -l2 -m bpref
python -m pya0.judge_rate topics-and-qrels/qrels.arqmath-2021-task1-official.txt ./eval-arqmath2-task1/trec-output/search_arqmath2_dpr_run
++ cat ./eval-arqmath2-task1/result.tsv
++ sed -e 's/[[:blank:]]/ /g'
System nDCG' mAP' p@10 BPref Judge
search_arqmath2_dpr_run 0.2700 0.0869 0.1521 0.0972 66.3

ARQMath topic breakdown

You can also break down a ARQMath run file by topic categories:

./eval-arqmath2-task1/preprocess.sh cleanup
./eval-arqmath2-task1/preprocess.sh filter ../../experiments/runs/search_arqmath2_dpr.run
./eval-arqmath2-task1/eval.sh --nojudge
# (Omitting many outputs here)
System nDCG' mAP' p@10 BPref Judge
search_arqmath2_dpr_run-Category-Calculation 0.2785 0.1098 0.1840 0.1194 0.0
search_arqmath2_dpr_run-Category-Proof 0.2641 0.0683 0.1222 0.0746 0.0
search_arqmath2_dpr_run-Difficulty-Low 0.2741 0.0932 0.1687 0.1146 0.0
search_arqmath2_dpr_run-Dependency-Formula 0.2406 0.0629 0.1238 0.0796 0.0
search_arqmath2_dpr_run 0.2700 0.0869 0.1521 0.0972 0.0
search_arqmath2_dpr_run-Dependency-Both 0.2835 0.0956 0.1625 0.1022 0.0
search_arqmath2_dpr_run-Category-Concept 0.2673 0.0831 0.1526 0.1001 0.0
search_arqmath2_dpr_run-Difficulty-Medium 0.2886 0.0850 0.1350 0.0897 0.0
search_arqmath2_dpr_run-Difficulty-High 0.2436 0.0782 0.1421 0.0758 0.0
search_arqmath2_dpr_run-Dependency-Text 0.2780 0.1022 0.1700 0.1143 0.0

Reranking, fusion (w/ cross validation) etc.

For how to invoke other evaluation scripts, please refer to the experiments/dense_retriever.sh file.

Training

If you want to train your own models, please refer to our Slurm scripts (1-epoch experiments, and fully-trained DPR or ColBERT). These scripts include the training parameters as well as training dataset and base model checkpoints (with NextCloud IDs).

For our 1-epoch experiments, you can download the training logs here.

The training data are preprocessed into pickle files (and sentence pairs if necessary) using these scripts, and our crawled MSE+AoPS raw data (before preprocessing) can be downloaded here.

Create data for training

To create training data from raw data:

wget https://vault.cs.uwaterloo.ca/s/G36Mjt55HWRSNRR/download -O mse-aops-2021.tar.gz
tar xzf mse-aops-2021.tar.gz
# cd to pya0 directory
rm -f mse-aops-2021-data-v3.pkl mse-aops-2021-vocab-v3.pkl
python -m pya0.mse-aops-2021 /path/to/corpus/ --num_tokenizer_ver=3
python -m pya0.mse-aops-2021-train-data generate_sentpairs --docs_file ./mse-aops-2021-data-v3.pkl --condenser_mode=True

Then, create a shards.txt and test.txt files to specify the training/testing sentence pairs.

To create shards.txt:

ls *.pairs.* > shards.txt

To create test.txt:

python -m pya0.transformer_utils pft_print ./tests/transformer_unmask.txt > test.txt

You may also need to copy the backbone model/tokenizer (e.g., bert-base-uncased and bert-tokenizer) to the data directory so that an offline training can be performed (useful in some Slurm environments).

Assume this data directory containing all the generated files is named data.ABC, an example command for pretraining would be:

export SLURM_JOB_ID=my_pretrain;
python ./pya0/utils/transformer.py pretrain \
	data.ABC/bert-base-uncased data.ABC/bert-tokenizer data.ABC/mse-aops-2021-vocab-v3.pkl \
	--test_file data.ABC/test.txt --test_cycle 100 --shards_list data.ABC/shards.txt \
	--batch_size $((38 * 3)) --save_fold 1 --epochs 10 \
	--cluster tcp://127.0.0.1:8912 --dev_map 3,4,5

Another example of training a condenser architecture:

python ./pya0/utils/transformer.py pretrain \
	data.ABC/bert-base-uncased data.ABC/bert-tokenizer data.ABC/mse-aops-2021-vocab-v3.pkl \
	--test_cycle 0 --shards_list data.ABC/shards.txt \
	--batch_size $((32 * 3)) --save_fold 1 --epochs 10 \
	--cluster tcp://127.0.0.1:8912 --dev_map 0,1,2 --architecture condenser

approach0 / math-dense-retrievers