EXTRA (EXplanaTion RAnking) Datasets

Paper

Lei Li, Yongfeng Zhang, Li Chen. EXTRA: Explanation Ranking Datasets for Explainable Recommendation. SIGIR'21 Resource. [Paper]

Datasets to download

Amazon Movies & TV
TripAdvisor Hong Kong
Yelp 2019

Data format

IDs.pickle can be loaded via the pickle package as a python list, where each record is a python dict in the form of

{
	"user": "7B555091EC0818119062CF726B9EF5FF",  # str
	"item": "1068719",       # str
	"rating": 5,             # int, not important to the ranking task
	"time": "2017-05-06",    # str in the format of YYYY-MM-DD, not available on TripAdvisor
	"exp_idx": ["34", "85"], # a list of str, they are the indices of explanations after sentence grouping via LSH
	"oexp_idx": ["91", "15"] # a list of str, they are the indices of original sentences, corresponding to senID in the following
}

Open id2exp.json via a text editor, e.g., Sublime, if you are curious about what the explanation indices correspond to. Or you can load it via testing.py by updating the parameters (line 5-7).

IDs.txt and id2exp.txt are compatible with IDs.pickle and id2exp.json. It would be easier to check the content with plain text files.
Each line in IDs.txt is in the format of userID::itemID::rating::timestamp::expID:expID::senID:senID, where timestamp is not available on TripAdvisor, and expID/senID are separated by ":" when there are multiple explanations.
Each line in id2exp.txt is in the format of expID::explanation sentence.
You can load the two files via movielens_load.py by updating the paths (line 1-2).

Folders named 1, 2, 3, 4 and 5 are data splits.
Each folder contains train.index and test.index which indicate the indices of their records in the list of IDs.pickle/IDs.txt.
train.index/test.index contain a line of numbers (indices), e.g., 5 8 9 10.

Creation steps

Run the scripts in the following order:

python 01_format_amazon.py \
	--raw-path SIGIR21-EXTRA-Datasets/reviews_Movies_and_TV_5.json.gz \
	--review-path outputs/reviews.jsonl

python 02_process_sentence.py \
	--review-path outputs/reviews.jsonl \
	--sentence-path sentences.jsonl --n-processes 8

python 03_group_sentence.py \
	--sentence-path outputs/sentences.jsonl \
	--directory outputs/ \
	--sim-thresholds 0.9 \
	--shingle-size 2 \
	--group-size 5 \
	--n-processes 8

Update the paths (line 5-9) in keep_valid.py.
Update the paths (line 6-9) in movielens.py, if you want to process the data into the MovieLens format.

Friendly reminder

Run the program on a machine with sufficient memory
Creating the datasets may take some time (e.g., hours for Yelp)

Code dependency

Python 3.6
NLTK
Datasketch

Citation

@inproceedings{SIGIR21-EXTRA,
	title={EXTRA: Explanation Ranking Datasets for Explainable Recommendation},
	author={Li, Lei and Zhang, Yongfeng and Chen, Li},
	booktitle={SIGIR},
	year={2021}
}

shunk031 / EXTRA

EXTRA (EXplanaTion RAnking) Datasets

Paper

Datasets to download

Data format

Creation steps

Friendly reminder

Code dependency

Citation

About

Languages