[🐛BUG] Validation with mode "unixxx" is extremely slow compared to "full".
lukas-wegmeth opened this issue · comments
Describe the bug
I was benchmarking many RecBole algorithms on a few data sets with "uni100" and "uni50" validation modes and noticed that validation took an unexpectedly long time.
Therefore, I have tested multiple combinations of settings to figure out if this only happens rarely, but I can reproduce it consistently.
I provide the results of my tests below. I wonder if this is a bug because, in my understanding, validation mode "unixxx" should be faster than "full". And I would expect "full" to take much longer than it does.
Please check if the data provided in the tables looks like you would expect, and if so, help me understand why "unixxx" takes so long compared to "full".
To Reproduce
Steps to reproduce the behavior:
import argparse
import json
import time
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.utils import ModelType, get_model, get_trainer, init_seed, init_logger
import torch
if __name__ == "__main__":
parser = argparse.ArgumentParser("Fit RecBole")
parser.add_argument('--data_set_name', dest='data_set_name', type=str, required=True)
parser.add_argument('--algorithm_name', dest='algorithm_name', type=str, required=True)
parser.add_argument('--algorithm_config', dest='algorithm_config', type=int, required=True)
parser.add_argument('--fold', dest='fold', type=int, required=True)
args = parser.parse_args()
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"CUDNN version: {torch.backends.cudnn.version()}")
print(f"PyTorch version: {torch.__version__}")
config_dict = {
"seed": 42, # default: "2020"
"data_path": "./data_sets/", # default: "dataset/"
"checkpoint_dir": f"./data_sets/{args.data_set_name}/checkpoint_{args.algorithm_name}/"
f"config_{args.algorithm_config}/fold_{args.fold}/", # default: "saved/"
"benchmark_filename": [f"train_split_fold_{args.fold}", f"valid_split_fold_{args.fold}",
f"test_split_fold_{args.fold}"],
# default: None
"field_separator": ",", # default: "\t"
"epochs": 50, # default: 300
"eval_step": 3, # default: 1
"stopping_step": 3, # default: 10
"eval_args":
{
"group_by": "user", # default: "user"
"order": "RO", # default: "RO"
"split":
{
# "RS": [8, 1, 1] # default: {"RS": [8, 1, 1]}
"LS": "valid_and_test"
},
"mode":
{
"valid": "uni50", # default: "full"
"test": "full", # default: "full"
},
},
"metrics": ["NDCG"],
# default: ["Recall", "MRR", "NDCG", "Hit", "Precision"]
"topk": [10], # default: 10
"valid_metric": "NDCG@10", # default: "MRR@10"
"eval_batch_size": 32768, # default: 4096
# misc settings
"model": args.algorithm_name,
"MODEL_TYPE": ModelType.GENERAL, # default: ModelType.GENERAL
"dataset": args.data_set_name, # default: None
}
print(f"Running algorithm {args.algorithm_name} configuration: {configurations[args.algorithm_config]}")
config = Config(config_dict=config_dict)
init_seed(config['seed'], config['reproducibility'])
init_logger(config)
logger = getLogger()
logger.info(config)
config["data_path"] = f"./data_sets/{args.data_set_name}/atomic/"
dataset = create_dataset(config)
logger.info(dataset)
train_data, valid_data, test_data = data_preparation(config, dataset)
model = get_model(config["model"])(config, train_data.dataset).to(config['device'])
logger.info(model)
trainer = get_trainer(config["MODEL_TYPE"], config["model"])(config, model)
start_fit = time.time()
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
end_fit = time.time()
model_file = trainer.saved_model_file
Expected behavior
Validation modes "unixxx" should be faster than "full".
Screenshots
Data Set | Model | Eval Batch Size | Validation Mode | Epoch Training Time (seconds) | Validation Time (seconds) | Validation Score (nDCG@10) |
---|---|---|---|---|---|---|
MovieLens-100K | DGCF | 32768 | uni100 | 1.7 | 10.37 | 0.3308 |
MovieLens-100K | DGCF | 4096 | uni100 | 1.64 | 22.29 | 0.3308 |
MovieLens-100K | DGCF | 32768 | uni50 | 1.68 | 4.4 | 0.398 |
MovieLens-100K | DGCF | 4096 | uni50 | 1.63 | 19.37 | 0.398 |
MovieLens-100K | DGCF | 32768 | full | 1.68 | 0.07 | 0.2455 |
MovieLens-100K | DGCF | 4096 | full | 1.63 | 0.2 | 0.2455 |
MovieLens-100K | SpectralCF | 32768 | uni100 | 0.27 | 1.83 | 0.2228 |
MovieLens-100K | SpectralCF | 4096 | uni100 | 0.35 | 3.95 | 0.2228 |
MovieLens-100K | SpectralCF | 32768 | uni50 | 0.25 | 0.97 | 0.2741 |
MovieLens-100K | SpectralCF | 4096 | uni50 | 0.28 | 2.97 | 0.2741 |
MovieLens-100K | SpectralCF | 32768 | full | 0.27 | 0.05 | 0.1743 |
MovieLens-100K | SpectralCF | 4096 | full | 0.25 | 0.2 | 0.1743 |
MovieLens-1M | DGCF | 32768 | uni100 | 82.09 | 776.27 | 0.3654 |
MovieLens-1M | DGCF | 4096 | uni100 | 82.1 | 1103.69 | 0.3654 |
MovieLens-1M | DGCF | 32768 | uni50 | 81.55 | 773.25 | 0.4565 |
MovieLens-1M | DGCF | 4096 | uni50 | 82.01 | 867.66 | 0.4565 |
MovieLens-1M | DGCF | 32768 | full | 81.77 | 0.61 | 0.238 |
MovieLens-1M | DGCF | 4096 | full | 81.8 | 3.42 | 0.238 |
MovieLens-1M | SpectralCF | 32768 | uni100 | 8.21 | 80.14 | 0.3057 |
MovieLens-1M | SpectralCF | 4096 | uni100 | 8.1 | 108.65 | 0.3057 |
MovieLens-1M | SpectralCF | 32768 | uni50 | 8.06 | 76.84 | 0.3897 |
MovieLens-1M | SpectralCF | 4096 | uni50 | 8.27 | 87.5 | 0.3899 |
MovieLens-1M | SpectralCF | 32768 | full | 8.21 | 0.5 | 0.1959 |
MovieLens-1M | SpectralCF | 4096 | full | 8.1 | 3.3 | 0.1958 |
Desktop (please complete the following information):
- OS: Linux
- RecBole Version: 1.2.0
- Python Version: 3.10
- PyTorch Version: 2.1.1
- cudatoolkit Version: 12.1
@lukas-wegmeth Hi! The longer valid/test time for sampling eval is normal. For full eval, we restore all the user and item embeddings to avoid repeat computations but not for negative sampling evaluation. You can check the source code for details. Note that predict
is used for negative sampling evaluations and full_sort_predict
is used for full eval.
@BishopLiu Thanks for replying. I have looked at the code and profiled the run time of the functions. I can see where negative sampling evaluation requires more time, but it is still unintuitive to me why it is. I believe negative sampling evaluation should be faster because fewer interactions must be predicted. Also, if restoring the embeddings is much quicker, why is it not done in negative sampling evaluation? Please let me know if I misunderstood anything about this.
@lukas-wegmeth Thank you for your attention to RecBole! The models in RecBole are implemented by different developers. Our first goal is to make sure the model is consistent with the original paper and runs correctly. And different developers have their own considerations. I'm sorry that I cannot answer why restoring embeddings is not done in negative sampling.
@BishopLiu I understand. Thanks for your response. Although my problem with high validation time persists, I can at least verify how this happens in the code now.