[🐛BUG] Validation with mode "unixxx" is extremely slow compared to "full".

Question

[🐛BUG] Validation with mode "unixxx" is extremely slow compared to "full".

lukas-wegmeth opened this issue 3 months ago · comments

Describe the bug
I was benchmarking many RecBole algorithms on a few data sets with "uni100" and "uni50" validation modes and noticed that validation took an unexpectedly long time.
Therefore, I have tested multiple combinations of settings to figure out if this only happens rarely, but I can reproduce it consistently.
I provide the results of my tests below. I wonder if this is a bug because, in my understanding, validation mode "unixxx" should be faster than "full". And I would expect "full" to take much longer than it does.
Please check if the data provided in the tables looks like you would expect, and if so, help me understand why "unixxx" takes so long compared to "full".

To Reproduce
Steps to reproduce the behavior:

import argparse
import json
import time
from logging import getLogger

from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.utils import ModelType, get_model, get_trainer, init_seed, init_logger

import torch

if __name__ == "__main__":
    parser = argparse.ArgumentParser("Fit RecBole")
    parser.add_argument('--data_set_name', dest='data_set_name', type=str, required=True)
    parser.add_argument('--algorithm_name', dest='algorithm_name', type=str, required=True)
    parser.add_argument('--algorithm_config', dest='algorithm_config', type=int, required=True)
    parser.add_argument('--fold', dest='fold', type=int, required=True)

    args = parser.parse_args()

    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"CUDNN version: {torch.backends.cudnn.version()}")
    print(f"PyTorch version: {torch.__version__}")

    config_dict = {
        "seed": 42,  # default: "2020"
        "data_path": "./data_sets/",  # default: "dataset/"
        "checkpoint_dir": f"./data_sets/{args.data_set_name}/checkpoint_{args.algorithm_name}/"
                          f"config_{args.algorithm_config}/fold_{args.fold}/",  # default: "saved/"
        "benchmark_filename": [f"train_split_fold_{args.fold}", f"valid_split_fold_{args.fold}",
                               f"test_split_fold_{args.fold}"],
        # default: None
        "field_separator": ",",  # default: "\t"
        "epochs": 50,  # default: 300
        "eval_step": 3,  # default: 1
        "stopping_step": 3,  # default: 10
        "eval_args":
            {
                "group_by": "user",  # default: "user"
                "order": "RO",  # default: "RO"
                "split":
                    {
                        # "RS": [8, 1, 1] # default: {"RS": [8, 1, 1]}
                        "LS": "valid_and_test"
                    },
                "mode":
                    {
                        "valid": "uni50",  # default: "full"
                        "test": "full",  # default: "full"
                    },
            },
        "metrics": ["NDCG"],
        # default: ["Recall", "MRR", "NDCG", "Hit", "Precision"]
        "topk": [10],  # default: 10
        "valid_metric": "NDCG@10",  # default: "MRR@10"
        "eval_batch_size": 32768,  # default: 4096
        # misc settings
        "model": args.algorithm_name,
        "MODEL_TYPE": ModelType.GENERAL,  # default: ModelType.GENERAL
        "dataset": args.data_set_name,  # default: None
    }
    print(f"Running algorithm {args.algorithm_name} configuration: {configurations[args.algorithm_config]}")

    config = Config(config_dict=config_dict)
    init_seed(config['seed'], config['reproducibility'])
    init_logger(config)
    logger = getLogger()
    logger.info(config)

    config["data_path"] = f"./data_sets/{args.data_set_name}/atomic/"
    dataset = create_dataset(config)
    logger.info(dataset)
    train_data, valid_data, test_data = data_preparation(config, dataset)

    model = get_model(config["model"])(config, train_data.dataset).to(config['device'])
    logger.info(model)
    trainer = get_trainer(config["MODEL_TYPE"], config["model"])(config, model)
    start_fit = time.time()
    best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
    end_fit = time.time()
    model_file = trainer.saved_model_file

Expected behavior
Validation modes "unixxx" should be faster than "full".

Screenshots

Data Set	Model	Eval Batch Size	Validation Mode	Epoch Training Time (seconds)	Validation Time (seconds)	Validation Score (nDCG@10)
MovieLens-100K	DGCF	32768	uni100	1.7	10.37	0.3308
MovieLens-100K	DGCF	4096	uni100	1.64	22.29	0.3308
MovieLens-100K	DGCF	32768	uni50	1.68	4.4	0.398
MovieLens-100K	DGCF	4096	uni50	1.63	19.37	0.398
MovieLens-100K	DGCF	32768	full	1.68	0.07	0.2455
MovieLens-100K	DGCF	4096	full	1.63	0.2	0.2455
MovieLens-100K	SpectralCF	32768	uni100	0.27	1.83	0.2228
MovieLens-100K	SpectralCF	4096	uni100	0.35	3.95	0.2228
MovieLens-100K	SpectralCF	32768	uni50	0.25	0.97	0.2741
MovieLens-100K	SpectralCF	4096	uni50	0.28	2.97	0.2741
MovieLens-100K	SpectralCF	32768	full	0.27	0.05	0.1743
MovieLens-100K	SpectralCF	4096	full	0.25	0.2	0.1743
MovieLens-1M	DGCF	32768	uni100	82.09	776.27	0.3654
MovieLens-1M	DGCF	4096	uni100	82.1	1103.69	0.3654
MovieLens-1M	DGCF	32768	uni50	81.55	773.25	0.4565
MovieLens-1M	DGCF	4096	uni50	82.01	867.66	0.4565
MovieLens-1M	DGCF	32768	full	81.77	0.61	0.238
MovieLens-1M	DGCF	4096	full	81.8	3.42	0.238
MovieLens-1M	SpectralCF	32768	uni100	8.21	80.14	0.3057
MovieLens-1M	SpectralCF	4096	uni100	8.1	108.65	0.3057
MovieLens-1M	SpectralCF	32768	uni50	8.06	76.84	0.3897
MovieLens-1M	SpectralCF	4096	uni50	8.27	87.5	0.3899
MovieLens-1M	SpectralCF	32768	full	8.21	0.5	0.1959
MovieLens-1M	SpectralCF	4096	full	8.1	3.3	0.1958

Desktop (please complete the following information):

OS: Linux
RecBole Version: 1.2.0
Python Version: 3.10
PyTorch Version: 2.1.1
cudatoolkit Version: 12.1

Enze Liu · Answer 1 · Sat Mar 09 2024 14:46:08 GMT+0800 (China Standard Time)

@lukas-wegmeth Hi! The longer valid/test time for sampling eval is normal. For full eval, we restore all the user and item embeddings to avoid repeat computations but not for negative sampling evaluation. You can check the source code for details. Note that predict is used for negative sampling evaluations and full_sort_predict is used for full eval.

Lukas Wegmeth · Answer 2 · Sat Mar 09 2024 22:07:52 GMT+0800 (China Standard Time)

@BishopLiu Thanks for replying. I have looked at the code and profiled the run time of the functions. I can see where negative sampling evaluation requires more time, but it is still unintuitive to me why it is. I believe negative sampling evaluation should be faster because fewer interactions must be predicted. Also, if restoring the embeddings is much quicker, why is it not done in negative sampling evaluation? Please let me know if I misunderstood anything about this.

Enze Liu · Answer 3 · Sun Mar 10 2024 12:14:35 GMT+0800 (China Standard Time)

@lukas-wegmeth Thank you for your attention to RecBole! The models in RecBole are implemented by different developers. Our first goal is to make sure the model is consistent with the original paper and runs correctly. And different developers have their own considerations. I'm sorry that I cannot answer why restoring embeddings is not done in negative sampling.

Lukas Wegmeth · Answer 4 · Thu Mar 14 2024 20:44:50 GMT+0800 (China Standard Time)

@BishopLiu I understand. Thanks for your response. Although my problem with high validation time persists, I can at least verify how this happens in the code now.