Privacy and Fairness in Recommender Systems

Instructions

clone the repository

git clone https://github.com/marinimau/privacy_and_fairness_in_recsys && cd privacy_and_fairness_in_recsys

create a virtual environment and activate it

python3 -m venv venv
source venv/bin/activate

install requirements

pip3 install -r requirements.txt

download the data

python3 setup.py

run the experiments

python3 main.py

Attributes

urls

urls = {
    "movielens1m": "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
}

a dict that contains the urls required in the source, in this case we have only the url to download movielens1M

classifier attributes

classifiers = ['random-forest', 'logistic-regression']

the list of classifier to use

classifier_models = {
    "random_forest": RandomForestClassifier(),
    "logistic_regression": LogisticRegression(),
    "naive_bayes": GaussianNB()
}

a dict that contains the Sklearn model for each classifier

classifier_params = {
    'random_forest': {
        'bootstrap': [False, True],
        'max_features': ['auto'],
        'n_estimators': [50, 700, 1800]
    },
    'logistic_regression': {
        'solver': ['lbfgs'],
        'penalty': ['l2', None],
        'C': [1.0],
        'random_state': [np.random.RandomState(0)],
    },
    'naive_bayes': {
        'priors': [[0.5, 0.5]]
    }
}

a dict that contains, for each classifier, the params for the grid search

trade_off_param_range = {
    'random_forest': np.arange(200, 700, 100),
    'logistic_regression': [0.001, 0.05, 0.1, 0.5, 1.0, 10.0],
    'naive_bayes': [0.001, 0.05, 0.1, 0.5, 1.0, 10.0]
}

a dict that contains, for each classifier the x ticks for the learning and validation curves plot

trade_off_param_name = {
    'random_forest': 'n_estimators',
    'logistic_regression': 'C',
    'naive_bayes': 'priors'
}

a dict that contains, for each classifier, the x axis title for the learning and validation curves plot

classifier_evaluation_plot = False

a flag that indicates if we need to generate learning and validation curves

current_trade_off_file_name = ''

the path to save the learning and validation curves (it is edited a run time)

SHOW_PLOT = False

a flag tha indicates if we need to see plot a run time (plots are still saved)

Recommender

best = [10, 20, 50]

the list of cutoffs for the recs

recs_file_names = [
    'user-knn',
    'item-knn',
    'most-pop',
    'bprmf',
    'neumf',
    'wrmf',
    'lightgcn',
    'multidae',
]

a list with names of relevance files (recs are extracted from the relevance files)

ignore_embeddings = [
    'most-pop',
    'neumf',
    'lightgcn',
    'multidae',
]

if a relevance file name is in this list we ignore embedding experiments for that classifier. Experiment are ignored also if the file is not found.

dataset_names = [
    'observation',
    'observation-no-score',
    'most-pop_relevance',
    'most-pop_embeddings',
    'most-pop_classification_best_10',
    'most-pop_classification_best_20',
    'most-pop_classification_best_50',
    'item-knn_relevance',
    'item-knn_embeddings',
    'item-knn_classification_best_10',
    'item-knn_classification_best_20',
    'item-knn_classification_best_50',
    'user-knn_relevance',
    'user-knn_embeddings',
    'user-knn_classification_best_10',
    'user-knn_classification_best_20',
    'user-knn_classification_best_50',
    'bprmf_relevance',
    'bprmf_embeddings',
    'bprmf_classification_best_10',
    'bprmf_classification_best_20',
    'bprmf_classification_best_50',
    'neumf_relevance',
    'neumf_embeddings',
    'neumf_classification_best_10',
    'neumf_classification_best_20',
    'neumf_classification_best_50',
    'wrmf_relevance',
    'wrmf_embeddings',
    'wrmf_classification_best_10',
    'wrmf_classification_best_20',
    'wrmf_classification_best_50',
    'lightgcn_relevance',
    'lightgcn_embeddings',
    'lightgcn_classification_best_10',
    'lightgcn_classification_best_20',
    'lightgcn_classification_best_50',
    'multidae_relevance',
    'multidae_embeddings',
    'multidae_classification_best_10',
    'multidae_classification_best_20',
    'multidae_classification_best_50',
]

list that contains the name of the experiments, names ara generated automatically in the source, but we can perform a validation of the experiment name using this file. (like a test)

required_experiments = {
    'observation': False,
    'embeddings': True,
    'recs': False
}

a dict that indicates the required experiments

required_obfuscated = {
    'observation': True,
    'embeddings': True,
    'recs': True
}

a dict that indicates where we need the obfuscated data

obfuscation_path = [
    'pred/',
    'avg/',
    'filtered/'
]

a list that contains the obfuscated file paths (one for each method)

obfuscation_method_index = 2

the index of "obfuscation_path" to select the obfuscation method

obfuscated_method = obfuscation_path[obfuscation_method_index] if required_obfuscated['observation'] else ''

generates the path of the input files: '' if we not need obfuscation; or the current obfuscation_path if we need obfuscated data

Sensitive attributes inference

label_names = ['age', 'gender']

the name of the sensitive attributes

age_labels = ['1', '56', '25', '45', '50', '35', '18']

a list of age label_names for movielens 1M, (not required if we binarize age)

Classifier evaluation

metrics = ['balanced_accuracy_score', 'f1_score', 'precision_score', 'recall_score', 'accuracy_score']

a list that contains the metrics we are using

Dataset

data_root_list = ['movielens1M', 'lastfm1k']

a list that contains the dataset. The organization of the input (interactions, embeddings, recs), results etc. is the same for each dataset.

data_root = data_root_list[0]

the current dataset

Time split

time_split_cutoffs = [.25, .50, .75, 1]

a list of cutoffs for the time splitting

time_split_current_cutoff = time_split_cutoffs[3]

the selected cutoff for the experiments

time_split_cutoffs_fixed = [5, 10, 20]

a list of fixed cutoffs for the time splitting

time_split_current_cutoff_fixed = time_split_cutoffs_fixed[2]

the selected cutoffs for the experiments if we require fixed cutoffs

perform_time_splitting = False

a flag that indicates if we need to perform time splitting experiments (instead of using the entire dataset)

fixed_time_splitting = False

a flag that indicates if we need to use fixed time split data (if False we use percentage time splitting)

Classification preprocessing

balance_data = True

a flag that indicates if we need to balance the dataset

normalize_data = False

a flag that indicates if we need to normalize data (sklearn.preprocessing.normalize)

lite_dataset = False

if true we use the first only "lite_dataset_size" users for the experiments

lite_dataset_size = 100

indicates how many users use for the experiments

test_set_size = 0.3

the size of the test set, the size of the training set is 1 - test_set_size

maintain_order_in_train_test_split = False

a flag that indicates if we need to maintain order in the training and test data

Filtering method

filtering_sampling_percentages = [1, 0.8, 0.6, 0.4]

the weight for each partition (the first is the one with the most recent interactions)

n_subset_for_filtering = len(filtering_sampling_percentages)

the number of partitions (it is the length of "filtering_sampling_percentages")

General

VERBOSE = True

verbose flag

DEBUG = False

not perform classification, allows you to quickly test the paths and the execution process of all experiments

Author

Mauro Marini

marinimau / privacy_and_fairness_in_recsys