Edenhofer / Augmented-DonorsChoose.org-Dataset

Amending metadata to the DonorsChoose.org dataset as to facility research in meta-learning for recommender systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code and Literature Repository for Investigating Meta-Learning Algorithms in the Context of Recommender Systems

Inspiration

The main inspiration for this research is based on the work performed by the ADAPT group at the Trinity college in Dublin. Of special note for this project is the research outlined in the paper One-at-a-time: A Meta-Learning Recommender-System for Recommendation-Algorithm Selection on Micro Level.

Abstract

The DonorsChoose.org dataset of past donations provides a big and feature-rich corpus of users and items. The dataset matches donors to projects in which they might be interested in and hence is intrinsically about recommendations. Due to the availability of detailed item-, user- and transaction-features, this corpus represents a suitable candidate for meta-learning approaches to be tested. This study aims at providing an augmented corpus for further recommender systems studies to test and evaluate meta-learning approaches. In the augmentation, metadata of collaborative and content-based filtering techniques is amended to the corpus. It is further extended with aggregated statistics for users and transactions and an exemplary meta-learning experiment. The performance in the learning subsystem is measured via the recall of recommended items in a Top-N test set. The augmented dataset and the source code are released into the public domain at GitHub:Edenhofer/Augmented-DonorsChoose.org-Dataset.

Corpus Features and Augmented Metadata

The following represents an organized list of names of columns. Each entry in the itemized transaction table provides a value in the listed columns. If the name of a column is deemed not self-explanatory, a short description is amended.

Transaction

  • DonationID
  • DonationIncludedOptionalDonation
  • DonationReceivedDate
    • DonationReceivedDateDay
    • DonationReceivedDateDayOfWeek
    • DonationReceivedDateMonth
    • DonationReceivedDateTimeOfDay
    • DonationReceivedDateYear
  • DonorCartSequence
  • DonorID
  • ProjectID
  • DonationAmount
  • (Zip|City|Sate)IsEqual i.e. whether user and item have identical values

User

  • DonorState
  • DonorCity
  • DonorZip
  • DonorIsTeacher
  • Concat[(DonorState|DonorCity|...)+] i.e. the concatenated value of mentioned columns

Item

  • SchoolID
  • ProjectGradeLevelCategory
  • ProjectResourceCategory
  • ProjectSubjectCategoryTree
  • ProjectSubjectSubcategoryTree
  • SchoolCity
  • SchoolMetroType
  • SchoolPercentageFreeLunch
  • SchoolState
  • SchoolZip

Learning Subsystem Prediction and Error

  • Collaborative Filtering Techniques
    • AbsoluteErrorSKLearn-(KNN|SVD)
    • SquareErrorSKLearn-(KNN|SVD)
    • RecallAtPositionSKLearn-(KNN|SVD)
    • PredictionSKLearn-(KNN|SVD) i.e. decomposition of the matrix or the interactions of the neighbor
  • Content-based Filtering Techniques
    • RecallAtPosition(FastText|Tfidf)
    • Prediction(FastText|Tfidf) i.e. cosine similarity of user profile and recommendation
  • Collaborative recommendations for user-groups
    • AbsoluteErrorGroupBy[(DonorState|DonorCity|...)+]-SKLearn-SVD
    • SquareErrorGroupBy[(DonorState|DonorCity|...)+]-SKLearn-SVD
    • RecallAtPositionGroupBy[(DonorState|DonorCity|...)+]-SKLearn-SVD
    • PredictionGroupByDonor[(DonorState|DonorCity|...)+]-SKLearn-SVD

Statistics

  • General
    • isTest i.e. whether the entry was used for testing during the holdout split
  • Values aggregated by User
    • ValueCountsDonorID i.e. number of transactions
    • ValueCountsByUserProjectID i.e. whether the user donated to popular projects
    • UserMean(DonationAmount|ProjectGradeLevelCategory|...)

Meta-Learning System

  • MetaPrediction(BaggingRg|GradientBoostingRg|...)RecallAtPosition(SKLearn-SVD|FastText|...) i.e. prediction of the error of the individual meta-learners in the error prediction step
  • MetaPrediction(BaggingRg|GradientBoostingRg|...)RecallAtPosition(SKLearn-SVD|FastText|...) i.e. prediction if the suggested algorithm is selected via error prediction
  • SubalgorithmPrediction(BaggingRg|GradientBoostingRg|...)RecallAtPosition i.e. prediction of the class in the classification step
  • MetaSubalgorithmPrediction(BaggingRg|GradientBoostingRg|...)RecallAtPosition(SKLearn-SVD|FastText|...) i.e. prediction if the suggested algorithm is selected via classification
  • SubalgorithmCategory i.e. 'category' of the transaction if assigned to the best performing algorithm

Code Design

This repository is the single source of truth for the whole scientific exploration of the augmentation and evaluation of the DonorsChoose.org dataset. In addition to the actual code needed for reproduction, the repository contains all relevant status updates. The dedicated folder for documentation is appropriately named doc. The dataset may be stored in data. Changes happening within this folder are ignored by the version control system. The main programs are learners.py and meta-learners.py with helper functions being outsource to recsys. The first python-script is dedicated to creating a dataset augmented with results from various filtering techniques. It represents the learning subsystem and performs the computationally most expensive steps. The second python-script further augments the dataset and executes the meta-learning algorithms.

Learning Subsystem

The learning subsystem is contained in the learners.py script and is adaptable via the configuration file config.yml. Most parameters should be self-explanatory. If deemed necessary, a small explanatory string is added. Most notably is the dictionary algorithms_args which specifies all the algorithms from the learning subsystem which to execute on the data.

Meta-learning System

The final augmentation is performed in meta-learners.py. The program's behavior can be configured via config-meta-learners.yml. Its most important option is the dictionary describing the meta-algorithms which to execute on the dataset.

Code Snippets

Configuration of Visuals

  • Non-interactive plotting
import matplotlib as mpl

mpl.use('cairo')

import matplotlib.pyplot as plt
  • Prettify plots
import seaborn as sns

sns.set_style('whitegrid')
  • Suitable Aspect Ratio for Plots
mpl.rcParams['figure.figsize'] = 6.4, 3.2
  • Enforce text rendering via LaTeX and mimic the font of the default matplotlib text
plt.rc('text', usetex=True)
mpl.rcParams['mathtext.fontset'] = 'custom'
mpl.rcParams['mathtext.rm'] = 'Bitstream Vera Sans'
mpl.rcParams['mathtext.it'] = 'Bitstream Vera Sans:italic'
mpl.rcParams['mathtext.bf'] = 'Bitstream Vera Sans:bold'

Visualizations

  • Donated amount in bins on a logarithmic scale
items_orig = donations[['ProjectID', 'DonorID', 'DonationAmount']]

plt.figure()
plt.hist(items_orig['DonationAmount'], bins=np.logspace(np.log10(items_orig['DonationAmount'].min()), np.log10(items_orig['DonationAmount'].max()), num=28 + 1), histtype='step')
plt.gca().set_xscale('log')
plt.xlabel('Donated Amount')
plt.ylabel('#Occurrence')
plt.tight_layout()
plt.savefig('DonationAmount - Distribution of the donated amount on a logarithmic scale.pdf', bbox_inches='tight')
plt.close()
  • Donated amount in bins on a logarithmic scale for clean subset
items_orig = donations.groupby(['DonorID', 'ProjectID'])['DonationAmount'].sum().reset_index()
# Perform preliminary data cleaning
items_orig = items_orig.drop(items_orig.query('0. <= DonationAmount <= 2.').index)
value_counts = items_orig['DonorID'].value_counts()
items_orig = items_orig[items_orig['DonorID'].isin(value_counts.index[value_counts >= 2])]

plt.figure()
plt.hist(items_orig['DonationAmount'], bins=np.logspace(np.log10(items_orig['DonationAmount'].min()), np.log10(items_orig['DonationAmount'].max()), num=13 + 1), density=True, histtype='step')
plt.gca().set_xscale('log')
plt.xlabel('Donated Amount')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('DonationAmount - Distribution of the donated amount on a logarithmic scale (for donors with at least 2 donations, excluding duplicates and low donations).pdf', bbox_inches='tight')
plt.close()
  • Distribution of ratings
# Shrink size and enlarge font
mpl.rcParams['figure.figsize'][0] /= 1.3

plt.figure()
plt.grid(b=False, axis='x')

plt.hist(items['DonationAmount'], bins=5, density=True, histtype='step')
plt.xticks([1.45, 2.2, 3., 3.8, 4.6], np.arange(1, 5+1))
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('DonationAmount - Distribution of ratings for logarithmic bins and excluded outliers.pdf', bbox_inches='tight')
plt.close()

mpl.rcParams['figure.figsize'][0] *= 1.3
  • Number of user donations
# Shrink size and enlarge font
mpl.rcParams['figure.figsize'][0] /= 1.3

plt.figure()
plt.grid(b=False, axis='x')

user_value_counts = items['DonorID'].value_counts()
# Disregard outliers
user_value_counts = user_value_counts[user_value_counts <= user_value_counts.mean() + user_value_counts.std()]

plt.hist(user_value_counts, bins=30, density=True, histtype='step')

plt.xlabel('Interactions per user')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('DonorID - Distribution of number of interactions per user.pdf', bbox_inches='tight')
plt.close()

mpl.rcParams['figure.figsize'][0] *= 1.3
  • RMSE for collaborative filtering techniques
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['zero', 'mean', 'random', 'SKLearn-KNN', 'SKLearn-NMF', 'SKLearn-SVD', 'SciPy-SVD']
average_rmse = [np.sqrt(np.square(np.zeros(items.shape[0]) - items['DonationAmount']).mean()),
  np.sqrt(np.square(np.full(items.shape[0], items['DonationAmount'].mean()) - items['DonationAmount']).mean()),
  np.sqrt(np.square(np.random.uniform(low=min(items['DonationAmount']), high=max(items['DonationAmount']), size=items.shape[0]) - items['DonationAmount']).mean()),
  np.sqrt(items['SquareErrorSKLearn-KNN'].mean()),
  np.sqrt(items['SquareErrorSKLearn-NMF'].mean()),
  np.sqrt(items['SquareErrorSKLearn-SVD'].mean()),
  np.sqrt(items['SquareErrorSciPy-SVD'].mean())]

plt.errorbar(np.arange(len(average_rmse)), average_rmse, xerr=0.45, markersize=0., ls='none')

plt.xticks(np.arange(len(algorithms_name)), algorithms_name)

plt.xlabel('Algorithm')
plt.ylabel('Test RMSE')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Collaborative Filters - RMSE for DIY algorithms and some baselines.pdf', bbox_inches='tight')
plt.close()
  • Recall@N for collaborative and content-based filters
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-NMF', 'SKLearn-SVD', 'SciPy-SVD', 'Tfidf']
algorithms_pretty_name = ['SKLearn-KNN', 'SKLearn-NMF', 'SKLearn-SVD', 'SciPy-SVD', 'SKLearn-TF-IDF']
average_recall = [items['RecallAtPosition' + alg_name].mean() for alg_name in algorithms_name]

plt.errorbar(np.arange(len(average_recall)), average_recall, xerr=0.45, markersize=0., ls='none')

plt.xticks(np.arange(len(algorithms_pretty_name)), algorithms_pretty_name)
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Average position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Collaborative and Content-based Filters - Average position in Top-N test set for various algorithms.pdf', bbox_inches='tight')
plt.close()
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-SVD', 'Tfidf']
algorithms_pretty_name = ['SKLearn-KNN', 'SKLearn-SVD', 'SKLearn-TF-IDF']

plt.hist([items['RecallAtPosition' + alg_name] for alg_name in algorithms_name], bins=10, density=True, label=algorithms_pretty_name, histtype='step')

plt.legend(loc=9)
plt.xlabel('Position in Top-N test set')
plt.ylabel('Frequency')

plt.tight_layout()

plt.savefig('Collaborative and Content-based Filters - Distribution of position in Top-N test set for various algorithms.pdf', bbox_inches='tight')
plt.close()
  • Learning subsystem Recall@N performance
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-SVD', 'Tfidf', 'FastText']
recall_pos = [items['RecallAtPosition' + alg_name].values for alg_name in algorithms_name] + [items[['RecallAtPosition' + alg_name for alg_name in algorithms_name]].min(axis=1).values]
algorithms_pretty_name = ['KNN', 'SVD', 'TF-IDF', 'FastText', 'Combined']

plt.boxplot(recall_pos, positions=np.arange(len(algorithms_pretty_name)), meanline=True, showmeans=True, showfliers=False)

plt.xticks(np.arange(len(algorithms_pretty_name)), algorithms_pretty_name)
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Learning subsystem - Position in Top-N test set for various algorithms.pdf', bbox_inches='tight')
plt.close()
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-SVD', 'Tfidf', 'FastText']
algorithms_pretty_name = {'SKLearn-KNN': 'KNN', 'SKLearn-SVD': 'SVD', 'Tfidf': 'TF-IDF', 'FastText': 'FastText'}
algorithms_value_counts = items[['RecallAtPosition' + alg_name for alg_name in algorithms_name]].idxmin(axis=1).value_counts().rename(dict(zip(['RecallAtPosition' + alg_name for alg_name in algorithms_name], algorithms_name))).to_dict()

plt.hist([items['RecallAtPosition' + alg_name] for alg_name in algorithms_name], bins=10, density=True, label=['{:<s} ({:<2.2%} overall best)'.format(algorithms_pretty_name[alg_name], algorithms_value_counts[alg_name] / items.shape[0]) for alg_name in algorithms_name], histtype='step')

plt.legend(loc=9)
plt.xlabel('Position in Top-N test set')
plt.ylabel('Frequency')

plt.tight_layout()

plt.savefig('Learning subsystem - Distribution of position in Top-N test set for various algorithms.pdf', bbox_inches='tight')
plt.close()
  • Meta-learner performance for classification and error prediction
meta_subset = meta_items.loc[test_idx]

plt.figure()
plt.grid(b=False, axis='x')

meta_algorithms_name = [('Bagging', 'Bagging'), ('DecisionTree', 'DecisionTree'), ('BalancedDecisionTree', 'BalancedDTree'), ('GradientBoosting', 'GradientBoosting'), ('NeuralNetwork', 'NeuralNetwork')]
algorithm_selection_columns = [('MetaSubalgorithmPrediction', 'CL'), ('MetaPrediction', 'EP')]
meta_algorithms_column = np.array([[pre[0] + meta_alg_name[0] for pre in algorithm_selection_columns] for meta_alg_name in meta_algorithms_name]).flatten()
meta_algorithms_pretty_name = np.array([[pre[1] + ' ' + meta_alg_name[1] for pre in algorithm_selection_columns] for meta_alg_name in meta_algorithms_name]).flatten()
average_recall = [meta_subset[c].mean() for c in meta_algorithms_column]

plt.errorbar(np.arange(len(average_recall)), average_recall, color=np.array([[c for _ in range(len(algorithm_selection_columns))] for c in plt.rcParams['axes.prop_cycle'].by_key()['color'][:len(meta_algorithms_name)]]).flatten(), xerr=0.45, markersize=0., ls='none')
plt.axhline(y=meta_subset.lookup(meta_subset.index, meta_subset['SubalgorithmCategory']).mean(), color='orange', linestyle='--')

plt.xticks(np.arange(len(meta_algorithms_pretty_name)), meta_algorithms_pretty_name)
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Average position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Meta-learner as Classifier and Error Predictor - Average position in Top-N test set for various meta-learner algorithms.pdf', bbox_inches='tight')
plt.close()
  • Learning subsystem Recall@N performance with augmented filtering techniques
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-SVD', 'GroupByDonorStateCityZip-SKLearn-SVD', 'GroupByDonorStateCity-SKLearn-SVD', 'Tfidf', 'FastText']
recall_pos = [items['RecallAtPosition' + alg_name].values for alg_name in algorithms_name] + [items[['RecallAtPosition' + alg_name for alg_name in algorithms_name]].min(axis=1).values]
algorithms_value_counts = items[['RecallAtPosition' + alg_name for alg_name in algorithms_name]].idxmin(axis=1).value_counts().rename(dict(zip(['RecallAtPosition' + alg_name for alg_name in algorithms_name], algorithms_name))).to_dict()

algorithms_name = algorithms_name + ['Combined']
algorithms_value_counts['Combined'] = items.shape[0]
algorithms_pretty_name = {'SKLearn-KNN': 'KNN', 'SKLearn-SVD': 'SVD', 'GroupByDonorStateCityZip-SKLearn-SVD': 'SVD (State, City, Zip)', 'GroupByDonorStateCity-SKLearn-SVD': 'SVD (State, City)', 'Tfidf': 'TF-IDF', 'FastText': 'FastText', 'Combined': 'Combined'}

plt.boxplot(recall_pos, positions=np.arange(len(algorithms_pretty_name)), meanline=True, showmeans=True, showfliers=False)

# This got a little bit out of hand...
# Actually just the percentage of each algorithm's contribution in the combined best is printed in a smaller font below the algorithm's name
plt.xticks(np.arange(len(algorithms_pretty_name)), [r'{{\fontsize{{1em}}{{3em}}\selectfont{{}}{0:<s}}}{1}{{\fontsize{{0.8em}}{{3em}}\selectfont{{}}{2:<2.2f}\%}}'.format(algorithms_pretty_name[alg_name], '\n', 100 * algorithms_value_counts[alg_name]  / items.shape[0]) for alg_name in algorithms_name])
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Learning subsystem - Position in Top-N test set for various algorithms with augmented filtering techniques.pdf', bbox_inches='tight')
plt.close()
  • Learning subsystem Recall@N performance
plt.figure()
plt.grid(b=False, axis='x')

algorithms_name = ['SKLearn-KNN', 'SKLearn-SVD', 'Tfidf', 'FastText']
recall_pos = [items['RecallAtPosition' + alg_name].values for alg_name in algorithms_name]
algorithms_value_counts = items[['RecallAtPosition' + alg_name for alg_name in sorted(algorithms_name)]].idxmin(axis=1).value_counts().rename(dict(zip(['RecallAtPosition' + alg_name for alg_name in algorithms_name], algorithms_name))).to_dict()

algorithms_pretty_name = {'SKLearn-KNN': 'KNN', 'SKLearn-SVD': 'SVD', 'GroupByDonorStateCityZip-SKLearn-SVD': 'SVD (State, City, Zip)', 'GroupByDonorStateCity-SKLearn-SVD': 'SVD (State, City)', 'Tfidf': 'TF-IDF', 'FastText': 'FastText', 'Combined': 'Combined'}

plt.boxplot(recall_pos, positions=np.arange(len(algorithms_name)), meanline=True, showmeans=True, showfliers=False)

# This got a little bit out of hand...
# Actually just the percentage of each algorithm's contribution in the combined best is printed in a smaller font below the algorithm's name
plt.xticks(np.arange(len(algorithms_name)), [r'{{\fontsize{{1em}}{{3em}}\selectfont{{}}{0:<s}}}{1}{{\fontsize{{0.8em}}{{3em}}\selectfont{{}}{2:<2.2f}\%}}'.format(algorithms_pretty_name[alg_name], '\n', 100 * algorithms_value_counts[alg_name]  / items.shape[0]) for alg_name in algorithms_name])
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Learning subsystem - Position in Top-N test set.pdf', bbox_inches='tight')
plt.close()
  • Meta-learner performance for classification and error prediction with augmented learning subsystem filtering techniques
meta_subset = meta_items.loc[test_idx]

plt.figure()
plt.grid(b=False, axis='x')

colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

meta_algorithms_name = [('MetaSubalgorithmPredictionBaggingCl', 'CL Bagging', colors[0]), ('MetaPredictionBaggingRg', 'EP Bagging', colors[3]), ('MetaSubalgorithmPredictionDecisionTreeCl', 'CL Decision Tree', colors[0]), ('MetaPredictionDecisionTreeRg40', 'EP Decision Tree', colors[3]), ('MetaSubalgorithmPredictionUserClusterKMeans', 'User-Clustering', colors[0]), ('MetaPredictionGradientBoostingRg', 'EP Gradient Boosting', colors[3]), ('MetaSubalgorithmPredictionStackingDecisionTree', 'Stacking DTree', colors[4])]
average_recall = [meta_subset[c].mean() for c in list(zip(*meta_algorithms_name))[0]]

plt.errorbar(np.arange(len(average_recall)), average_recall, color=list(zip(*meta_algorithms_name))[2], xerr=0.45, markersize=0., ls='none')
plt.axhline(y=meta_subset[meta_subset['SubalgorithmCategory'].mode()[0]].mean(), color='orange', linestyle='--')
plt.axhline(y=meta_subset.lookup(meta_subset.index, meta_subset['SubalgorithmCategory']).mean(), color='orange', linestyle='-')

plt.xticks(np.arange(len(meta_algorithms_name)), list(zip(*meta_algorithms_name))[1])
plt.ylim(ymin=-1)

plt.xlabel('Algorithm')
plt.ylabel('Average position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Meta-learner as Classifier and Error Predictor - Average position in Top-N test set for various meta-learner algorithms with augmented learning subsystem filtering techniques.pdf', bbox_inches='tight')
plt.close()
  • Meta-learner performance
meta_subset = meta_items.loc[test_idx]

plt.figure()
plt.grid(b=False, axis='x')

colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

meta_algorithms_name = [('MetaSubalgorithmPredictionDecisionTreeRg', 'Classifying DTree', colors[0]), ('MetaPredictionGradientBoostingRg', 'GB Recall Prediction', colors[1]), ('MetaSubalgorithmPredictionUserClusterKMeans', 'K-Means Clustering', colors[2]), ('MetaSubalgorithmPredictionStackingDecisionTree', 'Stacking DTree', colors[3])]
average_recall = [meta_subset[c].mean() for c in list(zip(*meta_algorithms_name))[0]]

plt.errorbar(np.arange(len(average_recall)), average_recall, color=list(zip(*meta_algorithms_name))[2], xerr=0.45, markersize=0., ls='none')
plt.axhline(y=meta_subset[meta_subset['SubalgorithmCategory'].mode()[0]].mean(), color='orange', linestyle='--')

plt.xticks(np.arange(len(meta_algorithms_name)), list(zip(*meta_algorithms_name))[1])
plt.ylim(ymin=meta_subset.lookup(meta_subset.index, meta_subset['SubalgorithmCategory']).mean()-1)

plt.xlabel('Algorithm')
plt.ylabel('Average position in Top-N test set')

plt.gcf().autofmt_xdate()
plt.tight_layout()

plt.savefig('Meta-learner Performance - Average position in Top-N test set for various meta-learner algorithms with augmented learning subsystem filtering techniques.pdf', bbox_inches='tight')
plt.close()

Past Roadmap

  • Find a suitable dataset for meta-learning
  • Evaluate existing software frameworks for their applicability as meta-feature generators
    • Meta-feature algorithms should include collaborative, content based and possibly deep learning based approaches
    • Suitable frameworks might be Tensorflow, scikit-learn and higher level libraries like Keras and scikit-surprise
  • Train and compare various meta-learning models
    • Predict either rating error or reformulate algorithm selection as classification problem
    • Evaluate model using appropriate variables, possible candidates might be the normalized discounted cumulative gain or the Kendall rank correlation coefficient

Outlook

  • Decaying rating based on the date of the donation
  • Use average algorithm with lowest overall error for each cluster in the user-clustering approach
  • Algorithm Selection as ranking task using Meta-Learning

About

Amending metadata to the DonorsChoose.org dataset as to facility research in meta-learning for recommender systems

License:The Unlicense


Languages

Language:Python 100.0%