linkerlin / cf4j

CF4J: Collaborative Filtering for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CF4J: Collaborative Filtering for Java

Build Status Maven Central javadoc Language grade: Java

A Java's Collaborative Filtering library to carry out experiments in research on Collaborative Filtering based Recommender Systems. The library has been designed from researchers to researchers.

Index

  1. Installation
  2. Getting Started
  3. Project Structure
  4. Customize CF4J
  5. Algorithm List
  6. Examples
  7. Datasets

Installation

CF4J is available in the most popular dependency management tools for Java. To add it to your project, you must add the following lines to your dependency management.

For Maven:

<dependency>
  <groupId>es.upm.etsisi</groupId>
  <artifactId>cf4j</artifactId>
  <version>2.1.1</version>
</dependency>

For Gradle:

compile group: 'es.upm.etsisi', name: 'cf4j', version: '2.1.1'

For SBT:

libraryDependencies += "es.upm.etsisi" % "cf4j" % "2.1.1"

For Ivy:

<dependency org="es.upm.etsisi" name="cf4j" rev="2.1.1"/>

For Grape:

@Grapes(
    @Grab(group='es.upm.etsisi', module='cf4j', version='2.1.1')
)

For Leiningen:

[es.upm.etsisi/cf4j "2.1.1"]

For Buildr:

'es.upm.etsisi:cf4j:jar:2.1.1'

You can find additional information about these dependencies in https://mvnrepository.com/artifact/es.upm.etsisi/cf4j

If you prefer to use the library without a dependency management tool, you must add the jar packaged version of CF4J to your project's classpath. For example, if you are using IntelliJ IDEA, copy the file to your project's directory, make right click on the jar file and select Add as Library.

You can find the jar packaged version of CF4J into the release section of github.

You can also package your own jar file . To do that, clone the repository using git clone git@github.com:ferortega/cf4j.git and package it with mvn package.

Getting Started

Let's encode our first experiment with CF4J.

  1. First of all, we need to load MovieLens's ratings. CF4J includes a preloaded version of most popular ratings databases. You can retrieve them using BenchmarkDataModels class. In this experiment we will load MovieLens 100k dataset.

    DataModel datamodel = BenchmarkDataModels.MovieLens100K();

    As you can observe, MovieLens dataset has been loaded into a DataModel. A DataModel is a high level in memory representation of the data structure required by collaborative filtering algorithms.

  2. Now, we need to create an object store the results of our experiment. CF4J includes some amazing tools to analyze the experimental results. You can find them in the es.upm.etsisi.plot package. In this case, we want to analyze how the Mean Squared Error (MSE) varies according to the value of the regularization term in Probabilistic Matrix Factorization (PMF) recommender, so we will use a LinePlot].

    double[] regValues = {0.000, 0.025, 0.05, 0.075, 0.100, 0.125, 0.150, 0.175, 0.200, 0.225, 0.250};
    LinePlot plot = new LinePlot(regValues, "regularization", "MSE");
  3. At this point everything is ready to perform the experiment. We add a new empty series to the plot:

    plot.addSeries("PMF");

    And we iterate over the different regularization values fitting a new instance of (PMF) recommender for each of them, computing the MSE of the fitted recommender predictions and adding the MSE score to the plot data. Note that the remaining model's hyper-parameters has been fixed for this experiment (numFactors=6, numIters=50, gamma=0.01 and seed=43):

    for (double reg : regValues) {
      PMF pmf = new PMF(datamodel, 6, 50, reg, 0.01, 43);
      pmf.fit();
    
      QualityMeasure mse = new MSE(pmf);
      double mseScore = mse.getScore();
    
      plot.setValue("PMF", reg, mseScore);
    }
  4. Finally, we visualize the experimental results.

    To draw the plot we use:

    plot.draw();

    And we obtain the following chart:

    PMF regularization term experiment

    To print the plot data in the standard output console we use:

    plot.printData("0.000");

    And we obtain the following output:

    +----------------+-------+
    | regularization | PMF   |
    +----------------+-------+
    |          0.000 | 1.150 |
    +----------------+-------+
    |          0.025 | 1.070 |
    +----------------+-------+
    |          0.050 | 1.021 |
    +----------------+-------+
    |          0.075 | 0.990 |
    +----------------+-------+
    |          0.100 | 0.972 |
    +----------------+-------+
    |          0.125 | 0.966 |
    +----------------+-------+
    |          0.150 | 0.969 |
    +----------------+-------+
    |          0.175 | 0.979 |
    +----------------+-------+
    |          0.200 | 0.993 |
    +----------------+-------+
    |          0.225 | 1.009 |
    +----------------+-------+
    |          0.250 | 1.027 |
    +----------------+-------+
    

You can find the full code of this example in GettingStartedExample.

Project Structure

The following image shows the class diagram of the whole project. The project has been divided into four main packages: data, recommender, qualityMeasure and util.

CF4J class diagram

es.upm.etsisi.cf4j.data package

This package contains all the classes that are needed to extract, transform, load and manipulate the data used by collaborative filtering algorithms. The most important classes of this package are:

  • DataSet. This interface is used to iterate over training and test ratings. Two implementations of this interface have been included: RandomSplitDataSet that randomly splits the ratings contained in a file into training and test ratings; and TrainTestFilesDataSet that loads training and test ratings from two different files.

  • DataModel. This class manages all the information related with a collaborative filtering based recommender system. A DataModel must be instantiated from a DataSet. Once the DataModel is created, it is composed by:

    • An array to store (training) User instances.
    • An array to store TestUser instances.
    • An array to store (training) Item instances.
    • An array to store TestItem instances.
  • User. This class represents a training user. Each User is defined by his/her index in the User array of the DataModel and an unique identifier. The User class contains the list of items rated by the user. These ratings can be retrieved using getItemAt(pos), that returns the index of the item rated at the pos position, and getRatingAt(pos), that returns the rating value of the item rated at the pos position. Items' indexes returned by getItemAt(pos) are sorted from lower to higher.

  • Item. This class represents a training item. Each Item is defined by its index in the Item array of the DataModel and an unique identifier. The Item class contains the list of users that have rated the item. These ratings can be retrieved using getUserAt(pos), that returns the user index of the User instance that have rated the item at the pos position, and getRatingAt(pos), that returns the rating value at the pos position. Users' indexes returned by getUserAt(pos) are sorted from lower to higher.

  • TestUser. This class represents a test user. Every TestUser is also a User due to the heritage relation between User and TestUser classes. Each TestUser is defined by his/her index in the TestUser array of the DataModel. The TestUser class contains the list of test items rated in test by the test user. These test ratings can be retrieved using getTestItemAt(pos), that returns the index of the item rated at the pos position, and getTestRatingAt(pos), that returns the test rating value of the test item rated at the pos position. Test items' indexes returned by getTestItemAt(pos) are sorted from lower to higher.

  • TestItem. This class represents a test item. Every TestItem is also a Item due to the heritage relation between Item and TestItem classes. Each TestItem is defined by his/her index in the TestItem array of the DataModel. The TestItem class contains the list of test users that have rated in test the item. These test ratings can be retrieved using getTestUserAt(pos), that returns the index of the testUser instance that have rated the test item at the pos position, and getTestRatingAt(pos), that returns the test rating value at the pos position. Test users' indexes returned by getTestUserAt(pos) are sorted from lower to higher.

  • BenchmarkDataModels. This class contains preloaded DataModel instances with the most popular datasets used in collaborative filtering research. See Datasets section for more details.

es.upm.etsisi.cf4j.recommender package

This package contains several implementations of collaborative filtering algorithms. You can check the full list in the Algorithm List section. Each collaborative filtering algorithm included in CF4J must extends the Recommender abstract class. This class forces to implement the following abstract methods:

  • fit(): used to estimate collaborative filtering recommender parameters given the hyper-parameters usually defined in the class constructor. To speed up the fitting process, most of the computations has been parallelized using Parallelizer util.

  • predict(userIndex, itemIndex): used to estimate the rating prediction of the user with index userIndex to the item with index itemIndex.

Each Recommender must be created from a DataModel instance and will be fitted to it.

es.upm.etsisi.cf4j.qualityMeasure package

This package contains the implementation of different quality measures for collaborative filtering based recommender systems. These quality measures are used to evaluate the performance of a Recommender instance. Included quality measures has been classified into two categories:

  • Quality measures for predictions, allocated into es.upm.etsisi.cf4j.qualityMeasures.prediction package.
  • Quality measures for recommendations, allocated into es.upm.etsisi.cf4j.qualityMeasures.recommendation package.

Each quality measure included in CF4J extends QualityMeasure abstract class. This class simplifies the computation of a quality measure from the test ratings. It contains the getScore() method that computes the score of the quality measure for each test user and returns the averaged score. The computation of the quality measure score for each test user is performed in parallel.

es.upm.etsisi.cf4j.util package

This package contains different utilities designed to ease common operations used in collaborative filtering research. This package includes the following sub-packages:

  • es.upm.etsisi.cf4j.util.plot includes plotting tools designed to analyze data of results obtained as consequence of collaborative filtering research. The following plot types:

    • LinePlot. Displays multiple data series with common numerical values on the x axis. Example:

      LinePlot example

    • XYPlot.Displays multiple data series defined by a sequence of XY points. All the points in a series must be assigned to a common plot's label. Example:

      XYPlot example

    • ScatterPlot. Displays the values of two numerical variables. Example:

      ScatterPlot example

    • HistogramPlot. Displays the histogram of a numerical variable by defining the number of bins. Example:

      HistogramPlot example

    • ColumnPlot. Displays numerical values related with a discrete variable placed on the x axis. Example:

      ColumnPlot example

  • es.upm.etsisi.cf4j.util.optimization includes optimization utils designed to tune recommenders' hyper-parameters.

  • es.upm.etsisi.cf4j.util.process includes processing utils designed to simplify the parallelization of collaborative filtering algorithms.

Read the javadoc documentation for additional information.

Customize CF4J

CF4J has been designed for the collaborative filtering's research community, so its extendability has been one of the main requirements of this project. As described above, an execution with CF4J includes the following steps:

  1. Load a dataset using an implementation of the DataSet class.
  2. Create a new DataModel from the loaded DataSet.
  3. Fit a Recommender to the DataSet.
  4. Evaluate the performance of a Recommender using a QualityMeasure.

Therefore, if you want to customize CF4J, you must work with DataSet, DataModel, Recommender and QualityMeasure classes:

DataSet is an interface that contains two methods to iterate over training ratings (getRatingsIterator()) and test ratings (getTestRatingsIterator()). The iteration is carried out over DataSetEntry instances, that contains the user, item and value of a rating. Any class that implements this interface may be used to create a DataModel.

DataModel is a class that should not be modified. It has been encoded to manage the essential information required by most of collaborative filtering algorithms (i.e. users, items and ratings). However, there are several algorithms that includes additional information to the recommendation process such as demographic information about the users or items description. Both DataModel, User and Item includes a DataBank instance (see javadoc) to store and retrieve any additional information required by a custom Recommender.

Recommender class can be extended to create your own collaborative filtering algorithm. As mentioned above, to create a new Recommender you must define the fit() and predict(userIndex, itemIndex) methods. In addition, to create a new similarity metric for a kNN based collaborative filtering, you should extend UserSimilarityMetric or ItemSimilarityMetric for user-to-user or item-to-item approaches of kNN, respectively.

QualityMeasure class allows to easily define new quality measures for both predictions and recommendations. This class includes an abstract method, getScore(TestUser testUser, double[] predictions), that must be implemented to compute the score of a testUser given his/her predictions.

Algorithm List

In this section we include the full list of algorithms implemented in the library.

  • Matrix factorization algorithms (es.upm.etsisi.cf4j.recommender.matrixFactorization package):

    Class Publication
    BiasedMF Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8), 30-37
    BNMF Hernando, A., Bobadilla, J., & Ortega, F. (2016). A non negative matrix factorization for collaborative filtering recommender systems on a Bayesian probabilistic model. Knowledge-Based Systems, 97, 188-202
    CLiMF Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., & Hanjalic, A. (2012, September). CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. In Proceedings of the sixth ACM conference on Recommender systems (pp. 139-146)
    HPF Gopalan, P., Hofman, J. M., & Blei, D. M. (2015, July). Scalable Recommendation with Hierarchical Poisson Factorization. In UAI (pp. 326-335)
    NMF Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562)
    PMF Mnih, A., & Salakhutdinov, R. R. (2008). Probabilistic matrix factorization. In Advances in neural information processing systems (pp. 1257-1264)
    SVDPlusPlus Koren, Y. (2008, August). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 426-434)
    URP Marlin, B. M. (2004). Modeling user rating profiles for collaborative filtering. In Advances in neural information processing systems (pp. 627-634)
  • Collaborative filtering based on neural networks (es.upm.etisi.recommender.neural package):

    Class Publication
    NCCF Bobadilla, J., Ortega, F., Gutiérrez, A., & Alonso, S. (2020). Classification-based Deep Neural Network Architecture for Collaborative Filtering Recommender Systems. International Journal of Interactive Multimedia & Artificial Intelligence, 6(1)
  • kNN based CF (both user-to-user and item-to-item approaches):

    • Traditional similarity metrics inspired by statistics (es.upm.etsisi.cf4j.recommender.knn.userSimilairtyMetrics and es.upm.etsisi.cf4j.recommender.knn.itemSimilairtyMetrics packages):

      • Pearson Correlation (Correlation)
      • Pearson Correlation Constrained (CorrelationConstrained)
      • Cosine similarity (Cosine)
      • Adjusted Cosine similarity (AdjustedCosine)
      • Jaccard index (Jaccard)
      • Mean Squared Difference (MSD)
      • Spearman Rank (SpearmanRank)
    • Similarity metrics created ad-hoc for collaborative filtering algorithm (es.upm.etsisi.cf4j.recommender.knn.userSimilairtyMetrics and es.upm.etsisi.cf4j.recommender.knn.itemSimilairtyMetrics packages):

      Class Publication
      CJMSD Bobadilla, J., Ortega, F., Hernando, A., & Arroyo, A. (2012). A Balanced Memory-Based Collaborative Filtering Similarity Measure, International Journal of Intelligent Systems, 27, 939-946.
      JMSD Bobadilla, J., Serradilla, F., & Bernal, J. (2010). A new collaborative filtering metric that improves the behavior of Recommender Systems, Knowledge-Based Systems, 23 (6), 520-528.
      PIP Ahn, H. J. (2008). A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem, Information Sciences, 178, 37-51.
      Singularities Bobadilla, J., Ortega, F., & Hernando, A. (2012). A collaborative filtering similarity measure based on singularities, Information Processing and Management, 48 (2), 204-217.
  • Quality measures:

    • For prediction (es.upm.etsisi.cf4j.qualityMeasure.prediction package):

      • Coverage (Coverage)
      • Mean Absolute Error (MAE)
      • Max User Error (Max)
      • Mean Squared Error (MSE)
      • Mean Squared Logarithmic Error (MSLE)
      • Percentage of prefect predictions (Perfect)
      • Coefficient of determination R2 (R2)
      • Root Mean Squared Error (RMSE)
    • For recommendation (es.upm.etsisi.cf4j.qualityMeasure.recommendation package):

      • Precision (Precision)
      • Recall (Recall)
      • F1 (F1)
      • Normalized Discounted Cumulative Gain (NDCG)
      • Novelty (Novelty)
      • Discovery (Discovery)
      • Diversity (Diversity)

Examples

In src/main/java/es/upm/etsisi/cf4j/examples you can find the following examples that shows the main features of CF4J.

In examples/recommender you will find examples showing how to compare different Recommender instances:

  • MatrixFactorizationComparison compares the RMSE score for different matrix factorization models varying the number of latent factors.
  • UserKnnComparison compares the MAE, Coverage, Precision and Recall quality measures scores for different similarity metrics applied to user-to-user knn based collaborative filtering. Each similarity metric is tested with different number of neighbors.
  • ItemKnnComparison compares the MSLE and nDCG quality measures scores for different similarity metrics applied to item-to-item knn based collaborative filtering. Each similarity metric is tested with different number of neighbors.

In examples/plot you will find examples examples showing how to plot with CF4J:

  • ColumnPlotExample analyzes the rating value distribution of MovieLens 1M dataset using a ColumnPlot.
  • HistogramPlotExample analyzes the average rating of each item that belongs to MovieLens 1M dataset. It shows the results using a HistogramPlot.
  • LinePlotExample compares the F1 score of the recommendations performed by PMF and NMF recommenders. Results are included in a LinePlot that contains the number of recommendations performed in the x axis.
  • ScatterPlotExample builds an ScatterPlot comparing the number of ratings of each test user with his/her averaged prediction error using BiasedMF as recommender.
  • XYPlotExample compares the Precision score (y axis) and the Recall score (x axis) for PMF and NMF recommenders using an XYPlot.

In examples/gridSearch you will find examples showing how to use GridSearch tool:

  • BiasedMFGridSearch tunes the hyper-parameters of BiasedMF recommender using the GridSearch tool. Top 5 results with lowest Mean Absolute Error (MAE) are printed.
  • UserKNNGridSearch tunes the parameters of UserKNN recommender using the GridSearch tool. Top 5 results with highest Precision score are printed.

Datasets

CF4J includes the most popular datasets used in collaborative filtering research. These datasets has been preloaded into DataModel instances and can be retrieved using BenchmarkDataModels class.

The datasets included in CF4J are:

Dataset Number of users Number of items Number of ratings Number of test ratings Rating scale
MovieLens100K 943 1,682 92,026 7,974 1 to 5
MovieLens1M 6,040 3,706 911,031 89,178 1 to 5
MovieLens10M 69,878 10,677 9,104,681 895,373 0.5 to 5.0
FilmTrust 1,508 2,071 32,675 2,819 0.5 to 4.0
BookCrossing 77,805 185,973 390,351 43,320 1 to 10
LibimSeTi 135,359 168,791 15,846,347 1,512,999 1 to 10
MyAnimeList 69,600 9,927 5,788,207 549,027 1 to 10
Jester 54,905 140 1,662,713 179,657 -10 to 10
Netflix Prize 480,189 17,770 99,945,049 535,458 1 to 5

About

CF4J: Collaborative Filtering for Java

License:Apache License 2.0


Languages

Language:Java 100.0%