A Java's Collaborative Filtering library to carry out experiments in research on Collaborative Filtering based Recommender Systems. The library has been designed from researchers to researchers.
If you enjoy cf4j, please cite us:
Ortega, F., Mayor, J., López-Fernández, D., & Lara-Cabrera, R. (2021). CF4J 2.0: Adapting Collaborative Filtering for Java to new challenges of collaborative filtering based recommender systems. Knowledge-Based Systems, 215, 106629.
@article{ortega2021cf4j,
title={CF4J 2.0: Adapting Collaborative Filtering for Java to new challenges of collaborative filtering based recommender systems},
author={Ortega, Fernando and Mayor, Jes{\'u}s and L{\'o}pez-Fern{\'a}ndez, Daniel and Lara-Cabrera, Ra{\'u}l},
journal={Knowledge-Based Systems},
volume={215},
pages={106629},
year={2021},
publisher={Elsevier}
}
Ortega, F., Zhu, B., Bobadilla, J., & Hernando, A. (2018). CF4J: Collaborative filtering for Java. Knowledge-Based Systems, 152, 94-99.
@article{ortega2018cf4j,
title={CF4J: Collaborative filtering for Java},
author={Ortega, Fernando and Zhu, Bo and Bobadilla, Jes{\'u}s and Hernando, Antonio},
journal={Knowledge-Based Systems},
volume={152},
pages={94--99},
year={2018},
publisher={Elsevier}
}
CF4J is available in the most popular dependency management tools for Java. To add it to your project, you must add the following lines to your dependency management.
For Maven:
<dependency>
<groupId>es.upm.etsisi</groupId>
<artifactId>cf4j</artifactId>
<version>2.3.0</version>
</dependency>
For Gradle:
compile group: 'es.upm.etsisi', name: 'cf4j', version: '2.3.0'
For SBT:
libraryDependencies += "es.upm.etsisi" % "cf4j" % "2.3.0"
For Ivy:
<dependency org="es.upm.etsisi" name="cf4j" rev="2.3.0"/>
For Grape:
@Grapes(
@Grab(group='es.upm.etsisi', module='cf4j', version='2.3.0')
)
For Leiningen:
[es.upm.etsisi/cf4j "2.3.0"]
For Buildr:
'es.upm.etsisi:cf4j:jar:2.3.0'
You can find additional information about these dependencies in https://mvnrepository.com/artifact/es.upm.etsisi/cf4j
If you prefer to use the library without a dependency management tool, you must add the jar
packaged version of CF4J to your project's classpath. For example, if you are using IntelliJ IDEA, copy the file to your project's directory, make right click on the jar
file and select Add as Library
.
You can find the jar
packaged version of CF4J into the release section of github.
You can also package your own jar
file . To do that, clone the repository using git clone git@github.com:ferortega/cf4j.git
and package it with mvn package
.
Let's encode our first experiment with CF4J.
-
First of all, we need to load MovieLens's ratings. CF4J includes a preloaded version of most popular ratings databases. You can retrieve them using
BenchmarkDataModels
class. In this experiment we will load MovieLens 100k dataset.DataModel datamodel = BenchmarkDataModels.MovieLens100K();
As you can observe, MovieLens dataset has been loaded into a
DataModel
. ADataModel
is a high level in memory representation of the data structure required by collaborative filtering algorithms. -
Now, we need to create an object store the results of our experiment. CF4J includes some amazing tools to analyze the experimental results. You can find them in the
es.upm.etsisi.plot
package. In this case, we want to analyze how the Mean Squared Error (MSE) varies according to the value of the regularization term in Probabilistic Matrix Factorization (PMF
) recommender, so we will use aLinePlot
].double[] regValues = {0.000, 0.025, 0.05, 0.075, 0.100, 0.125, 0.150, 0.175, 0.200, 0.225, 0.250}; LinePlot plot = new LinePlot(regValues, "regularization", "MSE");
-
At this point everything is ready to perform the experiment. We add a new empty series to the plot:
plot.addSeries("PMF");
And we iterate over the different regularization values fitting a new instance of (
PMF
) recommender for each of them, computing theMSE
of the fitted recommender predictions and adding the MSE score to the plot data. Note that the remaining model's hyper-parameters has been fixed for this experiment (numFactors=6
,numIters=50
,gamma=0.01
andseed=43
):for (double reg : regValues) { PMF pmf = new PMF(datamodel, 6, 50, reg, 0.01, 43); pmf.fit(); QualityMeasure mse = new MSE(pmf); double mseScore = mse.getScore(); plot.setValue("PMF", reg, mseScore); }
-
Finally, we visualize the experimental results.
To draw the plot we use:
plot.draw();
And we obtain the following chart:
To print the plot data in the standard output console we use:
plot.printData("0.000");
And we obtain the following output:
+----------------+-------+ | regularization | PMF | +----------------+-------+ | 0.000 | 1.150 | +----------------+-------+ | 0.025 | 1.070 | +----------------+-------+ | 0.050 | 1.021 | +----------------+-------+ | 0.075 | 0.990 | +----------------+-------+ | 0.100 | 0.972 | +----------------+-------+ | 0.125 | 0.966 | +----------------+-------+ | 0.150 | 0.969 | +----------------+-------+ | 0.175 | 0.979 | +----------------+-------+ | 0.200 | 0.993 | +----------------+-------+ | 0.225 | 1.009 | +----------------+-------+ | 0.250 | 1.027 | +----------------+-------+
You can find the full code of this example in GettingStartedExample.
The following image shows the class diagram of the whole project. The project has been divided into four main packages: data
, recommender
, qualityMeasure
and util
.
This package contains all the classes that are needed to extract, transform, load and manipulate the data used by collaborative filtering algorithms. The most important classes of this package are:
-
DataSet
. This interface is used to iterate over training and test ratings. Two implementations of this interface have been included:RandomSplitDataSet
that randomly splits the ratings contained in a file into training and test ratings; andTrainTestFilesDataSet
that loads training and test ratings from two different files. -
DataModel
. This class manages all the information related with a collaborative filtering based recommender system. ADataModel
must be instantiated from aDataSet
. Once theDataModel
is created, it is composed by:- An array to store (training)
User
instances. - An array to store
TestUser
instances. - An array to store (training)
Item
instances. - An array to store
TestItem
instances.
- An array to store (training)
-
User
. This class represents a training user. EachUser
is defined by his/her index in theUser
array of theDataModel
and an unique identifier. TheUser
class contains the list of items rated by the user. These ratings can be retrieved usinggetItemAt(pos)
, that returns the index of the item rated at thepos
position, andgetRatingAt(pos)
, that returns the rating value of the item rated at thepos
position. Items' indexes returned bygetItemAt(pos)
are sorted from lower to higher. -
Item
. This class represents a training item. EachItem
is defined by its index in theItem
array of theDataModel
and an unique identifier. TheItem
class contains the list of users that have rated the item. These ratings can be retrieved usinggetUserAt(pos)
, that returns the user index of theUser
instance that have rated the item at thepos
position, andgetRatingAt(pos)
, that returns the rating value at thepos
position. Users' indexes returned bygetUserAt(pos)
are sorted from lower to higher. -
TestUser
. This class represents a test user. EveryTestUser
is also aUser
due to the heritage relation betweenUser
andTestUser
classes. EachTestUser
is defined by his/her index in theTestUser
array of theDataModel
. TheTestUser
class contains the list of test items rated in test by the test user. These test ratings can be retrieved usinggetTestItemAt(pos)
, that returns the index of the item rated at thepos
position, andgetTestRatingAt(pos)
, that returns the test rating value of the test item rated at thepos
position. Test items' indexes returned bygetTestItemAt(pos)
are sorted from lower to higher. -
TestItem
. This class represents a test item. EveryTestItem
is also aItem
due to the heritage relation betweenItem
andTestItem
classes. EachTestItem
is defined by his/her index in theTestItem
array of theDataModel
. TheTestItem
class contains the list of test users that have rated in test the item. These test ratings can be retrieved usinggetTestUserAt(pos)
, that returns the index of thetestUser
instance that have rated the test item at thepos
position, andgetTestRatingAt(pos)
, that returns the test rating value at thepos
position. Test users' indexes returned bygetTestUserAt(pos)
are sorted from lower to higher. -
BenchmarkDataModels
. This class contains preloaded DataModel instances with the most popular datasets used in collaborative filtering research. See Datasets section for more details.
This package contains several implementations of collaborative filtering algorithms. You can check the full list in the Algorithm List section. Each collaborative filtering algorithm included in CF4J must extends the Recommender
abstract class. This class forces to implement the following abstract methods:
-
fit()
: used to estimate collaborative filtering recommender parameters given the hyper-parameters usually defined in the class constructor. To speed up the fitting process, most of the computations has been parallelized usingParallelizer
util. -
predict(userIndex, itemIndex)
: used to estimate the rating prediction of the user with indexuserIndex
to the item with indexitemIndex
.
Each Recommender
must be created from a DataModel
instance and will be fitted to it.
This package contains the implementation of different quality measures for collaborative filtering based recommender systems. These quality measures are used to evaluate the performance of a Recommender
instance. Included quality measures has been classified into two categories:
- Quality measures for predictions, allocated into
es.upm.etsisi.cf4j.qualityMeasures.prediction
package. - Quality measures for recommendations, allocated into
es.upm.etsisi.cf4j.qualityMeasures.recommendation
package.
Each quality measure included in CF4J extends QualityMeasure
abstract class. This class simplifies the computation of a quality measure from the test ratings. It contains the getScore()
method that computes the score of the quality measure for each test user and returns the averaged score. The computation of the quality measure score for each test user is performed in parallel.
This package contains different utilities designed to ease common operations used in collaborative filtering research. This package includes the following sub-packages:
-
es.upm.etsisi.cf4j.util.plot
includes plotting tools designed to analyze data of results obtained as consequence of collaborative filtering research. The following plot types:-
LinePlot
. Displays multiple data series with common numerical values on the x axis. Example: -
XYPlot
.Displays multiple data series defined by a sequence of XY points. All the points in a series must be assigned to a common plot's label. Example: -
ScatterPlot
. Displays the values of two numerical variables. Example: -
HistogramPlot
. Displays the histogram of a numerical variable by defining the number of bins. Example: -
ColumnPlot
. Displays numerical values related with a discrete variable placed on the x axis. Example:
-
-
es.upm.etsisi.cf4j.util.optimization
includes optimization utils designed to tune recommenders' hyper-parameters. -
es.upm.etsisi.cf4j.util.process
includes processing utils designed to simplify the parallelization of collaborative filtering algorithms.
Read the javadoc documentation for additional information.
CF4J has been designed for the collaborative filtering's research community, so its extendability has been one of the main requirements of this project. As described above, an execution with CF4J includes the following steps:
- Load a dataset using an implementation of the
DataSet
class. - Create a new
DataModel
from the loadedDataSet
. - Fit a
Recommender
to theDataSet
. - Evaluate the performance of a
Recommender
using aQualityMeasure
.
Therefore, if you want to customize CF4J, you must work with DataSet
, DataModel
, Recommender
and QualityMeasure
classes:
DataSet
is an interface that contains two methods to iterate over training ratings (getRatingsIterator()
) and test ratings (getTestRatingsIterator()
). The iteration is carried out over DataSetEntry
instances, that contains the user, item and value of a rating. Any class that implements this interface may be used to create a DataModel
.
DataModel
is a class that should not be modified. It has been encoded to manage the essential information required by most of collaborative filtering algorithms (i.e. users, items and ratings). However, there are several algorithms that includes additional information to the recommendation process such as demographic information about the users or items description. Both DataModel
, User
and Item
includes a DataBank
instance (see javadoc) to store and retrieve any additional information required by a custom Recommender
.
Recommender
class can be extended to create your own collaborative filtering algorithm. As mentioned above, to create a new Recommender
you must define the fit()
and predict(userIndex, itemIndex)
methods. In addition, to create a new similarity metric for a kNN based collaborative filtering, you should extend UserSimilarityMetric
or ItemSimilarityMetric
for user-to-user or item-to-item approaches of kNN, respectively.
QualityMeasure
class allows to easily define new quality measures for both predictions and recommendations. This class includes an abstract method, getScore(TestUser testUser, double[] predictions)
, that must be implemented to compute the score of a testUser
given his/her predictions
.
In this section we include the full list of algorithms implemented in the library.
-
Matrix factorization algorithms (
es.upm.etsisi.cf4j.recommender.matrixFactorization
package):Class Publication BiasedMF
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8), 30-37 BNMF
Hernando, A., Bobadilla, J., & Ortega, F. (2016). A non negative matrix factorization for collaborative filtering recommender systems on a Bayesian probabilistic model. Knowledge-Based Systems, 97, 188-202 CLiMF
Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., & Hanjalic, A. (2012, September). CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. In Proceedings of the sixth ACM conference on Recommender systems (pp. 139-146) HPF
Gopalan, P., Hofman, J. M., & Blei, D. M. (2015, July). Scalable Recommendation with Hierarchical Poisson Factorization. In UAI (pp. 326-335) NMF
Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562) PMF
Mnih, A., & Salakhutdinov, R. R. (2008). Probabilistic matrix factorization. In Advances in neural information processing systems (pp. 1257-1264) SVDPlusPlus
Koren, Y. (2008, August). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 426-434) URP
Marlin, B. M. (2004). Modeling user rating profiles for collaborative filtering. In Advances in neural information processing systems (pp. 627-634) BeMF
Ortega, F., Lara-Cabrera, R., González-Prieto, Á., & Bobadilla, J. (2021). Providing reliability in recommender systems through Bernoulli matrix factorization. Information Sciences, 553, 110-128. DeepMF
Lara-Cabrera, R., González-Prieto, Á., & Ortega, F. (2020). Deep matrix factorization approach for collaborative filtering recommender systems. Applied Sciences, 10(14), 4926. DirMF
Lara-Cabrera, R., González, Á., Ortega, F., & González-Prieto, Á. (2022). Dirichlet Matrix Factorization: A Reliable Classification-Based Recommender System. Applied Sciences, 12(3), 1223. -
Collaborative filtering based on neural networks (
es.upm.etisi.recommender.neural
package):Class Publication NCCF
Bobadilla, J., Ortega, F., Gutiérrez, A., & Alonso, S. (2020). Classification-based Deep Neural Network Architecture for Collaborative Filtering Recommender Systems. International Journal of Interactive Multimedia & Artificial Intelligence, 6(1) GMF
He, Xiangnan & Liao, Lizi & Zhang, Hanwang. (2017). Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web. MLP
He, Xiangnan & Liao, Lizi & Zhang, Hanwang. (2017). Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web. NeuMF
He, Xiangnan & Liao, Lizi & Zhang, Hanwang. (2017). Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web. -
kNN based CF (both user-to-user and item-to-item approaches):
-
Traditional similarity metrics inspired by statistics (
es.upm.etsisi.cf4j.recommender.knn.userSimilairtyMetrics
andes.upm.etsisi.cf4j.recommender.knn.itemSimilairtyMetrics
packages):- Pearson Correlation (
Correlation
) - Pearson Correlation Constrained (
CorrelationConstrained
) - Cosine similarity (
Cosine
) - Adjusted Cosine similarity (
AdjustedCosine
) - Jaccard index (
Jaccard
) - Mean Squared Difference (
MSD
) - Spearman Rank (
SpearmanRank
)
- Pearson Correlation (
-
Similarity metrics created ad-hoc for collaborative filtering algorithm (
es.upm.etsisi.cf4j.recommender.knn.userSimilairtyMetrics
andes.upm.etsisi.cf4j.recommender.knn.itemSimilairtyMetrics
packages):Class Publication CJMSD
Bobadilla, J., Ortega, F., Hernando, A., & Arroyo, A. (2012). A Balanced Memory-Based Collaborative Filtering Similarity Measure, International Journal of Intelligent Systems, 27, 939-946. JMSD
Bobadilla, J., Serradilla, F., & Bernal, J. (2010). A new collaborative filtering metric that improves the behavior of Recommender Systems, Knowledge-Based Systems, 23 (6), 520-528. PIP
Ahn, H. J. (2008). A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem, Information Sciences, 178, 37-51. Singularities
Bobadilla, J., Ortega, F., & Hernando, A. (2012). A collaborative filtering similarity measure based on singularities, Information Processing and Management, 48 (2), 204-217.
-
-
Quality measures:
-
For prediction (
es.upm.etsisi.cf4j.qualityMeasure.prediction
package):- Coverage (
Coverage
) - Mean Absolute Error (
MAE
) - Max User Error (
Max
) - Mean Squared Error (
MSE
) - Mean Squared Logarithmic Error (
MSLE
) - Percentage of prefect predictions (
Perfect
) - Coefficient of determination R2 (
R2
) - Root Mean Squared Error (
RMSE
)
- Coverage (
-
For recommendation (
es.upm.etsisi.cf4j.qualityMeasure.recommendation
package):- Precision (
Precision
) - Recall (
Recall
) - F1 (
F1
) - Normalized Discounted Cumulative Gain (
NDCG
) - Novelty (
Novelty
) - Discovery (
Discovery
) - Diversity (
Diversity
)
- Precision (
-
In src/main/java/es/upm/etsisi/cf4j/examples
you can find the following examples that shows the main features of CF4J.
In examples/recommender
you will find examples showing how to compare different Recommender
instances:
MatrixFactorizationComparison
compares the RMSE score for different matrix factorization models varying the number of latent factors.UserKnnComparison
compares the MAE, Coverage, Precision and Recall quality measures scores for different similarity metrics applied to user-to-user knn based collaborative filtering. Each similarity metric is tested with different number of neighbors.ItemKnnComparison
compares the MSLE and nDCG quality measures scores for different similarity metrics applied to item-to-item knn based collaborative filtering. Each similarity metric is tested with different number of neighbors.
In examples/plot
you will find examples examples showing how to plot with CF4J:
ColumnPlotExample
analyzes the rating value distribution of MovieLens 1M dataset using a ColumnPlot.HistogramPlotExample
analyzes the average rating of each item that belongs to MovieLens 1M dataset. It shows the results using a HistogramPlot.LinePlotExample
compares the F1 score of the recommendations performed by PMF and NMF recommenders. Results are included in a LinePlot that contains the number of recommendations performed in the x axis.ScatterPlotExample
builds an ScatterPlot comparing the number of ratings of each test user with his/her averaged prediction error using BiasedMF as recommender.XYPlotExample
compares the Precision score (y axis) and the Recall score (x axis) for PMF and NMF recommenders using an XYPlot.
In examples/gridSearch
you will find examples showing how to use GridSearch tool:
BiasedMFGridSearch
tunes the hyper-parameters of BiasedMF recommender using the GridSearch tool. Top 5 results with lowest Mean Absolute Error (MAE) are printed.UserKNNGridSearch
tunes the parameters of UserKNN recommender using the GridSearch tool. Top 5 results with highest Precision score are printed.PMFRandomSearchCV
tunes the parameters of PMF recommender using the RandomSearchCV tool. Top 10 results with lowest Mean Squared Error (MSE) are printed.
CF4J includes the most popular datasets used in collaborative filtering research. These datasets have been preloaded into DataModel instances and can be retrieved using BenchmarkDataModels
class.
The datasets included in CF4J are:
Dataset | Number of users | Number of items | Number of ratings | Number of test ratings | Rating scale |
---|---|---|---|---|---|
MovieLens100K | 943 | 1,682 | 92,026 | 7,974 | 1 to 5 |
MovieLens1M | 6,040 | 3,706 | 911,031 | 89,178 | 1 to 5 |
MovieLens10M | 69,878 | 10,677 | 9,104,681 | 895,373 | 0.5 to 5.0 |
FilmTrust | 1,508 | 2,071 | 32,675 | 2,819 | 0.5 to 4.0 |
BookCrossing | 77,805 | 185,973 | 390,351 | 43,320 | 1 to 10 |
LibimSeTi | 135,359 | 168,791 | 15,846,347 | 1,512,999 | 1 to 10 |
MyAnimeList | 69,600 | 9,927 | 5,788,207 | 549,027 | 1 to 10 |
Jester | 54,905 | 140 | 1,662,713 | 179,657 | -10 to 10 |
Netflix Prize | 480,189 | 17,770 | 99,945,049 | 535,458 | 1 to 5 |
BoardGameGeek | 411,375 | 21,925 | 18,273,394 | 63,6134 | 1 to 10 |
ALERT: due to security changes on the server hosting the BenchmarkDataModels
, these will no longer be available for versions lower than 2.3.0
. If you need to continue using the BenchmarkDataModels
, please upgrade to version 2.3.0
or higher.