cuMF / cumf_als

CUDA Matrix Factorization Library with Alternating Least Square (ALS)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Input data format?

cuihenggang opened this issue · comments

It's not very clear what the input data format is.

It seems to me that there are at least several input files used by main.cu:

"./netflix/R_test_coo.data.bin"
"./netflix/R_test_coo.row.bin"
"./netflix/R_test_coo.col.bin"
"./yahoo/yahoo_R_test_coo.data.bin"
"./yahoo/yahoo_R_test_coo.row.bin"
"./yahoo/yahoo_R_test_coo.col.bin"
"./netflix/R_train_csr.data.bin"
"./netflix/R_train_csr.indptr.bin"
"./netflix/R_train_csr.indices.bin"
"./yahoo/yahoo_R_train_csr.data.bin"
"./yahoo/yahoo_R_train_csr.indptr.bin"
"./yahoo/yahoo_R_train_csr.indices.bin"
"./netflix/R_train_csc.data.bin"
"./netflix/R_train_csc.indices.bin"
"./netflix/R_train_csc.indptr.bin"
"./yahoo/yahoo_R_train_csc.data.bin"
"./yahoo/yahoo_R_train_csc.indices.bin"
"./yahoo/yahoo_R_train_csc.indptr.bin"
"./netflix/R_train_coo.row.bin"
"./yahoo/yahoo_R_train_coo.row.bin"

Mathematically, for a given data set, e.g., netflix, you only need the training and testing rating matrices, say, netflix_mm and netflix_mme from from http://www.select.cs.cmu.edu/code/graphlab/datasets/.

COO, CSC and CSR are specific binary formats of training and testing data, to facilitate the computation. I will add a python script to start from netflix_mm and netflix_mme, and generate the required binary data files. Stay tuned :-)

Hengyang, please take a look at this file to prepare the needed input:
https://github.com/wei-tan/CuMF/blob/master/scripts/prepare_input.ipynb

Solved.

@cuihenggang: we have updated code and readme regarding input data. Just for your information :)