ycjuan / libffm

A Library for Field-aware Factorization Machines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Features with multiple fields in `bigdata.tr.txt`

Hydrotoast opened this issue · comments

Hi,

It seems like bigdata.tr.txt has feature 2739 with multiple fields (5 and 13) in lines 21, 36, and 88. Shouldn't fields be unique per feature?

Also feature 7686 appears with multiple fields in line 114

@Hydrotoast, you are right, but it is feature 2738 (not 2739)
I spotted 5:2738:0.3651 and 13:2738:0.3651 appears multiple times in the file.

According to the paper of libffm section 3.3 Adding field information-Numerical feature, I think it could be explained as:
the two fields (5 and 13) has the same value : 0.3651, after some discretization method, 0.3651 has been transformed to '2738'.

This is only my guess, I just started to research libffm, I hope the raw data of bigdata.tr.txt and the code for preparing the input for libffm are included in this repo.

@jenniyanjie you can find the answer in ffm.cpp line 68:69 and 43:44. The feature index is field-independent.

This is actually because this tiny data set was subsampled from Kaggle's Criteo competition, and in this competition, we use hashing trick to generate features. The example you gave is due to hashing conflict. We admit this is confusing, and will resolve this issue in next release.