Features with multiple fields in `bigdata.tr.txt`
Hydrotoast opened this issue · comments
Hi,
It seems like bigdata.tr.txt
has feature 2739
with multiple fields (5
and 13
) in lines 21
, 36
, and 88
. Shouldn't fields be unique per feature?
Also feature 7686
appears with multiple fields in line 114
@Hydrotoast, you are right, but it is feature 2738 (not 2739)
I spotted 5:2738:0.3651 and 13:2738:0.3651 appears multiple times in the file.
According to the paper of libffm section 3.3 Adding field information-Numerical feature, I think it could be explained as:
the two fields (5 and 13) has the same value : 0.3651, after some discretization method, 0.3651 has been transformed to '2738'.
This is only my guess, I just started to research libffm, I hope the raw data of bigdata.tr.txt and the code for preparing the input for libffm are included in this repo.
@jenniyanjie you can find the answer in ffm.cpp line 68:69 and 43:44. The feature index is field-independent.
This is actually because this tiny data set was subsampled from Kaggle's Criteo competition, and in this competition, we use hashing trick to generate features. The example you gave is due to hashing conflict. We admit this is confusing, and will resolve this issue in next release.