ThyrixYang / es_dfm

hi, Thanks for your work and contributing it to the community. :) I found a typo bug when preparing the criteo dataset as follows:

es_dfm/src/data.py

Line 428 in 9183c11

"sample_ts": train_data.sample_ts,

I think the "sample_ts": train_data.sample_ts should be "sample_ts": test_data.sample_ts although this bug would not take effect on the pre-training results, only when one wants to make some evaluations in the pre-training stage.

Hi @hwlza ,

We appreciate your interest in our work, and we are grateful for your observation regarding this minor typo. We have rectified this typo based on your suggestion.
We hope our code can be useful for your work. :)

Also, I think the following code snippets that aim to serialize the well-processed data miss one level of indentation, see

es_dfm/src/data.py

Lines 413 to 415 in 8487ed4

    
           if params["data_cache_path"] != "None": 
        
               with open(cache_path, "wb") as f: 
        
                   pickle.dump({"train": train_data, "test": test_data}, f)

and

es_dfm/src/data.py

Lines 340 to 343 in 8487ed4

    
           if params["data_cache_path"] != "None": 
        
               with open(cache_path, "wb") as f: 
        
                   pickle.dump({"train": train_stream, "test": test_stream}, f) 
        
           return train_stream, test_stream

Since in the current version, the file would be re-write again even if the well-processed data had been cached. I think this will lead to an unnecessary overload especially when the volume of the raw dataset is huge, though has nothing else impact on the final result.

Hi, @hwlza

Fixed. Thank you! I think this bug is introduced in the open-source version.

Have a nice day. Thanks for your excellent work. 😄

	if params["data_cache_path"] != "None":
	with open(cache_path, "wb") as f:
	pickle.dump({"train": train_data, "test": test_data}, f)

	if params["data_cache_path"] != "None":
	with open(cache_path, "wb") as f:
	pickle.dump({"train": train_stream, "test": test_stream}, f)
	return train_stream, test_stream

A tiny bug in data processing