fagan2888 / D3M-Online-Retail-Dataset

Convert D3M raw dataset to D3M clean dataset with Featuretools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About

The python script dfs_d3m.py takes in a multitable dataset and outputs a feature matrix and D3M data schema.

How to use

  1. Install python 2.7
  2. Install python requirements Featuretools and scikit-learn
   pip install sklearn
   pip install featuretools
  1. (Optional) Replace the LABELS_PATH in dfs_d3m.py to point to "data/purchase_sum_4_weeks_first_100.csv" to speed up run time (creates smaller dataset)
  2. Run
python dfs_d3m.py input OUTPUT_PATH

(OUTPUT_PATH is the directory where the clean dataset will be created)

The outputs are:

  • data/dataSchema.json
  • data/trainData.csv
  • data/trainTargets.csv
  • data/testData.csv
  • data/testTargets.csv

Coming soon

  1. This sample is currently hard coded to deal with the online retail data set. In the future, it will be able to take in an arbitrary data set.
  2. Similarly, the code currently splits the data into test and train (because the online retail is all one set). It will in the future take in separate test data and also transform that.

Reference

General Links

Wiki Links

About

Convert D3M raw dataset to D3M clean dataset with Featuretools


Languages

Language:Python 100.0%