alibaba / pipcook

Machine learning platform for Web developers

Home Page:https://alibaba.github.io/pipcook/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introducing Amazon Data Schema to simplify our workflow

yorkie opened this issue · comments

In Amazon ML, the user process is simplified by defining the data structure (Address: https://docs.aws.amazon.com/machine-learning/latest/dg/creating-a-data-schema-for-amazon- ml.html).

Today I just discussed with @FeelyChau about how to simplify user-defined data through our data, and how to use the model after output. I just saw this article and wrote it down for subsequent understanding.

In Amazon ML, it defines a thing called Data Schema, which is translated into a data specification. Amazon uses it to describe the format of a file, such as the following data in CSV format:

1,3,web developer,basic.4y,no,no,1,261,0
2,1,car repair,high.school,no,no,22,149,0
3,1,car mechanic,high.school,yes,no,65,226,1
4,2,software developer,basic.6y,no,no,1,151,0

Described in the following format:

{
    "version": "1.0",
    "rowId": "customerId",
    "targetAttributeName": "willRespondToCampaign",
    "dataFormat": "CSV",
    "dataFileContainsHeader": false,
    "attributes": [
        {
            "attributeName": "customerId",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "jobId",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "jobDescription",
            "attributeType": "TEXT"
        },
        {
            "attributeName": "education",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "housing",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "loan",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "campaign",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "duration",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "willRespondToCampaign",
            "attributeType": "BINARY"
        }
    ]
}

Mapped to Pipcook, our label is the targetAttributeName, which is the attribute we need to predict, and the current data format is identified through dataFormat. Then in Data Schema, the types are divided into BINARY/ NUMERIC/ CATEGORICAL/ TEXT categories, and then we only need to define the Label conversion function we support in Model Script or Framework, such as:

  • For BINARY, no conversion required
  • For CATEGORICAL, you need to follow the rules, label-map only needs to be digested inside Model Script
  • Same for NUMERIC and TEXT

In general, by introducing Data Schema in Datacook Script, adding data types for each row of data, and then before entering the model, according to the type to execute the function corresponding to Tensor, similarly, we pass the type of targetAttributeName, also always Can find out how to convert the prediction result (prediction) of the model from Tensor to BINARY, NUMERIC, CATEGORICAL, TEXT.

The only remaining question is: do we need to entangle, the output of DataSource Script and Dataflow Script is inconsistent, in a strict sense, Dataflow Script does not appear to modify the label, if it does, then directly output the results after Dataflow, I think it is acceptable, so the final data link is:

Data> Dataflow(Single)> Type-Tensor> Model> Tensor-Type(label)

What is your opinion on this? Welcome to discuss.