Support dataset system labels and tags

Question

Support dataset system labels and tags

cmenguy opened this issue a year ago · comments

Currently in the createDataSets method there is no way to pass labels and tags programmatically. You can do it most likely with the data parameter (haven't tried) but this is cumbersome as it requires passing the entire payload.

We would like to add extra parameters to this function to add system_labels: list[str] and tags: dict[str, list[str]] so it's completely transparent and easy to manipulate.

Possibly we can go further and abstract that fully for some use case, like for example if I want to create a dataset that is profile enabled it would be nice to just have to call createDataSets(..., profile_enabled=True)

Similarly, having the ability to update system labels and tags could be added in datasets.py module.

Julien · Answer 1 · Mon Feb 06 2023 16:15:00 GMT+0800 (China Standard Time)

I have improved the method as you wish. I think it makes sense to offer that possibility.
For system_labels, I could not find it in the documetnation so not sure how to implement it.
Find the new method description below:

def createDataSets(self, data: dict = None, name:str=None, schemaId:str=None, profileEnabled:bool=False,upsert:bool=False, tags:dict=None,**kwargs):
        """
        Create a new dataSets based either on preconfigured setup or by passing the full dictionary for creation.
        Arguments:
            data : REQUIRED : If you want to pass the dataset object directly (not require the name and schemaId then)
                more info: https://www.adobe.io/apis/experienceplatform/home/api-reference.html#/Datasets/postDataset
            name : REQUIRED : if you wish to create a dataset via autocompletion. Provide a name.
            schemaId : REQUIRED : The schema $id reference for creating your dataSet.
            profileEnabled : OPTIONAL : If the dataset to be created with profile enbaled
            upsert : OPTIONAL : If the dataset to be created with profile enbaled and Upsert capability.
            tags : OPTIONAL : set of attribute to add as tags.
        possible kwargs
            requestDataSource : Set to true if you want Catalog to create a dataSource on your behalf; otherwise, pass a dataSourceId in the body.
        """
        path = "/dataSets"
        params = {"requestDataSource": kwargs.get("requestDataSource", False)}
        if self.loggingEnabled:
            self.logger.debug(f"Starting createDataSets")
        if data is not None or isinstance(data, dict) == True:
            res = self.connector.postData(self.endpoint+path, params=params,
                             data=data)
        elif name is not None and schemaId is not None:
            data = {
                "name":name,
                "schemaRef": {
                    "id": schemaId,
                    "contentType": "application/vnd.adobe.xed+json;version=1"
                },
                "fileDescription": {
                    "persisted": True,
                    "containerFormat": "parquet",
                    "format": "parquet"
                }
            }
            if profileEnabled:
                data['tags'] = {
                            "unifiedIdentity": [
                                "enabled: true"
                            ],
                            "unifiedProfile": [
                                "enabled: true"
                            ]
                        }
            if upsert:
                data['tags']['unifiedProfile'] = ["enabled: true","isUpsert: true"]
            if tags is not None and type(tags) == dict:
                for key in tags:
                    data['tags'][key] = tags[key]
            res = self.connector.postData(self.endpoint+path, params=params,
                             data=data)
        return res

If no feedback receive, it will go live as such on 08.02.2023.
As you suggest, the data attribute can be used otherwise for more complex setup.

Charles Menguy · Answer 2 · Mon Feb 06 2023 22:34:54 GMT+0800 (China Standard Time)

Thanks @pitchmuc ! A couple suggestions:

There are use cases that I've seen where you want to enabled profile but not identity. As such, I would recommend not tying both tags together in if profileEnabled block, and maybe just having a separate parameter for identityEnabled: bool
For system labels they are separate from tags, here is an example below what a system label on a dataset looks like. Values can be user_no_write to make a dataset read-only, and also user_no_read to make the dataset hidden in the UI. Both labels can be set on the same dataset to make it both read-only and invisible.

{
	"name":"...",
	"description":"...",
	"schemaRef":{
		"id":"{{upsIngestionSchemaId}}",
		"contentType":"application/vnd.adobe.xed-full+json;version=1"
	},
	"fileDescription":{
		"persisted":true,
		"containerFormat":"parquet",
		"format":"parquet"
	},
	"aspect":"production",
	"tags": {
		"unifiedProfile": ["enabled:true","isUpsert:true"],
	},
    "systemLabels": ["user_no_write"]
}

Julien · Answer 3 · Mon Feb 06 2023 23:50:55 GMT+0800 (China Standard Time)

I will add this possibility.
I am very surprised and curious to know which uses case enabled profile but not identity (personal interest).
Can you share them ?
Does it mean that a dataset could ingest data to a profile without identity reconciliation ?
Does it mean that dataset like this can not create UIS graph connection ?

Charles Menguy · Answer 4 · Mon Feb 06 2023 23:57:10 GMT+0800 (China Standard Time)

@pitchmuc Yes for example we have a use case right now where we consume data coming from Profile exports. We then run some ML model on top of it, and at the end we want to ingest the resulting scores back to Profile. However for this use case we want to only ingest in Profile and not in Identity, because we are not re-stitching anything and just want to update profile data for existing profiles. Initially we were writing to both Identity and Profile but that caused issues in Identity where we ended up "resurrecting" some identities that had since been deleted.

Julien · Answer 5 · Tue Feb 07 2023 00:01:33 GMT+0800 (China Standard Time)

Thanks !
I am wondering if that could not solve some use-cases where we do not want identity to be stitched together but just add attributes to them....
Just added it here: abac275

Julien · Answer 6 · Wed Feb 08 2023 20:25:27 GMT+0800 (China Standard Time)

Release on 0.2.6 (now available on pypi)