IABTechLab / fideslang

Open-source description language for privacy to declare data types and data behaviors in your tech stack in order to simplify data privacy globally. Supports GDPR, CCPA, LGPD and ISO 19944.

Home Page:https://ethyca.github.io/fideslang

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upstream DSR and connector related metadata from Fides

TheAndrewJackson opened this issue · comments

The fides project has extended the fideslang models to included extra metadata for processing DSRs. Most notably the FidesopsMeta model. All of the changes made to fideslang models in the dataset.py should be upstreamed into the fideslang project. This is to standardize on what DSR metadata should look like.

Acceptance Criteria

  • Upstreamed all of the DSR fields from fides into fideslang
  • Update the Taxonomy model to not strip out extra fields that are included in JSON payloads
  • Update fidesops_meta to fides_meta

@ThomasLaPiana I have a couple spec related questions:

  • Should fidesctl_meta be deprecated and or removed?
    • I'm not sure if this is used at all. If not, now would be a great time to delete it.
  • Does it make sense to rename fidesops_meta to dsr_meta instead of fides_meta? (@pattisdr I'm curious of what you think of this idea as well)
    • I was wondering if this might be a little more self documenting. From my undestanding fidesops_meta was used for DSRs so we could just skip straight to that.

hey @TheAndrewJackson my vote would be for this to be as generic as possible and I wonder if dsr_meta is too specific because the information contained there could also be used for data maps too.

In other words, some of the information in fidesops_meta includes how one table is linked to another. We use that to build a graph when we're running DSR's, to know that we need to visit table A and table B before we visit table C, but those relationships also describe the existing system in a sense.

But I'm pretty new to the -ctl side of things, so @ThomasLaPiana can better weigh in on whether fidesops_meta information can be shared between -ops and -ctl features.

@pattisdr That makes sense! Thanks for the input. I think fides_meta makes more sense if it contains other info in addition what's related to DSRs.

Yep, I think fides_meta is a good catch-all! With the projects merged, it is much easier (neigh, impossible?) for us to accidentally overwrite each other's keys in that field, so I don't think we need to have separate _meta fields

also @SteveDMurphy can confirm whether or not we use fidesctl_meta, I believe we do for some system scanning and maybe classification?

Hey @pattisdr I've re-assigned this from @TheAndrewJackson. It looks like some progress has already been made — when fidesplus is released we should do a quickhandover :).

also @SteveDMurphy can confirm whether or not we use fidesctl_meta, I believe we do for some system scanning and maybe classification?

oof, sorry I missed this! But yes the fidesctl_meta can currently be used by the scan system aws functionality. I believe there is also a generic meta field used that may be general purpose enough for our uses but open to other ideas as well 👍🏽

To simplify matters, we could keep our focus on combining just Dataset concepts here. scan system aws functionality writes to Systems > fidesctl_meta but not Datasets > fidesctl_meta. And Fidesops never did anything related to Systems, so we don't necessarily need to touch Systems just yet.

So focusing on Datasets, we could combine fidesctl_meta from fideslang and fidesops_meta from fidesops into one field:

  • Rename fideslang > Datasets.fidesctl_meta to be Datasets.fides_meta.
    • Update fideslang > DatasetMetadata class to also have fields from fides > FidesopsDatasetMeta
  • Downstream in fides, rename ctl_datasets.fidesctl_meta column to fides_meta

I think we should leave the generic meta field alone for customer use, per this docstring:

"An optional object that provides additional information about the Dataset. You can structure the object however you like. It can be a simple set of key: value properties or a deeply nested hierarchy of objects. How you use the object is up to you: Fides ignores it."

To simplify matters, we could keep our focus on combining just Dataset concepts here. scan system aws functionality writes to Systems > fidesctl_meta but not Datasets > fidesctl_meta. And Fidesops never did anything related to Systems, so we don't necessarily need to touch Systems just yet.

So focusing on Datasets, we could combine fidesctl_meta from fideslang and fidesops_meta from fidesops into one field:

  • Rename fideslang > Datasets.fidesctl_meta to be Datasets.fides_meta.

    • Update fideslang > DatasetMetadata class to also have fields from fides > FidesopsDatasetMeta
  • Downstream in fides, rename ctl_datasets.fidesctl_meta column to fides_meta

I think we should leave the generic meta field alone for customer use, per this docstring:

"An optional object that provides additional information about the Dataset. You can structure the object however you like. It can be a simple set of key: value properties or a deeply nested hierarchy of objects. How you use the object is up to you: Fides ignores it."

Huge agree here!

  • Rename fideslang > Datasets.fidesctl_meta to be Datasets.fides_meta.
    • Update fideslang > DatasetMetadata class to also have fields from fides > FidesopsDatasetMeta
  • Downstream in fides, rename ctl_datasets.fidesctl_meta column to fides_meta

For the fideslang change only, would it make sense to introduce fides_meta across System and Organization (and anywhere else we need) as well so we can plan to deprecate the use of both fidesctl_meta and fidesops_meta without needing updates in both repo's again?

Looking at the code, maybe we can directly add meta and fides_meta to the FidesBase class? That way we don't have to worry about this in the future, and I don't think it hurts for every object to have it? (every top-level object at least)

Quick backwards compat question: is it possible to make this field work if it's called fidesops_meta or fides_meta? I'd like to stay compatible with existing datasets, especially all the SaaS connector ones.

hey @NevilleS that's the very thing I'm mulling over now! What I'm considering is a validator that takes in fidesops_meta if it exists and replaces fides_meta with that value, and perhaps adds a deprecation warning there.

We also could move it over as fidesops_meta too, that is simpler.

Looking at the code, maybe we can directly add meta and fides_meta to the FidesBase class? That way we don't have to worry about this in the future, and I don't think it hurts for every object to have it? (every top-level object at least)

I think scope is starting to creep here, the PR's are getting large, so I'd prefer to keep this specific issue constrained to datasets. Happy to ticket a followup to address this if that's alright.

  • My current work is renaming fidesctl_meta to be fides_meta on the fideslang Dataset model and also doing a data migration downstream in fides ctl_datasets to match.
  • Adding this same thing to Systems, Organizations would mean matching migrations downstream, and adding updates to the systems-related code that was writing to the fidesctl_meta field.

@pattisdr I can't argue with keeping a small scope 🙂 thanks for calling out the creep!

Agreed we should focus on datasets here

Great, reticketed here, we could do this last: https://github.com/ethyca/fideslang/issues/96