datacontract / datacontract-specification

The Data Contract Specification Repository

Home Page:https://datacontract.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Classification - extended

pixie79 opened this issue · comments

while the current specifications include PII (Personally Identifiable Information) and classification designators, we should extend the design to include more detailed actions and attributes related to PII handling and compliance. Here are some suggestions for enhancing the schema:

PII Handling Attributes:

  • PII_Method: This attribute could be either a free-form string or an enumeration with values such as MASKED, ENCRYPTED, etc. It specifies the method used to protect the PII.
  • PII_Description: This is a descriptive field that details how the PII is handled. For example, if the PII is masked, it might specify that a name is replaced with five asterisks.
  • PII_Linked_Entity: This field references a masked or anonymized version of the PII field that should be used for analytics or other purposes by default.
  • Right to Be Forgotten (RTBF) Attributes:
  • RTBF_Eligible: A boolean attribute indicating whether this field qualifies under the right to be forgotten rules.
  • RTBF_Method: An enumeration describing the method used for RTBF, such as DELETE, ONE_WAY_HASH, ENCRYPT_ONE_TIME_KEY, or PATTERN_MASK.
  • RTBF_Description: A free-form string providing additional details on the RTBF method or specific rules. For example, it may state that the right to be forgotten applies only five years after the last customer activity.
{
  "fields": [
    {
      "Name": "field_name",
      "Type": "data_type",
      "PII": {
        "IsPii": true,
        "Method": "MASKED",
        "Description": "For a name, replace with 5 asterisks",
        "AnaliticsLinkedEntity": "Linked version of the entity in this record that is used for analytics premasked",
        "RTBF": {
          "Eligible": true,
          "Method": "DELETE",
          "Lifetime": 300,
          "Description": "RTBF applies 300 days after last customer activity"
        }
      },
      "Classification": "classification_string"
    }
  ]
}

Schema Explanation

  • fields: An array of field objects.
  • Name: The name of the field (string).
  • Type: The data type of the field (string).
  • PII: An object containing attributes related to Personally Identifiable Information (PII):
  • IsPii: Boolean indicating if the field contains PII.
  • Method: Enum specifying the PII protection method (e.g., MASKED, ENCRYPTED).
  • Description: A string providing details on how the PII is handled.
  • AnaliticsLinkedEntity: A string referencing the pre-masked version of the entity used for analytics.
  • RTBF: An object containing attributes related to the Right to Be Forgotten (RTBF):
  • Eligible: Boolean indicating if the field qualifies for RTBF.
  • Method: Enum specifying the RTBF method (e.g., DELETE, ONE_WAY_HASH, ENCRYPT_ONE_TIME_KEY, PATTERN_MASK).
  • Lifetime: An integer specifying the duration (in days) after which RTBF applies.
  • Description: A string providing additional details on the RTBF process or rules.
  • Classification: A string or ENUM indicating the data classification.

This structure ensures all necessary details are captured and provides a robust framework for managing PII and compliance with data protection regulations.

This probably should apply at the entity level mostly however it may be worth also allowing a whole dataset to also be defined as PII but generally I believe this should be at the entity level especially when you get to Right to be Forgotten, it may be that some bit can be forgotten straight away but other information needs to be kept longer.

e.g if you keep the last 5 addresses of a customer, at the point they stop being a customer if you do need to retain records for legal reasons you probably have no grounds to keep more than the current address.

Thanks, @pixie79 for this contribution.

I agree that PII and RTBF become more and more important and the data contracts should be the source of truth to configure the attributes for individual fields or models.

They also play nicely with some Policy as Code engines.

At the moment, however, I do not feel confident enough to defining a generic fieldset that fits for these engines.
So, for now, I'd propose to go with config fields.

E.g.:

fields:
  customer_email_address:
    description: The email address, as entered by the customer. The email address was not verified.
    type: text
    format: email
    pii: true
    classification: sensitive
    config:
      pii:
        method: MASKED
        description: For a name, replace with 5 asterisks
        analiticsLinkedEntity: Linked version of the entity in this record that is used for analytics premasked
        rightToBeForgottonPolicy:
          eligible: true
          method: DELETE
          lifetime: 300
          description: RTBF applies 300 days after last customer activity

I think that makes sense and meets what I was looking for from it.