justin1121/cape-python

Cape Privacy offers data scientists and data engineers a policy-based interface for applying privacy-enhancing techniques across several popular libraries and frameworks to protect sensitive data throughout the data science life cycle.

Cape Python brings Cape's policy language to Pandas and Apache Spark, enabling you to collaborate on privacy-preserving policy at a non-technical level. The supported techniques include tokenization with linkability as well as perturbation and rounding. You can experiment with these techniques programmatically, in Python or in human-readable policy files. Stay tuned for more privacy-enhancing techniques in the future!

See below for instructions on how to get started or visit the documentation.

Getting Started

Cape Python is available via Pypi.

pip install cape-privacy

Support for Apache Spark is optional. If you plan on using the library together with Apache Spark, we suggest the following instead:

pip install cape-privacy[spark]

We recommend running it in a virtual environment, such as venv.

Installing from source

It is also possible to install the library from source.

git clone https://github.com/capeprivacy/cape-python.git
cd cape-python
make bootstrap

This will also install all dependencies, including Apache Spark. Make sure you have make installed before running the above.

Example

(this example is an abridged version of the tutorial found here)

To discover what different transformations do and how you might use them, it is best to explore via the transformations APIs:

df = pd.DataFrame({
    "name": ["alice", "bob"],
    "age": [34, 55],
    "birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
})

tokenize = Tokenizer(max_token_len=10, key=b"my secret")
perturb_numeric = NumericPerturbation(dtype=dtypes.Integer, min=-10, max=10)

df["name"] = tokenize(df["name"])
df["age"] = perturb_numeric(df["age"])

print(df.head())
# >>
#          name  age  birthdate
# 0  f42c2f1964   34 1985-02-23
# 1  2e586494b2   63 1963-05-10

These steps can be saved in policy files so you can share them and collaborate with your team:

# my-policy.yaml
label: my-policy
version: 1
rules:
  - match:
      name: age
    actions:
      - transform:
          type: numeric-perturbation
          dtype: Integer
          min: -10
          max: 10
          seed: 4984
  - match:
      name: name
    actions:
      - transform:
          type: tokenizer
          max_token_len: 10
          key: my secret

You can then load this policy and apply it to your data frame:

# df can be a Pandas or Spark data frame 
policy = cape.parse_policy("my-policy.yaml")
df = cape.apply_policy(policy, df)

print(df.head())
# >>
#          name  age  birthdate
# 0  f42c2f1964   34 1985-02-23
# 1  2e586494b2   63 1963-05-10

You can see more examples and usage here or by visiting our documentation.

Contributing and Bug Reports

Please file any feature request or bug report as GitHub issues.

License

Licensed under Apache License, Version 2.0 (see LICENSE or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in NOTICE.

About Cape

Cape Privacy helps teams share data and make decisions for safer and more powerful data science. Learn more at capeprivacy.com.

justin1121 / cape-python