crowds is a Python module that provides a suite of anonymization algorithms, allowing to transform Pandas dataframes so that they satisfy k-anonymity or differential privacy. This is a work in progress. So far, one algorithm has been implemented (OLA). Get in touch if you would like to contribute.
crowds requires:
- Python (>= 3.6)
- pandas (>= 0.25.1)
The easiest way to install is using pip
pip install -U crowds
or conda
conda install crowds
This is an implementation of the algorithm described by El Emam, Khalet, et al. (2009) [1]. Given a dataframe, an information loss function, and a set of generalization strategies, it returns a k-anonymous version [2], obtained using the single-dimensional global recording model, i.e.: the same values will be mapped consistently to the same generalizations in the new dataset, and the generalization for each dimension will not overlap.
To define a set of generalization rules:
from crowds.kanonymity.generalizations import GenRule
def first_gen(value):
return 'value'
def second_gen(value):
return 'value'
new_rule = GenRule([first_gen, second_gen])
ruleset = {
'attr_name': new_rule,
}
In order for the algorithm to work correctly, the loss function needs to be monotonic, i.e. non-decreasing for increasing generalization levels. Some information loss functions are provided in information_loss.py. It is also possible to define a custom generalization function (which must have the same signature as the following example):
def loss_fn(node):
return 0.0
Then, to anonymize:
from crowds.kanonymity import ola
anonymous_df = ola.anonymize(df, k=10, loss=loss_fn, generalizations=gen_rules)
For more, check out this example, using the "Adult" dataset from the UCI Machine Learning Repository [3].
[1] El Emam, Khaled, et al. "A globally optimal k-anonymity method for the de-identification of health data." Journal of the American Medical Informatics Association 16.5 (2009): 670-682.
[2] Sweeney, Latanya. "k-anonymity: A model for protecting privacy." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.05 (2002): 557-570.
[3] Dua, D. and Graff, C. "UCI Machine Learning Repository." Irvine, CA: University of California, School of Information and Computer Science (2019).