arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.

Home Page:http://arx.deidentifier.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE] Automatic Low band batching

ZachHaber opened this issue · comments

Is your feature request related to a problem? Please describe.
When working with the arx-deidentifier GUI, it's possible to manually check what the Mode of a quasi-identifier is and then add a column to the Data Transformation to batch a subset together.

I.e. a Race identifier:

0 1 2 3
null * * *
Asian * * *
White White White *
Black Black * *

This manual process has to be created manually for the data modes to try and keep utility while removing sections. You can see in the "alternatives" section of this request the full breadth of what I'm asking for.

Describe the solution you'd like

Ideally, I'd like an alternative to the Clustering and microaggregation setup that automatically will determine the "low band" alternatives whereby each enumeration of these values could be considered as a possibility to merge together, but unlike the current Clustering and Microaggregation, it will attempt if possible to leave terms alone.

Describe alternatives you've considered

The main alternative I can see is to allow for non-hierarchical data transformations to define these transformations manually. As you can see from the full enumeration of what I'm asking for, it's a lot to do manually regardless, even if it can be done in a single set of "transformations". This is only for 4 values and it's already at 16 sets of possibilities.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
null null null null * null null null * * * * * null * *
Asian Asian Asian * Asian Asian * * Asian Asian * * * * Asian *
White White * White White * White * White * White White * * * *
Black * Black Black Black * * Black * Black Black * Black * * *

Another alternative is the "Ordered Set hierarchy" if we can define them not by name, but by frequency within the data, then we can create the more likely parts of these enumerations more quickly... I.e. a Set of all but the top 3 data terms together, a set of all but the top 2 data terms together, followed by a set of only the top 1 data term, followed by all are supressed.

Thanks a lot for suggesting this. This is a great idea and will be implemented.

This has been implemented in the following branch:

https://github.com/arx-deidentifier/arx/tree/feature-frequency-hierarchy

See PR here:

#403

Are you able to build from source and give the branch a try and see whether the new "priority-based" hierarchy builder / wizard works for you?

I don't have a java development environment set up currently, I can try and take a look at it to get it set up on my next work day

Implemented.