[FEATURE] Automatic Low band batching

Question

[FEATURE] Automatic Low band batching

ZachHaber opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.
When working with the arx-deidentifier GUI, it's possible to manually check what the Mode of a quasi-identifier is and then add a column to the Data Transformation to batch a subset together.

I.e. a Race identifier:

0	1	2	3
null	*	*	*
Asian	*	*	*
White	White	White	*
Black	Black	*	*

This manual process has to be created manually for the data modes to try and keep utility while removing sections. You can see in the "alternatives" section of this request the full breadth of what I'm asking for.

Describe the solution you'd like

Ideally, I'd like an alternative to the Clustering and microaggregation setup that automatically will determine the "low band" alternatives whereby each enumeration of these values could be considered as a possibility to merge together, but unlike the current Clustering and Microaggregation, it will attempt if possible to leave terms alone.

Describe alternatives you've considered

The main alternative I can see is to allow for non-hierarchical data transformations to define these transformations manually. As you can see from the full enumeration of what I'm asking for, it's a lot to do manually regardless, even if it can be done in a single set of "transformations". This is only for 4 values and it's already at 16 sets of possibilities.

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
null	null	null	null	*	null	null	null	*	*	*	*	*	null	*	*
Asian	Asian	Asian	*	Asian	Asian	*	*	Asian	Asian	*	*	*	*	Asian	*
White	White	*	White	White	*	White	*	White	*	White	White	*	*	*	*
Black	*	Black	Black	Black	*	*	Black	*	Black	Black	*	Black	*	*	*

Another alternative is the "Ordered Set hierarchy" if we can define them not by name, but by frequency within the data, then we can create the more likely parts of these enumerations more quickly... I.e. a Set of all but the top 3 data terms together, a set of all but the top 2 data terms together, followed by a set of only the top 1 data term, followed by all are supressed.

Fabian Prasser · Answer 1 · Wed Aug 24 2022 04:26:31 GMT+0800 (China Standard Time)

Thanks a lot for suggesting this. This is a great idea and will be implemented.

Fabian Prasser · Answer 2 · Sun Aug 28 2022 21:34:38 GMT+0800 (China Standard Time)

This has been implemented in the following branch:

https://github.com/arx-deidentifier/arx/tree/feature-frequency-hierarchy

See PR here:

#403

Are you able to build from source and give the branch a try and see whether the new "priority-based" hierarchy builder / wizard works for you?

Zachary Haber · Answer 3 · Sun Aug 28 2022 22:26:38 GMT+0800 (China Standard Time)

I don't have a java development environment set up currently, I can try and take a look at it to get it set up on my next work day

Fabian Prasser · Answer 4 · Fri Sep 16 2022 05:09:45 GMT+0800 (China Standard Time)

Implemented.