Support fully-reproducible deidentification

Question

Support fully-reproducible deidentification

annawoodard opened this issue 2 years ago · comments

When data on a given cohort is accumulated over long periods, users may wish to run dicognito in multiple passes in order to perform preliminary analyses on the partial dataset. It would be convenient to be able to checkpoint the Anonymizer state so that patients seen in previous dicognito runs over the same cohort would have matching anonymized IDs.

Two options occur to me:

Use the anonymization map proposed in #124 as a simple checkpoint. I haven't looked at the code yet, so I'm not sure exactly what drawbacks this would have. I think there are some guarantees about the order of dates that might be broken in this case.
Serialize everything in the Anonymizer and save it to a pickle file. I think this would make starting from a checkpoint 'equivalent' to running in a single pass. It would have the disadvantage of adding another file with sensitive data to manage.

Blair Conrad · Answer 1 · Fri May 06 2022 04:45:53 GMT+0800 (China Standard Time)

Hi again, ma'am. Let me try to summarize to make sure I understand:

You'd like there to be a way to ensure that if you run dicognito a second time on the same inputs, you'd get the same results.

If that's correct, then at least within a single released version of dicognito, the following flag should accomplish it:

--seed SEED           The seed to use when generating random attribute values. Primarily intended to make testing easier. Best anonymization practice is to omit this value and let dicognito generate its own random seed.

As noted, it was initially conceived of as a convenience for testing. In my line of work, I generally wouldn't need the feature as you've described it. If this turns out to be sufficient for your needs, we could easily update the help message and optionally alias the flag to give it a more intention-revealing name, as "seed" is kind of jargony.

Blair Conrad · Answer 2 · Thu May 12 2022 23:11:59 GMT+0800 (China Standard Time)

@annawoodard, I was thinking about your workflow last evening, in particular this issue, and I realize that I'd erred. Setting --seed to a constant value does not guarantee completely reproducible results. I think it preserves everything except for UIDs (that is, anything with a VR of UI).

Regard the problem:

▶ dicognito --seed BLAH --output-directory out/a .\boring\000000CA.DCM
▶ dicognito --seed BLAH --output-directory out/b .\boring\000000CA.DCM
▶ summarize_dicom.py --with SOPInstanceUID out
PatientID     PatientName       AccessionNumber  SOPInstanceUID
------------  ----------------  ---------------  ----------------------------------------------------------------
11WMEJAQ2273  HILL^MAJOR^JAMAR  S4H12VFOT9VD     2.20220512121848929191.10000001.18675574395549631329024754104656
11WMEJAQ2273  HILL^MAJOR^JAMAR  S4H12VFOT9VD     2.20220512121856966994.10000001.16229549704517500334316552710522

Now, depending on the exact details of your workflow, this may be acceptable. You said

I run a pipeline every few months to pull down new exams, preprocess them, and add them to the training set.

If the deidentification is run only on the new exams, you likely won't have a problem, since the manner in which new UIDs is generated shouldn't have an impact across studies. If you were to run the pipeline over all exams (old and new), the old exams would be deidentified with different UIDs than the first time through, which may cause issues.

I've created #127, which demonstrates that we do not have consistent UI elements across time even when using the same seed value.

Anna Woodard · Answer 3 · Tue May 17 2022 04:59:03 GMT+0800 (China Standard Time)

Thanks for the explanations! Fixing the seeds will work perfectly for my workflow, so I agree that we can resolve this issue with a tweak to the help message to clarify. I don't think there's a need to alias the argument; I think seed is pretty explanatory.

Blair Conrad · Answer 4 · Tue May 17 2022 09:39:41 GMT+0800 (China Standard Time)

Cool. I'd worked up a change to the UI-handling to make it completely reproducible as well (we lose "sortable" UIs, but I'm not sure anyone cares, to be honest). I'll likely continue the pull request I've attached to this issue, and may make an additional one to expand the documentation for --seed.