blairconrad / dicognito

A library and command line tool for anonymizing DICOM files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Explicitly exclude DICOM tags from anonymization

medihack opened this issue · comments

Is there a way to explicitly exclude DICOM tags from anonymization? We use dicognito to anonymize exams for a trial, which works great, but we must retain some tags like StudyDate, SeriesDate or AcquisitionDate. Currently, I reset them programmatically after the anoynomization, but I wonder if there is an option to leave those tags untouched by dicognito.

Hi, @medihack. Thanks for your interest!

Currently, not from the command line. Programmatically, you could probably tear the DateTimeAnonymizer out of the Anonymizer._element_handlers, although it feels a little gross and would also stop anonymizing patient birth dates and the like, so you might weaken anonymity more than you intend.

There's no reason why we couldn't augment the tool (when using Anonymizer directly or via command line) to do something like you need, though.

We'd have to come up with a reasonable interface. Specifying fields to omit by name or tag would probably be a reasonable first start, but I this makes me wonder whether people would eventually want to do so by VR, for example.

Perhaps a --skip-attribute, specifiable multiple times. Initially could skip via name or tag. Private tags would need a little more thinking, but maybe we could defer that anyhow.

so

dicognito --skip-attribute StudyDate …

dicognito --skip-attribute 00080020 …

dicognito --skip-attribute StudyDate --skip-attribute StudyTime …

I'm open to suggestions.

Of course allowing some attributes to be skipped would perhaps open up other inconsistencies. For example, if StudyDate were not shifted, but we let dicognito do its work on PatientBirthDate, then the PatientAge would be incorrect, since we currently don't recalculate it. This can be documented, of course.

Hi @blairconrad. Thank you for looking into this. Something like --skip-attribute sounds great. Also, some API like anonymizer.anonymize(dataset, skip_attribute=["StudyDate", "StudyTime"]) as we use dicognito within another Open Source tool (ADIT) would be great, too. Something like --skip-vr DA sounds also useful. I would be fine with using those options even when some inconsistencies may occur. But sure, some documentation hints about those possible inconsistencies are reasonable.

Thanks for the feedback, @medihack. Were you hoping to contribute these changes?
I have some ideas about the internal architecture, so don't mind doing the work.

Our team can have a look if you don't have the resources right now. But it will take some time, too. Maybe we can start with skip_attribute.

@lmdulz I think you are more into this than me; maybe we can have a look together.

Cool. If you're interested, I'd had half an idea of being able to create a custom ElementAnonymizer that could ignore an attribute (or a VR or whatever's appropriate at the time) and then the Anonymizer could gain an add_element_anonymizer or similar method that would push the new anyonmizers to the front of _element_handlers. Then the command line code (or whatever other client exists) could create whatever instances of the class(es) it needs and shove them in. A little more flexible than building in something specifically for skipping. (But the command line gets the convenient --skip-* arguments as we discussed).

Hey all,

for sure, @blairconrad and me had the same talk about a protector class already before. In my head this would be a simple Class similar to the FixedValueAnonymizer which would get a list of tags to protect. Just instead of actually doing something, the class would simply return True on meeting one of the provided tags. And as said @blairconrad said before, putting to the front of _element_handlers would do the trick. But i see the point towards felxibility to create a class object for each single Tag to protect in connection with the add_element_anonymizer method.
Apart from changing StudyDate, I could also think about the other way around - what if PatientAge needs to be kept for newborn babies, but one sets the StudyDate half a year back with dicognito? Not sure if it would be more beneficial to let the user think about this as I could imagine that it could blow up complexity for too many individual cases.

But i see the point towards felxibility to create a class object for each single Tag to protect in connection with the add_element_anonymizer method.

True. I have no strong opinion on this. If an object had the capability to protect more than one Tag, it would still have the ability to protect one, so making the class that could protect many but using an individual object for each attribute mentioned on the command line would be fine by me. Whatever.

what if PatientAge needs to be kept for newborn babies, but one sets the StudyDate half a year back with dicognito

dicognito currently handles this out of the box, as we're careful to set the PatientBirthDate and the StudyDate (and Time) by the same delta. (In fact all DTs, TMs, and DTs are shifted by the same amount (within a particular invocation of dicognito, or across all runs if the same seed is used).) PatientAge was in my mind when I made this decision, but it was also to ensure, for example, that relevant prior studies remained prior studies relative the latest study. We wouldn't want to shift the current study back 18 months, but last week's study back 12 months. Then the later study becomes the earlier!

Ah I think you got me wrong... I know about the consistency of time-shifting that dicognito currently does. I just wanted to point out with that second example, that caring about all cross-relations of tags might be unnecessary (and at some point out of scope?) ? Maybe it would be enough to get a warning message or put one in the readme but I'd say people using dicognito would know those cross-relations

Ah, thanks for setting me straight. I think we are aligned.

caring about all cross-relations of tags might be unnecessary (and at some point out of scope?)

I agree. I think it's important for dicognito, as it ships, to worry about retaining relationships between attributes. But this "preserve existing values" feature in some ways feels to me like advanced usage, where dicognito will provide for users a more flexible, powerful option for the user, but the user must then take responsibility to use that additional power well.

Maybe it would be enough to get a warning message or put one in the readme but I'd say people using dicognito would know those cross-relations

Yeah, I would expect the users to understand the relationships. I wouldn't even have dicognito itself emit a warning message (which could quickly become annoying to the users), but perhaps a very brief note in the command-line help for the option, and definitely someing more comprehensive in the docstring for the add_element_anonymizer. And as you say, in the "official docs" which is currently the README: when it's augmented to mention the new feature, it should explain that using it has the potential to break the invariants that dicognito typically maintains and that care should be taken not to introduce inconsistencies in the output.

As a side note what I just remeberred bc the topic occurred months ago in my team: there is a Modified Attributes Sequence Attribute, which contains the modified attributes with their previous/ original values. However, I don't know how one could protect/ encrypt this sequence at this moment and it just came up to me - Did you know this before or is this a thing worth looking into for you?

Hey. The purpose of this sequence seems to be to preserve the original values, which is not compatible with how I originally imagined dicognito being used (although I fully admit that "trials" was not a use case I imagined, and am sure there are many more out there). I don't think we'd ever want the default behaviour to be to construct a Modified Attributes Sequence, but I suppose it could be a future opt-in.
Given the likely use cases for dicognito, though, I wonder if retaining the original data in "a safe place" along with a map correlating the new objects to the old ones (not unlike that proposed in #124) might not be a better approach. Or at least easier. Regardless, if you're interested in pursuing functionality that maps back to original data values, let's discuss outside this issue, as I don't see a connection between retaining certain values in the output and mapping back to values that were changed.

I think I misread your comment @lmdulz. You're interested in ensuring that we don't modify the Modified Attributes Sequence Attribute, not interested in creating one.

(I don't understand the comment about encrypting the sequence.)

Here's my initial thoughts: out of the box, I think it's appropriate for dicognito to anonymize the contents of the sequence. If we think we're obscuring original data, then I think obscuring the "original original" data also makes sense. Directives to skip anonymization of attributes based on name or tag or VR would apply to the objects within the sequence, so if we wanted to keep all StudyDates (for example) those would still be preserved.

But if you're wondering about how to ensure we keep the values of StudyDate only when it's inside (or only when outside) a Modified Attributes Sequence, it gets trickier. I suppose a custom element anonymizer could walk the current dataset's ancestor chain (via parent_seq and parent_dataset) to see if the dataset were contained in such a sequence and then decide to anonymize or not.

This may be getting into quite niche territory, though, and I'm not sure there's a need to include a command-line flag for this behaviour or to address it in this issue. Or maybe I'm wrong and there is a need. If so, convince me!

No no, your initial thought was right, the idea was to create one. But I'm also not soo convinced by it. The part about encryption was manily bc I had in my mind one needs to encrypt the Modified Attributes Sequence Attribute, otherwise it's whole existence makes no sense for me.
I think my whole comment about the Sequence could be discarded.

@lmdulz (or @medihack, I'm not sure who's doing what), I thought I'd check in on you. Is there anything I can do to help move your issue along? Design decisions you'd like to discuss? One thing that occurs to me is that --skip-attribute is maybe not the best name. I'd been using "attribute" all over in the code and docs where "element" more closely matches the language of DICOM.

Other notes: I did this morning merge an infrastructure change I'd had sitting in a pull request for a couple of weeks, so if you've an active branch, you'll eventually want to rebase on top of it. If you don't have an active branch, I've been mulling over the implementation and can probably have something for you to preview (maybe a finished product) by Monday. Your call.

Unfortunately, not yet. Our workaround is relatively easy as we only use dicognito programmatically and can easily reset some pseudonymized elements to the original values afterward. So, it is not our highest priority right now, and we would be happy to have a look and use your implementation :-)