Feature Request/Idea: Move draft Datasets should update their PID configuration

Question

Feature Request/Idea: Move draft Datasets should update their PID configuration

johannes-darms opened this issue 2 months ago · comments

Overview of the Feature Request
There is a superuser API/UI to move a dataset from one dataverse to another. Suppose both dataverses have different PID providers configured. Which PID provider do we expect to mint the identifier? IMHO it should be the one configured in the target dataverse, i.e. the one it was moved to. In the current implementation, PID properties are not changed during a move, so the old PID configuration remains active. This makes sense for PIDs that have already been published, even if it is counterintuitive for local administration and configuration! The situation is different for unpublished (i.e. draft) records, which are not yet visible outside the instance and can therefore obey the configuration of their dataverse/collection. Therefore, the PID properties of a dataset must be changed during the move process.

What kind of user is the feature intended for? Superuser
What inspired the request? Managing a dataverse with two collections, one with a PermaLink provider and one with a DoiProvider. If the user has accidentally created a dataset in the wrong dataverse, we cannot change the PID configuration using the move operation.

What existing behavior do you want changed?
The PID properties of a dataset must be changed during the move process and a notification must be sent to the user about the change.

Adaption of MoveDatasetCommand for draft datasets to invoke delete identifier for the configured PID and later register a new PID with new PID provider. Further, addition of a user notification about the move action

Sample text for the move UI:

"This function can be used to transfer a dataset from one dataverse to another. The PID settings of the target dataverse become active when an unpublished (i.e. draft) dataset is moved to another dataverse. This invalidates the existing PID and creates a new one. If the PID has already been in use outside the system, it will have to be adjusted. The PID configuration is not adjusted for dataset that have already been published."

Sample text for the user notification for moved draft datasets:

"The record {Dataset title} was moved from {previous host datverse name} to {new host dataverse name}. The intended PID ({PID prior move}) was changed to {new PID after move}. If the intended PID was already in use outside the system, it must be adapted."

Sample text for the user notification for moved published datasets:

"The record {Dataset title} was moved from {previous host datverse name} to {new host dataverse name}."

Johannes Darms · Answer 1 · Wed Apr 17 2024 22:15:48 GMT+0800 (China Standard Time)

More context is available within the zulip discussion..

Johannes Darms · Answer 2 · Wed Apr 17 2024 22:18:38 GMT+0800 (China Standard Time)

@qqmyers what do you think about this feature request?

qqmyers · Answer 3 · Wed Apr 17 2024 22:34:16 GMT+0800 (China Standard Time)

Given that the Dataset PID is assigned at creation, I'm not sure I like, as a proxy for users, that just moving has the automatic side effect of changing the PID (seems like a surprise, and it's different if the PID is published). That said, the use case makes sense and technically the approach would work without leaving any PIDs behind (since draft ones can be deleted). I'm not sure what might be better -just adding a warning that the PID will change? an option to decide if that's the right choice? A flag to let admins choose which way things should work? Or a separate action? Or a combination?

Some things to consider

seems like this is only needed if the PID provider is different in the other collection, i.e. I assume you wouldn't remote one PID with the same protocol/authority/shoulder just to create another one (but that means explaining it won't always change even for a draft).
there's PR in play to register file PIDs at creation time as well. If that merges, you'd need to handle them as well.
we do have the idea of an alternateID - one could keep the old PID associated with the dataset. That's probably overkill for draft datasets (and DV doesn't manage alternateIDs so the alternate wouldn't be made findable automatically at publication). And I don't think files have alternateIDs at all. Guessing this is just something to ignore, but if there's a use case where you'd want to handle published datasets, this might be useful - using the old PID would still redirect to get to the dataset page.

Johannes Darms · Answer 4 · Thu Apr 18 2024 14:32:47 GMT+0800 (China Standard Time)

Taking your comments into account, it may be better to leave the move operation as is and add a PidReconcileCommand that a superuser can execute. This method/API only acts on a given dataset and can be called independently of the move action. This way the admins are aware of the action and the consequences.

This approach would also support the use case where the PID provider changes and existing drafts need to be updated to use the new one. Furthermore, this command could support the use case where a PID provider has changed after a dataset has been published and a new version is requested. (Assuming this currently fails because the required PIDProvider configuration is no longer available).

qqmyers · Answer 5 · Thu Apr 18 2024 19:37:08 GMT+0800 (China Standard Time)

Sounds good to me. I think there are non-move use cases for this as well, e.g. just bulk replacing a FAKE, DataCite Test, or PermaLink provider in your system (w/o moving the data).

Johannes Darms · Answer 6 · Thu Apr 18 2024 21:41:24 GMT+0800 (China Standard Time)

I'll create of a new PidReconcileCommand. The first implementation will only works for draft datasets, with no published version. The command acts on a dataset specified by an identifier, and a PID Provider identified by a PidGeneratorId.

The process flow will be the following:

Verifies that the invoking users is a super user. Otherwise: Notify invoking entity about missing access rights.
Verifies that the given dataset is not yet published. Otherwise: Notify invoking entity about non matched precondition.
Verifies that the PID provider configured for the dataset is different. Otherwise: Notify invoking entity about non matched precondition.
invoke delete identifier for the dataset
register a new PID with new PID provider
When File PIDs are enabled globally (:FilePIDsEnabled) or for the collection of the dataset and dataset has datafiles, then perform the steps 4 and 5 for each datafile. [Given the PR to register file PIDs at creation time is merged.]
Notify involved people (owner and more?) about the change.

Who should we notify about a updated PIDs? Do we need to notify anyone other than the creator?

Johannes Darms · Answer 7 · Tue Apr 23 2024 22:15:09 GMT+0800 (China Standard Time)

@qqmyers I wonder whether this requirement is partly fulfilled with this method. Does it make sense to add another method or should I extend this method?

qqmyers · Answer 8 · Tue Apr 23 2024 22:58:30 GMT+0800 (China Standard Time)

I'd suggest a new method that's explicitly about regeneratePIDs/changePids()/replacePids/etc.. The one above only affects which PidProvider would be used if new files are added/doesn't affect the PID of the dataset or existing files. (The provider that would be used is the same as the one the dataset uses if that one can be used for new PIDs or the one set for the collection if not, unless you set it explicitly with this method).

Johannes Darms · Answer 9 · Fri May 17 2024 22:36:57 GMT+0800 (China Standard Time)

@qqmyers I've added a PR with a working implementation, including reconciling FilePIds. Could you take a look?
#10567