IQSS / dataverse

Open source research data repository software

Home Page:http://dataverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset pages cannot be opened after update to v6.2 with perma PID

HenningTimm opened this issue · comments

Error Description

After updating to v6.2, I get a 404 error for all dataset pages. The page only reads "Page Not Found - The page you are looking for was not found." in a red box, reports 404 in the DevTools, and the Payara logs tell me "failed to retrieve version" as shown below:

[#|2024-05-13T16:26:17.001+0200|WARNUNG|Payara 6.2023.8|edu.harvard.iq.dataverse.DatasetPage|_ThreadID=94;_ThreadName=http-thread-pool::jk-connector(4);_TimeMillis=1715610377001;_LevelValue=900;|
  Failed to retrieve version|#]

This error only occurs for datasets that existed prior to the update and have a valid PID. Opening pages with made-up PIDs reports an internal server error when parsing the PID. The datasets show up in the SOLR index with all of their metadata and files can be downloaded, but downloading metadata for the files fails. Datasets created after the migration were not affected.

Underlying Cause

TL/DR: Our Dataverse was running a perma PID provider that used an authority that had a trailing dash, which we later used as separator after the v6.2 update. This (probably) caused splitting issues making the old datasets unavailable since Dataverse could not find them in the respective database column.

This error occurred when migrating from v6.1 to v6.2 with an existing perma PID provider.
Beginning with v6.2 a separator (i.e. the character that sits between the authority and the should of the generated PID) can also be configured for perma PID providers. Prior to v6.2, we had encoded this separator either in the Shoulder or in the Authority, to be able to visually tell the parts of the PID apart. However, we were a bit inconsistent between our staging setup and our production setup:

Staging Server:
  Authority: thz
  Shoulder: -ds
→ (thz)(-ds)(randomString)

Production Server:
  Authority: thz-
  Shoulder: ds-
→ (thz-)(ds-)(randomString)

During migration, we tried to unify this again. So, we switched to the following pattern when setting up the new PID provider (as described in the release notes. Cf. #10516 for some more information about which settings we added) :

Both Server:
  Authority: thz
  Separator: -
  Shoulder: ds-
-> (thz)(-)(ds-)(randomString)

This worked fine for the staging server, but caused the issue described above for our production system.

Our best guess is that this caused issues with splitting the PIDs so that the old datasets could not be accessed, since the trailing - of the authority was used as separator during parsing of the PID. This made PIDs generated before the migration inaccessible since they still expected the - to be there. This can nicely be seen in the database (see below). Working backwards from the ERROR and INFO message in the payara logs, I reached this part in the code which splits at the first occurrence of the separator. This means that the authority for both thz- and thz would be reported as thz. These entries cannot be found in the database though, since the older datasets have the authority thz-.

When we took a look at the database with the following query

SELECT d.authority, d.createdate, d.identifier, d.identifierregistered, d.indextime, d.protocol, d.storageidentifier
FROM dvobject d
where d.dtype = 'Dataset'
order by createdate desc
limit 10;

We got the following result (truncated to two lines, with some data redacted) on our production server. The first entry with authority thz was created after the update, the second one is a dataset that was there before the update.

 authority |       createdate        | identifier         | identifierregistered |        indextime        | protocol |  storageidentifier  
-----------+-------------------------+--------------------+----------------------+-------------------------+----------+---------------------
 thz       | 2024-05-14 15:02:27.685 | ds-<randomString>  | t                    | 2024-05-14 15:02:28.048 | perma    | s3://thz/ds-<randomString>
 thz-      | 2024-mm-dd 10:00:00.000 | ds-<randomString>  | t                    | 2024-05-13 17:22:53.777 | perma    | s3://thz-/ds-<randomString>

Workarounds and Fixes

So, the easiest workaround would be to not use a separator that is a suffix of the authority in the first place.

We had to revert our production server (restored a VM snapshot) and re-ran the update but used / as a separator instead of -. This solved the issue, however, we now need to use a new PID schema: (thz)(/)(ds-)(randomString). This keeps all old datasets available via the legacy provider.

Potentially (warning: this is untested and highly speculative) we could have also manually removed the trailing dashes from the authority-column in the database to keep our established PID pattern, however we were not brave enough to try this.

While we only just begin to dig into the code base, we will check if it is possible to match the whole prefix instead of splitting on the separator when parsing a PID. This will only be relevant for perma PIDs, since DOIs, Handles, etc. follow are more rigid structure. However, for perma PIDs this could probably lead to unintentional behavior, e.g. when the separator is contained anywhere within the authority, not just as a suffix.

Further Reading

Interesting case. Permalinks and the idea of configurable separators definitely have some limitations as you can create ambiguous cases.

My guess is the code is working as ~designed - trying to parse a string like (thz)(-)(ds-)(randomString) with a perma provider that has - as a separator doesn't match dataset with authority thz- and identifier ds-(randomString) - that would match thz--ds-(randomString). (FWIW, the code you point to above doesn't run for permalinks - this parsePersistentId is used instead.) That's a consequence of the db not storing the separator and assuming it matches that of the pid provider. An alternate design could store the separator and that design would be needed to allow two perma providers that only differ by their separator. Even with that design, I'm not sure that allowing a provider to recognize a dataset who's authority includes the separator (and had a null separator) would be a good idea - that would allow two providers to recognize the same dataset. That said, any proposals/PRs to improve how PermaLinks work are welcome. Some limits on what characters can be in authorities and separators could avoid confusing cases and might be fine with everyone.

I can think of two possible ways to fix things:

  • just delete the '-' from the authority of the relevant datasets/datafiles in the database - as you suggest,
  • configure a second perma provider to recognize the legacy datasets and just make your new one with the - separator the default for new datasets.

The former is probably cleaner, but I think the latter is possible and would properly represent what you have now - old datasets created with one form of permalink and new ones with another. (We're not talking about base-urls here, but if you did have different ones for the two cases, you could support that with two providers.)

Awaiting response from @HenningTimm. When we hear back, we will be able to decide next steps.