IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use Abacus Data Network's new server URL to harvest the installation's metadata

jggautier opened this issue · comments

Clicking on the titles of the records harvested from Abacus Data Network into Harvard Dataverse (https://dataverse.harvard.edu/dataverse/ubc_harvested) no longer takes users to the datasets.

Abacus Data Network upgraded its Dataverse software version to v5.6 and changed its server URL to https://abacus.library.ubc.ca. The old URL, https://dvn.library.ubc.ca redirects to https://abacus.library.ubc.ca.

We'll need to harvest using new URL https://abacus.library.ubc.ca/oai.

I'm not sure if we should either:

  • Edit the existing harvesting client in Harvard Dataverse so that it uses the new server URL (https://abacus.library.ubc.ca/oai), then try to harvest
  • Or remove the existing client in Harvard Dataverse, then create a new one with the new server URL

To see if harvesting from Abacus Data Network would work, I just told Demo Dataverse to harvest records from https://abacus.library.ubc.ca/oai into the collection at https://demo.dataverse.org/dataverse/ubc_abacus_harvested, using the dataverse_json format.

I'll check later this week to see what happens.

2024/03/123

Just an update that Demo Dataverse couldn't harvest any of the dataset metadata from Abacus using the dataverse_json metadata format.

Just want to put it on record that changing the server type (from "DVN" to "Dataverse") in the harvesting clients panel did NOT fix the redirects for the existing harvested records either.
Screen Shot 2024-03-15 at 1 00 28 PM

This appears to be because they have changed all their handle identifiers - the ones we have harvested look like hdl:11272/NNNNN, the ones they are using now - hdl:11272.1/AB2/XXXXX.
The actual old handles are still redirecting properly, if you click on them:
Screen Shot 2024-03-15 at 1 08 32 PM
so they are still registered on the handlenet side. But the Abacus' new Dataverse no longer recognizes them, when we try to redirect there directly.

... one way or another, these records are hopelessly stale. We do need to delete this client and re-harvest. I agree it's prudent to first work out a working configuration on demo. If dataverse_json isn't working, we should follow the normal downgrade route - to oai_ddi, and then to oai_dc, if that's not working either.

Hah, I was able to fix the redirects for the old harvested records in prod, by changing to "generic" and using the handle resolver as the archive url:
Screen Shot 2024-03-15 at 1 23 46 PM

Haven't tried all of them, but the ones I tried worked.
This does NOT change the fact that we want to be able to re-harvest from their new server and to restart regular harvesting from them.

On demo every record failed with the same exception:
Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.JsonParseException (Invalid license: ...)

I'll need to refresh my memory on what this means.

If dataverse_json isn't working, we should follow the normal downgrade route - to oai_ddi, and then to oai_dc, if that's not working either.

Is it okay if I try oai_ddi now and then oai_dc if that doesn't work? Or should I wait until you can look into what's going on with that "Invalid license" exception?

In case "license" there means a dataset's license metadata, Abacus is running v5.6 so its datasets' dataverse_json exports are different than exports from installations running v5.10+ after the multiple license update.

Please go ahead and try the other formats, no need to wait.
(you will need to either delete and recreate the client; or purge the clientharvestrun entry from the database; otherwise it'll attempt to harvest incrementally, since the date/time of the last so called "success". sorry if I'm explaining the obvious)

You are most likely correct, about the json format. That was my guess, that it's completely incompatible between pre- and post-5.10, because of the license change. Just wanted to confirm this w/ others in dv-tech.

Using oai_ddi worked mostly.

It looks like 2,435 records were harvested into https://demo.dataverse.org/dataverse/ubc_abacus_harvested. The harvesting client page says that 3 failed.

And it looks like there are 2,449 datasets in the repository (https://abacus.library.ubc.ca), although maybe a few of those are missing in their oai feed because they were published very recently.

Want me to delete the client in Harvard Dataverse and re-create a new one using oai_ddi instead?

Yes, that's a very good success-to-fail ratio, let's use oai_ddi in prod.
I would only suggest to wait to run the actual harvest until the weekend.

Just leaving an update that the old client was deleted and I created a new client using the new server URL https://abacus.library.ubc.ca/oai. It's scheduled to run Saturdays at 4am and harvest oai_ddi metadata into the collection at https://dataverse.harvard.edu/dataverse/ubc_harvested.

Next Monday I'll check to see how it went 🤞

The harvesting client page for HDV says that the scheduled harvest ran on Saturday at 4 am and 2,438 records were harvested into https://dataverse.harvard.edu/dataverse/ubc_harvested. It failed to harvest 3 records.

I'm going to close this issue.

I saw that last night. Going to consider this a smashing success, by our standards.