IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update Schema.org exports of all datasets so they appear in Google Dataset Search

jggautier opened this issue · comments

For datasets whose latest versions were published after Dataverse v5.13 was applied to Harvard Dataverse, those datasets' Schema.org exports have been updated with the creator @type updates made in the pull request at #9089.

These changes were made largely to improve the odds that Google Dataset Search would index the datasets.

I think v5.13 was applied to Harvard's repository on Feb 15, 2023, so datasets with versions published after that date have the updated Schema.org metadata exports.

For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/ZJ8MC0 published on Feb. 15, we see the "Creator" metadata and its @type property saying that the creator is a person:

Screenshot 2023-04-26 at 9 36 31 AM

Datasets with versions published before v5.13 was applied have Schema.org exports that don't include the creator @type updates.

For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/VOPK0E published on Feb. 14, we see the "Creator" metadata doesn't have an @type property saying if the creator is a person or organization:

Screenshot 2023-04-26 at 9 46 31 AM

The same is true if we look at the JSON-LD in the page source.

In a conversation in an unrelated pull request, @qqmyers wrote that installations will need to do a reExportAll() so that all datasets include the Schema.org export updates.

Definition of done:
Do a reExportAll() so that the Schema.org metadata exports of all datasets in the Harvard Dataverse include the updates made in v5.13 pull request at #9089

The schema.org export only gets updated with a reExport, but the json-ld in the page is only cached as an @transient value in the DatasetVersion object (unless I'm missing something - the page info is version specific whereas the export is only cached for the latest version which is one reason why the page doesn't just load the cached export). So I'm not sure why it wouldn't be updated without a re-export. Are @transient values getting cached in ./generated or ./osgi-cache ?

Ah okay. When you say that "the page info is version specific whereas the export is only cached for the latest version," this makes me think that for each of a dataset's published versions, in the page source code there should be schema.org json-ld metadata.

Should that be the case? What I'm seeing is that only the latest published version has schema.org json-ld metadata in its page source code.

So for https://doi.org/10.7910/DVN/FZOVRC that has two published versions, the source code on the page for version 1 has an empty <script> tag:

Screenshot 2023-04-26 at 2 34 09 PM

For version 2, there's the Schema.org metadata and it refers to version 2

Screenshot 2023-04-26 at 2 34 28 PM

I see the same thing in a couple other Dataverse repositories I've been able to check.

It also sounds like a reExport wouldn't update the metadata in the page. And since that's the metadata that I think Google Dataset Search is using to index datasets in Dataverse repositories, a reExport all wouldn't result in more datasets being discoverable through Google Dataset Search.

Does that all make sense?

And should I open an issue in the Dataverse GitHub repo about figuring out how to update the json-ld on dataset pages?

I was referring to the underlying code. I see now that https://github.com/IQSS/dataverse/blob/4903e9f0277105ea6a8c59a2f962dac8bcf715f2/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java#L5535 only displays the json-ld for the latest version (even though the underlying code could do it for earlier versions.)

For the latest version not being up-to-date - I don't know why that is. If it is because the generated or osgi-cache dirs weren't cleared - we should see which one. The release notes already say you delete things in generated so if that wasn't done, it's not an issue with the code or release notes. If it is the osgi-cache dir we probably should add an issue to add to the release notes. If the reason still isn't clear, we should perhaps treat either this issue or a new one in the dataverse as a spike to investigate.

reExportAll would still be a good thing to do - in order to get the schema.org export files up-to-date.