IQSS / dataverse

Open source research data repository software

Home Page:http://dataverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add Croissant to Signposting "describedby" output

pdurbin opened this issue · comments

Today @siacus and I were talking about how dataset landing pages can become heavy when the machine-readable JSON we put in the <head> (Schema.org JSON-LD or Croissant) gets large. In a real-life dataset with 25K files, the Croissant file can be 7.1 MB.

We talked about putting a link to the Croissant file in our Signposting output, like we do for Schema.org JSON-LD. Basically, robots could request just the headers (e.g. with curl --head) and receive a link to the Croissant file, rather than the entire payload, which can be large.

Unfortunately, people suffering from heavy dataset pages won't get relief until the large content is removed from the <head> of the page, but putting the link in Signposting gives machines an option for the future if the world wants to move in that direction. We already suggested Signposting to the Croissant/Google Dataset Search team at mlcommons/croissant#530 (comment)

In our Signposting output, we already include a link for downloading Schema.org JSON-LD data via API. For example:

<https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP>;rel="describedby"

The Signposting spec seems to allow multiple "describedby" values, but if we prefer to keep a single "describedby" value, we could consider swapping out schema.org for croissant when it's available, like we do for the <head> tag:

I don't think this is a lot of work. A 3 is probably enough but I'll give it a 10 for reviewing the Signposting spec and talking to that community, if need be, about multiple "describedby" values. The file to edit is SignpostingResources.java as seen in PR #8981.

See also this issue we opened with the Croissant team where we asked for guidance on large Croissant files:

Related issues:

FWIW: I think signposting uses multiple describedbys - since you add the type attribute to specify the format for each one. We originally didn't put all of our exports in it because the draft/spec said something about only common formats, but in subsequent discussions, I don't think there would be any concern if we just automatically added all exports that are installed to the list.