table of zenodo and GitHub locations for versions of datasets

Question

table of zenodo and GitHub locations for versions of datasets

HedvigS opened this issue 8 months ago · comments

I think it may be useful to users if the rcldf package comes with a data object that contains pointers to major cldf datasets that dlce maintains, pointers that can be fed to rcldf::download().

What do you think @SimonGreenhill @xrotwang ?

dataset	version	Zenodo_url	GitHub_url
grambank	1.0.3	https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip	https://github.com/grambank/grambank/tree/v1.0.3
glottolog-cldf	4.8	https://zenodo.org/records/8131084/files/glottolog/glottolog-v4.8.zip	https://github.com/glottolog/glottolog-cldf/tree/v4.8

Robert Forkel · Answer 1 · Tue Nov 07 2023 18:36:45 GMT+0800 (China Standard Time)

If anything, the pointers should be DOIs, I think. Also, such a list might be of wider applicability than just R, so a data object seems like an unnecessary limitation. Why not a CSV file? Or a CLDF dataset of datasets? @johenglisch is working on a catalog of CLDF datasets, so he may have an idea about this.

Hedvig Skirgård · Answer 2 · Tue Nov 07 2023 18:54:04 GMT+0800 (China Standard Time)

DOis should be included too, that's a good point.

I was thinking practically for users of the rcldf package that zenodo download urls may be very handy.

Yes, I would like to know what @johenglisch thinks too. I could tag him originally, I think he's not "in" this repos.

CSV would be cool too, but i believe best practices for r packages is rda-files. I would prefer csv too since it'd make history more transperent, but I don't think that's what is usually done. @SimonGreenhill would know more.

Johannes Englisch · Answer 3 · Wed Nov 08 2023 19:42:33 GMT+0800 (China Standard Time)

Tadaaaa~ Here's a list of datasets the catalogue knows about currently: contributions.csv

I think adding direct download links to this shouldn't be too hard.

Hedvig Skirgård · Answer 4 · Mon Nov 13 2023 21:07:23 GMT+0800 (China Standard Time)

Tadaaaa~ Here's a list of datasets the catalogue knows about currently: contributions.csv

I think adding direct download links to this shouldn't be too hard.

NICE!

Those Zenodo links though, they need to look something like "https://zenodo.org/records/7740140/files/grambank/grambank-v1.0.zip" for rcldf::download() (and therefore also rcldf::cldf()) to work. What we're currently expecting is the link that's underneath the button "download" on the webpage for the record.

Johannes Englisch · Answer 5 · Tue Nov 14 2023 17:26:02 GMT+0800 (China Standard Time)

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Hedvig Skirgård · Answer 6 · Tue Nov 14 2023 18:03:29 GMT+0800 (China Standard Time)

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Yep. I don't know about @SimonGreenhill , but for me it'd be handy to have a list like that that points to the zip that Zenodo provides with the version etc info from the previous table. Thanks @johenglisch !

As far as I can tell, R-packages usually don't store information of this kind in csv-sheets. I think there are two options

a) Combine the csv-files Johannes has suggested into one and render it as an rdata object that comes with the rcldf package
b) Combine the csv-files Johannes has suggested into one and store it as a csv somewhere and encourage rcldf users to go fetch it there.

Simon J Greenhill · Answer 7 · Thu Dec 14 2023 05:17:04 GMT+0800 (China Standard Time)

I think this is well outside the scope of a package to read CLDF files is not the place to catalogue all the CLDF files (it'll be a pain to update and maintain).

Hedvig Skirgård · Answer 8 · Thu Dec 14 2023 07:05:25 GMT+0800 (China Standard Time)

I think this is well outside the scope of a package to read CLDF files is not the place to catalogue all the CLDF files (it'll be a pain to update and maintain).

Ok!

Hedvig Skirgård · Answer 9 · Thu Dec 14 2023 07:06:06 GMT+0800 (China Standard Time)

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Yep. I don't know about @SimonGreenhill , but for me it'd be handy to have a list like that that points to the zip that Zenodo provides with the version etc info from the previous table. Thanks @johenglisch !

As far as I can tell, R-packages usually don't store information of this kind in csv-sheets. I think there are two options

a) Combine the csv-files Johannes has suggested into one and render it as an rdata object that comes with the rcldf package b) Combine the csv-files Johannes has suggested into one and store it as a csv somewhere and encourage rcldf users to go fetch it there.

@xrotwang @johenglisch would you still consider (b)? I'd very, very much appreciate if there was one stable place for a table of this kind.

Robert Forkel · Answer 10 · Fri Dec 15 2023 22:26:51 GMT+0800 (China Standard Time)

Why should the files be combined into one? If people download cldf data, they'd be experienced with joining tables, I'd hope. Hedvig Skirgård ***@***.***> schrieb am Do., 14. Dez. 2023, 00:06:

…

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv <https://github.com/SimonGreenhill/rcldf/files/13346605/file-list.csv> Yep. I don't know about @SimonGreenhill <https://github.com/SimonGreenhill> , but for me it'd be handy to have a list like that that points to the zip that Zenodo provides with the version etc info from the previous table. Thanks @johenglisch <https://github.com/johenglisch> ! As far as I can tell, R-packages usually don't store information of this kind in csv-sheets. I think there are two options a) Combine the csv-files Johannes has suggested into one and render it as an rdata object that comes with the rcldf package b) Combine the csv-files Johannes has suggested into one and store it as a csv somewhere and encourage rcldf users to go fetch it there. @xrotwang <https://github.com/xrotwang> @johenglisch <https://github.com/johenglisch> would you still consider (2)? I'd very, very much appreciate if there was one stable place for a table of this kind. — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUOKG5SYSKL4CLZPVGI6LYJIYGRAVCNFSM6AAAAAA7A4XQSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUHAZTENZZHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Hedvig Skirgård · Answer 11 · Fri Dec 15 2023 23:19:54 GMT+0800 (China Standard Time)

Why should the files be combined into one? If people download cldf data, they'd be experienced with joining tables, I'd hope. Hedvig Skirgård @.***> schrieb am Do., 14. Dez. 2023, 00:06:

I don't think it's crucial that the tables are joined. For me, it would just be great if there is a stable place with info on Zenodo and GitHub locations for CLDF-datasets, along with version information. If you want to have mutliple tables, do so I'll join when I need. It's all worth it for a stable location with Zenodo records URLs :D

Hedvig Skirgård · Answer 12 · Sat Dec 16 2023 00:34:45 GMT+0800 (China Standard Time)

☁️ My dream ☁️ would be a place on https://cldf.clld.org/ , perhaps called "CLDF-dataset list" or something else equally informative and catchy where I can expect to find up-to-date tables that I can fetch straight from the command line as csv, tsv, json or something else simple and that in there somewhere is a list of Zenodo record URLs that I can feed to rcldf::cldf() and fetch CLDF-datasets ^^!

Robert Forkel · Answer 13 · Sat Dec 16 2023 00:49:21 GMT+0800 (China Standard Time)

Why not from https://raw.githubusercontent.com/cldf-datasets/clld_meta/master/cldf/contributions.csv ?

Robert Forkel · Answer 14 · Sat Dec 16 2023 00:49:55 GMT+0800 (China Standard Time)

Oh, and "up-to-date" is a big ask.

Hedvig Skirgård · Answer 15 · Sat Dec 16 2023 00:50:58 GMT+0800 (China Standard Time)

Oh, and "up-to-date" is a big ask.

Well, as up-to-date as what @johenglisch is building, is that an okay ask? I was thinking the easiest was to make it a regular product that is derived from what Johannes is building.

Robert Forkel · Answer 16 · Sat Dec 16 2023 00:52:11 GMT+0800 (China Standard Time)

But fetching the product from where it's built seems to be the easiest solution.

Hedvig Skirgård · Answer 17 · Sat Dec 16 2023 00:53:52 GMT+0800 (China Standard Time)

Why not from https://raw.githubusercontent.com/cldf-datasets/clld_meta/master/cldf/contributions.csv ?

Well, "why not" is because I didn't know this existed ^^!
What is the plan and intention for that repos? I'm getting 404 when i try to go to https://github.com/cldf-datasets/clld_meta

Hedvig Skirgård · Answer 18 · Sat Dec 16 2023 00:54:20 GMT+0800 (China Standard Time)

But fetching the product from where it's built seems to be the easiest solution.

I agree! I just don't know where that is or if it's stable where it is now

Hedvig Skirgård · Answer 19 · Sat Dec 16 2023 00:55:58 GMT+0800 (China Standard Time)

raw github links seem a little bit sub-optimal to me personally, but I'll go with it if that's the plan!

Robert Forkel · Answer 20 · Sat Dec 16 2023 00:56:05 GMT+0800 (China Standard Time)

It's still private. @johenglisch is it ok to make the repos public?

Robert Forkel · Answer 21 · Sat Dec 16 2023 01:00:27 GMT+0800 (China Standard Time)

The plan is not to make this the "official" list of all CLDF datasets or something like this. If a dataset you're interested in is in there, then it'll work for you. If it isn't you might file an issue, but we don't want to turn this into the only place to look for CLDF datasets.

Hedvig Skirgård · Answer 22 · Sat Dec 16 2023 01:31:15 GMT+0800 (China Standard Time)

The plan is not to make this the "official" list of all CLDF datasets or something like this. If a dataset you're interested in is in there, then it'll work for you. If it isn't you might file an issue, but we don't want to turn this into the only place to look for CLDF datasets.

Okay, that is good to know. Always good to know the plan and the imagined audience.

A table that is reasonably maintained over time (updated maybe at least once a year?) and contains Zenodo-links to at least the major datasets would be very valuable to me, and I think to me and to many others. If things can be added on request, that's great.

Robert Forkel · Answer 23 · Sat Dec 16 2023 01:35:14 GMT+0800 (China Standard Time)

I don't know. I guess most people will be interested in a handful of datasets and maintaining a list of links for these yourself seems simpler to me.

Hedvig Skirgård · Answer 24 · Sat Dec 16 2023 01:36:21 GMT+0800 (China Standard Time)

I don't know. I guess most people will be interested in a handful of datasets and maintaining a list of links for these yourself seems simpler to me.

If @johenglisch is making something that can produce a table like this anyway, please please please could there please bitte be a table somewhere that is reasonable stable? Please?

Hedvig Skirgård · Answer 25 · Sat Dec 16 2023 01:38:58 GMT+0800 (China Standard Time)

I don't think I'm alone in finding a table like that useful. I think the following people would use it: Angela Chira, Ezequiel Koile Olena Shcherbakova and Erich Round.

Robert Forkel · Answer 26 · Sat Dec 16 2023 01:41:34 GMT+0800 (China Standard Time)

Sure, I asked @johenglisch above, whether the repos can be made public.

Johannes Englisch · Answer 27 · Mon Jan 08 2024 18:18:22 GMT+0800 (China Standard Time)

Hi, I'm back. Happy new year and all that! o/

is it ok to make the repos public?

I don't see why not: https://github.com/cldf-datasets/clld_meta

Just keep in mind that I'm still working on it, so things might change here and there.

What is the plan and intention for that repos?

It's the database catalogue I was talking about before; just a way to get a bird's eye view on what kind of data is out there (and maybe what kind of data could be missing) and what glottocodes the datasets refer to.

Caveats:

This list is never going to be exhaustive.
This list will only be as up-to-date as the last guy who regenerated the data base.

But on the plus side it's just a regular old cldfbench, so anyone can run the code and make their own more up-to-date table (as long as Zenodo's API doesn't break again (<_<)" ).

we don't want to turn this into the only place to look for CLDF datasets

This. I'm kinda afraid of what I call the WALS Effect: During my Bachelor's someone told us about WALS and students started treating it kind of like an authoritative source of truth; like it's a complete lexicon of all the languages of the world. This database catalogue has potential to cause the same effect.

I don't really know how to combat that other than plastering disclaimers all over the thing…

could there please bitte be a table somewhere that is reasonable stable

Hm, my gut feeling tells me it might not be a good idea to simply create a ‘pretty link’ that people hard-code into their programs… We could release the catalogue to Zenodo – that would give you a DOI to whatever the latest version of the data is. That at least feels more stable than a long link into Microsoft's proprietary git hosting website.

There's likely gonna be a (very) simple web app to browse the meta data. But that will also just link to GitHub/Zenodo instead of providing direct downloads.

updated maybe at least once a year

Well, there's not a whole lot of years left before the question comes up who will have the time and/or funding to do the updates.

Hedvig Skirgård · Answer 28 · Mon Jan 08 2024 19:31:37 GMT+0800 (China Standard Time)

Hi, I'm back. Happy new year and all that! o/

is it ok to make the repos public?

I don't see why not: https://github.com/cldf-datasets/clld_meta

Just keep in mind that I'm still working on it, so things might change here and there.

What is the plan and intention for that repos?

It's the database catalogue I was talking about before; just a way to get a bird's eye view on what kind of data is out there (and maybe what kind of data could be missing) and what glottocodes the datasets refer to.

Caveats:

This list is never going to be exhaustive.

This list will only be as up-to-date as the last guy who regenerated the data base.

Lovely stuff! As up-to-date as this will be PLENTY indeed!!!

But on the plus side it's just a regular old cldfbench, so anyone can run the code and make their own more up-to-date table (as long as Zenodo's API doesn't break again (<_<)" ).

we don't want to turn this into the only place to look for CLDF datasets

This. I'm kinda afraid of what I call the WALS Effect: During my Bachelor's someone told us about WALS and students started treating it kind of like an authoritative source of truth; like it's a complete lexicon of all the languages of the world. This database catalogue has potential to cause the same effect.

I think you're letting this fear get in the way of a useful tool. Caveats and disclaimers will be great, I think you needn't worry too much about this.

I don't really know how to combat that other than plastering disclaimers all over the thing…

could there please bitte be a table somewhere that is reasonable stable

Hm, my gut feeling tells me it might not be a good idea to simply create a ‘pretty link’ that people hard-code into their programs… We could release the catalogue to Zenodo – that would give you a DOI to whatever the latest version of the data is. That at least feels more stable than a long link into Microsoft's proprietary git hosting website.

I would also prefer a Zenodo released dataset, for sure. Versioning is always better, and Zenodo is more reliable than GitHub like you say.

There's likely gonna be a (very) simple web app to browse the meta data. But that will also just link to GitHub/Zenodo instead of providing direct downloads.

updated maybe at least once a year

Well, there's not a whole lot of years left before the question comes up who will have the time and/or funding to do the updates.

That seems like a question for RG at one of your programmers' meetings.

For now, all I can say is that as a CLDF-end user - a table of this kind would be MARVELOUS.

Hedvig Skirgård · Answer 29 · Fri Mar 15 2024 22:41:14 GMT+0800 (China Standard Time)

@johenglisch and @xrotwang says that clld_meta will not include glottolog-cldf.

@SimonGreenhill can rcldf please include a vector/table whatever you want of just glottolog-cldf versions etc?

@xrotwang says:

@HedvigS I think it's reasonable to not include a whole catalog of DOIs of CLDF datasets in rcldf. But again, Glottolog is different. It's arguably a component of the CLDF spec itself in that it provides the catalog of valid Glottocodes. So including a way to retrieve this catalog in rcldf wouldn't seem totally inappropriate. OTOH figuring out this DOI and supplying it to rcldf at runtime doesn't seem too much to ask users, I'd think.

Robert Forkel · Answer 30 · Fri Mar 15 2024 22:45:50 GMT+0800 (China Standard Time)

include a vector/table whatever you want of just glottolog-cldf versions

Listing all versions of glottolog-cldf may not be optimal - since it basically couples Glottolog versioning and rcldf versioning. You'd need to release a new version of rcldf simply to make a new version of Glottolog "show up" and be accessible in the same way the others are. I was talking about glottolog-cldf's concept DOI. Retrieving all versions for this concept DOI is possible via Zenodo's search API.

Robert Forkel · Answer 31 · Fri Mar 15 2024 23:05:24 GMT+0800 (China Standard Time)

FWIW, here's how to do that:

$ curl "https://zenodo.org/api/records?q=conceptdoi%3A%2210.5281%2Fzenodo.3260727%22&allversions=true" | jq '.hits.hits[] | {doi: .doi, version: .metadata.version}'
{
  "doi": "10.5281/zenodo.10804582",
  "version": "v5.0"
}
{
  "doi": "10.5281/zenodo.6578300",
  "version": "v4.6"
}
{
  "doi": "10.5281/zenodo.4762034",
  "version": "v4.4"
}
...

Hedvig Skirgård · Answer 32 · Wed Mar 20 2024 03:55:06 GMT+0800 (China Standard Time)

nevermind let's use clld_meta for everything but glottolog and count on people finding the glottolog pointers themselves