SimonGreenhill / rcldf

rcldf - The R library for reading CLDF files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

table of zenodo and GitHub locations for versions of datasets

HedvigS opened this issue · comments

I think it may be useful to users if the rcldf package comes with a data object that contains pointers to major cldf datasets that dlce maintains, pointers that can be fed to rcldf::download().

What do you think @SimonGreenhill @xrotwang ?

dataset version Zenodo_url GitHub_url
grambank 1.0.3 https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip https://github.com/grambank/grambank/tree/v1.0.3
glottolog-cldf 4.8 https://zenodo.org/records/8131084/files/glottolog/glottolog-v4.8.zip https://github.com/glottolog/glottolog-cldf/tree/v4.8

If anything, the pointers should be DOIs, I think. Also, such a list might be of wider applicability than just R, so a data object seems like an unnecessary limitation. Why not a CSV file? Or a CLDF dataset of datasets? @johenglisch is working on a catalog of CLDF datasets, so he may have an idea about this.

DOis should be included too, that's a good point.

I was thinking practically for users of the rcldf package that zenodo download urls may be very handy.

Yes, I would like to know what @johenglisch thinks too. I could tag him originally, I think he's not "in" this repos.

CSV would be cool too, but i believe best practices for r packages is rda-files. I would prefer csv too since it'd make history more transperent, but I don't think that's what is usually done. @SimonGreenhill would know more.

Tadaaaa~ Here's a list of datasets the catalogue knows about currently: contributions.csv

I think adding direct download links to this shouldn't be too hard.

Tadaaaa~ Here's a list of datasets the catalogue knows about currently: contributions.csv

I think adding direct download links to this shouldn't be too hard.

NICE!

Those Zenodo links though, they need to look something like "https://zenodo.org/records/7740140/files/grambank/grambank-v1.0.zip" for rcldf::download() (and therefore also rcldf::cldf()) to work. What we're currently expecting is the link that's underneath the button "download" on the webpage for the record.

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Yep. I don't know about @SimonGreenhill , but for me it'd be handy to have a list like that that points to the zip that Zenodo provides with the version etc info from the previous table. Thanks @johenglisch !

As far as I can tell, R-packages usually don't store information of this kind in csv-sheets. I think there are two options

a) Combine the csv-files Johannes has suggested into one and render it as an rdata object that comes with the rcldf package
b) Combine the csv-files Johannes has suggested into one and store it as a csv somewhere and encourage rcldf users to go fetch it there.

I think this is well outside the scope of a package to read CLDF files is not the place to catalogue all the CLDF files (it'll be a pain to update and maintain).

I think this is well outside the scope of a package to read CLDF files is not the place to catalogue all the CLDF files (it'll be a pain to update and maintain).

Ok!

Yeah, those links point to the Zenodo entry itself, which can have multiple downloadable files (though datasets by us usually only have one). I quickly made a list of files: file-list.csv

Yep. I don't know about @SimonGreenhill , but for me it'd be handy to have a list like that that points to the zip that Zenodo provides with the version etc info from the previous table. Thanks @johenglisch !

As far as I can tell, R-packages usually don't store information of this kind in csv-sheets. I think there are two options

a) Combine the csv-files Johannes has suggested into one and render it as an rdata object that comes with the rcldf package b) Combine the csv-files Johannes has suggested into one and store it as a csv somewhere and encourage rcldf users to go fetch it there.

@xrotwang @johenglisch would you still consider (b)? I'd very, very much appreciate if there was one stable place for a table of this kind.

Why should the files be combined into one? If people download cldf data, they'd be experienced with joining tables, I'd hope. Hedvig Skirgård @.***> schrieb am Do., 14. Dez. 2023, 00:06:

I don't think it's crucial that the tables are joined. For me, it would just be great if there is a stable place with info on Zenodo and GitHub locations for CLDF-datasets, along with version information. If you want to have mutliple tables, do so I'll join when I need. It's all worth it for a stable location with Zenodo records URLs :D

☁️ My dream ☁️ would be a place on https://cldf.clld.org/ , perhaps called "CLDF-dataset list" or something else equally informative and catchy where I can expect to find up-to-date tables that I can fetch straight from the command line as csv, tsv, json or something else simple and that in there somewhere is a list of Zenodo record URLs that I can feed to rcldf::cldf() and fetch CLDF-datasets ^^!

Oh, and "up-to-date" is a big ask.

Oh, and "up-to-date" is a big ask.

Well, as up-to-date as what @johenglisch is building, is that an okay ask? I was thinking the easiest was to make it a regular product that is derived from what Johannes is building.

But fetching the product from where it's built seems to be the easiest solution.

Why not from https://raw.githubusercontent.com/cldf-datasets/clld_meta/master/cldf/contributions.csv ?

Well, "why not" is because I didn't know this existed ^^!
What is the plan and intention for that repos? I'm getting 404 when i try to go to https://github.com/cldf-datasets/clld_meta

But fetching the product from where it's built seems to be the easiest solution.

I agree! I just don't know where that is or if it's stable where it is now

raw github links seem a little bit sub-optimal to me personally, but I'll go with it if that's the plan!

It's still private. @johenglisch is it ok to make the repos public?

The plan is not to make this the "official" list of all CLDF datasets or something like this. If a dataset you're interested in is in there, then it'll work for you. If it isn't you might file an issue, but we don't want to turn this into the only place to look for CLDF datasets.

The plan is not to make this the "official" list of all CLDF datasets or something like this. If a dataset you're interested in is in there, then it'll work for you. If it isn't you might file an issue, but we don't want to turn this into the only place to look for CLDF datasets.

Okay, that is good to know. Always good to know the plan and the imagined audience.

A table that is reasonably maintained over time (updated maybe at least once a year?) and contains Zenodo-links to at least the major datasets would be very valuable to me, and I think to me and to many others. If things can be added on request, that's great.

I don't know. I guess most people will be interested in a handful of datasets and maintaining a list of links for these yourself seems simpler to me.

I don't know. I guess most people will be interested in a handful of datasets and maintaining a list of links for these yourself seems simpler to me.

If @johenglisch is making something that can produce a table like this anyway, please please please could there please bitte be a table somewhere that is reasonable stable? Please?

I don't think I'm alone in finding a table like that useful. I think the following people would use it: Angela Chira, Ezequiel Koile Olena Shcherbakova and Erich Round.

Sure, I asked @johenglisch above, whether the repos can be made public.

Hi, I'm back. Happy new year and all that! o/

is it ok to make the repos public?

I don't see why not: https://github.com/cldf-datasets/clld_meta

Just keep in mind that I'm still working on it, so things might change here and there.

What is the plan and intention for that repos?

It's the database catalogue I was talking about before; just a way to get a bird's eye view on what kind of data is out there (and maybe what kind of data could be missing) and what glottocodes the datasets refer to.

Caveats:

  • This list is never going to be exhaustive.
  • This list will only be as up-to-date as the last guy who regenerated the data base.

But on the plus side it's just a regular old cldfbench, so anyone can run the code and make their own more up-to-date table (as long as Zenodo's API doesn't break again (<_<)" ).

we don't want to turn this into the only place to look for CLDF datasets

This. I'm kinda afraid of what I call the WALS Effect: During my Bachelor's someone told us about WALS and students started treating it kind of like an authoritative source of truth; like it's a complete lexicon of all the languages of the world. This database catalogue has potential to cause the same effect.

I don't really know how to combat that other than plastering disclaimers all over the thing…

could there please bitte be a table somewhere that is reasonable stable

Hm, my gut feeling tells me it might not be a good idea to simply create a ‘pretty link’ that people hard-code into their programs… We could release the catalogue to Zenodo – that would give you a DOI to whatever the latest version of the data is. That at least feels more stable than a long link into Microsoft's proprietary git hosting website.

There's likely gonna be a (very) simple web app to browse the meta data. But that will also just link to GitHub/Zenodo instead of providing direct downloads.

updated maybe at least once a year

Well, there's not a whole lot of years left before the question comes up who will have the time and/or funding to do the updates.

Hi, I'm back. Happy new year and all that! o/

is it ok to make the repos public?

I don't see why not: https://github.com/cldf-datasets/clld_meta

Just keep in mind that I'm still working on it, so things might change here and there.

What is the plan and intention for that repos?

It's the database catalogue I was talking about before; just a way to get a bird's eye view on what kind of data is out there (and maybe what kind of data could be missing) and what glottocodes the datasets refer to.

Caveats:

  • This list is never going to be exhaustive.
  • This list will only be as up-to-date as the last guy who regenerated the data base.

Lovely stuff! As up-to-date as this will be PLENTY indeed!!!

But on the plus side it's just a regular old cldfbench, so anyone can run the code and make their own more up-to-date table (as long as Zenodo's API doesn't break again (<_<)" ).

we don't want to turn this into the only place to look for CLDF datasets

This. I'm kinda afraid of what I call the WALS Effect: During my Bachelor's someone told us about WALS and students started treating it kind of like an authoritative source of truth; like it's a complete lexicon of all the languages of the world. This database catalogue has potential to cause the same effect.

I think you're letting this fear get in the way of a useful tool. Caveats and disclaimers will be great, I think you needn't worry too much about this.

I don't really know how to combat that other than plastering disclaimers all over the thing…

could there please bitte be a table somewhere that is reasonable stable

Hm, my gut feeling tells me it might not be a good idea to simply create a ‘pretty link’ that people hard-code into their programs… We could release the catalogue to Zenodo – that would give you a DOI to whatever the latest version of the data is. That at least feels more stable than a long link into Microsoft's proprietary git hosting website.

I would also prefer a Zenodo released dataset, for sure. Versioning is always better, and Zenodo is more reliable than GitHub like you say.

There's likely gonna be a (very) simple web app to browse the meta data. But that will also just link to GitHub/Zenodo instead of providing direct downloads.

updated maybe at least once a year

Well, there's not a whole lot of years left before the question comes up who will have the time and/or funding to do the updates.

That seems like a question for RG at one of your programmers' meetings.

For now, all I can say is that as a CLDF-end user - a table of this kind would be MARVELOUS.

@johenglisch and @xrotwang says that clld_meta will not include glottolog-cldf.

@SimonGreenhill can rcldf please include a vector/table whatever you want of just glottolog-cldf versions etc?

@xrotwang says:

@HedvigS I think it's reasonable to not include a whole catalog of DOIs of CLDF datasets in rcldf. But again, Glottolog is different. It's arguably a component of the CLDF spec itself in that it provides the catalog of valid Glottocodes. So including a way to retrieve this catalog in rcldf wouldn't seem totally inappropriate. OTOH figuring out this DOI and supplying it to rcldf at runtime doesn't seem too much to ask users, I'd think.

include a vector/table whatever you want of just glottolog-cldf versions

Listing all versions of glottolog-cldf may not be optimal - since it basically couples Glottolog versioning and rcldf versioning. You'd need to release a new version of rcldf simply to make a new version of Glottolog "show up" and be accessible in the same way the others are. I was talking about glottolog-cldf's concept DOI. Retrieving all versions for this concept DOI is possible via Zenodo's search API.

FWIW, here's how to do that:

$ curl "https://zenodo.org/api/records?q=conceptdoi%3A%2210.5281%2Fzenodo.3260727%22&allversions=true" | jq '.hits.hits[] | {doi: .doi, version: .metadata.version}'
{
  "doi": "10.5281/zenodo.10804582",
  "version": "v5.0"
}
{
  "doi": "10.5281/zenodo.6578300",
  "version": "v4.6"
}
{
  "doi": "10.5281/zenodo.4762034",
  "version": "v4.4"
}
...

nevermind let's use clld_meta for everything but glottolog and count on people finding the glottolog pointers themselves