data(allen) no longer works

Question

data(allen) no longer works

flying-sheep opened this issue 4 years ago · comments

I specifically relied on this package to get real data for testing without internet access (once the package is installed). I can’t do that anymore!

Are there alternative packages with scRNAseq datasets that don’t download data at runtime?

Aaron Lun · Answer 1 · Thu Aug 20 2020 05:51:40 GMT+0800 (China Standard Time)

We switched over to the ExperimentHub model for data distribution in early 2019. This was the only sustainable approach for continued development; you can see how this package has grown from 3 small datasets to almost 50 datasets of varying sizes. Under the old model, the initial installation would be pulling down a few GB's of data. All Bioconductor single-cell data packages that I manage or am aware of have migrated to ExperimentHub or were developed with that in mind.

However, if you're willing to do a little bit of work after package installation, there is a way. After you install the package, you can run the getters for the desired datasets, e.g., scRNAseq::ZeiselBrainData(). This will pull down and cache the objects locally so that subsequent accesses do not require internet access. You can then set EXPERIMENT_HUB_LOCAL=TRUE to ensure that subsequent accesses don't even try to look online and only use the local cache. In container contexts, it is usually necessary to also set EXPERIMENT_HUB_CACHE to cache the files in some persistent location inside the container (and EXPERIMENT_HUB_ASK=FALSE to avoid being asked about whether the cache directory should be created).

Philipp A. · Answer 2 · Thu Aug 20 2020 05:58:07 GMT+0800 (China Standard Time)

anndata2ri’s test infrastructure is complex enough, I don’t want to figure out how to convince Travis to cache that, but thank you!

In order to not break backwards compat, I’d have left the three initial datasets in there accessible by data(), and added downloader functions for the others. (and of course for completeness sake also functions for the three OGs that simply invoke data)

Aaron Lun · Answer 3 · Thu Aug 20 2020 06:22:34 GMT+0800 (China Standard Time)

I'm pretty sure we had a deprecation period for those three built-in datasets during the last Bioconductor release cycle, along with pointers to the new functions that would replace them; I don't recall getting any noise about it at the time.

My various CI/CD steps (GitHub Actions mostly, but sometimes Travis) for several GitHub-hosted R packages routinely use scRNAseq without any extra work or problems, aside from the occasional connectivity issue.

Philipp A. · Answer 4 · Thu Aug 20 2020 07:04:21 GMT+0800 (China Standard Time)

Hm, then I’ll try doing just that. It’ll make things slower though, as the package cache doesn’t conveniently cache the data.