LTLA / scRNAseq

Clone of the Bioconductor repository for the scRNAseq package.

Home Page:http://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Seurat versions?

jr-leary7 opened this issue · comments

If I convert the datasets present in this package to Seurat format, with metadata retained between versions, would the authors allow a pull request? I know the community is somewhat split over SingleCellExperiment vs. Seurat, but my and many other labs use Seurat, and whenever I pull a dataset from this package I need to convert it to Seurat format. I think it'd be great for the package to include a compressed version of a Seurat object for each experiment, and I'm willing to do the legwork to convert the files myself.

I appreciate the offer but I'm afraid that I won't accept a dependency on Seurat in this package. This isn't a reflection of your converting abilities, it's just that I do not consider Seurat to sufficiently reliable as a dependency. Indeed, updating serialized instances of S4 classes is painful enough when I have full control over the class definition.

To be clear, I'm not against interoperability with other analysis frameworks. However, it should occur via dedicated adapter packages (e.g., https://github.com/theislab/zellkonverter, https://github.com/cellgeni/sceasy) so that any patches can be implemented in one place and rolled out to all affected packages at once.

Of course, other authors are willing to accept higher levels of risk, so you will see various SCE-based packages in Bioconductor also support Seurat inputs (e.g., dittoSeq). Good for them.

Fair enough. I'm curious as to why you don't consider Seurat to be sufficiently reliable? Are there issues with the package I'm unaware of that I should know about?

These are mostly considerations from a developer perspective.

Wheel reinvention

Bioconductor has pretty well-established classes for representing genomics data, stretching all the way back to the ExpressionSet class used for microarray data back in 2004. monocle and the older version of scater derived their classes from ExpressionSets, meaning that they could directly access a large body of statistical methods. (Not that much of the bulk transcriptomics stuff ended up being of great use in single-cell analysis, but we didn't know that at the time.)

Now, the ExpressionSet isn't the easiest to use and it doesn't support sparse matrices, so it was officially superseded by the SummarizedExperiment class in 2015. I'll cut Seurat version 2 some slack here, because it came out around about or before then (2014, IIRC). So I can understand why they home-brewed their own class. I can also understand why Rahul didn't want to change the package to use a SingleCellExperiment when Davide and I suggested it to him in 2017; it would break back-compatibility and affect end-users and all that.

But then they went ahead and made breaking changes to version 3 anyway (see my next point). If they were going to do that, they should have switched to a class that would play nice with other packages - there are over 300 Bioconductor packages that use SummarizedExperiments! - and then users could use Seurat and Bioconductor functions without having the hassle of conversion. Users would only have to learn how to work with a single data structure and they would immediately understand how to use a large portion of the Bioconductor software ecosystem. The Seurat developers would also be spared the overhead of defining, testing and updating the class, and if they needed more than the class had to offer, then they could either extend the class or help further its development.

It's an obvious win-win for all parties but clearly this did not happen. Indeed, there must be something in the water in New York, because engineering decisions like this seem to be par for the course - for example, I recently noticed https://github.com/mojaveazure/seurat-disk that uses HDF5 for storing large datasets. Seriously? Bioconductor has had HDF5 matrix abstractions for years. And we'll soon have TileDB abstractions for native sparse support. It's a shame that Seurat users won't be able to take advantage of these developments.

Breaking changes

The class definition change from Seurat version 2 to 3 is probably the prime example. While class definitions change all the time, we in Bioconductor are generally very careful to ensure that user code does not break. See, for example, the code inside the updateObject() function in SingleCellExperiment; this ensures that users working with serialized instances of the old class definition can continue to do so without any change to their code, because the updating is handled quietly within the SingleCellExperiment functions.

So, a class definition change can be properly managed to avoid any impact on end-users. In Seurat's case, this did not occur. From what I recall, the fundamental problem was that Seurat encouraged users to access the class slots directly with @, a very big no-no for S4 programming. They should be using getter and setter functions so that, if the class definition does change, the functions are responsible for silently remapping the user's request to the new class definition. (The various evoluations of SingleCellExperiment::reducedDim over the years is a good example.) However, by allowing users to access slots with @, the developers painted themselves into a corner where any change would be breaking.

To be fair, I can understand why the use of @ happened in the first place - it's a common rookie error when starting out with S4 development. But the correct remedy would have been to introduce getters and setters first, give them about a year to propagate into the end-user's code, and then update the class definition. (Maybe this did happen, I don't know, but I certainly don't recall seeing any getters/setters back in 2017.) Or even better, if Seurat had been part of Bioconductor, this would have been caught during package review and there wouldn't have been any tears at all.

One can only hope that the developers have learnt from this episode, but I understand that Seurat continues to chop-and change across minor version numbers without any deprecation period. My only first-hand experience is with the CCA functions that just disappeared upon package update (or got renamed, I didn't bother checking). This is annoying for end-users, but only mildly so, as all you have to do is to revert the package update and keep working on the old version.

However, it is an absolute heart-stopper for package developers. The CRAN/Bioconductor package model expects that, at any point in time, all packages are compatible with each other. This means that if Seurat introduces breaking changes, the onus is on its downstream clients to update promptly; client developers can't just say that their package only supports version 2 of Seurat. (In contrast, Python does allow these <= version requirements, but those have their own problems - namely that users may never be able to use two packages that they want in the same Python environment because the versions of those packages' dependencies are incompatible.)

I just don't need that stress in my life where I wake up one morning and a bunch of my packages are broken because of changes to Seurat and they need immediate fixing lest they be kicked off CRAN or Bioconductor (see next point). In fact, Bioconductor protects me against that by providing a release and development environment. The release environment is highly stable for a period of 6 months and intended for end-users who just want to get stuff done. The devel environment allows developers to play around with things that may or may not break downstream dependencies; and if it does break, it's not an issue, because the devel packages are a separate stream from the release packages and we can fix problems at our leisure. Of course, this only works if the packages being changed are themselves in Bioconductor, and this is not the case for Seurat.

Getting dropped from CRAN

Earlier this year, Seurat disappeared from CRAN for a few days. I don't even know how they managed to do that. I don't have much experience with the CRAN package submission system, but I understand that you only get booted off if you've been failing their checks for some time or if you're doing something Very Bad. I mean, come on, I'm one guy and I maintain close to 20 Bioconductor packages, so it's not that hard to keep everything together.

Whatever the reasons were, I can't depend on a package that doesn't reliably exist in a supported repository.

Unregulated use of Python

This one caused all sorts of nightmares. Seurat tries to call Python via reticulate to do various things (e.g., run UMAP). The problem is that reticulate will try to install Miniconda if it can't detect that Python is present. Moreover, once a reticulate-managed Miniconda is installed, reticulate will use that Miniconda's Python to the exclusion of all other Pythons that might be available - even if a user explicitly asks for another specific version of Python with, e.g., use_condaenv(). (You need to use required=TRUE to force it to use your requested version of Python, but this is not the default.)

The result of this is that other people's reticulate code was broken, even in R sessions that didn't use Seurat at all. I was willing to give the developers some initial benefit of the doubt, as this behavior is more a problem of reticulate than anything else. But once it was clear what was happening, one would have hoped that Seurat would have stopped depending on Python like that. The proper way to do it would be to use something like https://bioconductor.org/packages/devel/bioc/html/basilisk.html to provision a dedicated Python environment that doesn't break users or other packages in a persistent manner. And again, they would have been able to get this if they were part of Bioconductor.

Inefficiencies

I don't really know or care that Seurat is inefficient, that's their problem. But somehow the take-home message has mutated into "R is not efficient for single-cell analysis", which is certainly not true - I can analyze 300,000 cells on my laptop with 16 GB RAM by using Bioconductor code, see https://osca.bioconductor.org/hca-human-bone-marrow-10x-genomics.html for an example. Perhaps this perception is a factor in the migration of users to Python, and in particular the scanpy framework; which is a shame, as I believe that R offers a superior environment for interactive analysis than Python. (That is not to say that I am against R/Python integration; indeed, I am actually looking forward to stronger integration with well-written Python code via reticulate and basilisk, where the use of Python is fully transparent and hidden inside typical R function calls.)

Conclusions

I ranted a bit there, but that's the perspective from a package developer. They might also impact end-users, they might not; for example, you don't have to particularly care about version changes and upddates if you're using Docker to control your environment. I'll leave it to you to figure out whether any of the points above might impact your work.

With the transition of monocle3 to use the SingleCellExperiment, it really just remains for Seurat to migrate over and we will achieve unification of the entire R single-cell analysis stack. Version 4, perhaps.

Thank you so much for such a detailed response Dr. Lun! That was incredibly interesting and informative. As someone from a statistics background who is trying to follow software dev best practices when writing code & developing analytical frameworks, this sort of detailed background info is exactly the type of thing I like to read about & learn from.