chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data

Home Page:https://chanzuckerberg.github.io/cellxgene/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimizing Large Dataset Loading and Differential Expression Analysis in local hosted CellxGene VM

chunhuicai opened this issue · comments

We are currently utilizing CellxGene VM (https://github.com/Novartis/cellxgene-gateway) to host a substantial spatial transcriptomic dataset comprising roughly 16 million cells. However, we are facing a couple of critical issues that are hampering our analysis workflow:

Dataset Loading:

  • Incomplete Loading: During the dataset loading process, we often experience disruptions and incomplete loading scenarios. Though after several attempts, we can achieve full dataset loading with a loading time around 3m30s, the inconsistency remains a concern.
  • Conversion to CXG: After successful conversion of our dataset to CXG format, we realized that it is not being recognized by our self-hosted explorer.

Differential Expression Analysis:

  • Inconsistent Loading of Gene Details: While attempting to utilize the differential expressed gene function, we noticed it doesn't uniformly complete the loading of all gene details.

Comparatively, using CZI to work with large datasets (over 4 million cells), we observed a fast data loading and a smooth completion of differential expression analysis in a few seconds. Is there any practices, setups, or approaches that would help us to efficiently handle and analyze big datasets on the local CellxGene VM to achieve performance similar to CZI?

Hi, I am having similar issues with loading times for larger datasets. I was wondering how you were able to convert your datasets into CXG format?

Hey @mohammed-hussain1259

Thanks for the question. The original issue this user was experiencing was partially addressed over private communication, but sharing here for visibility and to continue the public discussion.

W.r.t. converting to CXG you can refer to this code in the single cell data portal repo that is the entry point for the CXG conversion. To provide a bit more context, the CXG file format is an implementation of the TIleDB format/data structure that adheres to the SOMA specification, with the goal of more specifically catering to the single cell use case.

Happy to provide more information if needed, but as a disclaimer - because of the wide range of contexts/requirements of different self-hosting use cases, the CELLxGENE team does not explicitly offer/guarantee support for self-hosting CZ CELLxGENE Annotate.

Hi @MaximilianLombardo,

Thank you so much for your quick response. I am attempting to self-host CellxGene with a similar setup as the original poster using CellxGene Gateway (https://github.com/Novartis/cellxgene-gateway). I have been running into similar issues with loading times, as even a dataset of 400k cells (4.5 gb) takes nearly 15 minutes to load. I noticed that the https://cellxgene.cziscience.com/ is able to load similar sized datasets at an incredibly fast speed and I was wondering if you could provide some guidance on how I might be able to achieve similar load times. I have used the --sparse and --backed flags and while it does improve performance the load times are still not comparable to what I see on the CellxGene site.

I completely understand that you are not able to explicitly offer/guarantee support for self-hosting, however I appreciate any guidance you can provide.

If you would like to correspond over private communication, please feel free to reach out to me on my email at mohammed.hussain@regeneron.com

Thank you again for all your assistance.