Speed up and extend normalizations

Question

Speed up and extend normalizations

mbuttner opened this issue a year ago · comments

Current normalization tools in pytometry use the FlowKit implementation and are FlowJo input compatible, but could use a speed up, especially the functions for logicle and biexponential transformations. The Cytotransform package implements commonly used transformations and is much faster. An adaptation of these functions to the anndata format would be amazing.

Ross Burton · Answer 1 · Thu Sep 07 2023 16:10:00 GMT+0800 (China Standard Time)

Cytotransform uses the FastLogicle C++ implementation from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761345/. How do you feel about making cytotransform a dependency of pytometry? Might be easier to manage rather than ingesting the C++ code into this repo?

It also accepts a Pandas DataFrame as input so a wrapper for anndata should be easy to create. I'm happy to pick this one up.

Maren Büttner · Answer 2 · Thu Sep 07 2023 23:49:14 GMT+0800 (China Standard Time)

I'm happy to make CytoTransform a dependency of pytometry. The wrapper should be relatively easy because most data matrices for flow data are dense. My only concern is that the pandas DataFrame is less memory efficient than a numpy 2D matrix, which might become problematic as soon as you use 10M+ cells. So I think that using the 2D numpy matrix as input would be preferable, especially when the C++ implementation will convert the input to a 2D array anyways.

Ross Burton · Answer 3 · Mon Sep 18 2023 00:30:32 GMT+0800 (China Standard Time)

That makes perfect sense. It will be a simple refactor to cytotransform to fix this since under the hood it acts on the numpy array of the pandas dataframe anyway. I'll refactor cytotransform base class to accept a dataframe or numpy array and skip wrangling the output into a dataframe when numpy arrays are provided as input - see burtonrj/cytotransform#5

Scott White · Answer 4 · Thu Sep 21 2023 11:13:42 GMT+0800 (China Standard Time)

Hi Maren and Ross,

It's great to see the Python cytometry ecosystem growing. However, I'm concerned that we may be unnecessarily fracturing our efforts. It seems we have 3 different implementations of transformations here. Perhaps there is a way to work together?

Kind regards,
Scott

Maren Büttner · Answer 5 · Fri Sep 22 2023 01:13:50 GMT+0800 (China Standard Time)

Hi Scott,
thanks for reaching out! The current normalization implementation in pytometry largely follow your implementation in the FlowKit package, and I respect your concern. Happy to talk and join forces.

Best,
Maren

Scott White · Answer 6 · Fri Sep 22 2023 11:19:37 GMT+0800 (China Standard Time)

Maren,

Most transformations in FlowKit are modeled after the GatingML 2.0 specification, and their implementation is defined in the FlowUtils package. The exception is the FlowJo biex transformation, as you've discovered. The history of FlowUtils comes from inheriting the original fcm code base, which I broke apart with the hope that the individual parts would be more useful to others.

I realize the requirements / needs for projects may differ, but the transformation functions are one of those rare, well-defined requirements that seem better served by many eyes on one set of implementations. I can move the FJ biex xform to FlowUtils if that would help. The reason I kept it separate was the added dependency of SciPy, where FlowUtils has the singular dependency of NumPy.

And, I see Ross has parallelized some of the other xforms in the CytoTransform package. I also thought about doing this as the parallelization for transforming FCM events is obvious, but multiprocessing (mp) in Python has a double-edge. Nesting mp in Python is "here be dragons" territory. I use mp in FlowKit for processing multiple FCS samples using the same gating strategy, with the rationale that a user would unlikely mp different gating strategies. Adding mp to the transformations that are necessary for the gating functions threw a wrench in the works. Still, it is an interesting idea Ross has for making the xform classes optionally mp.

Enough rambling by me, curious what you both think and what ideas you have.

-Scott

Ross Burton · Answer 7 · Fri Sep 22 2023 22:54:32 GMT+0800 (China Standard Time)

Hi guys!

Thanks for the input Scott! Much appreciated and I too agree transformations is an area where we should all be aligned regardless of what else we are doing - maybe also on things like compensations, various utilities, gate definitions, but we can discuss that another time - then we should allow the user some choice around what frameworks they want to work in. Maren is doing great work expanding cytometry into the scanpy ecosystem and I am right behind her on that! In fact I even hope to retire CytoPy and point people this direction.

As for cytotransform, I took inspiration from your work Scott and tried to make it GatingML 2.0 compatible and also performing testing against the GatingML 2.0 definitions (https://github.com/burtonrj/cytotransform/blob/main/tests/test_transformer.py). The mp approach is an ambitious attempt to brute force transforms on enormous datasets....but mp in Python is a swamp I will agree (hopefully to be fixed in 3.13 onwards https://peps.python.org/pep-0703/). Recognising that mp is not always desirable, cytotransform will just execute the function in a single process if n_jobs is 0 or 1. I've also made this the default behaviour if you have less than 10000 events....maybe it should just be the default behaviour anyway? MP should be an explicit choice? Interested to hear peoples thoughts on this.

Ross Burton · Answer 8 · Fri Sep 22 2023 22:58:27 GMT+0800 (China Standard Time)

Oh, also I've had far less installation woes with the FastLogicle.cpp implementation coupled with PyBind11. I haven't compared the performance of FastLogicle vs your C implementation though @whitews, not sure which is faster or more precise.

Maren Büttner · Answer 9 · Tue Oct 03 2023 00:18:48 GMT+0800 (China Standard Time)

Hi there,

thank you all for your suggestions. I might be more taciturn, but I enjoy the discussion.

About multiprocessing:
I had some good experience with multiprocessing in Python on our Linux cluster so far, but I do understand that this might not be generalizable to other architectures. So I agree with @whitews that we should disable multiprocessing for normalization by default and leave it to the user - as @burtonrj stated "make it an active choice".
Normalization functions as defined in FlowUtils/FlowKit:
I agree that the definitions of the data transformation leaves no room for variation. I do realize that the current implementation of the arcsinh does not quite follow the GatingML 2.0 standard, so that should be adapted. When I looked into FlowKit, I realized that pytometry follows much less the object-oriented coding paradigm as FlowKit does. That is why I decided to adapt these transformations to fit with the anndata format and the scanpy coding style instead of relying on FlowUtils/FlowKit directly. That results into a parallel development, obviously, but I haven't figured out a more elegant way to do this (or at least I didn't when I first implemented the functions).
Second, I extended the arcsinh and biExp transform functions such that every channel can be transformed individually.That's something that FlowJo does for the biExp function and which is a common practice for spectral flow data. My concern is that when we decide to rely more on FlowUtils, we might be less flexible to respond to user requests to extend the functionality (see issue #54). I'm leaning more towards keeping the pytometry package separate from FlowKit and follow through with @burtonrj's implementations in cytoTransform.

Scott White · Answer 10 · Tue Oct 03 2023 09:05:22 GMT+0800 (China Standard Time)

Maren & Ross,

It sounds like we are pretty much on the same page. What if I move the transform classes from FlowKit to FlowUtils, along with moving the FlowJo Biex implementation there. This adds SciPy as a dependency, which is why it was not done previously, but does put all the transforms in one place.

Ross, would you be available to help maintain FlowUtils if we did this? We could add the optional mp as well. I do feel that the more eyes we have on these critical implementations the better, inline with the Torvalds quote "given enough eyeballs, all bugs are shallow".

One other point is that the Sample class in FlowKit does allow for a dictionary of transforms mapped to individual channels. The Gate classes are also designed to be used independently from the heavy Session & Workspace classes. Not sure if it was clear, but FlowKit is more or less feature complete. It will never expand to have clustering algos, a GUI, etc. This was a decision made early on after being a bit stung by the algo chase building ReFlow.

Maren Büttner · Answer 11 · Tue Oct 03 2023 23:55:06 GMT+0800 (China Standard Time)

Hi Scott,

that sounds good to me. It would be great if the FlowJo Biexp implementation would also be in FlowUtils. pytometry has the scipy dependency already, so that doesn't change much. I guess that I have to spend more time with FlowKit to fully understand its functionality and implementation - thank you for pointing this out 🤷‍♀️
@burtonrj you'll be making the call as this decision will affect you the most. If yes, this issue turns into "inherit FlowUtils normalizations".

Ross Burton · Answer 12 · Thu Oct 05 2023 02:54:56 GMT+0800 (China Standard Time)

Hi all,

I'm happy with this call, I was going to have to maintain cytotransform anyway so it makes no difference to me, other than I won't be maintaining it on my own - completely agree with you @whitews, the more eyes the better!

Let me know next steps RE changes to FlowUtils.

Scott White · Answer 13 · Thu Oct 05 2023 21:00:50 GMT+0800 (China Standard Time)

Great, I've started a discussion on the FlowUtils repo here:
whitews/FlowUtils#11

We can plan the changes there (and avoid spamming this issue :)

Scott White · Answer 14 · Sun Oct 08 2023 10:51:29 GMT+0800 (China Standard Time)

Bumping this as I plan to release new versions of FlowUtils and FlowKit soon to support Python 3.12. I've discussed the proposed changes with my colleagues at Duke and have the go ahead. We are also considering moving the base Gate classes to FlowUtils, as these are also well-defined and would likely be helpful to others...though there are some implementation details to be worked out, particularly around the ratio dimension.

alefrol638 · Answer 15 · Wed Oct 18 2023 05:30:36 GMT+0800 (China Standard Time)

Do you also plan to add autologicle normalisation, like used in the R package cytofkit (cytofkit::autoLgcl, https://github.com/JinmiaoChenLab/cytofkit/blob/master/R/cytof_preProcess.R)?

Ross Burton · Answer 16 · Wed Oct 18 2023 16:55:05 GMT+0800 (China Standard Time)

I haven't personally looked into parameter optimisation for logicle transform, I've always left parameters as options for the user.

I would say we could incorporate this but first we need to develop a consensus in methodologies which (as discussed above) appears to be FlowUtils

Maren Büttner · Answer 17 · Wed May 08 2024 00:09:51 GMT+0800 (China Standard Time)

Do you also plan to add autologicle normalisation, like used in the R package cytofkit (cytofkit::autoLgcl, https://github.com/JinmiaoChenLab/cytofkit/blob/master/R/cytof_preProcess.R)?

@alefrol638 The autologicle normalization function is now implemented in pytometry v.0.1.5

Gregor Sturm · Answer 18 · Wed May 08 2024 16:02:34 GMT+0800 (China Standard Time)

I would be in favor of unifying normalization functions between the different ecosystems, which if I followed this discussion correctly, means relying on FlowUtils. In that I think it would be good if someone beyond @whitews had commit/release rights on that repo to avoid a single point of failure.

About multiprocessing

I think multiprocessing is important when moving towards datasets with 10s to 100s of millions of cells. I agree that Python multiprocessing is not perfect. Ideally the C functions can be paralellized (where applicable), or numba could be an option for this kind of code. I also made good experiences with replacing python multiprocessing with joblib.Parallel which is more robust and supports different backends, including dask.

Scott White · Answer 19 · Wed May 08 2024 23:50:05 GMT+0800 (China Standard Time)

@grst

I agree on unifying the well-defined components like the transforms and that it is not ideal to have such a critical repo with a single person. Equally important IMO is the stability of such a well-defined codebase, i.e. there's no need to change it unless something is broken.

On that point, I did make a local branch of FlowUtils with class-based transforms (something mentioned earlier in the thread). In the end, I decided against it. It was awkward to have an end user of FlowKit use those "pass through" classes, as the docstrings are in a second tier import. Seems easy enough to wrap the transform functions in a class as close to the end user as possible. The important part is to unify the underlying math of the transforms.

One other point, related to #65, is FlowUtils already has inverse functions for all the transforms. There are also a few other functions, including compensation and a GatingML compatible points_in_polygon and points_in_ellipsoid that might be useful.

For optimization, I think it would be useful to define some requirements / expectations prior to any implementation. Some data sets and benchmarks would be good as part of that as well. There are several ways to implement this, each with pros and cons. A good case can be made that parallelization could be a downstream responsibility, it is rather easy to batch transforms per sample or even on a more granular level.

Just a proposal, and I'd need to verify this on my end, but what if we move FlowUtils to scverse? I'd be in support of this if we can agree to keep it stable and focused, avoiding breaking changes for downstream packages such as pytometry and FlowKit.

Gregor Sturm · Answer 20 · Mon May 13 2024 15:42:44 GMT+0800 (China Standard Time)

Equally important IMO is the stability of such a well-defined codebase, i.e. there's no need to change it unless something is broken

Fully agree.

Just a proposal, and I'd need to verify this on my end, but what if we move FlowUtils to scverse? I'd be in support of this if we can agree to keep it stable and focused, avoiding breaking changes for downstream packages such as pytometry and FlowKit.

I'd be up for that, but this is something I would like to ask in the core team first. I'll put it on the agenda for next week.

For optimization, I think it would be useful to define some requirements / expectations prior to any implementation. Some data sets and benchmarks would be good as part of that as well.

We now have some infrastructure for continuous benchmarking in place for scverse repos. See in action e.g. here:
scverse/scanpy#3031 (comment). This would be another advantag of moving it to the scverse org.

Wherever it makes sense, I would be slightly in favor of implementing paralellization in the central repo. As discussed above, it's trivial to do parallelization, but doing it well might not be that easy.

Philipp A. · Answer 21 · Tue May 14 2024 16:45:17 GMT+0800 (China Standard Time)

If you need help with the benchmark setup, ask @ilan-gold or me. It’s easy to set up but you might need help with permissions. It’s also currently in alpha state so expect it to break sometimes.

Gregor Sturm · Answer 22 · Wed May 22 2024 01:10:08 GMT+0800 (China Standard Time)

Hi @whitews,

core team is fine with moving FlowUtils to scverse. One concern that came up is that the release process is well documented (we don't have a lot of experience with packages that have C extensions) and ideally automated via GitHub actions.

So we all agree that this is the way forward and pytometry will replace its normalization code with FlowUtils? @mbuttner @quentinblampey?

Maren Büttner · Answer 23 · Wed May 22 2024 02:10:27 GMT+0800 (China Standard Time)

Hi @grst

Thank you for the update. I am happy with the suggested action to replace pytometry's normalization code with FlowUtils.

Quentin Blampey · Answer 24 · Wed May 22 2024 15:29:24 GMT+0800 (China Standard Time)

Hello @grst, I'm also fine with this! One small question though: since the Python package is called flowutils, should we rename the GitHub repository as flowutils as well? I think this would also make more sense regarding python naming conventions

Scott White · Answer 25 · Wed May 22 2024 22:42:38 GMT+0800 (China Standard Time)

@grst OK, so how would this work? Would I remain the lead on FlowUtils?

Gregor Sturm · Answer 26 · Thu May 23 2024 00:17:06 GMT+0800 (China Standard Time)

Of course! You remain (co-)owner of the repository. You can also add a section to the readme to make clear you are the lead developer.

I guess @quentinblampey would get commit rights, too. Additionally all scverse core developers will have committ rights in theory, but not use them without your approval as long as you are reachable.

Gregor Sturm · Answer 27 · Thu May 23 2024 00:17:59 GMT+0800 (China Standard Time)

To initiate the transfer it's probably easiest if you make me an owner of the flowutils repo.