capeprivacy / cape-python

Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Integrate Cape Python to work with Dask

kjam opened this issue · comments

Is your feature request related to a problem? Please describe.
We've had several users request working with Dask directly instead of Spark and Pandas. Because of it's use in the Python data science community and ease of use for out-of-core computations and parallelization of workflows, it fits well with the data science needs we are trying to address.

Describe the solution you'd like
We should see how many changes we would need to get the cape_pandas integrations working for Dask Dataframes. Matt Rocklin had a look on the webinar and pointed out only a few lines (for example, where we explicitly call pd.Series when returning an array as a series), which would need to be updated for it to just work.

Describe alternatives you've considered
We could wait on Dask integration to prioritize other integrations; however, if it truly is as simple as changing a few returns, I would prefer we do it sooner! :)

Additional context
To hear Matt's comments, check out around 48minutes on this YouTube: - I'm sure he is happy to help if we need extra guidance! 🙌