nteract / papermill

📚 Parameterize, execute, and analyze notebooks

Home Page:http://papermill.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Advice on the storage of Pandas DataFrames

aabdullah-bos opened this issue · comments

I am interested in using Papermill to generate notebooks to analyze financial data. The example given in the README, gives an example of storing primitive Python types, but does not show or provide guidance on how to "record"/store more complex data types. I am consider storing the data in CSV files or in some type of "database-like" resource.

Do you have any guidance or advice on things that you've learned about data ingress and data storage when using Papermill?

Hi @aabdullah-bos,

Definitely we want to improve data storage in papermill. You can see #175 is suggesting moving that capability into an adjacent project and upgrade the interaction.

In general with what's there today the data has to be serializable to/from JSON for the record operation. Converting pandas dataframes to csv / json and recording that does work (I've used that in a few places myself). You just have to re-encode the data on the consumer side back to data frames knowing how you originally saved the data.

As far as leanings on the topic, it's good for one-off situations or situations where you don't have a readily available database/object store to reference. Handy when you have a chain of work and want to reproducibly store state in isolation of everything else along the way. It's bad when your dataframe is large (>20MB) because most notebook interfaces don't do incremental saves and send the whole notebook to the store on each save (papermill included). It's also not easily understood by most external systems what you're storing and where you're storing it if you need to share more globally. This means if the notebook data is to be used by more than one downstream system, notebook objects/stores should be registering the data they are housing within your pipeline or you should be saving it to such a store (like S3/Azure/your favorite blob store).

Hopefully this is useful insight. Feel free to ask more specific questions if any of that wasn't clear or needs deeper elaboration.

@MSeal Thanks for the response. I've implemented something like the code below...I haven't decided if I want to use JSON or CSV. I'll keep your "learnings", in mind as I work through a few more iterations. I think that I may be leaning towards recording a URL, handle, path or resource locator in the output notebook so that down stream notebooks can access them.

On the producer side

import papermill as pm
import pandas

def compute_some_data():
    df = pandas.DataFrame.from_dict({'a': [1,2,3], 'b': [3,2,1]})
    return df

df = compute_some_data()

# record as JSON
json_data = df.to_json()
pm.record('json_data', json_data)

# record as string
csv_data = df.to_csv()
pm.record('csv_data', csv_data)

On the consumer side

import papermill as pm
import pandas
import json
import io

nb = pm.read_notebook('test_df_json.ipynb')

# read from JSON
json_data = json.loads(nb.dataframe[nb.dataframe.name == 'json_data'].iloc[0]['value'])
df_j = pandas.DataFrame.from_dict(json_data)

# read from CSV
stream = io.StringIO(nb.dataframe[nb.dataframe.name == 'csv_data'].iloc[0]['value'])
df_c = pandas.read_csv(stream, index_col=0)

Awesome, hopefully the points above will help. I'm going to close the issue but feel free to reopen if you have followup concerns/questions.