awillats / interactive-visualization-resources-and-advice

A comparison of several Python libraries for interactive visualization, as well as advice for effective data viz

Home Page:https://awillats.github.io/interactive-visualization-resources-and-advice/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Interactive Visualization in Science:

Resources and advice for Python + JavaScript

by Adam Willats

Choose your own adventure!

Table of Contents


Context / background

What do I want from visualization libraries ?

I'm interested in using python and javascript to interactively explore problems in computational neuroscience and neuroengineering. I'm also interested in using interactive visualizations to teach concepts to others. here's some of my work:

holoviz docs image of high level + low level image from HoloViz documentation

my priorities are:

  • must allow rich interactivity (zooming, filtering, linking plots)
  • robust handling of multiple 10k sample timeseries
  • expressive, functional-style flexible mapping of (nested) indices to plots
    • I want to be able to explore relationships between variables in high(er) dimensional parameter sweeps
    • bonus points if it can render 3D plots
  • easy to embed / host results, especially in static webpages
    • easy to share, and for others to access, preferrably without complex tools

Visualization library options

Here I focus on a subset of solutions which I think are most promising for the interactive use-case. see also pyviz.org or "Dynamic science viz.."1 for a more comprehensive evaluation.

library language for computation lots of data custom js easy to embed interactivity 3D plots
Plotly /Dash python2 yes, webgl + datashader sort of - via Dash 3 yes, dash is more flexible, but more complicated high yes
Bokeh python (+ js) yes, column-data & server solutions yes yes high yes-ish
Altair python no 4 difficult yes, through vega-lite medium no
HoloViz python via datashader yes yes high yes, through plotly
Observable javascript5 yes6, stream from server yes 7 yes, can also embed single cells high yes, through js libraries

other libraries / ecosystems not investigated / compared here:

  • matplotlib - the de facto standard for (non-interactive, 2D) plotting in Python

    • Seaborn - high-level "opinionated defaults" library built on matplotlib
    • mpld3 - "The mpld3 project brings together Matplotlib, the popular Python-based graphing library, and D3js, the popular JavaScript library for creating interactive data visualizations for the web."
  • plotnine - like Altair, this is a Python library which implements a grammar of graphics approach to plotting

  • Streamlit - python dashboarding library becoming increasingly popular

  • weights and biases - visualization & logging for machine learning

  • see also:

    • The Python Visualization Landscape (2017) video of talk by Jake VanderPlas
    • lots more comparisons of python viz tools at pyviz.org Adaptation of Jake VanderPlas graphic about the Python visualization landscape, by Nicolas P. Rougier Adaptation of Jake VanderPlas graphic about the Python visualization landscape, by Nicolas P. Rougier, via PyViz.org

Lineage / taxonomy of libraries

breakdown of where libraries came from / were inspired by
  • D3.js

    • best as remapping data to visual features
    • comparatively "low-level" specification of plots
    • many libraries built from, or inspired by D3
      • D3 -> Vega

        • provides a higher-level, more concise interface to plots
        • Vega -> Vega-lite
          • making Vega simpler, provides an even higher-level interface
          • Vega-lite -> Altair
          • see "Exploratory Data Visualization with Vega, Vega-Lite, and Altair" - PyCon 2018 video of talk by Jake VanderPlas for examples and context
      • D3 -> Plotly

        • is a javascript library with interfaces to several languages:
        • Plotly is somewhat unique in the Python landscape for having rich, interactable 3D plots
  • Bokeh is inspired by, but not built on D3.js

  • HoloViz is a very high-level tool

    • which allows combinations of many of the above libaries
      • matplotlib, Seaborn
      • Bokeh
      • Plotly
      • Altair / Vega
      • ggplot2
  • Grammar of Graphics (see sources for more info)

    • explicit framework for:
      • Vega -> Vega-lite -> Altair
      • ggplot / ggplot2 -> plotnine
      • Observable:Plot
    • influenced
      • Plotly
      • Seaborn
      • Bokeh
  • Pandas DataFrames

    • several of the above plotting libraries depend on (or work best with) data being in pandas dataframes
    • they also work most smoothly if the data is organized in "tidy" form (see resources here)
    • 10 minutes to pandas
    • Python for Financial Data Analysis with pandas by Wes McKinney
      • Design philosophy - Clean axis indexing design to support fast data alignment, loops, hierarchical indexing and more - Think outside the matrix: stop thinking about shape and start thinking about indices - Fault tolerance: save you from common blunders caused by coding errors (misaligned data) - Hierarchical indexing + `group_by` database operations

Which plotting library should I use?

It depends on your goals and use case. Select the goals that line up most closely with what you want and I'll recommend a library to you.

  • 🛠️ I just want to publish a paper (or other non-interactive publication), and I want to be able to tweak the details of the figure

    matplotlib

  • ⚔️ I'm not interested in learning the intricacies of how to draw ellipses on a computer, just give me nice looking plots with minimal code

    Seaborn

    • you can always fine-tune details with matplotlib later if needed.
  • 🎻 I love the elegance of the grammar of graphics approach, and work mostly with small to medium-sized datasets. Easily specified interactivity would be a a bonus

    Altair

  • 🧙 I want interactivity, to be able to switch between python and javascript, and from small to large datasets (even if it takes more code)

    Bokeh

  • 🏹 I want the maximum amount of plotting features (3D plots, interactivity, dashboards) without having to write too much code

    Plotly

  • 🧰 I have no loyalties to a particular plotting library, let me mix-and-match the best tools for the job

    HoloViz

Design philosophies & features

Ploty 🚧 One double-edged feature of Plotly is the gradient of multiple approaches to achieve the same plot:

  • Good-looking plots can be put together very quickly with Plotly express
  • If you need more customization, you often end up switching to Plotly's graph objects. This has a similar level of customization to matplotlib
  • If you want to compose multiple plots together, you use Plotly's figure factory

While this flexible allows you to pick the right tool for the job, it makes looking through the documentation much more confusing.

Altair 🚧 The biggest part of the Altair learning curve for me was getting my data into the correct form. It was also disheartening to fall in love with the grammar of graphics approach, only to run into a wall when trying to plot many timeseries4.

Bokeh technical vision

  • attempts to address "How do we look at all the data?" and "How can scientists and data analysts be empowered to use visualization fluidly, not merely as an output facility or one stage of a pipeline, but as an entire mode of engagement with data and models?"

HoloViz 🚧

  • HoloViz Goals: talk
    • Full functionality in browsers (not desktop)
    • Full interactivity (inside and out of plots)
    • Focus on Python users, not web programmers
    • Start with data, not coding
    • Work with data of any size
    • Exploit general-purpose SciPy/PyData tools
    • Focus on 2D primarily, with some 3D
    • Avoid entangling your data, code, and viz:
      • Same viz/analysis code in Jupyter, Python, HPC, ...
      • Widgets/apps in Jupyter, standalone servers, web pages
      • Jupyter as a tool, not part of the results

Observable 🚧

  • created by designer of D3.js Mike Bostock
  • reactive javascript notebooks
  • often compared to excel, in that cells of code automatically rerun when their predecessors change value (unlike jupyter notebooks)

Handling large datasets

large datasets in Bokeh

large datasets in Plotly

large datasets in Altair

large datasets in Observable

Misc. considerations a good storage system

Should you be saving data from Python with pickle, numpy binary, writing to csv, or something else?

  • plain-text “human-readable” formats v.s. binary
    • "human-readable" formats like .csv give the ability to visually inspect the integrity of the data which is very useful
    • This choice also impact git’s ability to diff files.
      • Binary files seem very slow to add to git because of this.
      • The standard solution to this seems to be simple to avoid version-controlling binary data.
    • but generally binary data storage is going to be faster and more memory-efficient
  • ability to load partial chunks of data
    • one of the primary selling points of hdf5
    • also seems to be one of the use cases for ColumnDataSource as in Bokeh, although I haven't tried this yet
  • ease of integration with summary visualization tools
    • integration with data-analysis tools like numpy / pandas
  • tight connection between metadata / summary data and primary timeseries

Additional considerations

features which are good to have, but don’t strongly impact my user experience at the moment


Embedding results

Being able to share results and code with others, especially without them having to install a complex ecosystem of tools is useful, and good for open, reproducible science.

  • Observable

  • Plotly

  • Altair

  • Bokeh

    • Embedding Bokeh content docs

      Standalone documents These documents don’t require a Bokeh server to work. They may have many tools and interactions such as custom JavaScript callbacks but are otherwise nothing but HTML, CSS, and JavaScript. These documents can be embedded into other HTML pages as one large document or as a set of sub-components with individual templating.

      • can use file_html() or json_item() to get standalone components

      Bokeh applications These applications require a Bokeh server to work. Having a Bokeh server lets you connect events and tools to real-time Python callbacks that execute on the server. For more information about creating and running Bokeh apps, see Running a Bokeh server.

    • Code for embedding using various servers - examples repo


Interactivity

Shortlist - my favorite examples

  • Seattle weather interactive - demo
  • Using facets to identify patterns - Correlation over Time - observable

Examples of useful interactivity

Custom callbacks:

In order to implement rich interactivity beyond preconstructed templates, it is useful to have control over the callbacks or functions which execute after another event.

  • Dash has it's own pseudo-javascript interface to callbacks:

  • Bokeh has very straightforward integration with custom JS callbacks!

  • Altair / vega-lite

    Altair does not offer any way to register event handlers, beyond what's available in the Vega-Lite spec. That would have to be done in Javascript via the Vega view API

  • ObservableHQ:

    • Since the entire notebook is reactive, and built on javascript, callbacks should be straightforward
    • but specifically from vega-lite:
  • HoloViz

    • Linking using custom JS code

      Linking objects in Python is often very convenient, because it allows writing code entirely in Python. However, it also requires a live Python kernel. If instead we want a static example (e.g. on a simple website or in an email) to have custom interactivity, or we simply want to avoid the overhead of having to call back into Python, we can define links in JavaScript.

for Altair-specific implementation notes see building blocks of interactivity


Useful plotting techniques

(see the New Python Data Visualization Tools repo :fa-github: by Stephanie Kirmer to compare plot-type implementations across Altair, Plotly, Bokeh) 🚧 to-do: embed examples for each of these 🚧

  1. think about explanatory versus exploratory data-viz

  2. faceting / small multiples:

    • scatter-plot matrix (aka SPLOM) - 💡this is always my starting point for visualizing complex data

      implementations
       - [comparisons :fa-github:](https://github.com/skirmer/new-py-dataviz/blob/main/facets.ipynb) by Stephanie Kirmer 
       - [seaborn](https://seaborn.pydata.org/tutorial/axis_grids.html)
          - I think the added value of marginal distributions visualized with [kernel-density estimates](https://seaborn.pydata.org/examples/joint_kde.html) is great.
          - Seaborn's [PairGrid implementation](https://seaborn.pydata.org/tutorial/axis_grids.html) is the best one I've seen for this in Python
          - although [R's `pairs.panels`](http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs) seems to do something similar
          
       - [altair implementation](https://altair-viz.github.io/gallery/scatter_matrix.html) with linked behavior between panels
       - [plotly](https://plotly.com/python/splom/) [w/ customization using figure factory](https://plotly.com/python/v3/legacy/scatterplot-matrix/)
      
    • case study: correlation over time , article by Mike Freeman

      🌟 live, embedded demo 🌟 <iframe width="100%" height="600" frameborder="0" src="https://observablehq.com/embed/@observablehq/correlation-over-time?cells=facet_wrap"></iframe>
  3. add tooltips on hover with useful detail

    • plotly implementation docs
    • customizing tooltips in altair github
  4. use interactive heatmaps

    • can nest / hierarchically organize a lot of dimensions
    • Clustergrammer demo, talk video
      • highly interactive heatmap for clustering genes associated with phenotypes
    • Plotly examples, docs
  5. parallel coordinates - ⚠️ primarily for exploratory data-viz ⚠️

    • 5 minute intro by Amit Kapoor

    • longer showcase of parallel coordinates by Kai Chang

    • practical, implementation tips:
      • plotly implementation
        • while there are parallel coordinates implementations in many python plotting packages, this is the only one I've found with the very useful feature of filtering each dimension into ranges as well as being able to reorder axes
      • order of dimensions matters a lot! use a tool where you can rearrange order
      • scaling / normalization matters a lot!
      • coloring by a key attribute can help dissect structure
      • interactivity is crucial to sort through the "hairball"
        • high bandwidth, but hard to parse
      • can be used to pick out clusters in high-dimensional parameter space
  6. replacing legends with direct text-annotation

  7. "banking to 45 degrees" i.e. choosing aspect ratios for plots that maximize discriminability

  8. meaningful color-scales


Sources, inspiration, more resources


Appendix


Further musings

The following are my personal opinions and not necessarily general recommendations:

  • successful faceting might be even more useful than interactivity

    • with interactivity like sweeping through parameter space, we need something like a faded version of previous parameters to show context for where we explored from
    • this tradeoff changes based on the density, continuity and monotonicity of
  • fully flattened tidy csv for everything means loading far too much information (especially in Altair/Vega-lite)

    • but the tidy data paper helped me understand a lot of the philosophy of both Pandas DataFrames and vega-lite / altair
    • need some virtual-link / lazy-loading solution to continue this paradigm to large datasets
  • straying too far outside python limits iteration

    • there's value to communicating to scientists the python you're using to compute things
    • oftentimes the numpy syntax is much nicer than the equivalent javascript for matrix stuff
    • this is why jupyter is the current horse to bet on
    • this is also a factor making me hesitant to jump to Observable, despite all its nice features
    • this also means I'm very interested in watching the Pyodide project
  • aesthetically I don't like jupyter notebooks

    • I want plain-text notebooks
      • no excessive meta-data in file
      • easy to inspect, version-control the important stuff
      • easy to convert between script and notebook
    • I want to be able to use my multi-purpose editor (VSCode / Atom) for research as well
    • I want embedding not to require binder. (ideally "just go to a link!")
    • they also look very clunky (compared to observablehq for instance)
  • being able to host via github pages is a big advantage

    • but is this always doable via services like heroku / netlify ?
      • even if the library itself doesn't natively support it?
      • ( beyond my current expertise )

Tidying data

Structuring data for visualization tools

  • short version: keep one instance to one row

  • Tidy Data by Wickham, python version

    • well worth reading
    • practical examples
    • ellucidated the philosophy behind R, grammar of graphics, pandas
    • discusses value of alternative "wide-format" representations also
    • idea has ties to dimensional stacking for viz
  • data wrangling in observable

Footnotes:

Footnotes

  1. Dynamic scientific visualizations in the browser for Python users by Patrick Mineault

  2. Plotly also has interfaces in R, MATLAB, Julia, JavaScript

  3. Dash tries to provide a pure-Python interface to mimic the roles of HTML, JS, CSS in traditional websites. > "Dash abstracts away all of the technologies and protocols that are required to build a full-stack web app with interactive data visualization." dash callbacks

  4. there are some workarounds in progress for Altair w/ large datasets 2

  5. While Observable notebooks can't currently execute Python, I've included it here because I do think it's a promising solution for interactive data-science notebooks. At the moment my workflow would look something like performing the primary analysis in Python, exporting the results to .csv then importing that to Observable for visualization.

  6. data capacity / capabilities differ between public and private notebooks

  7. Observable's not JavaScript

About

A comparison of several Python libraries for interactive visualization, as well as advice for effective data viz

https://awillats.github.io/interactive-visualization-resources-and-advice/