d3 / d3-contour

Compute contour polygons using marching squares.

Home Page:https://d3js.org/d3-contour

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Separate bandwidth in x and y directions? + NRD formula

jrus opened this issue · comments

As discussed over at the Observable forum, it might be nice for the bandwidth to accept a 2-entry list or an object with x and y attributes or the like, especially since internally the implementation is already blurring separately in x and y directions.

I've now implemented this through various test notebooks that are not yet fully ready (coming soon). I'm enthusiastic about the idea of selecting the bandwidth relative to the variance of each dimension, but there are already a few observations I can share:

First, there is an obvious case where it doesn't work: when the data has no variance (maybe it's a single point, or points that show very small variation in y, for some reason). In those cases we wouldn't want the density contours to be flattened to a line. So there's going to be a sort of minimum, which can be set at the (arbitrary) default value of 20 pixels that exists already.

Second, the way it looks is a bit underwhelming. The current strategy creates "circles" around the data, the x/y aspect ratio creates "ellipses" (on purpose). Certainly nicer for statistics, but not as nice on the eye. So, I would not want to have a different aspect ratio with the default bandwidth generator.

Third, the nrd formula returns values that don't coincide with the way we use the given bandwidth. (Currently bandwidth represents, let's say, the radius of 1 iteration of blurring on a 4x grid, whereas in the litterature it's something like the std dev of the gaussian.) In my experiments, the scale factor between these values is about 5.

As a consequence, either we change, and users will have to rescale their hand-tuned bandwidths (my experience with this is that it's always hand-tuned to give a "nice" graph), or we continue with the same "bandwidth" and scale nrd to match what it's supposed to deliver, but its statistical properties are incorrect. Maybe a solution could be to deprecate bandwidth() and replace it with a new name like blur() or something.

Here's an implementation that seems to work, based on the new d3.blur proposal.
https://observablehq.com/@fil/x-y-bandwidth-for-density-contours

The remarks above still stand.

before
before

after
after

I figure that as a first step we should ship a version that accepts x/y bandwidths as inputs, and allow experimentations (this depends in turn on d3.blur (d3/d3-array#151).

For the nrd stuff, I'd wait for serious statisticians to test and validate the approach.