selection.select and data inheritance.
scameron opened this issue · comments
I'm hesitant to open this as an issue because I know the behaviour of having selection.select
inherit data from corresponding nodes in the parent selection is totally deliberate. But I think it's at least worth discussing so here goes.
This side effect of select
is something that confused me when I was first starting out with d3. And I regularly find myself explaining the behaviour to colleagues who are surprised by it. These explanations never seems to stop at the "what happens" but invariably continue into queries of "why would it do that"? I take this repeating pattern as a red flag that there is a rough edge in the API.
I think the confusion stems from 2 points:
- The term
select
has strong connotations of being read-only from popular usage in SQL and other query languages. This seems to run pretty deep for people (myself included). - All other instances of
select
andselectAll
in d3 besidesselection.select
are read only.
I understand that changing this would be a pretty severe backward compatibility break, but just as a thought experiment, do you see any other issues with the following?:
- Make all
select/selectAll
operations to be read only only. - Introduce
selection.update(selector)
to handle the update workflow where you truly do want to propagate data across nodes. This would essentially work just likeselection.select
does today.
It's maybe not as seamless but there is something nice (to me) about having mutating operations be explicit even if it means sacrificing some convenience.
BTW, I posted a StackOverflow question a little while ago, but that was more targeted at techniques for working around practical issues we were hitting. It adds some relevant context to this topic, though, so here's a link.
I assume you’ve read them already, but this is covered pretty well in Working with Selections (and earlier, Nested Selections). The key graf being:
Only selectAll has special behavior regarding grouping; select preserves the existing grouping. The select method differs because there is exactly one element in the new selection for each element in the old selection. Thus, select also propagates data from parent to child, whereas selectAll does not (hence the need for a data-join)!
Further, if you look at the implementation of selection.append and selection.insert, you’ll see there are just simple wrappers on top of selection.select. And I assume you’re not arguing to remove data inheritance from selection.append and selection.insert.
While anecdotal, in my usage I typically do want selection.select to propagate data automatically. This is because selection.select implies there is a 1-to-1 relationship between the parent and child, and so it is almost always the case that the parent and child are bound to the same data. Thus either the propagation of data is a noop (the child already has the same data) or it is desired, as when the parent’s data is updated and you then want to update a child:
var parent = d3.selectAll("element") // select some existing elements
.data(data) // join them to new data
.attr("foo", function(d) { return d.foo; }); // update parent attr to reflect new data
parent.select("child")
.attr("bar", function(d) { return d.bar; }); // update child attr to reflect new data
The idea is that parent.select is parallel to the parent.append you would do on enter:
var parent = d3.selectAll("element") // assuming no existing elements
.data(data) // join them to new data
.enter().append("element")
.attr("foo", function(d) { return d.foo; }); // initialize parent attr
parent.append("child")
.attr("bar", function(d) { return d.bar; }); // initialize child attr
If we removed data inheritance from selection.select, then you have to make the propagation of data to the child explicit, either by forcing a second data join on the child selection (child.data(data)
) or by introducing a new selection method that behaves like selection.select except for the fact that it also propagates data (your selection.update). Then people would likely misapply selection.select when they wanted the data to propagate. This is especially true if we change the behavior of the method that has existed since D3 was first launched! But any new API adds surface area and thus has to be weighed carefully.
I think this discussion would be improved if you could describe a specific example where selection.select propagating data is harmful. I recall vaguely at least one instance in the past where I didn’t want selection.select to propagate data, but I’ve forgotten the specifics, and I’m pretty sure the fix was a straightforward switch to selection.selectAll and a more restrictive selector (e.g., "child:first-child", or "child.special").
I’m going to close this issue, if only because this particular ship sailed three years ago and changing selection.select now would be highly disruptive, even for a new major version. But, I look forward to your reply, and perhaps there are other less disruptive ways we can address this confusion. Or maybe you’ll still change my mind, but don’t count on it. :)
Thanks a lot for the reply. You're right, I didn't really think there was any cheese down this tunnel but you never know what kind of ideas pop up when you talk it through so I thought it worthwhile. (Actually, as I reached the end of typing this I actually did think of one potential idea but it's buried at the end. :) )
I'll try to explain a little bit about the scenario we're working with to give some context around how we bump up against this.
The first point that's somewhat special is that we don't control the DOM. We have a visualization library that renders into a provided container element. Each visualization instance should not know about anything higher in the tree than the container element. This means we need to be careful to always contextualize selections under this container. d3.select
and d3.selectAll
with selector strings are more or less not used. Sure, we could use very specific selectors to ensure we don't pick up stray nodes, but the safest thing of all is to just use selection.select
and selection.selectAll
everywhere as it is guaranteed to search downward only.
Inside the visualization we have a couple of our own levels in the DOM (although over time we've been progressively flattening our SVG as we fight against the lack of explicit z-ordering!). Each visualization can be comprised of multiple plots and each plot can be comprised of many marks. The code is factored to separate responsibilities so, for example, a particular mark renderer is passed a "plot" container <g>
element and it does it's work inside this container. As with the higher level, the mark renderer doesn't know about anything in the visualization outside of its current plot.
So, therein, as they say, lies the rub. The following seemingly contradictory things (at least from the perspective of data inheritance) are true:
- Using container elements combined with separation of responsibilities in the code means element nesting and, correspondingly, nested selections are the most convenient way to constrain search visibility.
- In terms of bound data, there is little or no relationship between the data bound to a container node and the data bound to the children.
Here's an example:
<svg>
<g class="viz">
<g class="plot">
<defs>
... linear gradients, masks, usable elements
</defs>
<g class="datapoint">
... mark shapes and so on
</g>
.... more datapoints
</g>
</g>
</svg>
So, the plot <g>
has some plot-specific metadata bound to it. And the <defs>
node has some pre-calculated information needed by the defs on it. Along comes a developer who wants get the defs node as a selection. He knows there's only one defs node so, not deeply understanding the nuances of selection grouping semantics, his first instinct was to use plotContainer.select
instead of plotContainer.selectAll
. Whammo, now the defs node has plot metadata on it.
One thing that is true is that we definitely know very clearly which levels in the DOM have a parent-child data relationship and which ones don't. And the type of relationship between nodes never changes. If there was some way to indicate on the node that it's the end-of-line in terms of data inheritance, I think that would work for us. Do you think that would cause problems elsewhere? I suppose there would need to be something special done for insert
and append
to get around it when creating nodes...
Of course I would be very interested in hearing any other ideas you have either for d3 enhancements or techniques we can use to work around the issue. The technique we're using currently is called "teach people to not use selection.select unless they want data inheritance". That works too. :)
I think the main thing you’re doing that’s different than what I commonly do is how you treat data. As far as encapsulation goes, I think of data as something that’s public, like an input argument to a function, rather than the encapsulated private state of a component. To me this is close to the idea of “data-driven documents”, in that the data is driving the transformation of the document. That’s why in my piece on reusable charts I bind data to an element and then use selection.call to apply the component.
There are still cases, like with the axis component, where the data is rebound. But this is when there is a one-to-many relationship, as in the relationship between the parent axis and its main ticks.
Perhaps related, the axis component also has private state. But this is stored on element.__chart__
rather than in element.__data__
. The axis needs to store a copy of the scale at the state in which the axis was rendered so that when a transition is needed, the axis can transition the entering nodes using the old scale. (A copy is necessary because the state of the scale outside the axis is lost by the time the transition is requested.)
Of course, I suspect changing how you store data in your components would be pretty disruptive as well, but I’m curious to know what you think.
One thing I should clarify first is that when I talk about a developer making an error with selection.select
I'm talking about a developer implementing the internals of the library, not a developer consuming the library. Consumers never need to worry about this level of detail. In fact, consumers don't even have to use d3 if they don't wish to (although, really, who would not want to use d3?!? :) ).
There are essentially 2 kinds of data we deal with. There's the data fed in by the caller, which as you say is absolutely public and treated as an input argument. Then there's data used internally by the implementation for various supporting purposes. Maybe this second type of data is derived from the first type or maybe it's just made up to drive some supporting structure of the visualization.
With the first type there is no problem. There is generally a clear mapping between data elements and visual artifacts. As you say, this is truly the idea of a data-driven document where the data and the document are (to some extent) a reflection of one another.
The second type is a bit different, though. This data is really an internal implementation detail that the outside world doesn't (and shouldn't) know about. You could argue that driving DOM structure using this kind of data is somewhat artificial and that's probably technically true. But the data joining and selection mechanisms in d3 are so convenient and powerful that people want to use them everywhere. So if I want to create a <defs>
element I might choose to just bind an arbitrary single element array to create it. And if later I want some internal data available to the children of that defs element, a logical next step is to change my arbitrary-value single element array into an object-value single element array. Yes, it's internal state and it could (maybe should?) be persisted on a different property (like in your element.__chart__
example) but I don't think this would be many people's first instinct.
Looking a bit at the implementation of the d3.svg.axis
component, I was able to find one example where you're doing something along these lines:
var path = g.selectAll(".domain").data([ 0 ]);
This uses an arbitrary 0 value but you could imagine binding some internal data about the path here. And if you do g.select(".domain")
you would clobber that data. But this is a case where the .domain
path doesn't have any relationship to the data bound to g
. In fact, you could make this same argument for nearly all of the internal data joins in axis because the data for ticks and subticks are derived from the scale argument, not any data bound to the container selection.
var path = g.selectAll(".domain").data([0]);
This data-join is more about creating the domain path if it doesn’t exist than it is about selecting the domain path without propagating data; none of the code depends on the data that is bound to the domain path. And the ticks have different bound data because it’s a 1-to-many relationship with the axis.
You raise an interesting point about multiple levels of abstraction / encapsulation. In most of my usage I make very little effort at encapsulation because I’m typically building a one-off graphic. Even my example of the reusable chart component is somewhat theoretical as there’s little point in building a chart component for a one-off. The closest I get to encapsulation in practice is loose coupling via custom events.
The axis component and the brush component are also (limited) encapsulations. Both of these have some internal state which is captured by closure (how the components are configured). The axis component has some additional state that is hidden in the DOM, but in the semi-private __chart__
property rather than as data. Neither of these components are truly data-driven in the sense that their display depends solely on their configuration, not on bound data. Whereas in the reusable chart proposal, the chart depends both on the configuration and on the bound data, and thus the bound data is considered part of the public API.
Anyway, it’s difficult to speculate in the abstract about how you should use data, so I’m simply describing alternative different approaches. If you were designing a library in the most D3-ish way, using data joins and selections rather than the conventional way of instantiating each chart separately, then I’d recommend storing internal data in a separate property rather than risk clobbering bound data. Or perhaps this would apply to the design of components within your library, even if your library doesn’t expose the underlying D3-based implementation.
Regardless of what design you choose, if you’re using D3, then being cognizant of how data is bound to elements is essential. The implicit way in which selection.select propagates data from parent to child perhaps encourages people to overlook this relationship. But, I still think propagation of data in the 1-to-1 case is the more common behavior, and in addition to my aversion for changing a core API, I also feel like it would be more tedious to propagate the data explicitly, and trade one confusion for another. (I’ve seen confusion in the past where on binding data to a parent, people are surprised that the child data is not automatically updated. Implementing this automatically would require a DOM traversal whenever data is bound!)
I could maybe see introducing a way to select the first matching element without propagating data, say as a psuedo-selector in the style of jQuery ("element:first") or as a new method (selection.selectFirst). Yet this wouldn’t really address the original confusion, since you’d still need to understand the effect of selection.select on data to know to avoid it. And if you know this, you could just as easily construct a selector more carefully than learn the new API.
Regardless of what design you choose, if you’re using D3, then being cognizant of how data is bound to elements is essential. The implicit way in which selection.select propagates data from parent to child perhaps encourages people to overlook this relationship. But, I still think propagation of data in the 1-to-1 case is the more common behavior, and in addition to my aversion for changing a core API, I also feel like it would be more tedious to propagate the data explicitly, and trade one confusion for another. (I’ve seen confusion in the past where on binding data to a parent, people are surprised that the child data is not automatically updated. Implementing this automatically would require a DOM traversal whenever data is bound!)
I agree with this. Understanding the details of selections and data binding is essential. A bug based on unexpected behaviour is really just a learning opportunity. :) That sounds sarcastic but there is definitely some truth in it. I didn't understand the nuances of the 1-to-1 replacement nature of select
nearly as well as I do now after I needed to dig in to figure out our inheritance problem. And that is certainly a positive outcome for me.
I could maybe see introducing a way to select the first matching element without propagating data, say as a psuedo-selector in the style of jQuery ("element:first") or as a new method (selection.selectFirst). Yet this wouldn’t really address the original confusion, since you’d still need to understand the effect of selection.select on data to know to avoid it. And if you know this, you could just as easily construct a selector more carefully than learn the new API.
No, I totally agree that a new API would only shuffle the issue around (not to mention adding a subtle variant to an existing API). Since changing selection.select
at this stage is not feasible, I would leave the selection part of things alone.
What do you think about the idea of marking data as "uninheritable" at data bind time? Maybe with an additional (optional) argument on the selection.data
call that let you indicate that you want to bind this data as the end-of-the-line. The flag would be an explicit indication on the DOM node itself that __data__
on the parent node and __data__
on the child node are not related and selection.select
would not automatically propagate this data. I supposed selection.append
and selection.insert
would need to somehow ignore the flag to work correctly.
Not a great answer, but one way you can prevent data inheritance is to have an intermediate node without bound data.
var intermediary = selection.append("div")
.datum(function() { return null; });
Then, any select from the intermediary wouldn’t propagate data from the parent selection. But of course, the intermediate node in the DOM is somewhat unfortunate.
That's a clever workaround. Not sure I could live with the extra DOM node, though. :)
In the d3-graphviz library there is never a 1-to-1 relationship between parents and children and it is never the case that the parent and child are bound to the same data. Thus the propagation of data is never desired.
Since the application where I use d3-graphviz frequently wants to access nodes and edges of the generated graph without affecting bound data, I've added an extension to d3-selection, selection.selectWithoutDataPropagation which does d3.select(this.node().querySelector(name))
I also added a note to the library documentation to avoid the use of selection.select when building applications with d3-graphviz in order to avoid propagating incorrect data by mistake.
Would you we willing to accept a PR to d3-selection adding this function? If so, would you prefer a different name, e.g selectPure?
Even if this function is simple enough to implement outside d3-selection, its pure existence would further highlight that selection.select does more than just pure selection.