Clarify tracking of highest and lowest trackable values

Question

Clarify tracking of highest and lowest trackable values

marshallpierce opened this issue 7 years ago · comments

See #74 (comment) for more context.

We save in fields (and write when serializing) the requested limits, not the actual limits that result from the underlying encoding (which will encompass at least as much as what the user requested, and maybe more). Perhaps we should expose the actual limits of what a particular histogram can do, rather than just regurgitate the limits that the user requested? This would be useful when, say, storing metadata about histograms, since the data actually in the histogram is likely more interesting than the particular subset of values that were initially requested as trackable.

Strawman:

configured_low() for what the user requested when creating the histogram
actual_low() for what the histogram can support
configured_high(), actual_high()

Marshall Pierce · Answer 1 · Fri Dec 22 2017 12:10:54 GMT+0800 (China Standard Time)

In particular for actual_low(), note that we don't want the value for index 0. That index would represent every value in [0, 1 << unit_magnitude), which isn't very useful since it doesn't obey the precision guarantees of the rest of the data structure. Instead, we want the value of index 1, which should be 1 << unit_magnitude.

Jon Gjengset · Answer 2 · Sun Dec 24 2017 02:01:48 GMT+0800 (China Standard Time)

Are configured_low and configured_high even necessary?

Marshall Pierce · Answer 3 · Sun Dec 24 2017 04:44:35 GMT+0800 (China Standard Time)

Nope, they aren't, but we have to keep track of those numbers for serialization anyway. We don't need to ship them in v1, so to speak, but we should probably leave room for them name-wise in case it turns out that something needs them. The use cases I could imagine are things like generic display tools for histograms that might want to list all the metadata or things like that.

Jon Gjengset · Answer 4 · Sun Dec 24 2017 18:17:30 GMT+0800 (China Standard Time)

Hmm.. I'm partial to exposing the actual low and high as low and high, and then maybe exposing the configured values through a more verbose method.

Marshall Pierce · Answer 5 · Sun Dec 24 2017 22:09:20 GMT+0800 (China Standard Time)

I have some reservations about that nomenclature.

We've already used those names for different concepts.
It's not very clear when reading foo.low() exactly what that means. A user who hasn't read the API docs (of the current version!) may be forgiven for thinking that the number returned is the min value recorded thus far, or the configured low.

I don't think those are deal breakers, so if your heart is set on low() and high(), I can live with it, but if we're breaking (at least semantically) backwards compatibility, I think we could just as well go with names that don't have those downsides.

Jon Gjengset · Answer 6 · Mon Dec 25 2017 18:19:10 GMT+0800 (China Standard Time)

Well, we already have low, min, max, and high. I'm proposing that low and high be changed to return a more reasonable value than they currently do. One alternative would be to deprecate low and high, and instead introduce a range method which returns (actual_low, actual_high) + a configured_range which returns the configured low/high?

Marshall Pierce · Answer 7 · Wed Dec 27 2017 04:50:16 GMT+0800 (China Standard Time)

Hm, I do kinda like returning something more structured -- what about perhaps even taking it one level further and exposing a struct that has all that sort of stuff in it (min, max, unit magnitude, etc)? That would help reduce the rather imposing number of methods on Histogram if we could funnel all that stuff through a helper type.

Jon Gjengset · Answer 8 · Wed Dec 27 2017 18:02:54 GMT+0800 (China Standard Time)

Yeah, that's not a bad idea. Something like Histogram::statistics() -> HistogramStatistics. I guess we can bikeshed the name a little. I also like metrics and measure. Another question is whether or not we want to include measurements of the data (like max and min), and if so, where do we draw the line? Do we also want to start measuring (and report) the average? Number of samples?

Marshall Pierce · Answer 9 · Fri Dec 29 2017 23:28:36 GMT+0800 (China Standard Time)

I think maybe we could divide the fields in Histogram up as follows:

Implementation details like leading_zero_count_base: semantically derived from other fields, but useful optimizations for fast path calculations.
Stuff that's dependent on the values stored, like max_value and total_count.
Stuff that's dependent only on initial config and not in the first group, like highest_trackable_value.

Assuming that doesn't end up seeming odd in practice, we could bundle up the last group and expose that to users.

Jon Gjengset · Answer 10 · Mon Nov 30 2020 04:00:58 GMT+0800 (China Standard Time)

While going through my email, I came across some relevant notes from @giltene on the subject. The first is HdrHistogram/hdrhistogram-go#23 (comment). The second is from Gitter:

I like it, you’ve certainly made the lowest discernible changes clear. This is one of those situations where method overloading or optional parameters would come in handy, as allowing people to set lowestDiscernibleValue in the relatively rare cases where they need it to be non-1 creates a significant opportunity for confusion, which does not e use to the same level in languages where the commonly used histogram constructors don’t even use a lowest discernible argument.... my golang-foo is not strong enough to know what a good idiom to use here would be, if you wanted to encourage histogram creation that does not specify this parameter.

I think that a matching parameter name change to the one discussed here for golang would be appropriate for the C implementation, for the same user-confusion reasons. The difference between lowest discernible value and lowest traceable value may seem like a small semantic one, but I think there is a big difference in the likelihood of the api user misinterpreting the implications of setting the parameter to values other than 1.

Ditto for the Python implementation: I’d suggest changing lowest_trackable_value to lowest_discernible_value to cause people to think harder about the meaning and not jump to the assumption that it is simply the “min” of the range they want to cover.