jaegertracing / jaeger-client-go

Problem

The probabilistic sampler non-uniformly selects traces to be sampled, increasing the probability of trace collisions. The probability of trace collisions greatly increases when using a very small sampling rate.

The probabilistic sampler relies on the fact that traceIDs are 63bit random number, and make a sampling decision without independently generating a random number, and instead checking whether the traceID < (samplingRate * 2^63). Note that only half the traceID, traceID.Low is used for this computation.

The probabilistic sampler computes a SamplingBoundary as follows:

jaeger-client-go/sampler.go

Line 107 in 2cf72ee

const maxRandomNumber = ^(uint64(1) << 63) // i.e. 0x7fffffffffffffff

jaeger-client-go/sampler.go

Line 130 in 2cf72ee

s.samplingBoundary = uint64(float64(maxRandomNumber) * s.samplingRate)

For the purposes of illustration, let's assume the smallest expressible samplingBoundary, 0x1

The probabilistic sampler computes whether a trace is sampled as follows:

jaeger-client-go/sampler.go

Line 145 in 2cf72ee

return s.samplingBoundary >= id.Low&maxRandomNumber, s.tags

As samplingBoundary is 0x1, this selection logic will only ever select traces where the traceID is 0x1.

While the probabilistic sampler select traces probabilistically, the range of traceIDs selected is coupled to the probability.

Because the traceIDs of traces sampled by the probabilistic sampler can only lie in a particular range, we can compute the number of traceIDs required to generate a collision by using the analysis of the birthday problem.

where n(p;b) denotes the number of random integers drawn from [1, 2^b] to optain a probability p that at least two numbers are the same.
and b is the number of bits used
and p is the probability of a collision

We see that if the number of bits is 32 (e.g. setting samplingProbability to 0.5), there only needs to be approx 30k traceIDs produced to satisfy a 0.1 probability of collision.

Proposed Solution

The probabilistic sampler could generate the sampling decision independently of the traceID, and thus not bias the range of selected traceIDs.

This will probably cause an increase in cpu utilization from random number generation and we may want to provide a way to revert to previous behavior. As other samplers like PerOperationSampler and GuaranteedThroughputProbabilisticSampler depend on this, the change could be pretty involved.

Similar discussion open-telemetry/oteps#135 (comment)

Probabilistic sampler causes trace collisions

Problem

Proposed Solution