Probabilistic sampler causes trace collisions
vprithvi opened this issue · comments
Problem
The probabilistic sampler non-uniformly selects traces to be sampled, increasing the probability of trace collisions. The probability of trace collisions greatly increases when using a very small sampling rate.
The probabilistic sampler relies on the fact that traceIDs are 63bit random number, and make a sampling decision without independently generating a random number, and instead checking whether the traceID < (samplingRate * 2^63). Note that only half the traceID, traceID.Low
is used for this computation.
The probabilistic sampler computes a SamplingBoundary
as follows:
Line 107 in 2cf72ee
Line 130 in 2cf72ee
For the purposes of illustration, let's assume the smallest expressible samplingBoundary, 0x1
The probabilistic sampler computes whether a trace is sampled as follows:
Line 145 in 2cf72ee
As samplingBoundary
is 0x1
, this selection logic will only ever select traces where the traceID is 0x1
.
While the probabilistic sampler select traces probabilistically, the range of traceIDs selected is coupled to the probability.
Because the traceIDs of traces sampled by the probabilistic sampler can only lie in a particular range, we can compute the number of traceIDs required to generate a collision by using the analysis of the birthday problem.
where n(p;b) denotes the number of random integers drawn from [1, 2^b] to optain a probability p that at least two numbers are the same.
and b is the number of bits used
and p is the probability of a collision
We see that if the number of bits is 32 (e.g. setting samplingProbability
to 0.5
), there only needs to be approx 30k traceIDs produced to satisfy a 0.1
probability of collision.
Proposed Solution
The probabilistic sampler could generate the sampling decision independently of the traceID, and thus not bias the range of selected traceIDs.
This will probably cause an increase in cpu utilization from random number generation and we may want to provide a way to revert to previous behavior. As other samplers like PerOperationSampler
and GuaranteedThroughputProbabilisticSampler
depend on this, the change could be pretty involved.
Similar discussion open-telemetry/oteps#135 (comment)