version 0.20210206 brought a lot of overheads
whitehatboxer opened this issue · comments
Last days I updated nginx-lua-prometheus in my project. And I found new version brought a lot of overheads to my machine.
As we all know, the library use "Inf" to keep label when refers to type Histogram
so that keys can sorted as prometheus needed. And it will replace "Inf" to "+Inf" before it expose data out.
The MR #119 trys to remove leading and trailing zeros in bucket label values. So it changed code which was used to handle the "Inf" situation. This feature was removed in version 0.20210206. But some code left and it seems have some side effect.
Originally code was like this.
-- Replace "Inf" with "+Inf" in each metric's last bucket 'le' label.
if key:find('le="Inf"', 1, true) then
key = key:gsub('le="Inf"', 'le="+Inf"')
end
As in commit 47d4db1. (MR #119 )
key = format_bucket_when_expose(key)
...
local function format_bucket_when_expose(key)
local part1, bucket, part2 = key:match('(.*[,{]le=")(.*)(".*)')
if part1 == nil then
return key
end
if bucket == "Inf" then
return table.concat({part1, "+Inf", part2})
else
bucket = tostring(tonumber(bucket))
-- In lua5.3 when decimal part is zero, tonumber would not turn float to int like <5.3,
-- rather it would leave '.0' at the end. So trim it here.
if (bucket:sub(-2, -1) == ".0") then
bucket = bucket:sub(1, -3)
end
return table.concat({part1, bucket, part2})
end
end
Commit in version 0.20210206 857e1d9.
key = fix_histogram_bucket_labels(key)
...
local function fix_histogram_bucket_labels(key)
local part1, bucket, part2 = key:match('(.*[,{]le=")(.*)(".*)')
if part1 == nil then
return key
end
if bucket == "Inf" then
return table.concat({part1, "+Inf", part2})
else
return table.concat({part1, tostring(tonumber(bucket)), part2})
end
end
In version 0.20210206, it need to do a lot of string.match on all keys. So it was much slower than version before. Since it don't take any new feature. Why don't we fallback to the older one(Meanwhile a faster one)?
This feature was removed in version 0.20210206. But some code left and it seems have some side effect.
Stripping leading and trailing zeroes was not removed in 0.20210206
. It was simplified a bit, removing some logic that was only necessary on lua5.3 (which is not used for nginx-lua as I understand). String.match
is still the method used to parse bucket boundary values.
I'd be curious to know why you are seeing performance issues because of this. While string matching is expensive, it's not performed on the "hot path" and should only happen occasionally when metrics are served to Prometheus.
I'd be curious to know why you are seeing performance issues because of this.
There are two reasons. Firstly our project was a API Gateway with huge amount of traffic (2k QPS) and it was very sensitive to performance. Secondly metrics in one single node can be 10k+ somtimes 20k+ even (It may be a bad practice but that's the reality). It means 20k's string matching when served to Prometheus, meanwhile Nginx worker was absolutely exhausted.
Since it was used in ngx_lua, I think it can use ngx.re.match
instead of string.match
which reduced 50%+ CPU time in my benchmark.
Stripping leading and trailing zeroes was not removed in 0.20210206.
I havn't noticed that. But it is really a feature. I think I should use a old version.
Furthermore.
Our project had used older verison of this library for a long time. And it runs great util we our metrics grows up rapidly (20k+ as described above) which caused by a bad design. Request time to API served to Prometheus costed 2s (the duration was 1 minutes) finally. It slowed down our API Gateway heavily. I used systemtap to anlayze it and drawed a flame graph. I found lots of CPU time was cost in lj_str_new used by string.format in metric_data()
of this library. The situation cannot reproduced in my machine so I havn't figured it out. But I found some related issue openresty/luajit2#60. Did you faced this situation before? I'd like to upload more infomation if need.
Makes sense, thanks for providing more details.
I think using ngx.re.match
instead of String.match
is an obvious improvement, even though it will be a bit harder to test (might require adding a check in the integration test). Would you like to send a PR?
You could also consider patching the library locally and removing label name rewriting, or just running an older version. If there are more folks who would like to be able to disable label rewriting for performance reasons, we could make it configurable (but it will increase overall complexity of the library).
I'd like to send a PR. Since I need to deal with the performance issue refered to above, I will send a PR by the way.
@unbeatablekb did you manage to patch it? We are running into the same issue. 10k+ metrics and 60%+ CPU usage while exporting metrics.
@scrwr Already fixed it in my code. Because I have no time to add test on it, havn't submit a PR for it.
If you provide the diff here, I can have a look at the rest.
I think this has been improved in #131 available in a new release (0.20220127
)