high cardinality tags in micrometer metrics
grzegorz-moto opened this issue · comments
The metrics tags contains random id that caus high memory usage and leads to memory leak/exhausting. This is due the fact micrometer treats each new tag as new metric.
The example: https://github.com/reactor/reactor-netty/blob/main/reactor-netty-core/src/main/java/reactor/netty/resources/MicrometerPooledConnectionProviderMeterRegistrar.java#L55
@grzegorz-moto Please specify Reactor Netty version. Also why do you need so many connection pools and do you need all of them operational all the time?
version 1.0.25
the application is creating short leaving tcp connections to multiple targets - multiple of them at a time.
After some time in the heapdump there are tones of instances of io.micrometer.core.instrument.Meter$Id
for
reactor.netty.connection.provider.active.connections
reactor.netty.connection.provider.total.connections
reactor.netty.connection.provider.max.connections
reactor.netty.connection.provider.pending.connections
reactor.netty.connection.provider.idle.connections
reactor.netty.connection.provider.max.pending.connections
where the only TAG ID
makes it unique
@grzegorz-moto Update at least to version 1.0.26
where we deregister the metrics for the disposed connection pools. (better update to the latest one 1.0.34
)
https://github.com/reactor/reactor-netty/releases/tag/v1.0.26
#2608
Then ensure you have configuration for disposeInactivePoolsInBackground
, this will dispose all inactive connection pools and will deregister the metrics.
See more info for the configuration here https://projectreactor.io/docs/netty/release/reference/index.html#_connection_pool
where the only TAG ID makes it unique
This is interesting because the metrics have ID
, Remote Address
and Name
, ideally you should not have different ID
but the same Remote Address
I have ConnectionProvider and TcpClient configured in the following way:
private static final ConnectionProvider CONNECTION_PROVIDER = ConnectionProvider
.builder("connection-provider-with-metrics")
.metrics(true)
.maxIdleTime(Duration.ofSeconds(60))
.evictInBackground(Duration.ofSeconds(30))
.disposeInactivePoolsInBackground(Duration.ofSeconds(30), Duration.ofSeconds(60))
.build();
private static final TcpClient TCP_CLIENT = TcpClient
.create(CONNECTION_PROVIDER)
.wiretap(true)
.metrics(true)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(EpollChannelOption.TCP_KEEPCNT, SipTransportTcpConnectionConfiguration.getTcpKeepCnt())
.option(EpollChannelOption.TCP_KEEPIDLE, SipTransportTcpConnectionConfiguration.getTcpKeepIdle())
.option(EpollChannelOption.TCP_KEEPINTVL, SipTransportTcpConnectionConfiguration.getTcpKeepIntvl())
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, CONNECT_TIMEOUT)
.doOnConnect(bootstrap -> log.debug("Trying to connect to the remote host [{}]", NetworkLoggingUtility.createRemoteHostKeyValue(bootstrap.remoteAddress().get())));
private Mono<? extends Connection> connect(InetSocketAddress socketAddress) {
log.info("Establishing new TCP connection to [{}]", socketAddress);
return TCP_CLIENT
.remoteAddress(() -> socketAddress)
.doOnConnected(this::addHandlers)
.doOnConnected(tcpHandler::handleConnection)
.doOnDisconnected(tcpHandler::handleDisconnected)
.observe(tcpHandler.connectionObserver())
.connect()
.retryWhen(CONNECTION_RETRY_SPEC)
.onErrorMap(e -> new TransportFailureException(TransportFailureException.FailureStatus.CONNECTION_FAILURE, e));
}
and in onDisconnected()
dispose()
is being called on the connection. Without this manual dispose
the connections seems leaking.
After all this changes I'm still observing the number of instances of class io.micrometer.core.instrument.Meter$Id
is constantly growing
versions I'm using:
"io.projectreactor:reactor-bom:2022.0.9"
"io.netty:netty-bom:4.1.96.Final"
@grzegorz-moto Is the code below necessary to NOT be included into the common configuration?
.doOnConnected(this::addHandlers)
.doOnConnected(tcpHandler::handleConnection)
.doOnDisconnected(tcpHandler::handleDisconnected)
.observe(tcpHandler.connectionObserver())
@grzegorz-moto I'm closing this one, we can reopen it when you are able to reply to the comment above.