kserve / modelmesh

Distributed Model Serving Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ByteBuf leak observed in production deployment

njhill opened this issue · comments

A few instances of this were reported recently

2023-05-05 14:30:25
modelmesh-runtime ERROR LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	com.ibm.watson.modelmesh.GrpcSupport$3.parse(GrpcSupport.java:403)
	com.ibm.watson.modelmesh.GrpcSupport$3.parse(GrpcSupport.java:318)
	io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)
	io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:333)
	io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:316)
	io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:835)
	io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Thread.java:833)
	com.ibm.watson.litelinks.server.ServerRequestThread.run(ServerRequestThread.java:47)

indicates a bug in either model-mesh or grpc-java. Circumstantially it may have been introduced by #84 - needs investigation.

I had a look at the code. However it'd be useful to know whether payload processing was enabled in such cases.
Besides that it might be that we need to re insert

if (response != null) {
    response.release();
}

within the finally block here.
While ReleaseAfterResponse.releaseAll, which should do the trick, I am not seeing the ByteBuf being "added" to ReleaseAfterResponse in other parts of the code, hence there might be nothing getting released.

@njhill do these logs refer to installations where any PayloadProcessors were enabled? Or is it happening also on "default" ones?

@tteofili no this would have been just default configuration.

are there any steps to reproduce the observed behavior (or some data/logs/dumps to look at)?

after a first analysis it seems that the ReleaseAfterResponse.releaseAll() call is not correctly decreasing the refCount; however all attempts I have taken to replace/decorate that with a different mechanism of releasing the response ByteBuf resulted in either decreasing the refCount to -1 or breaking the PayloadProcessor mechanism.

@tteofili no steps that I know of but am trying to get more complete logs to look at from a production occurrence.

after a first analysis it seems that the ReleaseAfterResponse.releaseAll() call is not correctly decreasing the refCount; however all attempts I have taken to replace/decorate that with a different mechanism of releasing the response ByteBuf resulted in either decreasing the refCount to -1 or breaking the PayloadProcessor mechanism.

I think you're right that it must be in that area. The ReleaseAfterResponse.releaseAll() part hasn't changed but there is a way I can see that it could leak in theory, which wasn't the case before - specifically if any of the following lines throw:

            respReaderIndex = response.data.readerIndex();
            respSize = response.data.readableBytes();
            call.sendHeaders(response.metadata);
            call.sendMessage(response.data);

I'll open a PR to address that. Not sure that it's the problem but it's the only way I can see that the changes #84 could be implicated (and if so it's my fault since this was part of the changes that I suggested!)