asynkron / protoactor-go

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin

Home Page:http://proto.actor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leak when actor metrics are enabled

antoniomacri opened this issue · comments

Describe the bug
Enabling actor metrics:

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

causes a memory leak when actors are stopped.

To Reproduce
Run the following main:

package main

import (
	"fmt"
	"github.com/asynkron/protoactor-go/actor"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel/metric/global"
	"log"
	"math/rand"
	"net/http"
	_ "net/http/pprof"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	for i := 0; i < 20_000; i++ {
		message := &MyData{id: uuid.New().String(), value: randomString(100_000)}
		props := actor.PropsFromProducer(func() actor.Actor {
			return &myActor{data: make(map[string]*MyData)}
		})
		pid, err := system.Root.SpawnNamed(props, message.id)
		if err != nil && err != actor.ErrNameExists {
			fmt.Printf("Error while sending message to actor=%v err=%v\n", pid, err)
		}
		system.Root.Send(pid, message)
		if err := system.Root.PoisonFuture(pid).Wait(); err != nil {
			fmt.Printf("Error while terminating actor=%v err=%v\n", pid, err)
		}
		time.Sleep(5 * time.Millisecond)
	}
	fmt.Println("-- Done.")

	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
	<-sigs
}

type myActor struct {
	data map[string]*MyData
}

type MyData struct {
	id    string
	value string
}

func (a *myActor) Receive(ctx actor.Context) {
	switch msg := ctx.Message().(type) {
	case *MyData:
		a.data[msg.id] = msg
	}
}

func randomString(length int) string {
	const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
	b := make([]byte, length)
	for i := range b {
		b[i] = letterBytes[rand.Intn(len(letterBytes))]
	}
	return string(b)
}

A few GB of RAM are retained even though the actors are stopped immediately after sending the data message.

By replacing

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

with

	system := actor.NewActorSystem()

the memory leak disappears.

Expected behavior
The memory is released when the actors are stopped.

Additional context
In props.go the default spawner contains this code:

		ctx := newActorContext(actorSystem, props, parentContext.Self())
		mb := props.produceMailbox()

		// prepare the mailbox number counter
		if ctx.actorSystem.Config.MetricsProvider != nil {
			sysMetrics, ok := ctx.actorSystem.Extensions.Get(extensionId).(*Metrics)
			if ok && sysMetrics.enabled {
				if instruments := sysMetrics.metrics.Get(metrics.InternalActorMetrics); instruments != nil {
					sysMetrics.PrepareMailboxLengthGauge()
					meter := global.Meter(metrics.LibName)
					if err := meter.RegisterCallback([]instrument.Asynchronous{instruments.ActorMailboxLength}, func(goCtx context.Context) {
						instruments.ActorMailboxLength.Observe(goCtx, int64(mb.UserMessageCount()), sysMetrics.CommonLabels(ctx)...)
					}); err != nil {
						err = fmt.Errorf("failed to instrument Actor Mailbox, %w", err)
						plog.Error(err.Error(), log.Error(err))
					}
				}
			}
		}

Specifically the closurefunc(goCtx context.Context) captures mb and ctx, and is not unregistered after the actor is stopped.

This seems to be the root cause of the memory leak.

Also, it's not clear to me how that metric (ActorMailboxLength) should work.
Shouldn't it sum the mailbox length for all the living actors?
It instead seems to replace the gauge value with the mailbox length of the last called back actor.

I'm keen to help with this issue.

Any hint on how to fix the leak and correctly implement the mailbox metric?

Thanks for the detailed report.
I have to dive into this to see what the memory issue could be

It should be due to the registered callback for each actor, which keeps references to its mailbox and context via the closure.

So changing the callback logic should also fix the leak.

I'm thinking about registering a single callback in the OpenTelemetry Meter. Go implementation of the Meter doesn't provide a way to unregister the callback (although this seems mandated by the spec...). Therefore this should be handled explicitly. WDYT?

Thanks for the response.