Memory Leak in JIT Model under Multi-Goroutine Environment

Question

Memory Leak in JIT Model under Multi-Goroutine Environment

yinziyang opened this issue 7 months ago · comments

I have encountered a memory leak issue when executing a JIT model under a multi-goroutine environment. When a single goroutine is used, the memory usage appears to be normal, stabilizing around 1GB. However, when multiple goroutines are launched (e.g., 10), the memory usage rapidly exceeds 10GB and continues to increase at a fast pace.

Below is the code to reproduce the issue:

package main

import (
    "encoding/json"
    "log"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

// getModel loads the resnet18 model.
func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }
    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

// getTensor generates a test tensor.
func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, false)
    return
}

func main() {

    // Load resnet18 model
    net := getModel()

    // Generate test tensor
    tensor := getTensor()
    defer tensor.MustDrop()

    // Launch goroutines
    var goroutineNum = 10
    // When a single goroutine is used, the memory usage appears normal, long-term occupying 1GB.
    // When multiple goroutines are launched, e.g., 10, memory usage quickly exceeds 10GB, and continues to increase rapidly.
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                log.Println(net.ForwardT(tensor, false))
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

Steps to Reproduce:

Load the resnet18 model using the getModel function.
Generate a test tensor using the getTensor function.
Launch a goroutine to continuously call the ForwardT method on the net object, and observe the memory usage.

Expected Behavior:
The memory usage should remain stable regardless of the number of goroutines launched.

Actual Behavior:
The memory usage rapidly increases when multiple goroutines are launched, indicating a potential memory leak issue.

Environment:

Go version: (1.21.3)
Gotch version: (v0.9.0)
OS: (e.g., Ubuntu 22.04)

Any assistance on this issue would be greatly appreciated. Thank you!

sugarme · Answer 1 · Mon Oct 30 2023 11:01:04 GMT+0800 (China Standard Time)

@yinziyang ,

Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:

Your func getTensor last operation should be tensor = tensor.MustUnsqueeze(0, true) (true) to delete existing tensor before assigning to new one otherwise mem leak here.
In go routine for loop, you run net.Forward(tensor), which return a tensor, that tensor should be deleted after being used by log.Println() otherwise leak here as well.
When doing forward in inference mode, you should put inside ts.NoGrad() otherwise, autograd will build up (not really a memory leak but hidden tensors).

Please try those things to see how thing are going. Thanks.

yinziyang · Answer 2 · Mon Oct 30 2023 11:44:40 GMT+0800 (China Standard Time)

#@sugarme

Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code:

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    net := getModel()

    tensor := getTensor()
    defer tensor.MustDrop()

    var goroutineNum = 10
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

yinziyang · Answer 3 · Mon Oct 30 2023 11:58:39 GMT+0800 (China Standard Time)

When calling the model in multiple goroutines, a lot of warning messages appear, as follows:

2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping...

sugarme · Answer 4 · Mon Oct 30 2023 12:18:23 GMT+0800 (China Standard Time)

@yinziyang ,

Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model.

yinziyang · Answer 5 · Mon Oct 30 2023 12:40:15 GMT+0800 (China Standard Time)

I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues.

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    var goroutineNum = 10

    var nets []nn.FuncT
    for i := 0; i < goroutineNum; i++ {
        nets = append(nets, getModel())
    }

    tensor := getTensor()
    defer tensor.MustDrop()

    for i := 0; i < goroutineNum; i++ {
        net := nets[i]
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

sugarme · Answer 6 · Mon Oct 30 2023 13:13:23 GMT+0800 (China Standard Time)

@yinziyang ,

I will try to reproduce your problem when having time by this week. However, your latest go func() should not have input then.

What about some thing like this:

for i := 0; i < goroutineNum; i++ {
        go func() {
                net := getModel()
                tensor := getTensor()
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
                tensor.MustDrop()
            }
        }()
}

yinziyang · Answer 7 · Mon Oct 30 2023 15:52:53 GMT+0800 (China Standard Time)

The memory usage still keeps increasing, the key code is as follows:

for i := 0; i < goroutineNum; i++ {
    go func() {
        // goroutine model
        net := getModel()

        // test input tensor
        tensor := getTensor()
        defer tensor.MustDrop()

        // stress test to observe memory increase
        for {
            ts.NoGrad(func() {
                result := net.ForwardT(tensor, false)

                // drop result tensor
                result.MustDrop()
            })
        }
    }()
}

yinziyang · Answer 8 · Mon Oct 30 2023 18:29:47 GMT+0800 (China Standard Time)

@sugarme

I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released.

this is old code:

	atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, TensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

change to:

	tensorCount := atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, tensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed.

sugarme · Answer 9 · Mon Oct 30 2023 19:26:58 GMT+0800 (China Standard Time)

@yinziyang ,

Thanks for reporting.