sugarme / gotch

Go binding for Pytorch C++ API (libtorch)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory Leak in JIT Model under Multi-Goroutine Environment

yinziyang opened this issue · comments

I have encountered a memory leak issue when executing a JIT model under a multi-goroutine environment. When a single goroutine is used, the memory usage appears to be normal, stabilizing around 1GB. However, when multiple goroutines are launched (e.g., 10), the memory usage rapidly exceeds 10GB and continues to increase at a fast pace.

Below is the code to reproduce the issue:

package main

import (
    "encoding/json"
    "log"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

// getModel loads the resnet18 model.
func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }
    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

// getTensor generates a test tensor.
func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, false)
    return
}

func main() {

    // Load resnet18 model
    net := getModel()

    // Generate test tensor
    tensor := getTensor()
    defer tensor.MustDrop()

    // Launch goroutines
    var goroutineNum = 10
    // When a single goroutine is used, the memory usage appears normal, long-term occupying 1GB.
    // When multiple goroutines are launched, e.g., 10, memory usage quickly exceeds 10GB, and continues to increase rapidly.
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                log.Println(net.ForwardT(tensor, false))
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

Steps to Reproduce:

  1. Load the resnet18 model using the getModel function.
  2. Generate a test tensor using the getTensor function.
  3. Launch a goroutine to continuously call the ForwardT method on the net object, and observe the memory usage.

Expected Behavior:
The memory usage should remain stable regardless of the number of goroutines launched.

Actual Behavior:
The memory usage rapidly increases when multiple goroutines are launched, indicating a potential memory leak issue.

Environment:

  • Go version: (1.21.3)
  • Gotch version: (v0.9.0)
  • OS: (e.g., Ubuntu 22.04)

Any assistance on this issue would be greatly appreciated. Thank you!

@yinziyang ,

Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:

  1. Your func getTensor last operation should be tensor = tensor.MustUnsqueeze(0, true) (true) to delete existing tensor before assigning to new one otherwise mem leak here.
  2. In go routine for loop, you run net.Forward(tensor), which return a tensor, that tensor should be deleted after being used by log.Println() otherwise leak here as well.
  3. When doing forward in inference mode, you should put inside ts.NoGrad() otherwise, autograd will build up (not really a memory leak but hidden tensors).

Please try those things to see how thing are going. Thanks.

#@sugarme

Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code:

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    net := getModel()

    tensor := getTensor()
    defer tensor.MustDrop()

    var goroutineNum = 10
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

When calling the model in multiple goroutines, a lot of warning messages appear, as follows:

2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping...

@yinziyang ,

Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model.

I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues.

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    var goroutineNum = 10

    var nets []nn.FuncT
    for i := 0; i < goroutineNum; i++ {
        nets = append(nets, getModel())
    }

    tensor := getTensor()
    defer tensor.MustDrop()

    for i := 0; i < goroutineNum; i++ {
        net := nets[i]
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

@yinziyang ,

I will try to reproduce your problem when having time by this week. However, your latest go func() should not have input then.

What about some thing like this:

for i := 0; i < goroutineNum; i++ {
        go func() {
                net := getModel()
                tensor := getTensor()
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
                tensor.MustDrop()
            }
        }()
}

The memory usage still keeps increasing, the key code is as follows:

for i := 0; i < goroutineNum; i++ {
    go func() {
        // goroutine model
        net := getModel()

        // test input tensor
        tensor := getTensor()
        defer tensor.MustDrop()

        // stress test to observe memory increase
        for {
            ts.NoGrad(func() {
                result := net.ForwardT(tensor, false)

                // drop result tensor
                result.MustDrop()
            })
        }
    }()
}

@sugarme

I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released.

this is old code:

	atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, TensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

change to:

	tensorCount := atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, tensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed.

@yinziyang ,

Thanks for reporting.