Memory Leak in JIT Model under Multi-Goroutine Environment
yinziyang opened this issue · comments
I have encountered a memory leak issue when executing a JIT model under a multi-goroutine environment. When a single goroutine is used, the memory usage appears to be normal, stabilizing around 1GB. However, when multiple goroutines are launched (e.g., 10), the memory usage rapidly exceeds 10GB and continues to increase at a fast pace.
Below is the code to reproduce the issue:
package main
import (
"encoding/json"
"log"
"os"
"time"
"github.com/sugarme/gotch"
"github.com/sugarme/gotch/nn"
"github.com/sugarme/gotch/pickle"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch/vision"
)
// getModel loads the resnet18 model.
func getModel() (net nn.FuncT) {
modelName := "resnet18"
url, ok := gotch.ModelUrls[modelName]
if !ok {
panic("Unsupported model name")
}
modelFile, err := gotch.CachedPath(url)
if err != nil {
panic(err)
}
vs := nn.NewVarStore(gotch.CPU)
net = vision.ResNet18NoFinalLayer(vs.Root())
err = pickle.LoadAll(vs, modelFile)
if err != nil {
panic(err)
}
return
}
// getTensor generates a test tensor.
func getTensor() (tensor *ts.Tensor) {
b, err := os.ReadFile("test.data")
if err != nil {
panic(err)
}
var data []float32
err = json.Unmarshal(b, &data)
if err != nil {
panic(err)
}
tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
tensor = tensor.MustUnsqueeze(0, false)
return
}
func main() {
// Load resnet18 model
net := getModel()
// Generate test tensor
tensor := getTensor()
defer tensor.MustDrop()
// Launch goroutines
var goroutineNum = 10
// When a single goroutine is used, the memory usage appears normal, long-term occupying 1GB.
// When multiple goroutines are launched, e.g., 10, memory usage quickly exceeds 10GB, and continues to increase rapidly.
for i := 0; i < goroutineNum; i++ {
go func(net nn.FuncT) {
for {
log.Println(net.ForwardT(tensor, false))
}
}(net)
}
time.Sleep(5 * time.Minute)
}
Steps to Reproduce:
- Load the resnet18 model using the
getModel
function. - Generate a test tensor using the
getTensor
function. - Launch a goroutine to continuously call the
ForwardT
method on thenet
object, and observe the memory usage.
Expected Behavior:
The memory usage should remain stable regardless of the number of goroutines launched.
Actual Behavior:
The memory usage rapidly increases when multiple goroutines are launched, indicating a potential memory leak issue.
Environment:
- Go version: (1.21.3)
- Gotch version: (v0.9.0)
- OS: (e.g., Ubuntu 22.04)
Any assistance on this issue would be greatly appreciated. Thank you!
Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:
- Your func
getTensor
last operation should betensor = tensor.MustUnsqueeze(0, true)
(true) to delete existing tensor before assigning to new one otherwise mem leak here. - In go routine for loop, you run
net.Forward(tensor)
, which return a tensor, that tensor should be deleted after being used bylog.Println()
otherwise leak here as well. - When doing
forward
in inference mode, you should put insidets.NoGrad()
otherwise, autograd will build up (not really a memory leak but hidden tensors).
Please try those things to see how thing are going. Thanks.
Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code:
package main
import (
"encoding/json"
"os"
"time"
"github.com/sugarme/gotch"
"github.com/sugarme/gotch/nn"
"github.com/sugarme/gotch/pickle"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch/vision"
)
func getModel() (net nn.FuncT) {
modelName := "resnet18"
url, ok := gotch.ModelUrls[modelName]
if !ok {
panic("Unsupported model name")
}
modelFile, err := gotch.CachedPath(url)
if err != nil {
panic(err)
}
vs := nn.NewVarStore(gotch.CPU)
net = vision.ResNet18NoFinalLayer(vs.Root())
err = pickle.LoadAll(vs, modelFile)
if err != nil {
panic(err)
}
return
}
func getTensor() (tensor *ts.Tensor) {
b, err := os.ReadFile("test.data")
if err != nil {
panic(err)
}
var data []float32
err = json.Unmarshal(b, &data)
if err != nil {
panic(err)
}
tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
tensor = tensor.MustUnsqueeze(0, true)
return
}
func main() {
net := getModel()
tensor := getTensor()
defer tensor.MustDrop()
var goroutineNum = 10
for i := 0; i < goroutineNum; i++ {
go func(net nn.FuncT) {
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
}
}(net)
}
time.Sleep(5 * time.Minute)
}
When calling the model in multiple goroutines, a lot of warning messages appear, as follows:
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping...
Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model.
I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues.
package main
import (
"encoding/json"
"os"
"time"
"github.com/sugarme/gotch"
"github.com/sugarme/gotch/nn"
"github.com/sugarme/gotch/pickle"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch/vision"
)
func getModel() (net nn.FuncT) {
modelName := "resnet18"
url, ok := gotch.ModelUrls[modelName]
if !ok {
panic("Unsupported model name")
}
modelFile, err := gotch.CachedPath(url)
if err != nil {
panic(err)
}
vs := nn.NewVarStore(gotch.CPU)
net = vision.ResNet18NoFinalLayer(vs.Root())
err = pickle.LoadAll(vs, modelFile)
if err != nil {
panic(err)
}
return
}
func getTensor() (tensor *ts.Tensor) {
b, err := os.ReadFile("test.data")
if err != nil {
panic(err)
}
var data []float32
err = json.Unmarshal(b, &data)
if err != nil {
panic(err)
}
tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
tensor = tensor.MustUnsqueeze(0, true)
return
}
func main() {
var goroutineNum = 10
var nets []nn.FuncT
for i := 0; i < goroutineNum; i++ {
nets = append(nets, getModel())
}
tensor := getTensor()
defer tensor.MustDrop()
for i := 0; i < goroutineNum; i++ {
net := nets[i]
go func(net nn.FuncT) {
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
}
}(net)
}
time.Sleep(5 * time.Minute)
}
I will try to reproduce your problem when having time by this week. However, your latest go func()
should not have input then.
What about some thing like this:
for i := 0; i < goroutineNum; i++ {
go func() {
net := getModel()
tensor := getTensor()
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
tensor.MustDrop()
}
}()
}
The memory usage still keeps increasing, the key code is as follows:
for i := 0; i < goroutineNum; i++ {
go func() {
// goroutine model
net := getModel()
// test input tensor
tensor := getTensor()
defer tensor.MustDrop()
// stress test to observe memory increase
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
// drop result tensor
result.MustDrop()
})
}
}()
}
I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released.
this is old code:
atomic.AddInt64(&TensorCount, 1)
nbytes := x.nbytes()
atomic.AddInt64(&AllocatedMem, nbytes)
lock.Lock()
if _, ok := ExistingTensors[name]; ok {
name = fmt.Sprintf("%s_%09d", name, TensorCount)
}
ExistingTensors[name] = struct{}{}
lock.Unlock()
change to:
tensorCount := atomic.AddInt64(&TensorCount, 1)
nbytes := x.nbytes()
atomic.AddInt64(&AllocatedMem, nbytes)
lock.Lock()
if _, ok := ExistingTensors[name]; ok {
name = fmt.Sprintf("%s_%09d", name, tensorCount)
}
ExistingTensors[name] = struct{}{}
lock.Unlock()
I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed.
Thanks for reporting.