lifadev / archive_aws-lambda-go-shim

Author your AWS Lambda functions in Go, effectively.

Home Page:https://github.com/eawsy/aws-lambda-go-shim

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

First invocation sometimes crashes in handle()

thoellrich opened this issue · comments

About 20% of first time invocations of my lambda handler crash before my code is called. This happens somewhere in handle(). I don't see this exception when everything is running normally. It only appears to happen for the first call right after I upload a new version of my lambda handler. Any ideas?

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x7fd476570ec8]

goroutine 17 [running, locked to thread]:
	runtime.throw(0x7fd4765f3250, 0x5)
	/usr/local/go/src/runtime/panic.go:596 +0x97 fp=0xc4202878f0 sp=0xc4202878d0
	runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:297 +0x290 fp=0xc420287940 sp=0xc4202878f0
	runtime.call64(0xc4202477a0, 0x7fd46f727cd0, 0xc420247800, 0x1000000030)
	/usr/local/go/src/runtime/asm_amd64.s:515 +0x48 fp=0xc420287990 sp=0xc420287940
	reflect.Value.call(0x7fd46fb61f40, 0x7fd46f727cd0, 0x13, 0x7fd4765f2ff8, 0x4, 0xc420287e10, 0x2, 0x2, 0x7fd4768259c0, 0x7fd47684d368, ...)
	/usr/local/go/src/reflect/value.go:434 +0x921 fp=0xc420287cd8 sp=0xc420287990
	reflect.Value.Call(0x7fd46fb61f40, 0x7fd46f727cd0, 0x13, 0xc420287e10, 0x2, 0x2, 0xc4202f68f8, 0x193, 0x7fd476549aa8)
	/usr/local/go/src/reflect/value.go:302 +0xa6 fp=0xc420287d40 sp=0xc420287cd8
	main.handle(0x14e2614, 0x13f9324, 0x14b60e4, 0xc420287eb0, 0xc420287eb8, 0x0)
	/build/src/runtime.go:140 +0x4e1 fp=0xc420287e50 sp=0xc420287d40
	main._cgoexpwrap_dd780bcacc70_handle(0x14e2614, 0x13f9324, 0x14b60e4, 0x0, 0x0, 0x0)
	_/build/src/_obj/_cgo_gotypes.go:127 +0x9a fp=0xc420287e90 sp=0xc420287e50
	runtime.call64(0x0, 0x7ffc5d059ee8, 0x7ffc5d059f80, 0x30)
	/usr/local/go/src/runtime/asm_amd64.s:515 +0x4a fp=0xc420287ee0 sp=0xc420287e90
	runtime.cgocallbackg1(0x0)
	/usr/local/go/src/runtime/cgocall.go:301 +0x1a1 fp=0xc420287f58 sp=0xc420287ee0
	runtime.cgocallbackg(0x0)
	/usr/local/go/src/runtime/cgocall.go:184 +0x86 fp=0xc420287fc0 sp=0xc420287f58
	runtime.cgocallback_gofunc(0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/asm_amd64.s:767 +0x71 fp=0xc420287fe0 sp=0xc420287fc0
	runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420287fe8 sp=0xc420287fe0

goroutine 18 [syscall, locked to thread]:
	runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1

goroutine 19 [syscall]:
	os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:116 +0x118
	os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
	created by os/signal.init.1
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

Hi @thoellrich,

Can you please provide a failing snippet so I can investigate more effectively?

PS: I suspect you are running some goroutines without waiting for them to finish before exiting you lambda. This is just an intuition. So please double check that when your lambda function returns, no goroutine is still running.

Thanks for the quick response. The error happens before my code is called. One of the first things I do in my handler is logging a message that indicates what has arrived. This line is never printed in the failing cases.

The 3 go-routines above are the only ones active. In the failing case I did not even get a chance to spawn a new go-routine. At the end of my handler, I also log any still active go-routines and I can confirm that whatever I launch, is terminated correctly.

If I read the stack trace above correctly, the crash happens in the reflect.Value() call on line 140 of runtime.go: https://github.com/eawsy/aws-lambda-go-shim/blob/master/src/runtime.go#L140

I see that eawsy/aws-lambda-go-shim was just updated to use 1.8.2. I'll see if that makes a difference.

Please isolate the failing case and provide a code snippet. I cannot see what is happening this way. My intuition seems false. And if there is a bug with Go 1.8.2 we also need to provide the Go team a failing test.
PS: the stack trace provides info about the runtime.go#L140 but the problem is that reflect stumble upon a nil value and we have to find the root of the problem.

I reduced the handler to the bare minimum and it's still showing the crash.

Here's my handler.go:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
	"github.com/eawsy/aws-lambda-go-event/service/lambda/runtime/event/kinesisstreamsevt"
)

func Handle(evt *kinesisstreamsevt.Event, c *runtime.Context) (result interface{}, err error) {
	log.Printf("Handle() - record count=%d\n", len(evt.Records))
	return nil, nil
}

This lambda is listening on a Kinesis stream. The batch size on the trigger is set to 100.
An external process feeds sensor data into the stream.
The crash seems to happen more consistently when the first batch into the lambda is full (=batch size).
When the crash happens I can see the same stack trace as above. The log.Printf() is not called.

Here's the crash followed by a "good" call (you can see the result of the log.Printf) in Cloudwatch Logs:
capture

Something seems to wrong when the first batch of records arrive and that batch happens to be close or equal to batch size. The unraveling of the handler's event seems to cause the exception above. ctx.Elem() was used before, but evt.Elem() wasn't.

Hope this makes sense.

Thank you very much for taking the time to provide a minimal test case. I need your assistance to spot the problem.

Can you change a little bit your function to know if the incoming event is always filled with something:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
	"github.com/eawsy/aws-lambda-go-event/service/lambda/runtime/event/kinesisstreamsevt"
)

func Handle(evt *kinesisstreamsevt.Event, c *runtime.Context) (result interface{}, err error) {
        log.Printf("evt is not nil: %t", evt != nil)
	log.Printf("Handle() - record count=%d\n", len(evt.Records))
	return nil, nil
}

I tried various different runs but cannot reproduce the error, so it is very important to keep your settings without any modification, so we can dig into the bug and find it.

PS: Can you please run the following command to know if you run the latest version of the shim docker run --rm -it eawsy/aws-lambda-go-shim:latest version.

I absolutely cannot reproduce the error. Moreover, we have several Lambda behind kinesis and they are executed several hundred millions of times per month and I haven't spotted this error in the CloudWatch logs. Can you provide me more info info about your events, your Lambda configs, etc? I'm stuck!
If you need to converse in private mode, let me know.

I changed the code to:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
	"github.com/eawsy/aws-lambda-go-event/service/lambda/runtime/event/kinesisstreamsevt"
)

func Handle(evt *kinesisstreamsevt.Event, c *runtime.Context) (result interface{}, err error) {
	if evt != nil {
		log.Printf("Handle() - record count=%d\n", len(evt.Records))
	} else {
		log.Println("Handle() - evt==nil")
	}
	return nil, nil
}

and don't see any occurences of the evt==nil case. Yet I still see the SIGSEGV.

Also it looks like I had the latest version of the shim in place:

$ docker pull eawsy/aws-lambda-go-shim:latest
latest: Pulling from eawsy/aws-lambda-go-shim
Digest: sha256:e7d0182f0703226449446dfbf726dbab4ad35e1960b9a0e1870d80edef34f7ca
Status: Image is up to date for eawsy/aws-lambda-go-shim:latest
$ docker run --rm -it eawsy/aws-lambda-go-shim:latest version
2017-05-24

The code also works fine here and also handles millions of messages. The problem only occurs under 2 conditions: a) I push a code update and Lambda calls it the first time and b) Lambda decides to recycle the function (happens on a regular basis) and calls the new instance for the first time. For those 2 conditions I see the SIGSEGV. As Lambda detects the crash, it'll resend the Kinesis message and no data is lost.

I still think that evt at https://github.com/eawsy/aws-lambda-go-shim/blob/master/src/runtime.go#L140 has an unexpected value. Maybe I should try to build the shim here locally, add some logging and see if I can get more details that way? What do you think?

Interesting! I changed the runtime.go code to read:

capture

After that change I created the updated shim docker image, rebuilt my handler and called it several times. Each time I made sure that Lambda loaded it anew.

I saw it crash 2 times. Look at the Cloudwatch Logs Search below, I bet you can immediately spot where it crashed:

capture2

I have no idea why you're getting that weird address for the handler symbol. And why it works for the same file the 2nd time.

At that point it may be worth noting that my handler.so is ~ 20MB and the zip ~ 6MB:

-rw-r--r--  1 ubuntu ubuntu 21230440 May 26 11:43 handler.so
-rw-r--r--  1 ubuntu ubuntu  6401036 May 26 11:43 handler.zip

Could size be an issue?

So thankful for your efforts, it seems to be a very awkward bug.

  1. Whaoo. Why are these two addresses so different from the other calls!
  2. When I build your code, I have:
  • 2.7M handler.so
  • 1.7M handler.zip
    How is it possible to have 20M? Do you embed anything in the zip?

I just modified the main handler. There's other code that pulls in the AWS-SDK, sql.DB and other stuff. It's still part of the simplified *.so, it just does not get called (other than any init()). That would account for the size differences. Sorry I did not mention that.

I'm wondering now if the plugin is not completely loaded. Could it still be relocating?

I don't think that there is any problem with the size and the loading process. We've already uploaded very heavy .so and did not encountered any issue.
What I was suspecting is that there is some process in an init function somewhere in your dependencies or your own code which cause the problem. But as you've provided me a simple snippet I was very upset to not find any init function 😄
If the error happens only when you freshly reset/upload your Lambda, then we have to focus on init functions. Can you take an inventory of your init functions and then we will dig in each category and try to spot the bug.

One more thing, you must have less than 8 exported functions in your Handler. Is this verified?

I've opened an issue long time ago golang/go#19269 that will be resolved in Go1.9 and the issue you have referenced also references mine. I think we are getting closer 😄

Yeah - I think this is the same problem we are talking about. I think I installed the shim before the note about the 8 exported functions was there :-( That could be the problem here. I'll check exports and init()s and report back any findings.

My handler package had 11 exported symbols. Reduced it down to the bare minimum and have not seen the crash since.
Thanks for your help and I'm sorry we were chasing a known problem. Hopefully it'll help somebody else who runs into the same issue.
Thanks!

Perfect.
No problem, it's always with a pleasure that we tend to help around this project. It was also my fault to not think about this limitation at the beginning. It will certainly help others and in a few month with the upcoming Go1.9 this kind of issue will disappear.
Anyway, happy Go on Lambda and do not hesitate if you encounter any other problem in the future.