Long session connections get dropped

Question

Long session connections get dropped

Rez0k opened this issue 8 months ago · comments

What happened:
After migrating from Calico to aws vpc cni network policies (we are working with istio if that matters) we experience disconnections on long sessions connections such as Redis pub-sub or MongoDB connections.
The connection gets closed and then it reconnects again, which happens every few minutes.

I configured the vpc cni addon to be:

{
  "enableNetworkPolicy": "true",
  "nodeAgent": {
      "enableCloudWatchLogs": "true"
  }
}

I can't see any logs in the nodeagent container of the aws-node pod, all I get is:

{"level":"info","ts":"2023-11-23T10:48:03.421Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
2023-11-23T10:48:03.497038855Z 2023-11-23 10:48:03.4968622 +0000 UTC Logger.check error: failed to get caller

So, I can't attach logs.

What you expected to happen:
I expect the connections not to be dropped from the first place

How to reproduce it (as minimally and precisely as possible):
Install VPC CNI (network policy enabled) on eks 1.28, apply network policy and try to connect to mongodb instance (or redis pub sub) or probably any other long session technology.
The image I am using is: public.ecr.aws/docker/library/node:18.16.0-bullseye-slim
Run this sample code on a nodejs pod (prefarable to be public.ecr.aws/docker/library/node:18.16.0-bullseye-slim):

const mongoose = require('mongoose');

async function init() {
    const db = await mongoose.connect('mongodb://<mongodb-host>:27017/<db>?retryWrites=true&w=majority&directConnection=true');
    
    mongoose.connection.on('error', error => {
        console.log(`Got error: ${error}`);
    });
    
    mongoose.connection.on('connected', () => {
        console.log(`Mongo Connected`);
    });
    
    mongoose.connection.on('disconnected', () => {
        console.log("Mongo Disconnected");
    });
    
    mongoose.connection.on('reconnected', () => {
        console.log("Mongo Reconnected");
    });

    console.log("connected!")
}

init()

Wait few minutes and you should see logs like:

user@container-7b9f7xzs2-ysl25:/app# node mongo-sample-code.js 
connected!
Mongo Disconnected
Mongo Connected
Mongo Reconnected

Anything else we need to know?:
I use istio in my cluster and used Calico up until yesterday, I terminated all instances to flush all leftovers from Calico.
With Calico everything worked as expected.

Environment:

Kubernetes version (use kubectl version): v1.28.3-eks-4f4795d
CNI Version: v1.15.4-eksbuild.1
Network Policy Agent Version
OS (e.g: cat /etc/os-release): Debian GNU/Linux 11 (bullseye)
Kernel (e.g. uname -a): Linux #### ####.amzn2.x86_64

Jayanth Varavani · Answer 1 · Fri Nov 24 2023 13:52:56 GMT+0800 (China Standard Time)

Can you please set this flag -> --enable-policy-event-logs=true and check if you see DENY verdict for the flow which might be happening is what I suspect and might be similar to #139

Rez0k · Answer 2 · Sun Nov 26 2023 17:49:15 GMT+0800 (China Standard Time)

I elaborated a bit in the HOW TO REPRODUCE section (maybe it will help)

I enabled this flag and now I am getting logs in the nodeagent container but all the logs looks like:

2023-11-26 09:39:08.533284802 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533309791 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533326096 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533344727 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533365531 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533388542 +0000 UTC Logger.check error: failed to get caller
...

There is no DENY verdict in those logs, I don't know why the nodeagent print the logs like this.
I am still getting connection drops on long session connections like I mentioned above.

any ideas why?
Is this a bug on your side or is it something on my side?

Rez0k · Answer 3 · Tue Nov 28 2023 16:04:54 GMT+0800 (China Standard Time)

maybe #73 and #83 related

Jayanth Varavani · Answer 4 · Wed Nov 29 2023 04:07:07 GMT+0800 (China Standard Time)

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please try and let us know if it is holding up.

Rez0k · Answer 5 · Wed Nov 29 2023 21:38:08 GMT+0800 (China Standard Time)

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -
 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1
Please try and let us know if it is holding up.

Still not working for me, mongodb disconnecting every 5 minutes.

Jayanth Varavani · Answer 6 · Wed Nov 29 2023 23:35:20 GMT+0800 (China Standard Time)

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

Rez0k · Answer 7 · Sun Dec 03 2023 21:13:33 GMT+0800 (China Standard Time)

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

After furthere investigation, it seems to be that the long session connections got terminated because of istio envoy sidecar.
Short brief, istio creates a sidecar container for each pod, this sidecar container is an envoy proxy container responsible for forwarding the traffic to the main sidecar.
It worked before with calico but now, it's not.

Does this new information help?
I submitted the machine logs to the mail you mentioned with the relevant policyEndpoint

How to reproduce:
Create an eks cluster with aws VPC network policy and istio (https://istio.io/latest/docs/setup/getting-started/).
Create a nodejs pod with istio sidecar and paste the code I mentioned above (mongodb connection).
After 2-3 minutes the connection should get dropped.

Jayanth Varavani · Answer 8 · Thu Dec 07 2023 07:09:06 GMT+0800 (China Standard Time)

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

Jeffrey Nelson · Answer 9 · Thu Dec 28 2023 01:09:00 GMT+0800 (China Standard Time)

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

Rez0k · Answer 10 · Sun Dec 31 2023 17:23:53 GMT+0800 (China Standard Time)

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

I will try it this week and will update here.

Rez0k · Answer 11 · Mon Jan 08 2024 21:49:07 GMT+0800 (China Standard Time)

update: I prefer to wait for your next release candidate according to: #175 (comment)

Jayanth Varavani · Answer 12 · Tue Jan 09 2024 13:03:49 GMT+0800 (China Standard Time)

@Rez0k - We have v1.0.8-rc1 tag available if you would like to test.

Jayanth Varavani · Answer 13 · Wed Jan 17 2024 06:50:09 GMT+0800 (China Standard Time)

@Rez0k - Did you get a chance to verify the image?

Rez0k · Answer 14 · Tue Jan 23 2024 16:46:31 GMT+0800 (China Standard Time)

I prefer to wait for the official release, I will try on the v1.0.8 release.
I don't want to apply the network policy when I am not sure it will work as it will cause problems with my env and devs

Jayanth Varavani · Answer 15 · Wed Feb 21 2024 02:10:01 GMT+0800 (China Standard Time)

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

Rez0k · Answer 16 · Sun Mar 03 2024 20:38:25 GMT+0800 (China Standard Time)

Seems to work!