aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Long session connections get dropped

Rez0k opened this issue · comments

What happened:
After migrating from Calico to aws vpc cni network policies (we are working with istio if that matters) we experience disconnections on long sessions connections such as Redis pub-sub or MongoDB connections.
The connection gets closed and then it reconnects again, which happens every few minutes.

I configured the vpc cni addon to be:

{
  "enableNetworkPolicy": "true",
  "nodeAgent": {
      "enableCloudWatchLogs": "true"
  }
}

I can't see any logs in the nodeagent container of the aws-node pod, all I get is:

{"level":"info","ts":"2023-11-23T10:48:03.421Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
2023-11-23T10:48:03.497038855Z 2023-11-23 10:48:03.4968622 +0000 UTC Logger.check error: failed to get caller

So, I can't attach logs.

What you expected to happen:
I expect the connections not to be dropped from the first place

How to reproduce it (as minimally and precisely as possible):
Install VPC CNI (network policy enabled) on eks 1.28, apply network policy and try to connect to mongodb instance (or redis pub sub) or probably any other long session technology.
The image I am using is: public.ecr.aws/docker/library/node:18.16.0-bullseye-slim
Run this sample code on a nodejs pod (prefarable to be public.ecr.aws/docker/library/node:18.16.0-bullseye-slim):

const mongoose = require('mongoose');

async function init() {
    const db = await mongoose.connect('mongodb://<mongodb-host>:27017/<db>?retryWrites=true&w=majority&directConnection=true');
    
    mongoose.connection.on('error', error => {
        console.log(`Got error: ${error}`);
    });
    
    mongoose.connection.on('connected', () => {
        console.log(`Mongo Connected`);
    });
    
    mongoose.connection.on('disconnected', () => {
        console.log("Mongo Disconnected");
    });
    
    mongoose.connection.on('reconnected', () => {
        console.log("Mongo Reconnected");
    });

    console.log("connected!")
}

init()

Wait few minutes and you should see logs like:

user@container-7b9f7xzs2-ysl25:/app# node mongo-sample-code.js 
connected!
Mongo Disconnected
Mongo Connected
Mongo Reconnected

Anything else we need to know?:
I use istio in my cluster and used Calico up until yesterday, I terminated all instances to flush all leftovers from Calico.
With Calico everything worked as expected.

Environment:

  • Kubernetes version (use kubectl version): v1.28.3-eks-4f4795d
  • CNI Version: v1.15.4-eksbuild.1
  • Network Policy Agent Version
  • OS (e.g: cat /etc/os-release): Debian GNU/Linux 11 (bullseye)
  • Kernel (e.g. uname -a): Linux #### ####.amzn2.x86_64

Can you please set this flag -> --enable-policy-event-logs=true and check if you see DENY verdict for the flow which might be happening is what I suspect and might be similar to #139

  • I elaborated a bit in the HOW TO REPRODUCE section (maybe it will help)

I enabled this flag and now I am getting logs in the nodeagent container but all the logs looks like:

2023-11-26 09:39:08.533284802 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533309791 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533326096 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533344727 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533365531 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533388542 +0000 UTC Logger.check error: failed to get caller
...

There is no DENY verdict in those logs, I don't know why the nodeagent print the logs like this.
I am still getting connection drops on long session connections like I mentioned above.

any ideas why?
Is this a bug on your side or is it something on my side?

maybe #73 and #83 related

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please try and let us know if it is holding up.

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please try and let us know if it is holding up.

Still not working for me, mongodb disconnecting every 5 minutes.

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

After furthere investigation, it seems to be that the long session connections got terminated because of istio envoy sidecar.
Short brief, istio creates a sidecar container for each pod, this sidecar container is an envoy proxy container responsible for forwarding the traffic to the main sidecar.
It worked before with calico but now, it's not.

Does this new information help?
I submitted the machine logs to the mail you mentioned with the relevant policyEndpoint

How to reproduce:
Create an eks cluster with aws VPC network policy and istio (https://istio.io/latest/docs/setup/getting-started/).
Create a nodejs pod with istio sidecar and paste the code I mentioned above (mongodb connection).
After 2-3 minutes the connection should get dropped.

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

I will try it this week and will update here.

update: I prefer to wait for your next release candidate according to: #175 (comment)

@Rez0k - We have v1.0.8-rc1 tag available if you would like to test.

@Rez0k - Did you get a chance to verify the image?

I prefer to wait for the official release, I will try on the v1.0.8 release.
I don't want to apply the network policy when I am not sure it will work as it will cause problems with my env and devs

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

Seems to work!