Is the network policy for VPC CNI designed to be stateful or stateless?

Question

Is the network policy for VPC CNI designed to be stateful or stateless?

khayong opened this issue 7 months ago · comments

What happened:

I have created an egress network policy allowing the web pod to establish connections with the backend server pod at port 4000.

podSelector:
    matchLabels:
      app.kubernetes.io/component: web
egress:
    - ports:
        - protocol: TCP
          port: 4000
      to:
       - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: backend
         podSelector:
            matchLabels:
              app.kubernetes.io/name: backend

While initially operating as intended, after some time, the packet log occasionally registers a DENY entry for certain return traffic.

Node: ip-10-0-64-172.ap-southeast-1.compute.internal;SIP: 10.0.68.172;SPORT: 4000;DIP: 10.0.74.123;DPORT: 39816;PROTOCOL: TCP;PolicyVerdict: DENY

where 10.0.68.172 is the backend server, 10.0.74.123 is the web server.

To mitigate this issue, I have to define an ephemeral port range for the ingress of the returned traffic, similar to the VPC ACL configuration.

podSelector:
    matchLabels:
      app.kubernetes.io/component: web
ingress:
    - ports:
        - protocol: TCP
          port: 1024
          endPort: 65535
      from:
       - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: backend
         podSelector:
            matchLabels:
              app.kubernetes.io/name: backend

Attach logs

What you expected to happen:
Kubernetes Network Policies are stateful, which means there's often no need to explicitly define rules for return traffic?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Server Version: v1.28.4-eks-8cb36c9
CNI Version: v1.16.0-eksbuild.1
OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

Kernel (e.g. uname -a): Linux ip-10-0-64-172.ap-southeast-1.compute.internal 5.10.199-190.747.amzn2.x86_64 aws/amazon-vpc-cni-k8s#1 SMP Sat Nov 4 16:55:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Jeffrey Nelson · Answer 1 · Wed Jan 03 2024 00:38:55 GMT+0800 (China Standard Time)

Moving to Network Policy agent repo

Jeffrey Nelson · Answer 2 · Wed Jan 03 2024 00:42:02 GMT+0800 (China Standard Time)

@khayong the network policy implementation is stateless. What does the policy endpoint object show for this policy? You can see the output with kubectl get policyendpoint <policy_name>

Jayanth Varavani · Answer 3 · Wed Jan 03 2024 02:15:49 GMT+0800 (China Standard Time)

Yes you are right, there is no need to explicitly define rules for return traffic. Can you check the number of the entries in the network policy agent's conntrack table when the issue starts to happen? When the issue happens is there any pod churn or just the established connections fail after a while?

Steps to check -

SSH to the node where you are seeing deny logs, then cd /opt/cni/bin/
Dump the maps - ./aws-eks-na-cli ebpf maps
Pick the ID which has Keysize 20 Valuesize 1 MaxEntries 65536

For example here ID is 5 ->

./aws-eks-na-cli ebpf maps
Maps currently loaded : 
Type : 2 ID : 3
Keysize 4 Valuesize 98 MaxEntries 1
========================================================================================
Type : 9 ID : 5
Keysize 20 Valuesize 1 MaxEntries 65536
========================================================================================
Type : 27 ID : 6
Keysize 0 Valuesize 0 MaxEntries 262144
========================================================================================
Type : 11 ID : 16
Keysize 8 Valuesize 288 MaxEntries 65536
========================================================================================

Then using the ID, we should be able to get the number of entries using this CLI -> ./aws-eks-na-cli ebpf dump-maps 5 (Note: replace 5 with the ID you got from step 3.)

Den Stroebel · Answer 4 · Wed Jan 03 2024 17:25:30 GMT+0800 (China Standard Time)

I have also encountered this issue, and it seems to relate to long-lived connections being removed from the conntrack table prematurely. There are other issues in this repository relating to this and the latest version (CNI v1.16.0-eksbuild.1 / policy agent 1.0.7) does not fix the issue.

If you enable policy logging using the below configuration on the VPC CNI (if deployed through the UI, else use the appropriate args in Helm/CLI), you'll see that there's an ACCEPT for the connection, then sometime later it's removed from the conntrack table, followed by a DENY in your logs.

{
    "enableNetworkPolicy":"true",
    "nodeAgent": {
        "enableCloudWatchLogs": "true",
        "enablePolicyEventLogs": "true"
    }
}

Jeffrey Nelson · Answer 5 · Sat Jan 06 2024 06:59:09 GMT+0800 (China Standard Time)

@khayong @stroebs it is possible this is the same error as #179. Do you have multiple replicas of these pods scheduled on the same node? If so, the symptoms would line up.

Den Stroebel · Answer 6 · Sat Jan 06 2024 07:06:19 GMT+0800 (China Standard Time)

Do you have multiple replicas of these pods scheduled on the same node?

I think this could very well be the case, as we bin pack on a small number of nodes to keep cost low. This would explain why we did not witness this issue in an our development environment which does not have more than 1 replica per deployment.

Jeffrey Nelson · Answer 7 · Sat Jan 06 2024 07:09:05 GMT+0800 (China Standard Time)

We will have a release candidate image soon if you are willing to try it out to see if it resolves the issue. The official release image containing #179 is targeting mid-January.

Jayanth Varavani · Answer 8 · Tue Jan 09 2024 13:00:12 GMT+0800 (China Standard Time)

@khayong @stroebs - We have v1.0.8-rc1 tag available if you would like to try.

Tan Khay Ong · Answer 9 · Fri Jan 12 2024 00:42:04 GMT+0800 (China Standard Time)

Thanks jayanthvn, it works. With v1.0.8-rc1, there's no need for me to explicitly define rules for return traffic.

Tan Khay Ong · Answer 10 · Fri Jan 12 2024 15:11:23 GMT+0800 (China Standard Time)

I observed some denied connections in the log today. It appears that there might be a delay in creating entries in the conntrack table. The initial two logs indicate a denial status due to the conntrack not being updated? However, after a delay of 3 seconds, the third log reflects an allowance, seems like the conntrack entry has been successfully created at that time.

On the conntrack table, I can see the presence of the corresponding entry.

Is it considered normal for there to be a delay in the creation of conntrack entries?

Den Stroebel · Answer 11 · Fri Jan 12 2024 15:29:22 GMT+0800 (China Standard Time)

I observed some denied connections in the log today.

I have observed the same behaviour. This is with a single pod in a replicaset so unrelated to the race condition I think.

Tan Khay Ong · Answer 12 · Fri Jan 12 2024 16:01:11 GMT+0800 (China Standard Time)

I have another sample here, but there is no allow log at all.

It appears that the conntrack entry was created in the incorrect direction. Should the source and destination be swapped?

Jayanth Varavani · Answer 13 · Sat Jan 13 2024 05:29:45 GMT+0800 (China Standard Time)

@khayong - There will be few seconds(1-2s) delay for the controller to reconcile and attach probes to the new pods. Traffic will be allowed until the probes are attached and then the policy enforcement will take into effect based on the config..in this case probe was probably missing when ingress traffic came in and so no conntrack entry was created.

Regarding the 2nd issue, do you have active policy on .54 pod? if yes can you share the PE?

Tan Khay Ong · Answer 14 · Sat Jan 13 2024 17:57:36 GMT+0800 (China Standard Time)

Regarding the 2nd issue, do you have active policy on .54 pod? if yes can you share the PE?

yes, here it is

apiVersion: networking.k8s.aws/v1alpha1
kind: PolicyEndpoint
metadata:
  creationTimestamp: "2024-01-11T16:17:37Z"
  generateName: live2-gateway-
  generation: 1
  name: live2-gateway-855lp
  namespace: live2
  ownerReferences:
  - apiVersion: networking.k8s.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: NetworkPolicy
    name: live2-gateway
    uid: e2fec936-f1d0-4f9a-bd8c-07d5967ba9e8
  resourceVersion: "24471548"
  uid: 2734c483-dc7b-412f-983b-6f2d2b2ca463
spec:
  egress:  
  - cidr: 0.0.0.0/0
    ports:
    - port: 53
      protocol: UDP
  - cidr: ::/0
    ports:
    - port: 53
      protocol: UDP  
  ingress:  
  - cidr: 10.0.64.172
    ports:
    - port: 8080
      protocol: TCP
    - port: 8080
      protocol: TCP
  - cidr: 10.0.78.236
    ports:
    - port: 8080
      protocol: TCP
    - port: 8080
      protocol: TCP        
  podIsolation:
  - Ingress
  - Egress
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: live2
      app.kubernetes.io/name: gateway
  podSelectorEndpoints:
  - hostIP: 10.0.54.248
    name: live2-gateway-b575dcf44-w6sfc
    namespace: live2
    podIP: 10.0.59.54
  - hostIP: 10.0.54.248
    name: live2-gateway-b575dcf44-ktrzz
    namespace: live2
    podIP: 10.0.60.190
  policyRef:
    name: live2-gateway
    namespace: live2

Jayanth Varavani · Answer 15 · Wed Jan 17 2024 06:47:14 GMT+0800 (China Standard Time)

@khayong we are unable to repro this. Can we get on a call? Are you on Kubernetes channel is so we can connect in #aws-vpc-cni .

Jayanth Varavani · Answer 16 · Wed Feb 21 2024 01:57:09 GMT+0800 (China Standard Time)

Can you please try with the latest v1.0.8 release? - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

Jeffrey Nelson · Answer 17 · Tue Feb 27 2024 00:54:36 GMT+0800 (China Standard Time)

Closing as v1.0.8 has been released. Please reopen if your issue is not resolved.