envoyproxy / xds-relay

Caching, aggregation, and relaying for xDS compliant clients and origin servers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Properly handle (N)ACKs

caitong93 opened this issue · comments

It seems go-control-plane delegate handling of (N)ACKs to the cache implementation. For every request, the server will try to cancel the last watch, and create a new watch https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/v2/server.go#L402 . This cause duplicate pushing in xds-relay.

environment

It should be easy to reproduce, istiod can be replaced with other control plane

config.yaml

admin:
  address:
    address: 0.0.0.0
    port_value: 8889
server:
  address:
    address: 0.0.0.0
    port_value: 8888
origin_server:
  address:
    address: istiod
    port_value: 15010
logging:
  level: DEBUG
metrics_sink:
  statsd:
    root_prefix: abc
    flush_interval: 30s
    address:
      address: 127.0.0.1
      port_value: 12345
cache:
  ttl: 30s
  max_entries: 100000

rules.yaml

fragments:
  - rules:
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: eds
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.Cluster"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: cds

Thanks @caitong93 for the issue - sorry I'm just seeing this now. Can you elaborate further on the duplicate pushing in xds-relay if the prior watch is cancelled? Can you also give me an example of what the behavior you're seeing is vs. expected behavior?

@jessicayuen

Sorry, I found it has nothing to do with watch cancel. The issue is here

aggregatedKey = fmt.Sprintf("%s%s", unaggregatedPrefix, req.String())

req.String() contains nonce, consider following case:

  1. client send CDS request1(nonce="")
  2. xds-relay failed to map to the request to key, generate key1(nonce=""), respond with response1(nonce="1")
  3. client send ack1(nonce="1")
  4. xds-relay receive ack1, but handle it as a normal request, failed to map and generate key2(nonce="1"), respond with response2(nonce="2")
  5. client send ack2(nonce="2")
    ... (infinite loop)

This can be reproduced using example in README:

update example/config-files/aggregation-rules.yaml, remove CDS match

fragments: 
  - rules:
      - match:
          request_type_match:
            types:
              - "type.googleapis.com/envoy.api.v2.Listener"
              - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
              - "type.googleapis.com/envoy.api.v2.RouteConfiguration"
        result:
          request_node_fragment:
            field: 1
            action:
              regex_action: { pattern: "^(.*)-.*$", replace: "$1" }
  - rules:
      - match:
          request_type_match:
            types:
              - "type.googleapis.com/envoy.api.v2.Listener"
        result:
          string_fragment: "lds"
      - match:
          request_type_match:
            types:
              - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
        result:
          string_fragment: "eds"
      - match:
          request_type_match:
            types:
              - "type.googleapis.com/envoy.api.v2.RouteConfiguration"
        result:
          string_fragment: "rds"
  - rules:
      - match:
          request_type_match:
            types:
              - "type.googleapis.com/envoy.api.v2.RouteConfiguration"
        result:
          resource_names_fragment:
            element: 0
            action: { exact: true }

start envoy

docker run -it --rm -v `pwd`/example/config-files:/config-files  envoyproxy/envoy-dev -c /config-files/envoy-bootstrap-1.yaml

I see what you mean. There needs to be some more design around how we want to support keys that don't map to an aggregation rule (#56). Couple cursory options -

1: Error and cancel watch, defer back to clients to make sure that all aggregation rules are captured.
-- This is probably not great since the aggregation rules are currently not feature rich enough to capture all use cases.
2: Maintain a separate cache for unaggregated requests. This would be a simple req/resp cache.

cc @jyotimahapatra @eapolinario for thoughts.

Following up on this one. We also want to test how xdsrelay behaves in case of NACK, and whether it sends the error_details to the upstream control plane.
When the response from the upstream control plane contains a field that envoy doesn't understand, it sends error_details in the response. The presence of error_details indicates NACK and xds relay should be able to transparently inform the upstream control plane about it. It is also possible from xdsrelay perspective to perform more actions if necessary. We can have a proposal about what we expect xdsrelay to do in such scenarios. @samrabelachew fyi

We handled somaspectof nack in #165

closing this for now.Let us know if there;s any. other use cases remaining