Properly handle (N)ACKs
caitong93 opened this issue · comments
It seems go-control-plane delegate handling of (N)ACKs to the cache implementation. For every request, the server will try to cancel the last watch, and create a new watch https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/v2/server.go#L402 . This cause duplicate pushing in xds-relay.
environment
It should be easy to reproduce, istiod
can be replaced with other control plane
config.yaml
admin:
address:
address: 0.0.0.0
port_value: 8889
server:
address:
address: 0.0.0.0
port_value: 8888
origin_server:
address:
address: istiod
port_value: 15010
logging:
level: DEBUG
metrics_sink:
statsd:
root_prefix: abc
flush_interval: 30s
address:
address: 127.0.0.1
port_value: 12345
cache:
ttl: 30s
max_entries: 100000
rules.yaml
fragments:
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
result:
and_result:
result_predicates:
- request_node_fragment:
# filed: 2
field: NODE_CLUSTER
action:
exact: true
- string_fragment: eds
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.Cluster"
result:
and_result:
result_predicates:
- request_node_fragment:
# filed: 2
field: NODE_CLUSTER
action:
exact: true
- string_fragment: cds
Thanks @caitong93 for the issue - sorry I'm just seeing this now. Can you elaborate further on the duplicate pushing in xds-relay if the prior watch is cancelled? Can you also give me an example of what the behavior you're seeing is vs. expected behavior?
Sorry, I found it has nothing to do with watch cancel. The issue is here
req.String()
contains nonce, consider following case:
- client send CDS request1(nonce="")
- xds-relay failed to map to the request to key, generate key1(nonce=""), respond with response1(nonce="1")
- client send ack1(nonce="1")
- xds-relay receive ack1, but handle it as a normal request, failed to map and generate key2(nonce="1"), respond with response2(nonce="2")
- client send ack2(nonce="2")
... (infinite loop)
This can be reproduced using example in README:
update example/config-files/aggregation-rules.yaml
, remove CDS match
fragments:
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.Listener"
- "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
- "type.googleapis.com/envoy.api.v2.RouteConfiguration"
result:
request_node_fragment:
field: 1
action:
regex_action: { pattern: "^(.*)-.*$", replace: "$1" }
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.Listener"
result:
string_fragment: "lds"
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
result:
string_fragment: "eds"
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.RouteConfiguration"
result:
string_fragment: "rds"
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.RouteConfiguration"
result:
resource_names_fragment:
element: 0
action: { exact: true }
start envoy
docker run -it --rm -v `pwd`/example/config-files:/config-files envoyproxy/envoy-dev -c /config-files/envoy-bootstrap-1.yaml
I see what you mean. There needs to be some more design around how we want to support keys that don't map to an aggregation rule (#56). Couple cursory options -
1: Error and cancel watch, defer back to clients to make sure that all aggregation rules are captured.
-- This is probably not great since the aggregation rules are currently not feature rich enough to capture all use cases.
2: Maintain a separate cache for unaggregated requests. This would be a simple req/resp cache.
cc @jyotimahapatra @eapolinario for thoughts.
Following up on this one. We also want to test how xdsrelay behaves in case of NACK, and whether it sends the error_details
to the upstream control plane.
When the response from the upstream control plane contains a field that envoy doesn't understand, it sends error_details
in the response. The presence of error_details indicates NACK and xds relay should be able to transparently inform the upstream control plane about it. It is also possible from xdsrelay perspective to perform more actions if necessary. We can have a proposal about what we expect xdsrelay to do in such scenarios. @samrabelachew fyi
We handled somaspectof nack in #165
closing this for now.Let us know if there;s any. other use cases remaining