Multicluster examples do not work if trustDomain value is set
nea1 opened this issue · comments
Bug Description
The Multi-Primary on different networks multicluster example works exactly as expected if using the same trustDomain (e.g. cluster.local
) in both clusters. However if the trustDomain value differs across clusters (everything else remaining the same); then cross-cluster calls do not work.
e.g.
cluster 1:
meshConfig:
trustDomain: search.prod
cluster 2:
meshConfig:
trustDomain: payments.prod
Both clusters have a common root of trust as per the pre-requisite setup in Configure Trust
We require the use of different trust domains to uniquely identify workloads across the estate; however we also require them to have the same x509 root cert.
Expected Behaviour
The workload certs in both clusters share the same root, so I would expect cross cluster MTLS calls to succeed. i.e.
- Calls to the helloworld service should be served by both the local v1 deployment and remote v2 deployment
- Calls to a remote httpbin service should be successfully served over MTLS
Actual Behaviour
- Calls to the helloworld service are served only from the local cluster:
Hello version: v1, instance: helloworld-v1-7df57fccf6-wl85c
Hello version: v1, instance: helloworld-v1-7df57fccf6-wl85c
Hello version: v1, instance: helloworld-v1-7df57fccf6-wl85c
Hello version: v1, instance: helloworld-v1-7df57fccf6-wl85c
...
- Calls to a remote httpbin service fail
kubectl exec -n foo deploy/sleep --context="${CTX_CLUSTER1}" -- curl http://httpbin.bar/get
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
I would expect the trustDomain values being different only to come into play if referencing the spiffe ids/principals in an AuthorizationPolicy; which is not the case here (no polices are deployed)
Unless there is some extra config required for multiple trustDomains (with the same root) to work in a multicluster setup, then this looks to be a bug?
Version
istioctl version ✔
client version: 1.13.4
control plane version: 1.13.4
data plane version: 1.13.4 (4 proxies)
Additional Information
No response
By default only the same trust domain is trusted; to add more trustDomainAliases
can be set
@howardjohn my understanding is that trustDomainAliases
effectively says "these trustdomains should be considered the same as this trustdomain"; which is subtlety different to "these trustdomains should be trusted"
For example if I configure the following in the "payments" cluster:
meshConfig:
trustDomain: payments.prod
trustDomainAliases:
- search.prod
...then a workload with spiffe id spiffe://payments.prod/ns/foo/sa/sleep
is considered identical to a workload with spiffe id spiffe://search.prod/ns/foo/sa/sleep
. Therefore an AuthorizationPolicy containing:
- from:
- source:
principals: ["payments.prod/ns/foo/sa/sleep"]
would allow access from payments.prod/ns/foo/sa/sleep
and search.prod/ns/foo/sa/sleep
.
I still want to be able to distinguish between workloads with different spiffe ids and for example allow access from the payments instance, but deny access from the search instance - how do I do that?
I would have expected the default behaviour when there is common root cert to be, that in the absence of an AuthorizationPolicy saying otherwise, that communication would be trusted - a user would then be able to deploy an AuthorizationPolicy to deny other trustdomains if desired i.e. something like
action: DENY
rules:
- from:
- source:
principals: ["anotherdomain.prod/*"]
Ah good point... this does seem like a functionality gap.
Do you know if there is a workaround for this today? I was looking at certificatedata which appears to let you "add" trust domains to the current mesh, but looks like it relies on adding a bundle url (which in this case would be redundant, as it's effectively the same bundle)
Hey @nea1 - did you ever find an alternative approach here? I've run into the exact same situation.
Hey @skizot722 - unfortunately not. I didn't pursue the certificatedata approach I listed above as even if it worked, it would have been difficult to manage for our use case (we have 10s of trust domains across 100+ clusters). Also ultimately it really would have just been a workaround rather than a proper solution. Unfortunately I don't currently have the bandwidth to contribute a fix, so if someone is able to do that then that would be awesome.
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2022-06-01. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.
Created by the issue and PR lifecycle manager.