Controller crashes when certificates are handled by Vault PKI
joatmon08 opened this issue · comments
Overview of the Issue
I configured a Gateway
with a TLS certificate that is generated by Vault PKI secrets engine. It comes up successfully but when I create an HTTPRoute
to add an upstream, the API Gateway controller throws an error and fails to add the route because it cannot validate the certificate's SPIFFE URL.
Reproduction Steps
-
Create three self-signed root CAs and configure each with a
Vault PKI secrets engine with two levels of intermediate certificates.- Cluster certificate (self-signed)
- Connect CA (self-signed)
- Consul API Gateway certificate (self-signed)
-
Create a Consul cluster that uses Vault PKI Secrets Engine.
global: datacenter: "${CONSUL_DATACENTER}" name: consul secretsBackend: vault: enabled: true consulServerRole: ${CONSUL_SERVER_ROLE} consulClientRole: ${CONSUL_CLIENT_ROLE} consulCARole: ${CONSUL_CA_ROLE} manageSystemACLsRole: ${SERVER_ACL_INIT_ROLE} agentAnnotations: | "vault.hashicorp.com/namespace": "${VAULT_NAMESPACE}" connectCA: address: ${VAULT_ADDR} rootPKIPath: ${CONSUL_CONNECT_PKI_PATH_ROOT} intermediatePKIPath: ${CONSUL_CONNECT_PKI_PATH_INT} authMethodPath: ${KUBERNETES_AUTH_METHOD_PATH} additionalConfig: '"{"connect": [{ "ca_config": [{ "namespace": "${VAULT_NAMESPACE}"}]}]}"' tls: enabled: true enableAutoEncrypt: true caCert: secretName: "${CONSUL_PKI_PATH}/cert/ca" caKey: secretName: "${CONSUL_PKI_PATH}/issue/${CONSUL_SERVER_ROLE}" secretKey: private_key acls: manageSystemACLs: true bootstrapToken: secretName: "${CONSUL_STATIC_PATH}/data/bootstrap" secretKey: token gossipEncryption: secretName: ${CONSUL_STATIC_PATH}/data/gossip secretKey: key server: replicas: 1 serverCert: secretName: "${CONSUL_PKI_PATH}/issue/${CONSUL_SERVER_ROLE}" connectInject: replicas: 1 enabled: true controller: enabled: true terminatingGateways: enabled: true defaults: replicas: 1 apiGateway: enabled: true logLevel: trace image: "hashicorp/consul-api-gateway:0.2.1" managedGatewayClass: serviceType: LoadBalancer ui: enabled: true service: enabled: true type: LoadBalancer
-
Deploy a gateway with a TLS certificate.
apiVersion: gateway.networking.k8s.io/v1alpha2 kind: Gateway metadata: name: api-gateway namespace: default spec: gatewayClassName: consul-api-gateway listeners: - allowedRoutes: namespaces: from: Same name: https port: 8443 protocol: HTTPS tls: certificateRefs: - group: "" kind: Secret name: consul-api-gateway-cert mode: Terminate
The gateway comes up:
$ kubectl get pods NAME READY STATUS RESTARTS AGE api-gateway-5d5dd555b5-9kxqh 1/1 Running 0 8m35s consul-api-gateway-controller-6489bfb4dc-rn8rw 2/2 Running 18 (23m ago) 85m
-
Deploy an
HTTPRoute
.apiVersion: gateway.networking.k8s.io/v1alpha2 kind: HTTPRoute metadata: name: hashicups spec: parentRefs: - name: api-gateway rules: - matches: - path: type: PathPrefix value: / backendRefs: - kind: Service name: nginx namespace: default port: 80
The gateway throws an error and restarts:
$ kubectl get pods NAME READY STATUS RESTARTS AGE api-gateway-5d5dd555b5-9kxqh 1/1 Running 0 10m consul-api-gateway-controller-6489bfb4dc-rn8rw 1/2 Error 20 (19s ago) 87m
Logs
Logs
2022-06-03T16:39:26.260Z [INFO] manager/internal.go:383: consul-api-gateway-server.controller-runtime: starting metrics server: path=/metrics
2022-06-03T16:39:26.260Z [TRACE] envoy/secrets.go:300: consul-api-gateway-server.sds-server.secret-manager: running secrets manager
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xafb0b1]
goroutine 324 [running]:
github.com/hashicorp/consul-api-gateway/internal/envoy.verifySPIFFE({0x1c56bf8, 0xc0007bbe60}, {0x1c987b0, 0xc00015b280}, 0x0, {0x1c3a9b8, 0xc00080a3c0})
/home/runner/work/consul-api-gateway/consul-api-gateway/internal/envoy/middleware.go:84 +0x1d1
github.com/hashicorp/consul-api-gateway/internal/envoy.SPIFFEStreamMiddleware.func1({0x198c780, 0xc0004fa090}, {0x1c732b0, 0xc00068ec00}, 0x167aec0, 0x1abe690)
/home/runner/work/consul-api-gateway/consul-api-gateway/internal/envoy/middleware.go:68 +0xc5
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0003cd6c0, {0x1c85848, 0xc000017500}, 0xc0000c9b00, 0xc0004fa120, 0x2ae5940, 0x0)
/home/runner/go/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1557 +0xe9a
google.golang.org/grpc.(*Server).handleStream(0xc0003cd6c0, {0x1c85848, 0xc000017500}, 0xc0000c9b00, 0x0)
/home/runner/go/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1630 +0x9e5
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/home/runner/go/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:941 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/home/runner/go/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:939 +0x294
Expected behavior
I expected to have the HTTPRoute
add an upstream to my service and be able to access the service over HTTPS.
Environment details
consul-api-gateway
version: v0.2.1- Kubernetes version: v1.22.9-eks-a64ea69
- Consul Server version: v1.12.0
- Consul-K8s version: v0.44.0
- Cloud Provider (If self-hosted, the Kubernetes provider utilized): EKS, AKS, GKE, OpenShift (and version), Rancher (and version), TKGI (and version): EKS v1.22.9
Additional Context
You can find the full deployment (including Vault PKI secrets engine setup and certificate generation) at joatmon08/hashicorp-stack-demoapp.
So, this is likely coming from an issue with our server-side mTLS verification for SDS. Besides using our root cert for crypto verification, we use the root and leaf cert SPIFFE urls to verify the identity of a known gateway as well as ensure that it has the ability to request certain certificates. This identity verification happens after the cryptographic verification of the leaf certs using the requested root cert.
Included in the ID check is this bit:
consul-api-gateway/internal/envoy/middleware.go
Lines 84 to 87 in 358445d
where spiffeCA
comes from the connect root cert from Consul (in this case, backed by Vault's PKI infrastructure). It seems like the root CA for connect in this case has no SPIFFE identifier (probably due to the particularities of the Vault setup), though Consul generally has one when using its default PKI setup. I think this means that we can't always assume that the root actually has such an identifier.
My suggestion is that we consider just dropping this particular check and ignore the "host" part of the SPIFFE url in the client cert. We'd still use the rest of the SPIFFE path for identifying the namespace/name of the deployed gateway and aligning it with our gateway configuration, but we should be able to ignore the need for a root CA SPIFFE component and only leverage the CA for cryptographic verification. So, TLDR; just remove the above lines and I think we should be good.