caius / spacelift-tailscale

Docker image for Spacelift containing Tailscale

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update docs for HTTPS_PROXY

caius opened this issue · comments

Can't set HTTPS_PROXY in the environment because it breaks Spacelift uploading your planned resources/changes into their own S3 bucket.

Instead need to have export HTTPS_PROXY=http://127.0.0.1:8080 in the before_* hooks, and unset HTTPS_PROXY in the matching after_* hooks. This stops Spacelift from saving this envariable as part of the environment.

(We can't use the trap '' EXIT logic because that runs after Spacelift has saved the transient envariables to disk.)

The plot thickens even further. What we're trying to achieve is calling terraform with HTTPS_PROXY=http://127.0.0.1:8080 so HTTPS connections from within terraform go via tailscaled which is acting as a HTTP proxy. Anything for the tailnet goes across it, anything for the internet goes from the container.

The limitations are we can only start tailscaled during the before hooks of a phase, and we need to stop it during the after phase (or when the shell exits) otherwise Spacelift sits there for 10 minutes waiting for all processes in the container to exit.

Initial Attempt

For Plan & Perform phases, spacelift runs in the following order:

  1. Environment variables saved to /mnt/workspace/.env_hooks_before
  2. before_$phase hooks invoked
  3. $phase command invoked (eg, terraform plan …)
  4. after_$phase hooks invoked
  5. Environment variables saved to /mnt/workspace/.env_hooks_after
  6. Changes to resources taken from workspace, uploaded into S3/control plane

Steps 1 through 5 run inside the same ash shell (sh -c "… && …"), step 6 I guess runs from the spacelift-worker binary itself. Haven't confirmed that though. (Setting set -o xtrace in the ash shell doesn't have an effect after the shell exits.)

In the initial instance (as per README on main) we define before_$phase on a context as:

  • spacetail up

  • trap 'spacetail down' EXIT

    (Using trap here to stop the tailscaled process after $phase has been executed as the shell exits, even if $phase command errored. If we use an after_$phase hook to stop tailscaled, it doesn't stop when the $phase command errored and the container gets "stuck" waiting for processes to exit until it hits a 10 minute timeout.)

And have the following environment variables set on the context too:

  • HTTP_PROXY=http://127.0.0.1:8080
  • HTTPS_PROXY=http://127.0.0.1:8080

Running a plan now lets terraform talk to HTTPS endpoint over the tailnet, but then Step 6 fails with a proxy error:

[01HP7K3N3GNWZ9KQG934MM6A29] Changes are GO
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HP7K3N3GNWZ9KQG934MM6A29] unexpected exit code when running show command: 1
[01HP7K3N3GNWZ9KQG934MM6A29] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HP7K3N3GNWZ9KQG934MM6A29] Unexpected exit code when listing outputs: 1

I think this is bubbling out from https://github.com/golang/go/blob/e17e5308fd5a26da5702d16cc837ee77cdb30ab6/src/net/http/transport.go#L1617 which is why I suspect spacelift-worker is trying to talk to S3 itself and Go picks up HTTP_PROXY, HTTPS_PROXY from the environment it's in.

Looking at Spacelift's terraform workflow.yml it'll either be calling terraform show -json or terraform show -json {{ .PlanFileName }} and presumably loading the envariables from the file to make sure terraform can execute properly. In this case we don't want it to use the proxy, because our state isn't over the tailnet. (And because we can't leave tailscaled running for this, because it's happening after we can control the runtime.)

unset Attempt

After some thinking and debugging, came up with setting the HTTP_PROXY and HTTPS_PROXY environment variables in the before phase hooks and unsetting them in the after phase hooks rather than defining as environment variables, so they never get persisted into the env file, therefore aren't defined at the point the S3 upload happens from spacelift-worker too.

(As mentioned in the issue body above, initially attempted using unset from the trap but that doesn't work because the Environment is persisted to disk before the shell exits and trap is called.)

So we end up with the context having no environment variables set, and the before_$phase hooks set to:

  • spacetail up
  • trap 'spacetail down EXIT`
  • export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080

And the corresponding after_$phase hooks set to:

  • unset HTTP_PROXY HTTPS_PROXY

🎉 This works for Plan phase. 😭 And then fails for the Apply phase. Also works fine for the Perform phase.

Turns out, after a bunch of debugging (drop set -o xtrace in a before hook to observe what's being run in the shell) there's an ordering difference (bug?) with apply phase compared to Plan and Perform phases.

For the Apply phase the list of steps above doesn't hold true, the environment persistence and after hooks are swapped in ordering. So I'm observing the following happening for an Apply Phase:

  1. Environment variables saved to /mnt/workspace/.env_hooks_before
  2. before_$phase hooks invoked
  3. $phase command invoked (eg, terraform plan …)
  4. Environment variables saved to /mnt/workspace/.env_hooks_after
  5. after_$phase hooks invoked
  6. Changes to resources taken from workspace, uploaded into S3/control plane

(Step 4 & 5 have reversed.)

So now we're back to terraform show / S3 bucket upload erroring out trying to use a HTTPS_PROXY that has been shut down by the time Step 6 is invoked.

[01HPGVA7CVJEAH3CPHARYR31N7] Changes applied successfully
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources failed: unexpected exit code when running show command: 1

As a workaround for now, I'm editing the /mnt/workspace/.env_hooks_after file on disk in an after hook to remove the HTTP_PROXY= and HTTPS_PROXY= lines so they aren't loaded when Step 6 is running terraform show. So the working hooks for now are

before_$phase:

  • spacetail up
  • trap 'spacetail down' EXIT
  • export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080

after_$phase:

  • unset HTTP_PROXY HTTPS_PROXY
  • sed -e '/HTTP_PROXY=/d' -e /HTTPS_PROXY/d -i /mnt/workspace/.env_hooks_after || true

This ensures the environment variables aren't left in the env after any of the phases run, but it's weird the Apply phase saves the environment variables then runs the after hooks.