Update docs for HTTPS_PROXY

Question

Update docs for HTTPS_PROXY

caius opened this issue 5 months ago · comments

Can't set HTTPS_PROXY in the environment because it breaks Spacelift uploading your planned resources/changes into their own S3 bucket.

Instead need to have export HTTPS_PROXY=http://127.0.0.1:8080 in the before_* hooks, and unset HTTPS_PROXY in the matching after_* hooks. This stops Spacelift from saving this envariable as part of the environment.

(We can't use the trap '' EXIT logic because that runs after Spacelift has saved the transient envariables to disk.)

Caius Durling · Answer 1 · Wed Feb 14 2024 00:44:51 GMT+0800 (China Standard Time)

The plot thickens even further. What we're trying to achieve is calling terraform with HTTPS_PROXY=http://127.0.0.1:8080 so HTTPS connections from within terraform go via tailscaled which is acting as a HTTP proxy. Anything for the tailnet goes across it, anything for the internet goes from the container.

The limitations are we can only start tailscaled during the before hooks of a phase, and we need to stop it during the after phase (or when the shell exits) otherwise Spacelift sits there for 10 minutes waiting for all processes in the container to exit.

Initial Attempt

For Plan & Perform phases, spacelift runs in the following order:

Environment variables saved to /mnt/workspace/.env_hooks_before
before_$phase hooks invoked
$phase command invoked (eg, terraform plan …)
after_$phase hooks invoked
Environment variables saved to /mnt/workspace/.env_hooks_after
Changes to resources taken from workspace, uploaded into S3/control plane

Steps 1 through 5 run inside the same ash shell (sh -c "… && …"), step 6 I guess runs from the spacelift-worker binary itself. Haven't confirmed that though. (Setting set -o xtrace in the ash shell doesn't have an effect after the shell exits.)

In the initial instance (as per README on main) we define before_$phase on a context as:

spacetail up
trap 'spacetail down' EXIT

(Using trap here to stop the tailscaled process after $phase has been executed as the shell exits, even if $phase command errored. If we use an after_$phase hook to stop tailscaled, it doesn't stop when the $phase command errored and the container gets "stuck" waiting for processes to exit until it hits a 10 minute timeout.)

And have the following environment variables set on the context too:

HTTP_PROXY=http://127.0.0.1:8080
HTTPS_PROXY=http://127.0.0.1:8080

Running a plan now lets terraform talk to HTTPS endpoint over the tailnet, but then Step 6 fails with a proxy error:

[01HP7K3N3GNWZ9KQG934MM6A29] Changes are GO
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HP7K3N3GNWZ9KQG934MM6A29] unexpected exit code when running show command: 1
[01HP7K3N3GNWZ9KQG934MM6A29] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HP7K3N3GNWZ9KQG934MM6A29] Unexpected exit code when listing outputs: 1

I think this is bubbling out from https://github.com/golang/go/blob/e17e5308fd5a26da5702d16cc837ee77cdb30ab6/src/net/http/transport.go#L1617 which is why I suspect spacelift-worker is trying to talk to S3 itself and Go picks up HTTP_PROXY, HTTPS_PROXY from the environment it's in.

Looking at Spacelift's terraform workflow.yml it'll either be calling terraform show -json or terraform show -json {{ .PlanFileName }} and presumably loading the envariables from the file to make sure terraform can execute properly. In this case we don't want it to use the proxy, because our state isn't over the tailnet. (And because we can't leave tailscaled running for this, because it's happening after we can control the runtime.)

`unset` Attempt

After some thinking and debugging, came up with setting the HTTP_PROXY and HTTPS_PROXY environment variables in the before phase hooks and unsetting them in the after phase hooks rather than defining as environment variables, so they never get persisted into the env file, therefore aren't defined at the point the S3 upload happens from spacelift-worker too.

(As mentioned in the issue body above, initially attempted using unset from the trap but that doesn't work because the Environment is persisted to disk before the shell exits and trap is called.)

So we end up with the context having no environment variables set, and the before_$phase hooks set to:

spacetail up
trap 'spacetail down EXIT`
export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080

And the corresponding after_$phase hooks set to:

unset HTTP_PROXY HTTPS_PROXY

🎉 This works for Plan phase. 😭 And then fails for the Apply phase. Also works fine for the Perform phase.

Turns out, after a bunch of debugging (drop set -o xtrace in a before hook to observe what's being run in the shell) there's an ordering difference (bug?) with apply phase compared to Plan and Perform phases.

For the Apply phase the list of steps above doesn't hold true, the environment persistence and after hooks are swapped in ordering. So I'm observing the following happening for an Apply Phase:

Environment variables saved to /mnt/workspace/.env_hooks_before
before_$phase hooks invoked
$phase command invoked (eg, terraform plan …)
Environment variables saved to /mnt/workspace/.env_hooks_after
after_$phase hooks invoked
Changes to resources taken from workspace, uploaded into S3/control plane

(Step 4 & 5 have reversed.)

So now we're back to terraform show / S3 bucket upload erroring out trying to use a HTTPS_PROXY that has been shut down by the time Step 6 is invoked.

[01HPGVA7CVJEAH3CPHARYR31N7] Changes applied successfully
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│ 
│ 
╵
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources failed: unexpected exit code when running show command: 1

As a workaround for now, I'm editing the /mnt/workspace/.env_hooks_after file on disk in an after hook to remove the HTTP_PROXY= and HTTPS_PROXY= lines so they aren't loaded when Step 6 is running terraform show. So the working hooks for now are

before_$phase:

spacetail up
trap 'spacetail down' EXIT
export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080

after_$phase:

unset HTTP_PROXY HTTPS_PROXY
sed -e '/HTTP_PROXY=/d' -e /HTTPS_PROXY/d -i /mnt/workspace/.env_hooks_after || true

This ensures the environment variables aren't left in the env after any of the phases run, but it's weird the Apply phase saves the environment variables then runs the after hooks.

Update docs for HTTPS_PROXY

Initial Attempt

unset Attempt

`unset` Attempt