zhaofengli / attic

Multi-tenant Nix Binary Cache

Home Page:https://docs.attic.rs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stream error in h2 framing layer reported by nix

oluceps opened this issue · comments

I deployed attic with PostgreSQL and s3(minio) behind caddy reverse proxy, some of the path could normally download but others not.

With attic nix keeps reporting error HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer).

nix-store --store $PWD/nix-demo -r /nix/store/bl6bp58rs0iw42991h88q2b9rd3gd5sf-rust_niri-0.1.4

<snip>
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 272 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 588 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 1057 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 2458 ms
error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/bl6bp58rs0iw42991h88q2b9rd3gd5sf-rust_niri-0.1.4' failed



This occurs every time I try to download.

Tried setting http2 = false in nix config option and the error turns 200 (curl error: Transferred a partial file), also 100% reproducing. And switch reverse proxy tool from caddy to nginx produce this error as well.

info

caddy config: https://github.com/oluceps/nixos-config/blob/baa0ac6404a554d3ce4ab92e41a794b1bbc279cd/hosts/hastur/caddy.nix#L27

nginx config: https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/srv/nginx.nix#L6

atticd config: https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/srv/atticd.nix#L4

nix config(others placed in ~/.config/nix): https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/misc.nix#L34

nix-info -m
 - system: `"x86_64-linux"`
 - host os: `Linux 6.8.1-cachyos, NixOS, 24.05 (Uakari), 24.05.20240329.d8fe5e6`
 - multi-user?: `no`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.2`
 - channels(root): `"nixos"`
 - nixpkgs: `/nix/store/2h6rmgvakbz2mhy8l8img6cxxx200d08-cb1gs888vfqxawvc65q1dk6jzbayh3wz-source`

Weird. seems fixed after I recreated the psql db. With service error

Apr 07 16:57:17 hastur systemd[1]: Started atticd.service.
░░ Subject: A start job for unit atticd.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit atticd.service has finished successfully.
░░ 
░░ The job identifier is 6966.
Apr 07 16:57:17 hastur atticd[3734]: Attic Server 0.1.0 (release)
Apr 07 16:57:17 hastur atticd[3734]: Running migrations...
Apr 07 16:57:17 hastur atticd[3734]: Starting API server...
Apr 07 16:57:17 hastur atticd[3734]: Listening on [::]:8083...
Apr 07 16:57:41 hastur atticd[3734]: thread 'main' panicked at /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-vendor-cargo-deps/c19b7c6f923b580ac259164a89f2577984ad5ab09ee9d583b888f934adbbe8d0/sqlx-postgres-0.7.3/src/message/parse.rs:31:13:
Apr 07 16:57:41 hastur atticd[3734]: assertion failed: self.param_types.len() <= (u16::MAX as usize)
Apr 07 16:57:41 hastur atticd[3734]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Apr 07 16:57:42 hastur systemd[1]: atticd.service: Main process exited, code=exited, status=101/n/a
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit atticd.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 101.

I don't know about the h2 issues, but that assertion failure sounds similar to #130 and #115 (which has a PR attempting to fix it: #116).

Running into this issue too with my CI. Reverse proxy is Caddy, attic is using postgresql via UNIX sockets + S3 via garage. Occasionally it'll cause it to just stop using it as a substituter as it'll randomly decide to have spurious H2 failures which is pretty annoying. No errors in my attic logs when this happens.

error: unable to download 'https://attic.kennel.juneis.dog/conduwuit/nar/yx6f4mdc4050n19hfvc5hy8rcf8sm0qw.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)

error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduwuit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduwuit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled

I was able to reproduce the http 2 errors with the following reverse proxy config (note that TLS is required to reproduce the curl error: Stream error in the HTTP/2 framing layer -- without the TLS, I was instead seeing curl error: Transferred a partial file):

# Caddyfile
# attic
:8081 {
	tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
	reverse_proxy h2c://[::]:8080 {
		transport http {
			versions h2c 2
		}
	}
}

# minio s3
:9998 {
	tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
	reverse_proxy http://127.0.0.1:9999 {
		transport http {
			versions h2c 2
		}
	}
}

but not with this one:

# Caddyfile
# attic
:8081 {
	tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
	reverse_proxy h2c://[::]:8080 {
		transport http {
			versions h2c 2
		}
	}
}

# minio s3
:9998 {
	tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
	reverse_proxy http://127.0.0.1:9999 {
	}
}

I don't know if this is exactly the same issue you are seeing, though. It may be worth modifying your atticd command's environment to include RUST_LOG=hyper::proto::h2=trace, which will show you logs like this when it fails:

2024-05-26T22:23:43.348934Z DEBUG hyper::proto::h2::server: stream error: error from user's HttpBody stream: Storage error: service error
   0: tokio::task::runtime.spawn
           with kind=task task.name= task.id=56 loc.file="/home/vin/workspace/vcs/attic/attic/src/stream.rs" loc.line=85 loc.col=34
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
   1: tower_http::trace::make_span::request
           with method=GET uri=http://scadrial.tailb203c.ts.net:8081/test-test/nar/7g9f1rh7ab38q81llw9h8l4c56g00405.nar version=HTTP/2.0
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-http-0.4.4/src/trace/make_span.rs:109
   2: tokio::task::runtime.spawn
           with kind=task task.name= task.id=52 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
   3: tokio::task::runtime.spawn
           with kind=task task.name= task.id=45 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17

and in the Caddy logs, I saw (among other errors):

2024/05/26 22:23:44.574 ERROR   http.log.error  unexpected EOF  {"request": {"remote_ip": "100.109.66.101", "remote_port": "55646", "client_ip": "100.109.66.101", "proto": "HTTP/2.0", "method": "GET", "host": "scadrial.tailb203c.ts.net:9998", "uri": "/attic/249ee91f-9bc5-4606-8f08-10a17ed00f03.chunk?x-id=GetObject", "headers": {"Authorization": [], "X-Amz-Content-Sha256": ["e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"], "Amz-Sdk-Request": ["attempt=2; max=3"], "Amz-Sdk-Invocation-Id": ["7fb1131b-6231-4d1e-969f-3df25ddc4614"], "User-Agent": ["aws-sdk-rust/0.57.1 os/linux lang/rust/1.76.0"], "X-Amz-User-Agent": ["aws-sdk-rust/0.57.1 api/s3/0.35.0 os/linux lang/rust/1.76.0"], "X-Amz-Date": ["20240526T222344Z"]}, "tls": {"resumed": true, "version": 772, "cipher_suite": 4865, "proto": "h2", "server_name": "scadrial.tailb203c.ts.net"}}, "duration": 0.000733056, "status": 502, "err_id": "3bnwbtvet", "err_trace": "reverseproxy.statusError (reverseproxy.go:1267)"}

I think my conclusion is that your S3 backend is, for whatever reason, trying to use HTTP/2 between it and attic to download the chunks (before attic re-assembles the chunks into a NAR and streams that back in an HTTP/2 response). If you can try to force HTTP/1.1 between attic and your S3 backend, this might go away?

When I did some research, I found that S3 doesn't appear to actually support HTTP/2 directly? Or at least there are some issues throughout the internet that are not really clear either way. So this may be the issue -- that the communication between attic and S3 is trying to erroneously upgrade to HTTP/2 when it's unsupported? I can't say for sure, but maybe this is helpful in tracking down the issue.

I think your Caddyfile seems to be giving me better results. I re-ran a CI job that pulls from attic and I don't see any failures now.

Unfortunately still getting HTTP/2 errors: https://gitlab.com/conduwuit/conduwuit/-/jobs/6952157628#L440

I'll try to re-read your comment again and see if I missed something.

(I get a 404 when I try to click that, but I would be interested to know if you saw anything in the atticd logs while running it with RUST_LOG=hyper::proto::h2=trace)

Had the GitLab mirror private for some reason (how were other people able to star it in the past?)

I'll re-run CI on the GItLab with trace logs again

https://girlboss.ceo/~strawberry/pb/qa2A

Nothing really sticks out when the HTTP/2 errors happen except for one of these:

May 27 20:03:15 TPYEXECL.local atticd[2867619]: 2024-05-28T00:03:15.893193Z DEBUG hyper::proto::h2: send body user stream error: error from user's HttpBody stream: Storage error: Does not understand the remote file reference
May 27 20:03:15 TPYEXECL.local atticd[2867619]:    0: tokio::task::runtime.spawn
May 27 20:03:15 TPYEXECL.local atticd[2867619]:            with kind=task task.name= task.id=1488 loc.file="/opt/attic/attic/attic/src/stream.rs" loc.line=85 loc.col=34
May 27 20:03:15 TPYEXECL.local atticd[2867619]:              at /opt/attic/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/util/trace.rs:17
May 27 20:03:15 TPYEXECL.local atticd[2867619]: 2024-05-28T00:03:15.893272Z DEBUG hyper::proto::h2::server: stream error: error from user's HttpBody stream: Storage error: Does not understand the remote file reference
May 27 20:03:15 TPYEXECL.local atticd[2867619]:    0: tokio::task::runtime.spawn
May 27 20:03:15 TPYEXECL.local atticd[2867619]:            with kind=task task.name= task.id=1488 loc.file="/opt/attic/attic/attic/src/stream.rs" loc.line=85 loc.col=34
May 27 20:03:15 TPYEXECL.local atticd[2867619]:              at /opt/attic/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/util/trace.rs:17

Now that is interesting... At any point, were you using the default local-file setup (stored on disk) before moving to an S3-based storage solution? That error message can only come from one of three places:

"Does not understand the remote file reference"

"Does not understand the remote file reference"

"Does not understand the remote file reference"

So it seems to me like it's trying to download a chunk from S3 but it's finding something that is not an S3 file reference (interesting that this is only in the debug logs though...).

You could do something like the following to find if this is what's happening: SELECT * FROM chunk WHERE remote_file_id NOT LIKE 's3:%'

If there are any results, then those will be causing this issue. You'll likely need to purge them from the database in some way.

Oh, yes I was. I knew this would be a footgun at some point :< really wish there was a proper way to migrate over from local to S3, I couldn't find anything online for attic.

I was using a local storage and really wanted to move to garage. So I dumped the entire local storage to the S3 bucket (retaining the format I saw when I did ls on the S3 bucket). Everything appeared to work fine, but I guess it didn't..

The fact that this isn't propagated even as an error in atticd logs is pretty annoying, but at least this is the issue now.

Is there a way I can migrate the "local" references in the database to S3? I still have all the data/chunks/etc, it's just in the S3 bucket and not local.

I looked at SELECT * FROM chunk WHERE remote_file_id NOT LIKE 'local:%'; and I think I can piece it together, just making the SQL query to fix this will be pain. Unless attic can repopulate this and I don't need to do any database magic?

Is there a way I can migrate the "local" references in the database to S3? I still have all the data/chunks/etc, it's just in the S3 bucket and not local.

Storage migration is tricky business that can be very case-specific (in this case you moved the files yourself with old reference). It's something that can be added, but for now you could make an one-off script that changes all remote_file and remote_file_id in a loop and it should work.

Perhaps these HTTP/2 errors are actually just database errors not being shown to the user at the end of the day considering oluceps said the HTTP/2 errors disappeared after deleting the database and there's nothing actually wrong with the reverse proxy...

Weird. seems fixed after I recreated the psql db. With service error

for now you could make an one-off script that changes all remote_file and remote_file_id in a loop and it should work.

This has worked! (and I am very impressed it did considering what I did lol)

copying path '/nix/store/6hpjqjglnfq6n9bgharvm52gs3gcjrq2-source' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/k34c74fqb2a957c854gnhr5xfajsnxsh-source' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/b2qciqlj1nfzknnxjm7dlw9jjqf9k974-configureCargoCommonVarsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/mvzsygcwg6512djvv6zfcwb2m2a8d0kg-configureCargoVendoredDepsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/j3bqgzdla9kgpgczv6hpkfyy0230z74k-compiler-rt-libc-static-x86_64-unknown-linux-musl-17.0.6-dev' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/zzm37bc4pakds0yybq1xqv7pgvasgmcn-installCargoArtifactsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/amv3ym8zg6kh7lwilzsp7lcypgckswkx-attic-0.1.0' from 'https://attic.kennel.juneis.dog/conduwuit'...

No more HTTP/2 errors and no more hyper errors in atticd! Since I appear to be the first user ever who wanted to migrate from local to S3 for attic, I'll probably write a quick post about it somewhere and what I did to do this.

I was also able to revert my reverse proxy changes back to the typical Caddy default (HTTP/3, HTTP/2, HTTP/1.1) for both atticd, garage, and the reverse_proxy http transport and still no issues. So definitely these HTTP/2 stream errors are red herrings for a deeper issue, in this case a database issue.

Wrote what I did here: https://girlboss.ceo/~strawberry/nix_attic_s3_migration-28-may-2024.txt

No this is not a good or clean way to do this, but for anyone who really wants to migrate without deleting everything this is what I did.

Thanks for that write-up, I'm sure it'll be very helpful to any who run into the same issue!

To "resolve" this issue, I'm gonna look into seeing if there's a way to surface that error instead of burying it in hyper's debug logging... If you don't hear anything before Friday, ping me and I'll remind myself to look at it next week.

I opened #137 which should make this kind of issue more obvious in the future. I briefly looked for anywhere we were streaming data back to the user, but this looks like the only place.

Before:

$ cargo r --bin atticd
Attic Server 0.1.0 (debug)
Running migrations...
Starting API server...
Listening on [::]:8080...

$ nix-store -r ............
error: unable to download 'https://...:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/i53fqfzxc4cz0251r6crz9hzv89rj05h-test' failed

After:

$ cargo r --bin atticd
Attic Server 0.1.0 (debug)
Running migrations...
Starting API server...
Listening on [::]:8080...
2024-05-28T18:38:26.580550Z ERROR attic_server::api::binary_cache: Stream error: Storage error: Does not understand the remote file reference
   0: tokio::task::runtime.spawn
           with kind=task task.name= task.id=63 loc.file="/home/vin/workspace/vcs/attic/attic/src/stream.rs" loc.line=85 loc.col=34
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
   1: tower_http::trace::make_span::request
           with method=GET uri=http://scadrial.tailb203c.ts.net:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar version=HTTP/2.0
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-http-0.4.4/src/trace/make_span.rs:109
   2: tokio::task::runtime.spawn
           with kind=task task.name= task.id=55 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
   3: tokio::task::runtime.spawn
           with kind=task task.name= task.id=48 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
             at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17

$ nix-store -r ............
error: unable to download 'https://...:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/i53fqfzxc4cz0251r6crz9hzv89rj05h-test' failed