Stream error in h2 framing layer reported by nix
oluceps opened this issue · comments
I deployed attic with PostgreSQL and s3(minio) behind caddy reverse proxy, some of the path could normally download but others not.
With attic nix keeps reporting error HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
.
nix-store --store $PWD/nix-demo -r /nix/store/bl6bp58rs0iw42991h88q2b9rd3gd5sf-rust_niri-0.1.4
<snip>
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 272 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 588 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 1057 ms
warning: error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer); retrying in 2458 ms
error: unable to download 'https://attic.nyaw.xyz/dev/nar/bl6bp58rs0iw42991h88q2b9rd3gd5sf.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/bl6bp58rs0iw42991h88q2b9rd3gd5sf-rust_niri-0.1.4' failed
This occurs every time I try to download.
Tried setting http2 = false
in nix config option and the error turns 200 (curl error: Transferred a partial file)
, also 100% reproducing. And switch reverse proxy tool from caddy to nginx produce this error as well.
info
caddy config: https://github.com/oluceps/nixos-config/blob/baa0ac6404a554d3ce4ab92e41a794b1bbc279cd/hosts/hastur/caddy.nix#L27
nginx config: https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/srv/nginx.nix#L6
atticd config: https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/srv/atticd.nix#L4
nix config(others placed in ~/.config/nix): https://github.com/oluceps/nixos-config/blob/0135fc3e596792085ea93cde1d57a32be2ac1798/misc.nix#L34
nix-info -m
- system: `"x86_64-linux"`
- host os: `Linux 6.8.1-cachyos, NixOS, 24.05 (Uakari), 24.05.20240329.d8fe5e6`
- multi-user?: `no`
- sandbox: `yes`
- version: `nix-env (Nix) 2.18.2`
- channels(root): `"nixos"`
- nixpkgs: `/nix/store/2h6rmgvakbz2mhy8l8img6cxxx200d08-cb1gs888vfqxawvc65q1dk6jzbayh3wz-source`
Weird. seems fixed after I recreated the psql db. With service error
Apr 07 16:57:17 hastur systemd[1]: Started atticd.service.
░░ Subject: A start job for unit atticd.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit atticd.service has finished successfully.
░░
░░ The job identifier is 6966.
Apr 07 16:57:17 hastur atticd[3734]: Attic Server 0.1.0 (release)
Apr 07 16:57:17 hastur atticd[3734]: Running migrations...
Apr 07 16:57:17 hastur atticd[3734]: Starting API server...
Apr 07 16:57:17 hastur atticd[3734]: Listening on [::]:8083...
Apr 07 16:57:41 hastur atticd[3734]: thread 'main' panicked at /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-vendor-cargo-deps/c19b7c6f923b580ac259164a89f2577984ad5ab09ee9d583b888f934adbbe8d0/sqlx-postgres-0.7.3/src/message/parse.rs:31:13:
Apr 07 16:57:41 hastur atticd[3734]: assertion failed: self.param_types.len() <= (u16::MAX as usize)
Apr 07 16:57:41 hastur atticd[3734]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Apr 07 16:57:42 hastur systemd[1]: atticd.service: Main process exited, code=exited, status=101/n/a
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit atticd.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 101.
Running into this issue too with my CI. Reverse proxy is Caddy, attic is using postgresql via UNIX sockets + S3 via garage. Occasionally it'll cause it to just stop using it as a substituter as it'll randomly decide to have spurious H2 failures which is pretty annoying. No errors in my attic logs when this happens.
error: unable to download 'https://attic.kennel.juneis.dog/conduwuit/nar/yx6f4mdc4050n19hfvc5hy8rcf8sm0qw.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduwuit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduwuit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
error: substituter 'https://attic.kennel.juneis.dog/conduit' is disabled
I was able to reproduce the http 2 errors with the following reverse proxy config (note that TLS is required to reproduce the curl error: Stream error in the HTTP/2 framing layer
-- without the TLS, I was instead seeing curl error: Transferred a partial file
):
# Caddyfile
# attic
:8081 {
tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
reverse_proxy h2c://[::]:8080 {
transport http {
versions h2c 2
}
}
}
# minio s3
:9998 {
tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
reverse_proxy http://127.0.0.1:9999 {
transport http {
versions h2c 2
}
}
}
but not with this one:
# Caddyfile
# attic
:8081 {
tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
reverse_proxy h2c://[::]:8080 {
transport http {
versions h2c 2
}
}
}
# minio s3
:9998 {
tls scadrial.tailb203c.ts.net.crt scadrial.tailb203c.ts.net.key
reverse_proxy http://127.0.0.1:9999 {
}
}
I don't know if this is exactly the same issue you are seeing, though. It may be worth modifying your atticd
command's environment to include RUST_LOG=hyper::proto::h2=trace
, which will show you logs like this when it fails:
2024-05-26T22:23:43.348934Z DEBUG hyper::proto::h2::server: stream error: error from user's HttpBody stream: Storage error: service error
0: tokio::task::runtime.spawn
with kind=task task.name= task.id=56 loc.file="/home/vin/workspace/vcs/attic/attic/src/stream.rs" loc.line=85 loc.col=34
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
1: tower_http::trace::make_span::request
with method=GET uri=http://scadrial.tailb203c.ts.net:8081/test-test/nar/7g9f1rh7ab38q81llw9h8l4c56g00405.nar version=HTTP/2.0
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-http-0.4.4/src/trace/make_span.rs:109
2: tokio::task::runtime.spawn
with kind=task task.name= task.id=52 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
3: tokio::task::runtime.spawn
with kind=task task.name= task.id=45 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
and in the Caddy logs, I saw (among other errors):
2024/05/26 22:23:44.574 ERROR http.log.error unexpected EOF {"request": {"remote_ip": "100.109.66.101", "remote_port": "55646", "client_ip": "100.109.66.101", "proto": "HTTP/2.0", "method": "GET", "host": "scadrial.tailb203c.ts.net:9998", "uri": "/attic/249ee91f-9bc5-4606-8f08-10a17ed00f03.chunk?x-id=GetObject", "headers": {"Authorization": [], "X-Amz-Content-Sha256": ["e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"], "Amz-Sdk-Request": ["attempt=2; max=3"], "Amz-Sdk-Invocation-Id": ["7fb1131b-6231-4d1e-969f-3df25ddc4614"], "User-Agent": ["aws-sdk-rust/0.57.1 os/linux lang/rust/1.76.0"], "X-Amz-User-Agent": ["aws-sdk-rust/0.57.1 api/s3/0.35.0 os/linux lang/rust/1.76.0"], "X-Amz-Date": ["20240526T222344Z"]}, "tls": {"resumed": true, "version": 772, "cipher_suite": 4865, "proto": "h2", "server_name": "scadrial.tailb203c.ts.net"}}, "duration": 0.000733056, "status": 502, "err_id": "3bnwbtvet", "err_trace": "reverseproxy.statusError (reverseproxy.go:1267)"}
I think my conclusion is that your S3 backend is, for whatever reason, trying to use HTTP/2 between it and attic to download the chunks (before attic re-assembles the chunks into a NAR and streams that back in an HTTP/2 response). If you can try to force HTTP/1.1 between attic and your S3 backend, this might go away?
When I did some research, I found that S3 doesn't appear to actually support HTTP/2 directly? Or at least there are some issues throughout the internet that are not really clear either way. So this may be the issue -- that the communication between attic and S3 is trying to erroneously upgrade to HTTP/2 when it's unsupported? I can't say for sure, but maybe this is helpful in tracking down the issue.
I think your Caddyfile seems to be giving me better results. I re-ran a CI job that pulls from attic and I don't see any failures now.
Unfortunately still getting HTTP/2 errors: https://gitlab.com/conduwuit/conduwuit/-/jobs/6952157628#L440
I'll try to re-read your comment again and see if I missed something.
(I get a 404 when I try to click that, but I would be interested to know if you saw anything in the atticd logs while running it with RUST_LOG=hyper::proto::h2=trace
)
Had the GitLab mirror private for some reason (how were other people able to star it in the past?)
I'll re-run CI on the GItLab with trace logs again
https://girlboss.ceo/~strawberry/pb/qa2A
Nothing really sticks out when the HTTP/2 errors happen except for one of these:
May 27 20:03:15 TPYEXECL.local atticd[2867619]: 2024-05-28T00:03:15.893193Z DEBUG hyper::proto::h2: send body user stream error: error from user's HttpBody stream: Storage error: Does not understand the remote file reference
May 27 20:03:15 TPYEXECL.local atticd[2867619]: 0: tokio::task::runtime.spawn
May 27 20:03:15 TPYEXECL.local atticd[2867619]: with kind=task task.name= task.id=1488 loc.file="/opt/attic/attic/attic/src/stream.rs" loc.line=85 loc.col=34
May 27 20:03:15 TPYEXECL.local atticd[2867619]: at /opt/attic/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/util/trace.rs:17
May 27 20:03:15 TPYEXECL.local atticd[2867619]: 2024-05-28T00:03:15.893272Z DEBUG hyper::proto::h2::server: stream error: error from user's HttpBody stream: Storage error: Does not understand the remote file reference
May 27 20:03:15 TPYEXECL.local atticd[2867619]: 0: tokio::task::runtime.spawn
May 27 20:03:15 TPYEXECL.local atticd[2867619]: with kind=task task.name= task.id=1488 loc.file="/opt/attic/attic/attic/src/stream.rs" loc.line=85 loc.col=34
May 27 20:03:15 TPYEXECL.local atticd[2867619]: at /opt/attic/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/util/trace.rs:17
Now that is interesting... At any point, were you using the default local-file setup (stored on disk) before moving to an S3-based storage solution? That error message can only come from one of three places:
attic/server/src/storage/s3.rs
Line 122 in 4dbdbee
attic/server/src/storage/local.rs
Line 173 in 4dbdbee
attic/server/src/storage/local.rs
Line 202 in 4dbdbee
So it seems to me like it's trying to download a chunk from S3 but it's finding something that is not an S3 file reference (interesting that this is only in the debug logs though...).
You could do something like the following to find if this is what's happening: SELECT * FROM chunk WHERE remote_file_id NOT LIKE 's3:%'
If there are any results, then those will be causing this issue. You'll likely need to purge them from the database in some way.
Oh, yes I was. I knew this would be a footgun at some point :< really wish there was a proper way to migrate over from local to S3, I couldn't find anything online for attic.
I was using a local storage and really wanted to move to garage. So I dumped the entire local storage to the S3 bucket (retaining the format I saw when I did ls
on the S3 bucket). Everything appeared to work fine, but I guess it didn't..
The fact that this isn't propagated even as an error in atticd logs is pretty annoying, but at least this is the issue now.
Is there a way I can migrate the "local" references in the database to S3? I still have all the data/chunks/etc, it's just in the S3 bucket and not local.
I looked at SELECT * FROM chunk WHERE remote_file_id NOT LIKE 'local:%';
and I think I can piece it together, just making the SQL query to fix this will be pain. Unless attic can repopulate this and I don't need to do any database magic?
Is there a way I can migrate the "local" references in the database to S3? I still have all the data/chunks/etc, it's just in the S3 bucket and not local.
Storage migration is tricky business that can be very case-specific (in this case you moved the files yourself with old reference). It's something that can be added, but for now you could make an one-off script that changes all remote_file
and remote_file_id
in a loop and it should work.
Perhaps these HTTP/2 errors are actually just database errors not being shown to the user at the end of the day considering oluceps said the HTTP/2 errors disappeared after deleting the database and there's nothing actually wrong with the reverse proxy...
Weird. seems fixed after I recreated the psql db. With service error
for now you could make an one-off script that changes all remote_file and remote_file_id in a loop and it should work.
This has worked! (and I am very impressed it did considering what I did lol)
copying path '/nix/store/6hpjqjglnfq6n9bgharvm52gs3gcjrq2-source' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/k34c74fqb2a957c854gnhr5xfajsnxsh-source' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/b2qciqlj1nfzknnxjm7dlw9jjqf9k974-configureCargoCommonVarsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/mvzsygcwg6512djvv6zfcwb2m2a8d0kg-configureCargoVendoredDepsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/j3bqgzdla9kgpgczv6hpkfyy0230z74k-compiler-rt-libc-static-x86_64-unknown-linux-musl-17.0.6-dev' from 'https://attic.kennel.juneis.dog/conduwuit'...
copying path '/nix/store/zzm37bc4pakds0yybq1xqv7pgvasgmcn-installCargoArtifactsHook' from 'https://attic.kennel.juneis.dog/conduit'...
copying path '/nix/store/amv3ym8zg6kh7lwilzsp7lcypgckswkx-attic-0.1.0' from 'https://attic.kennel.juneis.dog/conduwuit'...
No more HTTP/2 errors and no more hyper errors in atticd! Since I appear to be the first user ever who wanted to migrate from local to S3 for attic, I'll probably write a quick post about it somewhere and what I did to do this.
I was also able to revert my reverse proxy changes back to the typical Caddy default (HTTP/3, HTTP/2, HTTP/1.1) for both atticd, garage, and the reverse_proxy http transport and still no issues. So definitely these HTTP/2 stream errors are red herrings for a deeper issue, in this case a database issue.
Wrote what I did here: https://girlboss.ceo/~strawberry/nix_attic_s3_migration-28-may-2024.txt
No this is not a good or clean way to do this, but for anyone who really wants to migrate without deleting everything this is what I did.
Thanks for that write-up, I'm sure it'll be very helpful to any who run into the same issue!
To "resolve" this issue, I'm gonna look into seeing if there's a way to surface that error instead of burying it in hyper's debug logging... If you don't hear anything before Friday, ping me and I'll remind myself to look at it next week.
I opened #137 which should make this kind of issue more obvious in the future. I briefly looked for anywhere we were streaming data back to the user, but this looks like the only place.
Before:
$ cargo r --bin atticd
Attic Server 0.1.0 (debug)
Running migrations...
Starting API server...
Listening on [::]:8080...
$ nix-store -r ............
error: unable to download 'https://...:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/i53fqfzxc4cz0251r6crz9hzv89rj05h-test' failed
After:
$ cargo r --bin atticd
Attic Server 0.1.0 (debug)
Running migrations...
Starting API server...
Listening on [::]:8080...
2024-05-28T18:38:26.580550Z ERROR attic_server::api::binary_cache: Stream error: Storage error: Does not understand the remote file reference
0: tokio::task::runtime.spawn
with kind=task task.name= task.id=63 loc.file="/home/vin/workspace/vcs/attic/attic/src/stream.rs" loc.line=85 loc.col=34
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
1: tower_http::trace::make_span::request
with method=GET uri=http://scadrial.tailb203c.ts.net:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar version=HTTP/2.0
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-http-0.4.4/src/trace/make_span.rs:109
2: tokio::task::runtime.spawn
with kind=task task.name= task.id=55 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
3: tokio::task::runtime.spawn
with kind=task task.name= task.id=48 loc.file="/home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.27/src/common/exec.rs" loc.line=49 loc.col=21
at /home/vin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/util/trace.rs:17
$ nix-store -r ............
error: unable to download 'https://...:8081/test-test/nar/i53fqfzxc4cz0251r6crz9hzv89rj05h.nar': HTTP error 200 (curl error: Stream error in the HTTP/2 framing layer)
error: build of '/nix/store/i53fqfzxc4cz0251r6crz9hzv89rj05h-test' failed