p-e-w / pingpong

End-to-end latency monitoring for Matrix

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Second account always unable to join room

MacLemon opened this issue · comments

Summary:

Second account always unable to join room created by first account.

Steps to Reproduce:

Have to known-working accounts for latency measuring ready with credentials.
launch pingpong.

go/bin/pingpong --debug @fairydust.space-latency-bot:matrix.org:<PASSPHRASEREDACTED> @matrix.org-latency-bot:fairydust.space:<PASSPHRASEREDACTED>

Expected Results:

Both accounts should connect successfully, the first should create a room which is joined by the second.

Actual Results:

Both accounts login correctly to their respective homeservers. The first creates a room, then immediately leaves.
The second account is always unable to join that newly created room.
When flipping the order of account upon invocation, it's again the second one unable to join.

Log output when launched with --debug flag:

2021/01/21 18:19:28 [@matrix.org-latency-bot:fairydust.space] logged in
2021/01/21 18:19:29 [@fairydust.space-latency-bot:matrix.org] logged in
2021/01/21 18:19:29 [@matrix.org-latency-bot:fairydust.space] created room !GGkvXWSMTmoaAAahNb:fairydust.space
2021/01/21 18:19:29 [@matrix.org-latency-bot:fairydust.space] left room !GGkvXWSMTmoaAAahNb:fairydust.space
2021/01/21 18:19:30 [@fairydust.space-latency-bot:matrix.org] logged out
2021/01/21 18:19:30 [@matrix.org-latency-bot:fairydust.space] logged out
2021/01/21 18:19:30 [@fairydust.space-latency-bot:matrix.org] [FATAL] unable to join room !GGkvXWSMTmoaAAahNb:fairydust.space: failed to POST /_matrix/client/r0/join/!GGkvXWSMTmoaAAahNb:fairydust.space: M_UNKNOWN (HTTP 404): No known servers

When switching the order of accounts in the invocation:

2021/01/21 18:19:04 [@fairydust.space-latency-bot:matrix.org] logged in
2021/01/21 18:19:04 [@matrix.org-latency-bot:fairydust.space] logged in
2021/01/21 18:19:04 [@fairydust.space-latency-bot:matrix.org] created room !LKDqzxQsMfXZYFkynW:matrix.org
2021/01/21 18:19:05 [@fairydust.space-latency-bot:matrix.org] left room !LKDqzxQsMfXZYFkynW:matrix.org
2021/01/21 18:19:05 [@matrix.org-latency-bot:fairydust.space] logged out
2021/01/21 18:19:05 [@fairydust.space-latency-bot:matrix.org] logged out
2021/01/21 18:19:05 [@matrix.org-latency-bot:fairydust.space] [FATAL] unable to join room !LKDqzxQsMfXZYFkynW:matrix.org: failed to POST /_matrix/client/r0/join/!LKDqzxQsMfXZYFkynW:matrix.org: M_UNKNOWN (HTTP 404): No known servers

Regression:

n/a

Notes:

pingpong installed as suggested in the README by issuing:

go get github.com/p-e-w/pingpong

Checked DNS and all hostnames can be properly resolved on the host running pingpong. (Just because it's always DNS.)

Version Information:

Tested on FreeBSD 12.2-p2, with legacy IPv4 available only.

go version
go version go1.15.6 freebsd/amd64

Tested on macOS 10.13.6 HighSierra, with IP and legacy IP available.

go version go1.15.6 darwin/amd64

fairydust.space is running synapse 1.25.0.
matrix.org ist running 1.26.0rc1 (b=matrix-org-hotfixes,bde75f5f6,dirty).

Thank you for the high-quality report! It's so nice (and rare) to get this much useful information without having to ask.


First, to clear up what is happening here:

The first creates a room, then immediately leaves.

For architectural reasons (the TUI, which uses the alternate screen, has to be closed before messages can be printed), fatal errors are currently reported from a deferred routine that is the last thing that runs before the program exits. This means they are printed as the last line of the debug output, regardless of when the underlying error occurs.

So the real sequence of events here is that the first client creates a room, the second client tries to join and fails, which is a fatal error, PingPong cleans up everything (leaving rooms, logging out), and only then is the error reported.

I apologize for this confusing behavior. I will look into how I can improve it.


Now, for the actual problem. PingPong delegates all interactions with homeservers to Mautrix. The following possibilities come to mind to explain what is happening in your case:

  1. There is a federation issue between matrix.org and fairydust.space.
  2. There is a bug in Mautrix.
  3. PingPong doesn't use Mautrix correctly.

I have tested PingPong with various combinations of accounts, including accounts on the same homeserver, accounts on separate homeservers, and accounts on homeservers running different software (Synapse and Dendrite). I have never encountered any problems, so my working hypothesis is that option (3) is unlikely.

To further debug this issue, it would help if you could do the following:

  • Use separate Element instances to connect to both accounts simultaneously, then try to start a direct chat, which is essentially what PingPong does. Try to exchange messages. If that works, option (1) can be excluded.
  • Do the same thing, but with gomuks instead of Element. Gomuks uses Mautrix as well, so if that also works, option (2) can be excluded.

Use separate Element instances to connect to both accounts simultaneously, then try to start a direct chat, which is essentially what PingPong does. Try to exchange messages. If that works, option (1) can be excluded.

I had checked that before posting this issue, sorry for not mentioning it.
We do get occasional reports of delayed message delivery (by up to several minutes) from matrix.org to fairydust.space which is the reason we want to have some latency monitoring. The delays only affect messages originating from matrix.org and are also observable at other instances in the federation. Therefor I assume that there is no fundamental federation issue at play.

Maybe pingpong times out before?

Do the same thing, but with gomuks instead of Element. Gomuks uses Mautrix as well, so if that also works, option (2) can be excluded.

Just tried that but at the moment installing gomuks via the usual go get fails (on multiple platforms).

Probably not relevant to you, but for completeness:

$ go get github.com/tulir/gomuks
cannot find package "github.com/russross/blackfriday/v2" in any of:
        /usr/local/go/src/github.com/russross/blackfriday/v2 (from $GOROOT)
        /home/<username>/go/src/github.com/russross/blackfriday/v2 (from $GOPATH)

Trying go get github.com/russross/blackfriday works without a problem, just like installing pingpong didn't return any errors.

gomuks installation fails on

  • FreeBSD 12.2-p2
  • macOS 10.13.6
  • Arch Linux

I'll see if I can manually build gomuks., but that will take some time.
I managed to build gomuks manually, but it doesn't result in a working binary. Therefor I've filed gomuks/issues/262.

Joining rooms by room ID requires passing a server name to join through, otherwise your server doesn't know how to join the room (the server name at the end of room IDs is never used). For mautrix-go, the server name can be passed in the second parameter of JoinRoom, which seems to be empty in main.go#L111

Joining empty rooms is not possible in Matrix, but it sounds like that wasn't actually what happened here.

@tulir Thanks for the insight. It does work with Dendrite, which is why I didn't notice this bug until now. I guess I never tested using a Dendrite account as the one that creates the room, and a Synapse account as the one that joins.

the server name at the end of room IDs is never used

That's highly unintuitive, and not at all clear from the spec. The spec makes it sound like server_name is an optional parameter for fine-control. Why wouldn't the homeserver use the name from the room ID if it is already there?

I've applied the fix suggested by @tulir, and it works for me now. Please test again. In particular, I am interested whether there are any more issues on FreeBSD specifically, as that is an often-neglected platform but I would still like it to work ideally.

The spec makes it sound like server_name is an optional parameter for fine-control.

When joining with room aliases or joining a room that you're certain your server is in already, the server name param isn't necessary.

Why wouldn't the homeserver use the name from the room ID if it is already there?

The name is there only for uniqueness to prevent room ID conflicts. It doesn't mean the server is still in the room. In the future, room IDs will become hashes or public keys and the server name won't be there anymore.

I'm pretty certain FreeBSD is not to be blamed here, since the problem occurred on multiple platforms in the same manner.

I updated and installed via git clone and go build as per tulir's recommendation. With the new patch it's also fixed on all three platforms I tested, in the same manner. Thanks for the fix, it pings and pongs fine now.