Timeout error not show which fallback was used

Question

Timeout error not show which fallback was used

laskoviymishka opened this issue 7 months ago · comments

Is your feature request related to a problem? Please describe.

Same as here: jackc/pgconn#139
Given: Cluster with ipv4 and ipv6 address.
when there is no network connection we should see what exact address we failed to connect

Describe the solution you'd like

Show IP-address that failed as part of error message.

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: dial tcp [2a05:d01c:d05:6500:3318:e752:f216:3d9b]:5432: i/o timeout)

Describe alternatives you've considered

Leave it as is

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: context Deadline exceeded)

Additional context

This error also fixed at v4 driver - jackc/pgconn#140

Jack Christensen · Answer 1 · Sun Mar 17 2024 05:25:09 GMT+0800 (China Standard Time)

I think this is fundamentally the same issue as #1929. And I think it should be solved by the same approach. Connect should return some sort of multi-error that includes the errors from all the connection attempts.

Andrei Tserakhau · Answer 2 · Sun Mar 17 2024 05:31:19 GMT+0800 (China Standard Time)

This error is an exception, so solving same way as #1929 - is incorrect, since we try host sequential, once we reach first timeout all other host will receive same errTimeout automatically.

Andrei Tserakhau · Answer 3 · Sun Mar 17 2024 05:35:37 GMT+0800 (China Standard Time)

In the meantime using parallel conn checker would break dual-stack logic.

I tried to implement parallel connect. In this approach we will use the ipv4 address and start working with it, so in theory it's okay, but the psql does not work like this, so I would say that the current approach with a sequential check is right, but the error message can be improved.

Jack Christensen · Answer 4 · Sun Mar 17 2024 11:18:12 GMT+0800 (China Standard Time)

Yeah, I don't think a parallel connection attempt is a good idea. And I don't think the other PR that is proposed for #1929 is right either. But if the error message was a multi error that included all the attempts, I think that would solve this and #1929.

Andrei Tserakhau · Answer 5 · Sun Mar 17 2024 15:44:44 GMT+0800 (China Standard Time)

Nah, multiple errors would be equal to no error, you will see 2 hosts that errored but in fact only one is not reachable.
Maybe it's a good idea to stop iterations after nor errTimeout but after errTimeout with context deadline exceeded. Will try to debug such impl.

Jack Christensen · Answer 6 · Sun May 12 2024 03:43:55 GMT+0800 (China Standard Time)

FWIW, multi errors are implemented in 8db9716.

This will produce errors like:

    failed to connect to `user=postgres database=pgx_test`:
            lookup foo.invalid: no such host
            [::1]:1 (localhost): dial error: dial tcp [::1]:1: connect: connection refused
            127.0.0.1:1 (localhost): dial error: dial tcp 127.0.0.1:1: connect: connection refused
            127.0.0.1:2 (127.0.0.1): dial error: dial tcp 127.0.0.1:2: connect: connection refused

Andrei Tserakhau · Answer 7 · Sun May 12 2024 03:49:34 GMT+0800 (China Standard Time)

This is a bit wrong, per what I see in commit, what happens when ipv4 connectivity is ok, but ipv6 is blocked is an issue.
We shall return ipv6 conn refused, but not ipv4.

Jack Christensen · Answer 8 · Sun May 12 2024 06:34:53 GMT+0800 (China Standard Time)

If ipv4 worked then no error would be returned. The error list is only returned if no attempt succeeds.

Andrei Tserakhau · Answer 9 · Sun May 12 2024 07:27:19 GMT+0800 (China Standard Time)

Then this violate original behavior of ip versions priority, for example psql cli (and any known to me driver) will ignore ipv4 if ipv6 available on host, so connecting here would be a strange behavior.

Jack Christensen · Answer 10 · Sun May 12 2024 08:15:33 GMT+0800 (China Standard Time)

This is matching the behavior of libpq. See https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS.

In either format, a single host name can translate to multiple network addresses. A common example of this is a host that has both an IPv4 and an IPv6 address.

When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached, the connection fails. If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.

This is exactly what pgx is trying to match.

Furthermore, there is no change in connection behavior. The only change 8db9716 made was to report the results of all attempts when none of them succeeded.

Andrei Tserakhau · Answer 11 · Mon May 13 2024 05:25:27 GMT+0800 (China Standard Time)

Then looks like this closes the original issue.

Enviroment:

host chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has address 10.0.101.236
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has IPv6 address 2a05:d018:471:2f50:9396:217d:8dcc:c73

ubuntu@ip-10-0-1-121:~$ telnet chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com 5432
Trying 2a05:d018:471:2f50:9396:217d:8dcc:c73...
^C
ubuntu@ip-10-0-1-121:~$ telnet 10.0.101.236 5432
Trying 10.0.101.236...
Connected to 10.0.101.236.
Escape character is '^]'.

With incorrect db-name error is not network related:

ubuntu@ip-10-0-1-121:~$  DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password connect_timeout=2" && ./todo 

Unable to connection to database: failed to connect to `user=cdcdb_admin database=`:
        [2a05:d018:471:2f50:9396:217d:8dcc:c73]:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): dial error: timeout: context deadline exceeded
        10.0.101.236:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): server error: FATAL: database "cdcdb_admin" does not exist (SQLSTATE 3D000)

With correct DB URL:

ubuntu@ip-10-0-1-121:~$ DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password database=chinook connect_timeout=2" && ./todo 

Todo pgx demo

Usage:

  todo list
  todo add task
  todo update task_num item
  todo remove task_num

Example:

  todo add 'Learn Go'
  todo list

Andrei Tserakhau · Answer 12 · Mon May 13 2024 05:30:03 GMT+0800 (China Standard Time)

@jackc will you backport this to old v4 driver? (some people still use it, for example me :D)

Jack Christensen · Answer 13 · Wed May 15 2024 09:10:22 GMT+0800 (China Standard Time)

@laskoviymishka

Personally, I don't plan on porting back to v4. I'm not sure how far that part of the code has diverged between v4 and v5 and I'm typically only doing bug fixes on v4. But if someone else wants to do the work I don't mind merging it.

Andrei Tserakhau · Answer 14 · Wed May 15 2024 16:18:55 GMT+0800 (China Standard Time)

Okay, I'll try to backport and if it goes well - will open PR.