Timeout error not show which fallback was used
laskoviymishka opened this issue · comments
Is your feature request related to a problem? Please describe.
Same as here: jackc/pgconn#139
Given: Cluster with ipv4 and ipv6 address.
when there is no network connection we should see what exact address we failed to connect
Describe the solution you'd like
Show IP-address that failed as part of error message.
failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: dial tcp [2a05:d01c:d05:6500:3318:e752:f216:3d9b]:5432: i/o timeout)
Describe alternatives you've considered
Leave it as is
failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: context Deadline exceeded)
Additional context
This error also fixed at v4 driver - jackc/pgconn#140
I think this is fundamentally the same issue as #1929. And I think it should be solved by the same approach. Connect
should return some sort of multi-error that includes the errors from all the connection attempts.
This error is an exception, so solving same way as #1929 - is incorrect, since we try host sequential, once we reach first timeout all other host will receive same errTimeout automatically.
In the meantime using parallel conn checker would break dual-stack logic.
I tried to implement parallel connect. In this approach we will use the ipv4 address and start working with it, so in theory it's okay, but the psql
does not work like this, so I would say that the current approach with a sequential check is right, but the error message can be improved.
Nah, multiple errors would be equal to no error, you will see 2 hosts that errored but in fact only one is not reachable.
Maybe it's a good idea to stop iterations after nor errTimeout but after errTimeout with context deadline exceeded. Will try to debug such impl.
FWIW, multi errors are implemented in 8db9716.
This will produce errors like:
failed to connect to `user=postgres database=pgx_test`:
lookup foo.invalid: no such host
[::1]:1 (localhost): dial error: dial tcp [::1]:1: connect: connection refused
127.0.0.1:1 (localhost): dial error: dial tcp 127.0.0.1:1: connect: connection refused
127.0.0.1:2 (127.0.0.1): dial error: dial tcp 127.0.0.1:2: connect: connection refused
This is a bit wrong, per what I see in commit, what happens when ipv4 connectivity is ok, but ipv6 is blocked is an issue.
We shall return ipv6 conn refused, but not ipv4.
If ipv4 worked then no error would be returned. The error list is only returned if no attempt succeeds.
Then this violate original behavior of ip versions priority, for example psql cli (and any known to me driver) will ignore ipv4 if ipv6 available on host, so connecting here would be a strange behavior.
This is matching the behavior of libpq. See https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS.
In either format, a single host name can translate to multiple network addresses. A common example of this is a host that has both an IPv4 and an IPv6 address.
When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached, the connection fails. If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.
This is exactly what pgx is trying to match.
Furthermore, there is no change in connection behavior. The only change 8db9716 made was to report the results of all attempts when none of them succeeded.
Then looks like this closes the original issue.
Enviroment:
host chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has address 10.0.101.236
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has IPv6 address 2a05:d018:471:2f50:9396:217d:8dcc:c73
ubuntu@ip-10-0-1-121:~$ telnet chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com 5432
Trying 2a05:d018:471:2f50:9396:217d:8dcc:c73...
^C
ubuntu@ip-10-0-1-121:~$ telnet 10.0.101.236 5432
Trying 10.0.101.236...
Connected to 10.0.101.236.
Escape character is '^]'.
With incorrect db-name error is not network related:
ubuntu@ip-10-0-1-121:~$ DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password connect_timeout=2" && ./todo
Unable to connection to database: failed to connect to `user=cdcdb_admin database=`:
[2a05:d018:471:2f50:9396:217d:8dcc:c73]:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): dial error: timeout: context deadline exceeded
10.0.101.236:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): server error: FATAL: database "cdcdb_admin" does not exist (SQLSTATE 3D000)
With correct DB URL:
ubuntu@ip-10-0-1-121:~$ DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password database=chinook connect_timeout=2" && ./todo
Todo pgx demo
Usage:
todo list
todo add task
todo update task_num item
todo remove task_num
Example:
todo add 'Learn Go'
todo list
@jackc will you backport this to old v4 driver? (some people still use it, for example me :D)
Personally, I don't plan on porting back to v4. I'm not sure how far that part of the code has diverged between v4 and v5 and I'm typically only doing bug fixes on v4. But if someone else wants to do the work I don't mind merging it.
Okay, I'll try to backport and if it goes well - will open PR.