jackc / pgx

PostgreSQL driver and toolkit for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timeout error not show which fallback was used

laskoviymishka opened this issue · comments

Is your feature request related to a problem? Please describe.

Same as here: jackc/pgconn#139
Given: Cluster with ipv4 and ipv6 address.
when there is no network connection we should see what exact address we failed to connect

Describe the solution you'd like

Show IP-address that failed as part of error message.

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: dial tcp [2a05:d01c:d05:6500:3318:e752:f216:3d9b]:5432: i/o timeout)

Describe alternatives you've considered

Leave it as is

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: context Deadline exceeded)

Additional context

This error also fixed at v4 driver - jackc/pgconn#140

I think this is fundamentally the same issue as #1929. And I think it should be solved by the same approach. Connect should return some sort of multi-error that includes the errors from all the connection attempts.

This error is an exception, so solving same way as #1929 - is incorrect, since we try host sequential, once we reach first timeout all other host will receive same errTimeout automatically.

In the meantime using parallel conn checker would break dual-stack logic.

I tried to implement parallel connect. In this approach we will use the ipv4 address and start working with it, so in theory it's okay, but the psql does not work like this, so I would say that the current approach with a sequential check is right, but the error message can be improved.

Yeah, I don't think a parallel connection attempt is a good idea. And I don't think the other PR that is proposed for #1929 is right either. But if the error message was a multi error that included all the attempts, I think that would solve this and #1929.

Nah, multiple errors would be equal to no error, you will see 2 hosts that errored but in fact only one is not reachable.
Maybe it's a good idea to stop iterations after nor errTimeout but after errTimeout with context deadline exceeded. Will try to debug such impl.

FWIW, multi errors are implemented in 8db9716.

This will produce errors like:

    failed to connect to `user=postgres database=pgx_test`:
            lookup foo.invalid: no such host
            [::1]:1 (localhost): dial error: dial tcp [::1]:1: connect: connection refused
            127.0.0.1:1 (localhost): dial error: dial tcp 127.0.0.1:1: connect: connection refused
            127.0.0.1:2 (127.0.0.1): dial error: dial tcp 127.0.0.1:2: connect: connection refused

This is a bit wrong, per what I see in commit, what happens when ipv4 connectivity is ok, but ipv6 is blocked is an issue.
We shall return ipv6 conn refused, but not ipv4.

If ipv4 worked then no error would be returned. The error list is only returned if no attempt succeeds.

Then this violate original behavior of ip versions priority, for example psql cli (and any known to me driver) will ignore ipv4 if ipv6 available on host, so connecting here would be a strange behavior.

This is matching the behavior of libpq. See https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS.

In either format, a single host name can translate to multiple network addresses. A common example of this is a host that has both an IPv4 and an IPv6 address.

When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached, the connection fails. If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.

This is exactly what pgx is trying to match.


Furthermore, there is no change in connection behavior. The only change 8db9716 made was to report the results of all attempts when none of them succeeded.

Then looks like this closes the original issue.

Enviroment:

host chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has address 10.0.101.236
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has IPv6 address 2a05:d018:471:2f50:9396:217d:8dcc:c73

ubuntu@ip-10-0-1-121:~$ telnet chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com 5432
Trying 2a05:d018:471:2f50:9396:217d:8dcc:c73...
^C
ubuntu@ip-10-0-1-121:~$ telnet 10.0.101.236 5432
Trying 10.0.101.236...
Connected to 10.0.101.236.
Escape character is '^]'.

With incorrect db-name error is not network related:

ubuntu@ip-10-0-1-121:~$  DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password connect_timeout=2" && ./todo 

Unable to connection to database: failed to connect to `user=cdcdb_admin database=`:
        [2a05:d018:471:2f50:9396:217d:8dcc:c73]:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): dial error: timeout: context deadline exceeded
        10.0.101.236:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): server error: FATAL: database "cdcdb_admin" does not exist (SQLSTATE 3D000)

With correct DB URL:

ubuntu@ip-10-0-1-121:~$ DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password database=chinook connect_timeout=2" && ./todo 

Todo pgx demo

Usage:

  todo list
  todo add task
  todo update task_num item
  todo remove task_num

Example:

  todo add 'Learn Go'
  todo list

@jackc will you backport this to old v4 driver? (some people still use it, for example me :D)

@laskoviymishka

Personally, I don't plan on porting back to v4. I'm not sure how far that part of the code has diverged between v4 and v5 and I'm typically only doing bug fixes on v4. But if someone else wants to do the work I don't mind merging it.

Okay, I'll try to backport and if it goes well - will open PR.