dimitri / pgloader

Migrate to PostgreSQL in a single command!

Home Page:http://pgloader.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

postgres -> postgres include/exclude logic not working correctly

pob8888 opened this issue · comments

  • pgloader --version

    3.6.3~devel and also 3.6.2079646 (compiled from src today)
    
  • did you test a fresh compile from the source tree?

yes

  • did you search for other similar issues?

yes - issue #1153 similar, but was not resolved/progressed

  • how can I reproduce the bug?
load database
  from pgsql://adminuser:redacted@dbsrc.abcd.eu-west-2.rds.amazonaws.com:5432/postgres
  into pgsql://adminuser:redacted@dbdst.abcd.eu-west-2.rds.amazonaws.com:5432/postgres
  with drop schema
  including only table names matching ~/.*/ in schema 'schema1'
  including only table names matching ~/.*/ in schema 'schema2'
    excluding table names matching ~/_staging/ in schema 'schema1'
--  excluding table names matching ~/_staging/ in schema 'schema2'
  cast type "character varying" to "character varying" keep typemod
;

  • pgloader output you obtain
2024-01-24T20:21:37.008000Z LOG pgloader version "3.6.3~devel"
2024-01-24T20:21:37.008000Z LOG Data errors in '/tmp/pgloader/'
2024-01-24T20:21:37.008000Z LOG Parsing commands from file #P"/home/pob/my.dms"
2024-01-24T20:21:37.196002Z LOG Migrating from #<PGSQL-CONNECTION pgsql://adminuser@dbsrc.abcd.eu-west-2.rds.amazonaws.com:5432/postgres {1007EE2E93}>
2024-01-24T20:21:37.196002Z LOG Migrating into #<PGSQL-CONNECTION pgsql://adminuser@dbdst.abcd.eu-west-2.rds.amazonaws.com:5432/postgres {1007EE46A3}>
2024-01-24T20:21:39.052005Z LOG report summary reset
               table name     errors       rows      bytes      total time
-------------------------  ---------  ---------  ---------  --------------
          fetch meta data          0          1                     0.336s
           Create Schemas          0          0                     0.164s
         Create SQL Types          0          0                     0.052s
            Create tables          0          2                     0.144s
           Set Table OIDs          0          1                     0.020s
-------------------------  ---------  ---------  ---------  --------------
"schema1"."films_staging"          0          0                     0.312s
-------------------------  ---------  ---------  ---------  --------------
  COPY Threads Completion          0          4                     0.316s
   Index Build Completion          0          0                     0.000s
          Reset Sequences          0          0                     0.260s
             Primary Keys          0          0                     0.000s
      Create Foreign Keys          0          0                     0.000s
          Create Triggers          0          0                     0.096s
         Install Comments          0          0                     0.000s
-------------------------  ---------  ---------  ---------  --------------
        Total import time          ✓          0                     0.672s

  • data that is being loaded, if relevant
\d schema*.*
                          Table "schema1.films_staging"
 Column  |         Type          | Collation | Nullable |         Default         
---------+-----------------------+-----------+----------+-------------------------
 title   | character varying(20) |           |          | NULL::character varying
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | NULL::character varying

  • How the data is different from what you expected, if relevant

I am expecting it to NOT load the table schema1.films_staging
I am also expecting to see schema1.films_me, schema2.films, schema2.filmes_staging being copied

Instead, it is copying only schema1.films_staging - it is as though the exclude is inverted.

Ultimately, I am trying to achieve something like this config:

load database
  from pgsql://adminuser:redacted@dbsrc.abcd.eu-west-2.rds.amazonaws.com:5432/postgres
  into pgsql://adminuser:redacted@dbdst.abcd.eu-west-2.rds.amazonaws.com:5432/postgres
  with drop schema
  including only table names matching ~/.*/ in schema 'schema1'
  including only table names matching ~/.*/ in schema 'schema2'
    excluding table names matching ~/_staging/ in schema 'schema1'
    excluding table names matching ~/_staging/ in schema 'schema2'
  cast type "character varying" to "character varying" keep typemod
;

successfully copy all tables from schema1 and schema2, excluding any tables in schema1 or schema2 that match the string *_staging

Further explanation:

src database:

postgres=> \d schema*.*
                     Table "schema1.films_me"
 Column  |         Type          | Collation | Nullable | Default 
---------+-----------------------+-----------+----------+---------
 title   | character varying(20) |           |          | 
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | 

                  Table "schema1.films_staging"
 Column  |         Type          | Collation | Nullable | Default 
---------+-----------------------+-----------+----------+---------
 title   | character varying(20) |           |          | 
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | 

                      Table "schema2.films"
 Column  |         Type          | Collation | Nullable | Default 
---------+-----------------------+-----------+----------+---------
 title   | character varying(20) |           |          | 
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | 

                  Table "schema2.films_staging"
 Column  |         Type          | Collation | Nullable | Default 
---------+-----------------------+-----------+----------+---------
 title   | character varying(20) |           |          | 
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | 

postgres=> 

destination database before running pgloader:

postgres=> \d schema*.*
Did not find any relation named "schema*.*".
postgres=> 

run pgloader my.loader

pob@pob-laptop:~$ ./pgloader my.loader 
2024-01-24T20:25:26.008000Z LOG pgloader version "3.6.2079646"
2024-01-24T20:25:26.008000Z LOG Data errors in '/tmp/pgloader/'
2024-01-24T20:25:26.008000Z LOG Parsing commands from file #P"/home/pob/my.loader"
2024-01-24T20:25:26.208001Z LOG Migrating from #<PGSQL-CONNECTION pgsql://adminuser@dbsrc.abcd.eu-west-2.rds.amazonaws.com:5432/postgres {1007E2A7A3}>
2024-01-24T20:25:26.208001Z LOG Migrating into #<PGSQL-CONNECTION pgsql://adminuser@dbdst.abcd.eu-west-2.rds.amazonaws.com:5432/postgres {1007E2BFB3}>
2024-01-24T20:25:27.924004Z LOG report summary reset
               table name     errors       rows      bytes      total time
-------------------------  ---------  ---------  ---------  --------------
          fetch meta data          0          1                     0.412s
           Create Schemas          0          0                     0.080s
         Create SQL Types          0          0                     0.048s
            Create tables          0          2                     0.148s
           Set Table OIDs          0          1                     0.020s
-------------------------  ---------  ---------  ---------  --------------
"schema1"."films_staging"          0          0                     0.236s
-------------------------  ---------  ---------  ---------  --------------
  COPY Threads Completion          0          4                     0.244s
   Index Build Completion          0          0                     0.000s
          Reset Sequences          0          0                     0.292s
             Primary Keys          0          0                     0.000s
      Create Foreign Keys          0          0                     0.000s
          Create Triggers          0          0                     0.040s
         Install Comments          0          0                     0.000s
-------------------------  ---------  ---------  ---------  --------------
        Total import time          ✓          0                     0.576s
pob@pob-laptop:~$ 

destination database after running pgloader

postgres=> \d schema*.*
                          Table "schema1.films_staging"
 Column  |         Type          | Collation | Nullable |         Default         
---------+-----------------------+-----------+----------+-------------------------
 title   | character varying(20) |           |          | NULL::character varying
 release | date                  |           |          | 
 awards  | character varying(20) |           |          | NULL::character varying

Facing a similar issue. My set up is slightly different, but the symptom seems to be the same.

I'm running in a docker container, so I didn't do a fresh build locally. My Dockerfile is very basic:
FROM dimitri/pgloader:latest

I have a subset of tables I want to migrate from sqlserver (2016) -> postgres. The source and target DBs are hosted in docker containers as well.

This example only has a single table, as it reproduces the error consistently (my ms.load file can be seen in the log below). I'm setting a fetch limit to avoid hitting out of heap errors.

I run pgloader with -d and -v to see what information i could get:

# uname -a
Linux 01d1a0a03d1c 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 GNU/Linux
# pgloader --version
pgloader version "3.6.7~devel"
compiled with SBCL 2.1.1.debian

# pgloader -d -v ms.load
pgloader version 3.6.7~devel
compiled with SBCL 2.1.1.debian
sb-impl::*default-external-format* :UTF-8
tmpdir: #P"/tmp/pgloader/"
2024-01-24T23:30:27.000000Z NOTICE Starting pgloader, log system is ready.
2024-01-24T23:30:27.010000Z INFO Starting monitor
2024-01-24T23:30:27.010000Z LOG pgloader version "3.6.7~devel"
2024-01-24T23:30:27.020000Z INFO Parsed command:
load database
    from mssql://<user>@<host>/<src_db>
    into pgsql://<user>@<host>/<dst_db>

including only table names like 'address' in schema 'application'

WITH prefetch rows = 10000
;

2024-01-24T23:30:27.090000Z DEBUG CONNECTED TO #<PGLOADER.PGSQL:PGSQL-CONNECTION pgsql://<user>@<host>/<dst_db> {1007E429A3}>
2024-01-24T23:30:27.090000Z DEBUG SET client_encoding TO 'utf8'
2024-01-24T23:30:27.090000Z DEBUG SET application_name TO 'pgloader'
2024-01-24T23:30:27.100000Z LOG Migrating from #<MSSQL-CONNECTION mssql://<user>@<host>/<src_db> {1007E41EE3}>
2024-01-24T23:30:27.100000Z LOG Migrating into #<PGSQL-CONNECTION pgsql://<user>@<host>/<dst_db> {1007E429A3}>
Max connections reached, increase value of TDS_MAX_CONN
2024-01-24T23:30:27.120000Z SQL MSSQL: sending query: -- params: dbname
--         table-type-name
--         including
--         filter-list-to-where-clause including
--         excluding
--         filter-list-to-where-clause excluding
  select c.TABLE_SCHEMA,
         c.TABLE_NAME,
         c.COLUMN_NAME,
         c.DATA_TYPE,
         CASE
         WHEN c.COLUMN_DEFAULT LIKE '((%' AND c.COLUMN_DEFAULT LIKE '%))' THEN
             CASE
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4) = 'newid()' THEN 'GENERATE_UUID'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4) LIKE 'convert(%varchar%,getdate(),%)' THEN 'today'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4) = 'getdate()' THEN 'CURRENT_TIMESTAMP'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4) LIKE '''%''' THEN SUBSTRING(c.COLUMN_DEFAULT,4,len(c.COLUMN_DEFAULT)-6)
                 ELSE SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4)
             END
         WHEN c.COLUMN_DEFAULT LIKE '(%' AND c.COLUMN_DEFAULT LIKE '%)' THEN
             CASE
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,2,len(c.COLUMN_DEFAULT)-2) = 'newid()' THEN 'GENERATE_UUID'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,2,len(c.COLUMN_DEFAULT)-2) LIKE 'convert(%varchar%,getdate(),%)' THEN 'today'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,2,len(c.COLUMN_DEFAULT)-2) = 'getdate()' THEN 'CURRENT_TIMESTAMP'
                 WHEN SUBSTRING(c.COLUMN_DEFAULT,2,len(c.COLUMN_DEFAULT)-2) LIKE '''%''' THEN SUBSTRING(c.COLUMN_DEFAULT,3,len(c.COLUMN_DEFAULT)-4)
                 ELSE SUBSTRING(c.COLUMN_DEFAULT,2,len(c.COLUMN_DEFAULT)-2)
             END
         ELSE c.COLUMN_DEFAULT
         END,
         c.IS_NULLABLE,
         COLUMNPROPERTY(object_id(c.TABLE_NAME), c.COLUMN_NAME, 'IsIdentity'),
         c.CHARACTER_MAXIMUM_LENGTH,
         c.NUMERIC_PRECISION,
         c.NUMERIC_PRECISION_RADIX,
         c.NUMERIC_SCALE,
         c.DATETIME_PRECISION,
         c.CHARACTER_SET_NAME,
         c.COLLATION_NAME

    from INFORMATION_SCHEMA.COLUMNS c
         join INFORMATION_SCHEMA.TABLES t
              on c.TABLE_SCHEMA = t.TABLE_SCHEMA
             and c.TABLE_NAME = t.TABLE_NAME

   where     c.TABLE_CATALOG = 'axelerator_hannover'
         and t.TABLE_TYPE = 'BASE TABLE'
         and ((c.table_schema = 'application' and c.table_name LIKE 'address'))
         

order by c.table_schema, c.table_name, c.ordinal_position;
2024-01-24T23:30:27.150000Z SQL MSSQL: sending query: -- params: including
--         filter-list-to-where-clause including
--         excluding
--         filter-list-to-where-clause excluding
    select schema_name(schema_id) as SchemaName,
           o.name as TableName,
           REPLACE(i.name, '.', '_') as IndexName,
           co.[name] as ColumnName,
           i.is_unique,
           i.is_primary_key,
           i.filter_definition

    from sys.indexes i
         join sys.objects o on i.object_id = o.object_id
         join sys.index_columns ic on ic.object_id = i.object_id
             and ic.index_id = i.index_id
         join sys.columns co on co.object_id = i.object_id
             and co.column_id = ic.column_id

   where schema_name(schema_id) not in ('dto', 'sys')
         and ((schema_name(schema_id) = 'application' and o.name LIKE 'address'))
         

order by SchemaName,
         o.[name],
         i.[name],
         ic.is_included_column,
         ic.key_ordinal;
2024-01-24T23:30:27.170000Z SQL MSSQL: sending query: -- params: dbname
--         including
--         filter-list-to-where-clause including
--         excluding
--         filter-list-to-where-clause excluding
   SELECT
           REPLACE(KCU1.CONSTRAINT_NAME, '.', '_') AS 'CONSTRAINT_NAME'
         , KCU1.TABLE_SCHEMA AS 'TABLE_SCHEMA'
         , KCU1.TABLE_NAME AS 'TABLE_NAME'
         , KCU1.COLUMN_NAME AS 'COLUMN_NAME'
         , KCU2.TABLE_SCHEMA AS 'UNIQUE_TABLE_SCHEMA'
         , KCU2.TABLE_NAME AS 'UNIQUE_TABLE_NAME'
         , KCU2.COLUMN_NAME AS 'UNIQUE_COLUMN_NAME'
         , RC.UPDATE_RULE AS 'UPDATE_RULE'
         , RC.DELETE_RULE AS 'DELETE_RULE'

    FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS RC
         JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU1
              ON KCU1.CONSTRAINT_CATALOG = RC.CONSTRAINT_CATALOG
                 AND KCU1.CONSTRAINT_SCHEMA = RC.CONSTRAINT_SCHEMA
                 AND KCU1.CONSTRAINT_NAME = RC.CONSTRAINT_NAME
         JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU2
              ON KCU2.CONSTRAINT_CATALOG = RC.UNIQUE_CONSTRAINT_CATALOG
                 AND KCU2.CONSTRAINT_SCHEMA = RC.UNIQUE_CONSTRAINT_SCHEMA
                 AND KCU2.CONSTRAINT_NAME = RC.UNIQUE_CONSTRAINT_NAME

   WHERE KCU1.ORDINAL_POSITION = KCU2.ORDINAL_POSITION
         AND KCU1.TABLE_CATALOG = 'axelerator_hannover'
         AND KCU1.CONSTRAINT_CATALOG = 'axelerator_hannover'
         AND KCU1.CONSTRAINT_SCHEMA NOT IN ('dto', 'sys')
         AND KCU1.TABLE_SCHEMA NOT IN ('dto', 'sys')
         AND KCU2.TABLE_SCHEMA NOT IN ('dto', 'sys')

         and ((kcu1.table_schema = 'application' and kcu1.table_name LIKE 'address'))
         

ORDER BY KCU1.CONSTRAINT_NAME, KCU1.ORDINAL_POSITION;
2024-01-24T23:30:27.180000Z ERROR MSSQL ERROR: %dbsqlexec fail
2024-01-24T23:30:27.180000Z LOG You might need to review the FreeTDS protocol version in your freetds.conf file, see http://www.freetds.org/userguide/choosingtdsprotocol.htm
2024-01-24T23:30:27.180000Z LOG report summary reset
       table name     errors       read   imported      bytes      total time       read      write
-----------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
  fetch meta data          0          0          0                     0.000s    
-----------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
-----------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
2024-01-24T23:30:27.190000Z INFO Stopping monitor
#

I'm not an expert on sqlserver, but I believe there is an issue in the last query using kcu1 rather than KCU1 as the table alias. Running a simple query (just via SQL Server Management Studio):

select *
from INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU1
ORDER BY KCU1.CONSTRAINT_NAME
<returns rows>
select *
from INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU1
WHERE kcu1.table_name LIKE 'address'

Msg 4104, Level 16, State 1, Line 41
The multi-part identifier "kcu1.table_name" could not be bound.

Removing the including only table names like 'address' in schema 'application' from the ms.load file removes this issue and my migration runs successfully.