uktrade / pg-bulk-ingest

I'm trying to insert data into a table with the following definition, where the base class defines the schema and metadata.

class Dedupes(CMFBase):
    __tablename__ = "cmf__ddupes"
    __table_args__ = (UniqueConstraint("left", "right"),)
    
    sha1: Mapped[bytes] = mapped_column(primary_key=True)
    left: Mapped[bytes] = mapped_column()
    right: Mapped[bytes] = mapped_column()

I'm trying to insert data that looks like this:

sha1	left	right
b'\x1a\x86\x83\xbf\xe8I\x8f\x14\xe7\xe8i\xe0\x...	b'\x03\xfb\xaf\xea\xb1\xe3O\xcbY\x11p\2\x83\x...	b']\x92\x90V\xda\xc2\xe0\xbe\t\xb385\x9bx%f{\x...
b'\x08\xdd\x1e\x85r\xa6\x14"\x1b>r3\x85E\xd4e...	b'v\xc6\xa7\tM\xbay\x96\x19\x03e\xe8\xec\xb6r...	b'\xa5\xceT\x06\xad\x8eg\xaa\x81\xc6\n\x9bs\x9...
b'\xe8{J\xc0[\xfa\xe7Y\xd4M\t\xf1V\x9a\x07\x1b...	b'#\t\x8aa1\x01\xfb1u\xb3\xcdf\xf7\xa3\x97\xbf...	b'\xad>H\x14\x83\x15&\xe5\xfcn\xb3\xef\x8a\xa0...

The bytes Python datatype, especially represented as a b-string, seems to give pg-bulk-ingest some trouble:

import itertools.batched
from pg_bulk_ingest import Upsert, ingest 

def batches(high_watermark):
    for records in itertools.batched(dataframe.to_records(index=None), 100_000):
        yield None, None, ((Dedupes.__table__, (t)) for t in records)

with engine.connect() as conn:
    ingest(
        conn=conn,
        metadata=Dedupes.metadata,
        batches=fn_dedupe_batch,
        upsert=Upsert.IF_PRIMARY_KEY,
    )

cmf/data/results.py:161: in to_cmf
    self._deduper_to_cmf(engine=engine)
cmf/data/results.py:369: in _deduper_to_cmf
    ingest(
/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:370: in ingest
    csv_copy(sql, copy_from_stdin, conn, target_table, batch_table, table_batch)
/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:267: in csv_copy
    copy_from_stdin(cursor, str(bind_identifiers(sql, conn, "COPY {}.{} FROM STDIN", batch_table.schema, batch_table.name)), to_file_like_obj(db_rows, str))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cursor = <cursor object at 0x7f663ae73130; closed: -1>
query = 'COPY "test"."_tmp_52c372382f5740d4a65701a598aa99a0" FROM STDIN'
f = <to_file_like_obj.to_file_like_obj.<locals>.FileLikeObj object at 0x7f660fc4bf10>

    def copy_from_stdin2(cursor, query, f):
>       cursor.copy_expert(query, f, size=65536)
E       psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea
E       CONTEXT:  COPY _tmp_52c372382f5740d4a65701a598aa99a0, line 1, column left: "b'\xbe>\xbb\r\xf6\xd8`F\x0c\xa7rt/p!\xd8\xb0\xc7\xac\x1c'"

/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:60: InvalidTextRepresentation

While probably pg-bulk-ingest should handle this better, I suspect this is solvable in client (i.e. your) code, essentially by converting it to a str instance that's in the "hex" format described at https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-HEX-FORMAT

my_bytes = b'abc'
my_bytes_postgresql_friendly_str = '\\x' + my_bytes.hex().upper()

Or if you're keen, you can probably add a clause at

pg-bulk-ingest/pg_bulk_ingest.py

Line 257 in 38840d1

else:

to test if the column is sa.BYTEA, and then do the above (and add a test for it).

It would technically be a breaking change if for all sa.BYTEA columns we depend on the input value being a bytes instance, but I think I'm good with that.

Error inserting bytes data