Error inserting bytes data
wpfl-dbt opened this issue · comments
I'm trying to insert data into a table with the following definition, where the base class defines the schema and metadata.
class Dedupes(CMFBase):
__tablename__ = "cmf__ddupes"
__table_args__ = (UniqueConstraint("left", "right"),)
sha1: Mapped[bytes] = mapped_column(primary_key=True)
left: Mapped[bytes] = mapped_column()
right: Mapped[bytes] = mapped_column()
I'm trying to insert data that looks like this:
sha1 | left | right |
---|---|---|
b'\x1a\x86\x83\xbf\xe8I\x8f\x14\xe7\xe8i\xe0\x... | b'\x03\xfb\xaf\xea\xb1\xe3O\xcbY\x11p\2\x83\x... | b']\x92\x90V\xda\xc2\xe0\xbe\t\xb385\x9bx%f{\x... |
b'\x08\xdd\x1e\x85r\xa6\x14"\x1b>r3\x85E\xd4e... | b'v\xc6\xa7\tM\xbay\x96\x19\x03e\xe8\xec\xb6r... | b'\xa5\xceT\x06\xad\x8eg\xaa\x81\xc6\n\x9bs\x9... |
b'\xe8{J\xc0[\xfa\xe7Y\xd4M\t\xf1V\x9a\x07\x1b... | b'#\t\x8aa1\x01\xfb1u\xb3\xcdf\xf7\xa3\x97\xbf... | b'\xad>H\x14\x83\x15&\xe5\xfcn\xb3\xef\x8a\xa0... |
The bytes
Python datatype, especially represented as a b-string, seems to give pg-bulk-ingest
some trouble:
import itertools.batched
from pg_bulk_ingest import Upsert, ingest
def batches(high_watermark):
for records in itertools.batched(dataframe.to_records(index=None), 100_000):
yield None, None, ((Dedupes.__table__, (t)) for t in records)
with engine.connect() as conn:
ingest(
conn=conn,
metadata=Dedupes.metadata,
batches=fn_dedupe_batch,
upsert=Upsert.IF_PRIMARY_KEY,
)
cmf/data/results.py:161: in to_cmf
self._deduper_to_cmf(engine=engine)
cmf/data/results.py:369: in _deduper_to_cmf
ingest(
/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:370: in ingest
csv_copy(sql, copy_from_stdin, conn, target_table, batch_table, table_batch)
/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:267: in csv_copy
copy_from_stdin(cursor, str(bind_identifiers(sql, conn, "COPY {}.{} FROM STDIN", batch_table.schema, batch_table.name)), to_file_like_obj(db_rows, str))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cursor = <cursor object at 0x7f663ae73130; closed: -1>
query = 'COPY "test"."_tmp_52c372382f5740d4a65701a598aa99a0" FROM STDIN'
f = <to_file_like_obj.to_file_like_obj.<locals>.FileLikeObj object at 0x7f660fc4bf10>
def copy_from_stdin2(cursor, query, f):
> cursor.copy_expert(query, f, size=65536)
E psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea
E CONTEXT: COPY _tmp_52c372382f5740d4a65701a598aa99a0, line 1, column left: "b'\xbe>\xbb\r\xf6\xd8`F\x0c\xa7rt/p!\xd8\xb0\xc7\xac\x1c'"
/opt/conda/envs/company_matching/lib/python3.9/site-packages/pg_bulk_ingest.py:60: InvalidTextRepresentation
While probably pg-bulk-ingest should handle this better, I suspect this is solvable in client (i.e. your) code, essentially by converting it to a str
instance that's in the "hex" format described at https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-HEX-FORMAT
my_bytes = b'abc'
my_bytes_postgresql_friendly_str = '\\x' + my_bytes.hex().upper()
Or if you're keen, you can probably add a clause at
pg-bulk-ingest/pg_bulk_ingest.py
Line 257 in 38840d1
sa.BYTEA
, and then do the above (and add a test for it).
It would technically be a breaking change if for all sa.BYTEA
columns we depend on the input value being a bytes
instance, but I think I'm good with that.