Aws Glue error for append data

Question

Aws Glue error for append data

apersilva opened this issue 4 months ago · comments

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

A start use pyicerg with glue catalog and start error

The table in glue catalog have a comment column .
It´s possible to ignore comment table for append data in table ?

Andre Luis Anastacio · Answer 1 · Wed May 15 2024 03:14:44 GMT+0800 (China Standard Time)

Hello @apersilva, can you give us the error stack trace and a minimal code example that can reproduce this error?

apersilva · Answer 2 · Wed May 15 2024 04:03:39 GMT+0800 (China Standard Time)

def update_table(database_target, table_target,database_name, table_name, partition_by,size, process_date, custom_partion):

catalog =load_catalog('glue', **{
        'type': 'glue', 'verify' : False
    })

tabela = catalog.load_table(f"{database_target}.{table_target}")

metadata = {}
for doc in tabela.metadata.schemas[0].columns:
    metadata.update({doc.name: f"({doc.doc})"})

df = pa.Table.from_pylist(
[
    {"nome_tabela": table_name, 
     "nome_base_dados": database_name, 
     "particao": partition_by, 
     "numero_registro": size, 
     "process_date": process_date, 
     "particao_customizada":  custom_partion,
     "data_criacao": datetime.now().date() }
],
metadata=metadata      
)

    
tabela.append(df)

apersilva · Answer 3 · Wed May 15 2024 04:08:52 GMT+0800 (China Standard Time)

└────┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘Traceback (most recent call last):
File "c:\great_teste\update_table.py", line 45, in update_table
tabela.append(df)
File "C:\Users\9001329\AppData\Roaming\Python\Python310\site-packages\pyiceberg\table_init_.py", line 1057, in append
check_schema_compatible(self.schema(), other_schema=df.schema)
File "C:\Users\9001329\AppData\Roaming\Python\Python310\site-packages\pyiceberg\table_init.py", line 175, in _check_schema_compatible
raise ValueError(f"Mismatch in fields:\n{console.export_text()}")
ValueError: Mismatch in fields:
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ ┃ Table field ┃ Dataframe field ┃┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ ❌ │ 1: nome_tabela: optional string (Nome data Tabela Processada) │ 1: nome_tabela: optional string │
│ ❌ │ 2: nome_base_dados: optional string (Nome do Banco de dados que pertence │ 2: nome_base_dados: optional string │
│ │ a tabela) │ ││ ❌ │ 3: particao: optional string (Nome da particao) │ 3: particao: optional string │
│ ❌ │ 4: numero_registro: optional long (Quantidade de registros) │ 4: numero_registro: optional long │
│ ❌ │ 5: process_date: optional string (parametro quando é enviado e passo para │ 5: process_date: optional string │
│ │ a funcao de escrita para particao) │ ││ ❌ │ 6: particao_customizada: optional string (Indica que a partição é │ 6: particao_customizada: optional string │
│ │ diferente do padrão) │ ││ ❌ │ 7: data_criacao: optional date (Data em que foi inserido o registro) │ 7: data_criacao: optional date │
└────┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘

Andre Luis Anastacio · Answer 4 · Thu May 16 2024 23:37:00 GMT+0800 (China Standard Time)

@Fokko, can you help with clarifying the expected behavior? I believe we should compare the representations (repr) of the objects. Currently, the doc attribute is not included in the __repr__, so changing the comparison to be between repr objects might solve this problem. What do you think?

Andre Luis Anastacio · Answer 5 · Fri May 17 2024 04:59:43 GMT+0800 (China Standard Time)

Sorry, I double-checked the Java implementation, and it's correct on the Python side.

@apersilva, for your case, I believe you need to do something like this:

from pyiceberg.io.pyarrow import schema_to_pyarrow

schema = schema_to_pyarrow(tabela.schema())

df = pa.Table.from_pylist(
    [
        {
            "nome_tabela": table_name,
            "nome_base_dados": database_name,
            "particao": partition_by,
            "numero_registro": size,
            "process_date": process_date,
            "particao_customizada": custom_partition,
            "data_criacao": datetime.now().date()
        }
    ],
    schema=schema
)

tabela.append(df)

In a future release, there will be a function in the Schema object to return the Arrow schema, so it would look like this: schema = tabela.schema().as_arrow()

apersilva · Answer 6 · Fri May 17 2024 05:42:22 GMT+0800 (China Standard Time)

It´s work, thanks a lot.

Kevin Liu · Answer 7 · Thu Jun 20 2024 00:21:25 GMT+0800 (China Standard Time)

@apersilva looks like your issue is resolved, can we close this issue?