damavis / airflow-hop-plugin

Apache Hop plugin for Apache Airflow - Orquestate Apache Hop pipelines and workflows from Airflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in processing unicode characters in the response returned by Hop server

PeterCahn opened this issue · comments

Hello everyone!

Context

I am running both Hop and Airflow in Docker containers, and I faced an issue involving unicode characters in the XML response after starting a workflow.

Problem

I am trying to run a HopWorkflowOperator in the following way:

    wf = HopWorkflowOperator(
        task_id='workflow_test',
        workflow='workflow/simple.hwf',
        project_name='hop-poc',
        log_level='Debug')

But I faced this problem:

44d66f31d403
*** Reading local file: /opt/airflow/logs/dag_id=sample_dag/run_id=manual__2022-11-18T16:40:46.452020+00:00/task_id=workflow_test/attempt=1.log
[2022-11-18, 17:40:49 CET] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [queued]>
[2022-11-18, 17:40:49 CET] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [queued]>
[2022-11-18, 17:40:49 CET] {taskinstance.py:1362} INFO - 
--------------------------------------------------------------------------------
[2022-11-18, 17:40:49 CET] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
[2022-11-18, 17:40:49 CET] {taskinstance.py:1364} INFO - 
--------------------------------------------------------------------------------
[2022-11-18, 17:40:49 CET] {taskinstance.py:1383} INFO - Executing <Task(HopWorkflowOperator): workflow_test> on 2022-11-18 16:40:46.452020+00:00
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:55} INFO - Started process 79 to run task
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:82} INFO - Running: ['***', 'tasks', 'run', 'sample_dag', 'workflow_test', 'manual__2022-11-18T16:40:46.452020+00:00', '--job-id', '736', '--raw', '--subdir', 'DAGS_FOLDER/hop-***-plugin.py', '--cfg-path', '/tmp/tmp12mv56ef']
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:83} INFO - Job 736: Subtask workflow_test
[2022-11-18, 17:40:49 CET] {task_command.py:376} INFO - Running <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [running]> on host 44d66f31d403
[2022-11-18, 17:40:49 CET] {taskinstance.py:1592} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=sample_dag
AIRFLOW_CTX_TASK_ID=workflow_test
AIRFLOW_CTX_EXECUTION_DATE=2022-11-18T16:40:46.452020+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-11-18T16:40:46.452020+00:00
[2022-11-18, 17:40:49 CET] {base.py:71} INFO - Using connection ID 'hop_default' for task execution.
[2022-11-18, 17:40:49 CET] {operators.py:83} INFO - workflow/simple.hwf: Workflow 'simple' was added to the list with id 62acb5f5-7062-438b-820e-5e1ccf24c1cc
[2022-11-18, 17:40:49 CET] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_hop/operators.py", line 85, in execute
    start_rs = conn.start_workflow(self.workflow, work_id)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_hop/hooks.py", line 210, in start_workflow
    result = xmltodict.parse(response.content)
  File "/home/airflow/.local/lib/python3.7/site-packages/xmltodict.py", line 378, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 45
[2022-11-18, 17:40:49 CET] {taskinstance.py:1406} INFO - Marking task as FAILED. dag_id=sample_dag, task_id=workflow_test, execution_date=20221118T164046, start_date=20221118T164049, end_date=20221118T164049
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:105} ERROR - Failed to execute job 736 for task workflow_test (not well-formed (invalid token): line 4, column 45; 79)
[2022-11-18, 17:40:49 CET] {local_task_job.py:164} INFO - Task exited with return code 1
[2022-11-18, 17:40:49 CET] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

After digging in, I discovered that the returned XML in the call inside start_workflow, and accessed with response.content, was the following:

b'<?xml version="1.0" encoding="UTF-8"?>\n<webresult>\n <result>OK</result>\n <message>Il workflow [workflow/simple.hwf] \xe8 iniziato.</message>\n <id>885c2aae-0a37-4147-a783-65e87cf71b29</id>\n</webresult>\n\n'

where "line 4, column 45" is exactly the value "\xe8" (which is the letter 'è'), which caused the error during the parsing in xmltodict.

Workaround

Pass response.text instead of response.context to xmltodict.parse method to handle unicode characters during parsing.


I see that response.context is used everywhere when parsing response from the Hop server, there may be the same problem with other methods inside the HopServerConnection?

It seems (not 100% sure) that the messages are encoded in UTF-16 unicode instead of UTF-8, see messages_it_IT.properties.

è in UTF-16 : 00E8
è in UTF-8: C3E8

@PeterCahn It would help if you share a server response in a file, using curl or a similar tool. Please, use english while investigating.

UTF-8 invalid

>>> b'\xe8'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte

UTF-8 valid

>>> b'\xc3\xa8'.decode('utf-8')
'è'

Hi, thank you for your answer!

UTF-8 invalid

>>> b'\xe8'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte

UTF-8 valid

>>> b'\xc3\xa8'.decode('utf-8')
'è'

I see that, in those property files inside the Hop project that you linked, \u00E8 is hardcoded.

How can it be resolved otherwise if not using response.text?

Hi @PeterCahn ,

You were completely right. We have applied the patch and it's available on the last release 0.1.0b2.

Thank you.