Error in processing unicode characters in the response returned by Hop server
PeterCahn opened this issue · comments
Hello everyone!
Context
I am running both Hop and Airflow in Docker containers, and I faced an issue involving unicode characters in the XML response after starting a workflow.
Problem
I am trying to run a HopWorkflowOperator in the following way:
wf = HopWorkflowOperator(
task_id='workflow_test',
workflow='workflow/simple.hwf',
project_name='hop-poc',
log_level='Debug')
But I faced this problem:
44d66f31d403
*** Reading local file: /opt/airflow/logs/dag_id=sample_dag/run_id=manual__2022-11-18T16:40:46.452020+00:00/task_id=workflow_test/attempt=1.log
[2022-11-18, 17:40:49 CET] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [queued]>
[2022-11-18, 17:40:49 CET] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [queued]>
[2022-11-18, 17:40:49 CET] {taskinstance.py:1362} INFO -
--------------------------------------------------------------------------------
[2022-11-18, 17:40:49 CET] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
[2022-11-18, 17:40:49 CET] {taskinstance.py:1364} INFO -
--------------------------------------------------------------------------------
[2022-11-18, 17:40:49 CET] {taskinstance.py:1383} INFO - Executing <Task(HopWorkflowOperator): workflow_test> on 2022-11-18 16:40:46.452020+00:00
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:55} INFO - Started process 79 to run task
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:82} INFO - Running: ['***', 'tasks', 'run', 'sample_dag', 'workflow_test', 'manual__2022-11-18T16:40:46.452020+00:00', '--job-id', '736', '--raw', '--subdir', 'DAGS_FOLDER/hop-***-plugin.py', '--cfg-path', '/tmp/tmp12mv56ef']
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:83} INFO - Job 736: Subtask workflow_test
[2022-11-18, 17:40:49 CET] {task_command.py:376} INFO - Running <TaskInstance: sample_dag.workflow_test manual__2022-11-18T16:40:46.452020+00:00 [running]> on host 44d66f31d403
[2022-11-18, 17:40:49 CET] {taskinstance.py:1592} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=sample_dag
AIRFLOW_CTX_TASK_ID=workflow_test
AIRFLOW_CTX_EXECUTION_DATE=2022-11-18T16:40:46.452020+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-11-18T16:40:46.452020+00:00
[2022-11-18, 17:40:49 CET] {base.py:71} INFO - Using connection ID 'hop_default' for task execution.
[2022-11-18, 17:40:49 CET] {operators.py:83} INFO - workflow/simple.hwf: Workflow 'simple' was added to the list with id 62acb5f5-7062-438b-820e-5e1ccf24c1cc
[2022-11-18, 17:40:49 CET] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow_hop/operators.py", line 85, in execute
start_rs = conn.start_workflow(self.workflow, work_id)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow_hop/hooks.py", line 210, in start_workflow
result = xmltodict.parse(response.content)
File "/home/airflow/.local/lib/python3.7/site-packages/xmltodict.py", line 378, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 45
[2022-11-18, 17:40:49 CET] {taskinstance.py:1406} INFO - Marking task as FAILED. dag_id=sample_dag, task_id=workflow_test, execution_date=20221118T164046, start_date=20221118T164049, end_date=20221118T164049
[2022-11-18, 17:40:49 CET] {standard_task_runner.py:105} ERROR - Failed to execute job 736 for task workflow_test (not well-formed (invalid token): line 4, column 45; 79)
[2022-11-18, 17:40:49 CET] {local_task_job.py:164} INFO - Task exited with return code 1
[2022-11-18, 17:40:49 CET] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
After digging in, I discovered that the returned XML in the call inside start_workflow, and accessed with response.content
, was the following:
b'<?xml version="1.0" encoding="UTF-8"?>\n<webresult>\n <result>OK</result>\n <message>Il workflow [workflow/simple.hwf] \xe8 iniziato.</message>\n <id>885c2aae-0a37-4147-a783-65e87cf71b29</id>\n</webresult>\n\n'
where "line 4, column 45" is exactly the value "\xe8" (which is the letter 'è'), which caused the error during the parsing in xmltodict.
Workaround
Pass response.text
instead of response.context
to xmltodict.parse method to handle unicode characters during parsing.
I see that response.context
is used everywhere when parsing response from the Hop server, there may be the same problem with other methods inside the HopServerConnection
?
It seems (not 100% sure) that the messages are encoded in UTF-16 unicode instead of UTF-8, see messages_it_IT.properties.
è
in UTF-16 : 00E8
è
in UTF-8: C3E8
@PeterCahn It would help if you share a server response in a file, using curl or a similar tool. Please, use english while investigating.
UTF-8 invalid
>>> b'\xe8'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
UTF-8 valid
>>> b'\xc3\xa8'.decode('utf-8')
'è'
Hi, thank you for your answer!
UTF-8 invalid
>>> b'\xe8'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
UTF-8 valid
>>> b'\xc3\xa8'.decode('utf-8') 'è'
I see that, in those property files inside the Hop project that you linked, \u00E8
is hardcoded.
How can it be resolved otherwise if not using response.text
?
Hi @PeterCahn ,
You were completely right. We have applied the patch and it's available on the last release 0.1.0b2
.
Thank you.