datahub-project / datahub

The Metadata Platform for your Data Stack

Home Page:https://datahubproject.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Redshift Ingestion broken in 0.13.2

joshua-pgatour opened this issue · comments

Describe the bug

Execution finished with errors.
{'exec_id': '2236cd44-90eb-4781-9563-05ea51f4bbd4',
 'infos': ['2024-05-03 19:27:34.726045 INFO: Starting execution for task with name=RUN_INGEST',
           '2024-05-03 19:55:40.923833 INFO: Caught exception EXECUTING task_id=2236cd44-90eb-4781-9563-05ea51f4bbd4, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': ['2024-05-03 19:55:40.923635 ERROR: The ingestion process was killed by signal SIGKILL likely because it ran out of memory. You can '
            'resolve this issue by allocating more memory to the datahub-actions container.']}

I have tried to increase my memory up to 32gb and I still get this error.  I've turned off Lineage and Profiling and even tried just tables, no views.  

Occassionally I will get this error instead of the memory one:

  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
    yield from self.extract_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
    lineage_extractor.populate_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
    table_renames, all_tables_set = self._process_table_renames(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 872, in _process_table_renames
    all_tables[database][schema].add(prev_name)
KeyError: 'pgat_competitions_x'

In this case it seems to be trying to reference a schema name that I have filtered out in the ingest recipe.

@joshua-pgatour what CLI version is this with?

I merged a fix related to this in #9967

Thank you for the reply. I have tried going back one version at a time on datahub-actions dockerhub releases and the KeyError seems to stop happening around v10. However, I still have a memory issue. I have pretty much maxed out the size my pod can be and it still fails with memory SIGKILL. Any suggestions on getting around this? Here is my current recipe:

`source:
type: redshift
config:
host_port: ''
database: pgat
username:
table_lineage_mode: mixed
include_table_lineage: false
include_tables: true
include_views: false
profiling:
enabled: false
profile_table_level_only: false
stateful_ingestion:
enabled: true
password: '${redshift_secret2}'
schema_pattern:
allow:
- 'curated_.*'

`

So I figured out how to change the CLI version in the ingest recipe. I'm sorry I thought it was controlled by the datahub-actions container version. (Didn't know it was controlled in the recipe). 0.10.5.1 CLI works fine on 16gb memory and there is no KeyError. I will experiment at what point this breaks. But I gotta believe there's a memory leak in newer versions.

I can confirm that v0.13.2 has the memory problem. v0.13.1 works, but in my testing the ingest process has slowed significantly since 0.12