dbt-labs / dbt-event-logging

a dbt package to make auditing dbt runs easy.

Home Page:https://hub.getdbt.com/dbt-labs/logging/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Run started events not logged

Limess opened this issue · comments

Describe the bug

Since my change to include extra columns (#17), run start events are not sent except for the first run which creates the table.

Steps To Reproduce

Run a DBT run with a single model and the event logging hooks configured.

Check the database table afterwards, no run started event is present:

select * from analytics_meta.dbt_audit_log order by event_timestamp desc;

Expected behavior

The run started event is logged.

Actual behavior

All expected events are logged except for run started.

run started is logged the first time the table is created.

Example from two runs with no existing table:

| event_name                 | event_timestamp     | event_schema                   | event_model                | event_user             | event_target | event_is_full_refresh | invocation_id                        |
|----------------------------|---------------------|--------------------------------|----------------------------|------------------------|--------------|-----------------------|--------------------------------------|
| run completed              | 2020-03-03 09:50:53 |                                |                            | analyst_charlie_briggs | dev          | t                     | 826ed09c-523a-49fb-8e98-f89437926d7a |
| model deployment completed | 2020-03-03 09:50:52 | analyst_charlie_briggs_scratch | users_for_tracking_erasure | analyst_charlie_briggs | dev          | t                     | 826ed09c-523a-49fb-8e98-f89437926d7a |
| model deployment started   | 2020-03-03 09:50:51 | analyst_charlie_briggs_scratch | users_for_tracking_erasure | analyst_charlie_briggs | dev          | t                     | 826ed09c-523a-49fb-8e98-f89437926d7a |
| run completed              | 2020-03-03 09:48:27 |                                |                            | analyst_charlie_briggs | dev          | t                     | 59de4dbb-99c0-4178-8c59-501234085fdc |
| model deployment started   | 2020-03-03 09:48:25 | analyst_charlie_briggs_scratch | users_for_tracking_erasure | analyst_charlie_briggs | dev          | t                     | 59de4dbb-99c0-4178-8c59-501234085fdc |
| model deployment completed | 2020-03-03 09:48:25 | analyst_charlie_briggs_scratch | users_for_tracking_erasure | analyst_charlie_briggs | dev          | t                     | 59de4dbb-99c0-4178-8c59-501234085fdc |
| run started                | 2020-03-03 09:48:22 |                                |                            | analyst_charlie_briggs | dev          | t                     | 59de4dbb-99c0-4178-8c59-501234085fdc |

System information

How did you add this package to your project:

packages:
  - package: fishtown-analytics/dbt_utils
    version: 0.2.5

  - package: fishtown-analytics/logging
    version: 0.2.1

Which database are you using dbt with?

  • Postgres
  • Redshift
  • BigQuery
  • Snowflake
  • Other (specify: ____________)

The output of dbt --version:

0.15.2

Additional context

I can't see anything obvious in my change which would have resulted in a change of behaviour. This statement only has extra fields.

This seems to suggest that if create_audit_log_table hits the CREATE TABLE path then the log is output correctly, but if it hits the modification step it is not.

I've tried adding SELECT 1 as a default statement if no columns require modification, assuming that it may be the macro outputting nothing which causes this issue but it doesn't seem to resolve this issue.

Upon checking the logs:

2020-03-03 09:59:57,28995 (MainThread): 09:59:57 | 3 of 3 START hook: logging.on-run-start.2............................ [RUN]
2020-03-03 09:59:57,29261 (MainThread): Using redshift connection "master".
2020-03-03 09:59:57,29406 (MainThread): On master: /* {"app": "dbt", "dbt_version": "0.15.2", "profile_name": "signal", "target_name": "dev", "connection_name": "master"} */

    

    


    

    

    



    insert into "analytics"."analyst_charlie_briggs_scratch_meta"."dbt_audit_log" (
        event_name,
        event_timestamp,
        event_schema,
        event_model,
        event_user,
        event_target,
        event_is_full_refresh,
        invocation_id
    )

    values (
        'run started',
        
  
    
  
    getdate()



,
        '',
        '',
        'analyst_charlie_briggs',
        'dev',
        TRUE,
        '976e02e6-d351-447a-9f87-266980607df6'
    )

    

    




2020-03-03 09:59:57,104381 (MainThread): SQL status: INSERT 0 1 in 0.07 seconds
2020-03-03 09:59:57,105942 (MainThread): 09:59:57 | 3 of 3 OK hook: logging.on-run-start.2............................... [INSERT 0 1 in 0.08s]
2020-03-03 09:59:57,106273 (MainThread): 09:59:57 | 
2020-03-03 09:59:57,106509 (MainThread): On master: ROLLBACK

Looks like the insert is getting rolled back for some reason? I've tried running that statement manually with no issues. That could be unrelated DBT internals though.

Adding commit; after the insert statement seems to resolve this issue. This is already the case on the run end statement, any idea why it's not the default, or why it seems to be required? (or why this has regressed with my change? 😅 )

hey @Limess - I think your analysis is spot-on here! I did some digging, and there's some funkiness going on here around how dbt manages on-run-{start|end} hooks. If you check out the configured hooks in this package:
https://github.com/fishtown-analytics/dbt-event-logging/blob/71533ed08b29ad7ed3469aa04e50dd0293570d3c/dbt_project.yml#L18-L21

You should see that the create_audit_log_table macro invokes a query against the database (via adapter.get_columns_in_relation). This query should require a new connection, but it appears to me that both the introspective query (via get_columns_in_relation) and the insert query are both using the same connection. So, when dbt goes to clean up the connection created for the get_columns_in_relation query, it is errantly rolling back the MainThread connection, including the insert statement!

So, I think adding a little commit in here might be a good stop-gap to add to this package, but this does sound like a bug in dbt Core to me! I think the ultimate fix would just be to use a different connection for the on-run-{start|end} hooks (named on-run-start instead of master). That way, the master connection can still be rolled-back without impacting and queries executing in the on-run-* part of the run.

I can create an issue in dbt Core that elaborates on this a little bit more, but in the meantime, please lmk if you have any questions or thoughts!