apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform

Home Page:https://superset.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Number with Trendline coudn't work with PostgreSQL's date field

ensky opened this issue · comments

commented

Big Number with Trendline coudn't work with PostgreSQL's date field.

How to reproduce the bug

  1. Create a table with a column which type is "date" in postgresql
  2. Add that database and table as dataset to Superset
  3. Try to create a Big Number with Trendline chart with that dataset
  4. It shows
Unexpected error
Error: '+08'

which +08 is the timezone about your pgsql settings

Expected results

The chart should be draw successfully

Actual results

Unexpected error
Error: '+08'

Screenshots

image

Additional Error Log

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/security/decorators.py", line 190, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/utils/log.py", line 245, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/views/core.py", line 2351, in results
    return self.results_exec(key)
  File "/app/superset/views/core.py", line 2419, in results_exec
    obj = _deserialize_results_payload(
  File "/app/superset/views/utils.py", line 625, in _deserialize_results_payload
    df = result_set.SupersetResultSet.convert_table_to_df(pa_table)
  File "/app/superset/result_set.py", line 177, in convert_table_to_df
    return table.to_pandas(integer_object_nulls=True)
  File "pyarrow/array.pxi", line 757, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1748, in pyarrow.lib.Table._to_pandas
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 789, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1130, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1130, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 733, in _reconstruct_block
    dtype = make_datetimetz(item['timezone'])
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 758, in make_datetimetz
    tz = pa.lib.string_to_tzinfo(tz)
  File "pyarrow/types.pxi", line 1927, in pyarrow.lib.string_to_tzinfo
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "/usr/local/lib/python3.8/site-packages/pytz/__init__.py", line 188, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: '+08'
2022-05-23 03:27:32,051:ERROR:superset.views.base:'+08'

Environment

  • browser type and version: any browser
  • superset version: 1.5.0
  • python version: 3.8.12
  • any feature flags active:
    • VERSIONED_EXPORT
    • ENABLE_TEMPLATE_PROCESSING

Checklist

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

  • This issue doesn't exists in Superset-v1.4.0

The SQL in the chart is

SELECT DATE_TRUNC('day', publish_date) AS __timestamp,
       count(*) AS count
FROM
  (SELECT publish_date
   FROM software.package_file) AS virtual_table
GROUP BY DATE_TRUNC('day', publish_date)
LIMIT 50000;

The error part seems to be the DATE_TRUNC function, it returns something like "2022-05-23 12:00:00+08". Which format seems to be errorly processed by pyarrow project.

Accourding to pyarrow project, the timezone format seems to be like +08:00 format, maybe it's the root cuase? https://github.com/apache/arrow/blob/apache-arrow-5.0.0/python/pyarrow/types.pxi#L1915

@ensky what is the datatype of publish_date in the software.package_file table?

commented

@villebro It's date type

SELECT 
   table_name, 
   column_name, 
   data_type 
FROM 
   information_schema.columns
WHERE 
   table_name = 'package_file' and column_name = 'publish_date';

image

I'm unable to repro on master:

image

I unfortunately don't have an instance of 1.5.0 running right now, but I can start one if needed.

Are you using a cache to store results? And can you check if you're able to view samples for the dataset?

commented

@villebro I can reproduce it in SQL Lab with your reproduce step with query

SELECT DATE_TRUNC('day', publish_date),
       count(*) AS count
FROM
  (SELECT '2021-01-01'::date as publish_date) AS virtual_table
GROUP BY DATE_TRUNC('day', publish_date)
LIMIT 50000;

image

pgsql version
PostgreSQL 12.7 (Debian 12.7-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

image

I am facing same issue with all Time-series charts(echarts and legacy both). The issue seems to be handling of Date datatype in Postgres, casting such column to Timestamp solves the issue.

It looks like an issue that popped somewhere else, and that was fixed by upgrading psycopg2 to 2.9.1 in requirements.txt

@ensky I'm not able to reproduce the error with the query you provided:

SELECT DATE_TRUNC('day', publish_date),
       count(*) AS count
FROM
  (SELECT '2021-01-01'::date as publish_date) AS virtual_table
GROUP BY DATE_TRUNC('day', publish_date)
LIMIT 50000;

Executed tests:

  • master on Postgres 12
  • master on Postgres 12.11
  • 1.5.1rc1 on Postgres 12
  • 1.5.1rc1 on Postgres 12.11

I am facing same issue with all Time-series charts(echarts and legacy both). The issue seems to be handling of Date datatype in Postgres, casting such column to Timestamp solves the issue.

Seems like issue is something else. Local docker setup doesn't show issue, but kubernetes setup shows issue. I have older version of postgres in k8s, so that may be the issue.

commented

but kubernetes setup shows issue

True. Our environment is located on the k8s indeed.

What's the Postgres version in k8s?

commented

@michael-s-molina As mentioned above, it's 12.7
#20155 (comment)

I misunderstood 😄 , I thought that when @kamalkeshavani-aiinside commented:

I have older version of postgres in k8s

He was already considering the previous comment and that we tested on versions 12 and 12.11.

I am facing same issue with all Time-series charts

Me too. For me, this issue was introduced with the upgrade to 1.5.0. Unfortunately, a downgrade does not work ("Databases" is empty after downgrade).

Maybe the FutureWarning at the top of my (redacted) stack trace contains a valuable hint for one of you:

"Data is timezone-aware. Converting timezone-aware data to timezone-naive by passing dtype='datetime64[ns]' to DataFrame or Series is deprecated and will raise in a future version. Use pd.Series(values).dt.tz_localize(None) instead."

Jun 02 15:09:25 demeter gunicorn[1697193]: /srv/superset/local/lib/python3.8/site-packages/superset/common/query_context_processor.py:186: FutureWarning: Data is timezone-aware. Converting timezone-aware data to timezone-naive by passing dtype='datetime64[ns]' to DataFrame or Series is deprecated and will raise in a future version. Use `pd.Series(values).dt.tz_localize(None)` instead.
Jun 02 15:09:25 demeter gunicorn[1697193]:   result = query_context.datasource.query(query_object.to_dict())
Jun 02 15:09:25 demeter gunicorn[1697193]: Query SELECT DATE_TRUNC('year', entry_date) AS __timestamp,
Jun 02 15:09:25 demeter gunicorn[1697193]:        sum(our_fees) AS "SUM(our_fees)"
Jun 02 15:09:25 demeter gunicorn[1697193]: FROM
Jun 02 15:09:25 demeter gunicorn[1697193]:   (SELECT i.date AS invoice_date,
Jun 02 15:09:25 demeter gunicorn[1697193]:           e.date AS entry_date,
...
Jun 02 15:09:25 demeter gunicorn[1697193]: WHERE entry_date >= TO_TIMESTAMP('2012-01-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
Jun 02 15:09:25 demeter gunicorn[1697193]:   AND entry_date < TO_TIMESTAMP('2022-01-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
Jun 02 15:09:25 demeter gunicorn[1697193]:   AND number = 970
Jun 02 15:09:25 demeter gunicorn[1697193]:   AND subnumber = 1
Jun 02 15:09:25 demeter gunicorn[1697193]: GROUP BY DATE_TRUNC('year', entry_date)
Jun 02 15:09:25 demeter gunicorn[1697193]: ORDER BY "SUM(our_fees)" DESC
Jun 02 15:09:25 demeter gunicorn[1697193]: LIMIT 10000 on schema public failed
Jun 02 15:09:25 demeter gunicorn[1697193]: Traceback (most recent call last):
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/superset/connectors/sqla/models.py", line 1719, in query
Jun 02 15:09:25 demeter gunicorn[1697193]:     df = self.database.get_df(sql, self.schema, mutator=assign_column_label)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/superset/models/core.py", line 453, in get_df
Jun 02 15:09:25 demeter gunicorn[1697193]:     df = result_set.to_pandas_df()
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/superset/result_set.py", line 202, in to_pandas_df
Jun 02 15:09:25 demeter gunicorn[1697193]:     return self.convert_table_to_df(self.table)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/superset/result_set.py", line 177, in convert_table_to_df
Jun 02 15:09:25 demeter gunicorn[1697193]:     return table.to_pandas(integer_object_nulls=True)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "pyarrow/array.pxi", line 757, in pyarrow.lib._PandasConvertible.to_pandas
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "pyarrow/table.pxi", line 1748, in pyarrow.lib.Table._to_pandas
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 789, in table_to_blockmanager
Jun 02 15:09:25 demeter gunicorn[1697193]:     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1130, in _table_to_blocks
Jun 02 15:09:25 demeter gunicorn[1697193]:     return [_reconstruct_block(item, columns, extension_columns)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1130, in <listcomp>
Jun 02 15:09:25 demeter gunicorn[1697193]:     return [_reconstruct_block(item, columns, extension_columns)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 733, in _reconstruct_block                                                                         Jun 02 15:09:25 demeter gunicorn[1697193]:     dtype = make_datetimetz(item['timezone'])
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/srv/superset/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 758, in make_datetimetz
Jun 02 15:09:25 demeter gunicorn[1697193]:     tz = pa.lib.string_to_tzinfo(tz)
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "pyarrow/types.pxi", line 1927, in pyarrow.lib.string_to_tzinfo                                                                                                                     Jun 02 15:09:25 demeter gunicorn[1697193]:   File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
Jun 02 15:09:25 demeter gunicorn[1697193]:   File "/usr/lib/python3/dist-packages/pytz/__init__.py", line 181, in timezone
Jun 02 15:09:25 demeter gunicorn[1697193]:     raise UnknownTimeZoneError(zone)
Jun 02 15:09:25 demeter gunicorn[1697193]: pytz.exceptions.UnknownTimeZoneError: '+01' 
Jun 02 15:09:25 demeter gunicorn[1697193]: 192.168.0.121 - - [02/Jun/2022:15:09:25 +0200] "POST /api/v1/chart/data?form_data=%7B%22slice_id%22%3A211%7D&dashboard_id=13&force HTTP/1.0" 400 27 "https://[...]" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
Jun 02 15:09:25 demeter gunicorn[1697193]: [2022-06-02 15:09:25 +0200] [1697193] [DEBUG] Closing connection.
commented

It looks like an issue that popped somewhere else, and that was fixed by upgrading psycopg2 to 2.9.1 in requirements.txt

I can confirm that with psycopg2=2.9.1 the problem is solved.

So it seems to be the problem about k8s setup example located in https://superset.apache.org/docs/installation/running-on-kubernetes , and which instructed to pip install psycopg2==2.8.5 in dependency section.

Anyone may change it to pip install psycopg2==2.9.1 then this problem is solved.

So it seems to be the problem about k8s setup example located in https://superset.apache.org/docs/installation/running-on-kubernetes , and which instructed to pip install psycopg2==2.8.5 in dependency section.

Thanks for reporting this @ensky and also for submitting a PR to update the guide 🙂

I can confirm that with psycopg2=2.9.1 the problem is solved.

Anyone may change it to pip install psycopg2==2.9.1 then this problem is solved.

I cannot confirm this resolution. Upgrading superset to 1.5.1 via pip. The problem persists for me. pip even upgrades to psycopg2==2.9.3. Also, manually downgrading to 2.9.1 does not resolve the issue. The patch in 0b49ca6 still works.

This is really useful.

Setting psycopg2==2.9.3 works for me (Using superset 2.0.0 Docker Image)

Perhaps this is an issue with my PostgreSQL version (14). Anyway, c61ac76 is an updated patch to fix the timezone issue if (like me) the psycopg2 upgrade does not resolve it for you. This is better than my previous patch, because it is also compatible with using Prophet.