DataExploration.profile results in "ErrorMessage': "unsupported operand type(s) for -: 'int' and 'NoneType'"

Question

DataExploration.profile results in "ErrorMessage': "unsupported operand type(s) for -: 'int' and 'NoneType'"

paulochf opened this issue 2 years ago · comments

Hey all!

I'm trying to use the package but I'm getting that message.

import luminaire
import pandas as pd

from luminaire.exploration.data_exploration import DataExploration

past = pd.read_csv("dataset.csv").set_index("index")

de = DataExploration(freq='D')

past_prof, profile = de.profile(df=past)
#(None,
#{'success': False,
# 'ErrorMessage': "unsupported operand type(s) for -: 'int' and 'NoneType'"})

Is that anything data-related?

Here are my infos:

Python 3.7.10
requirements.txt: see below, result from pip install -U jupyterlab numpy pandas matplotlib luminaire pip setuptools pyarrow

Thanks!

anyio==3.6.1
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
attrs==22.1.0
Babel==2.10.3
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
boto3==1.24.76
botocore==1.27.76
certifi==2022.9.14
cffi==1.15.1
changepy==0.3.1
charset-normalizer==2.1.1
cloudpickle==2.2.0
cycler==0.11.0
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
entrypoints==0.4
fastjsonschema==2.16.2
fonttools==4.37.2
future==0.18.2
hyperopt==0.2.7
idna==3.4
importlib-metadata==4.12.0
importlib-resources==5.9.0
ipykernel==6.15.3
ipython==7.34.0
ipython-genutils==0.2.0
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.10
jsonschema==4.16.0
jupyter-core==4.11.1
jupyter-server==1.18.1
jupyter_client==7.3.5
jupyterlab==3.4.7
jupyterlab-pygments==0.2.2
jupyterlab_server==2.15.1
kiwisolver==1.4.4
luminaire==0.4.0
lxml==4.9.1
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
mistune==2.0.4
nbclassic==0.4.3
nbclient==0.6.8
nbconvert==7.0.0
nbformat==5.5.0
nest-asyncio==1.5.5
networkx==2.6.3
notebook==6.4.12
notebook-shim==0.1.0
numpy==1.21.6
packaging==21.3
pandas==1.3.5
pandas-redshift==2.0.5
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pkgutil_resolve_name==1.3.10
prometheus-client==0.14.1
prompt-toolkit==3.0.31
psutil==5.9.2
psycopg2-binary==2.9.3
ptyprocess==0.7.0
py4j==0.10.9.7
pyarrow==9.0.0
pycparser==2.21
Pygments==2.13.0
pykalman==0.9.5
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2022.2.1
pyzmq==24.0.0
requests==2.28.1
s3transfer==0.6.0
scikit-learn==1.0.2
scipy==1.7.3
Send2Trash==1.8.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
statsmodels==0.13.2
terminado==0.15.0
threadpoolctl==3.1.0
tinycss2==1.1.1
tomli==2.0.1
tornado==6.2
tqdm==4.64.1
traitlets==5.4.0
typing_extensions==4.3.0
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
zipp==3.8.1

SAYAN CHAKRABORTY · Answer 1 · Tue Sep 20 2022 13:46:34 GMT+0800 (China Standard Time)

It is difficult to understand the issue just from the screenshot. Can you explicitly cast the raw column to np.float and see if you can reproduce the error? Otherwise, can you reproduce it any other sharable data? I doubt it could possibly be data related.

Paulo Haddad · Answer 2 · Tue Sep 20 2022 22:54:05 GMT+0800 (China Standard Time)

I added the code as text and image. The image shows that the data is already as float.

Paulo Haddad · Answer 3 · Tue Sep 20 2022 23:49:02 GMT+0800 (China Standard Time)

I could debug the code and realized that fill_rate has None as the default value when it should be a float.

Is that expected? If the argument is required, it should be indicated as one and checked beforehand.

SAYAN CHAKRABORTY · Answer 4 · Wed Sep 21 2022 08:10:34 GMT+0800 (China Standard Time)

@paulochf I think your debugging caught the right issue. This example shows the required attributes for the DataExploration class. You can also bypass it and run the hyperparameter optimization the gives you a best fill rate based on the missingness pattern in the data. Please refer to this example for more details.

The purpose of this fill_rate parameter is to understand the missingness pattern over the longer history and not to generate overconfident predictions (with narrower confidence bounds) based on a recent time window and avoid relying on too many synthetic data.

Paulo Haddad · Answer 5 · Wed Sep 21 2022 11:59:22 GMT+0800 (China Standard Time)

That's a nice piece of information! I wish I had seen it before, as I went directly on the API doc page. It could be indicated there as a required parameter.

Thank you for your replies!