mckinsey / causalnex

A Python library that helps data scientists to infer causation rather than observing correlation.

Home Page:http://causalnex.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Only work in Jupyter notebook?

RoneChen opened this issue · comments

commented

Description

Hi, I want to use your package in one of my projects. However, with the same dataset, when I run it in Jypyter notebook, the BayesianNetwork can be trained successfully, while running in Pycharm, a MemoryError whill be thrown.

Context

It means I cannot use your package in my project.

Steps to Reproduce

Just use the same dataset, and copy the code to PyCharm and run it.

Expected Result

I hope it can train a Bayesian Network

Actual Result

numpy.core._exceptions.MemoryError: Unable to allocate 83.3 GiB for an array with shape (570, 39251968)...

Traceback (most recent call last):
  File "D:/internship/causalnex_online/causalnex_model.py", line 148, in <module>
    bn = bn.fit_cpds(train, method='BayesianEstimator', bayes_prior='K2')
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\causalnex\network\network.py", line 421, in fit_cpds
    self._model.fit(
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 586, in fit
    cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 97, in get_parameters
    parameters = Parallel(n_jobs=n_jobs, prefer="threads")(
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 1056, in __call__
    self.retrieve()
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 89, in _get_node_param
    cpd = self.estimate_cpd(
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 177, in estimate_cpd
    pseudo_counts = np.ones(cpd_shape, dtype=int)
  File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\numpy\core\numeric.py", line 204, in ones
    a = empty(shape, dtype, order)
numpy.core._exceptions.MemoryError: Unable to allocate 83.3 GiB for an array with shape (570, 39251968) and data type int32

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • CausalNex version used (pip show causalnex): 0.11.0
  • Python version used (python -V): 3.8
  • Operating system and version: Windown 10

The dataset is attached below,
day.csv

Hi @RoneChen, and thanks for the question!

In the error body you can see that during data processing the matrix with ~40M cols is created, most likely your machine doesn't have 80Gb RAM to store this dataset. Unfortunately I cannot reproduce model training in jupyter notebook due to the same data size error - I'd suggest working with smaller dataset.

Hope it helps!

commented

Hi @EvgeniiaVeselova. Thanks for your reply. If I shrink the size of dataset further, I am afraid that might affect the model performance.

When I train this model in jupyter notebook, there is not any error. However, when I train the model outside a jupyter notebook, for example in Pycharm, it throws the memory error.

@RoneChen you can analyze the data preprocessing steps and find out where dataset grows drastically - your initial dataset contains 16 cols while error is thrown for the dataset with 40M cols. In general memory management errors are not related to the CausalNex as you can see in the error body.

My guess is you can reduce number of bins generated during data discretization. This is only a guess, you can also attach .ipynb and .py files for the detailed look.

commented

@EvgeniiaVeselova Thanks for your suggestions! And here is my code. Please have a detailed look.
causalnex_online.zip