Only work in Jupyter notebook?
RoneChen opened this issue · comments
Description
Hi, I want to use your package in one of my projects. However, with the same dataset, when I run it in Jypyter notebook, the BayesianNetwork can be trained successfully, while running in Pycharm, a MemoryError whill be thrown.
Context
It means I cannot use your package in my project.
Steps to Reproduce
Just use the same dataset, and copy the code to PyCharm and run it.
Expected Result
I hope it can train a Bayesian Network
Actual Result
numpy.core._exceptions.MemoryError: Unable to allocate 83.3 GiB for an array with shape (570, 39251968)...
Traceback (most recent call last):
File "D:/internship/causalnex_online/causalnex_model.py", line 148, in <module>
bn = bn.fit_cpds(train, method='BayesianEstimator', bayes_prior='K2')
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\causalnex\network\network.py", line 421, in fit_cpds
self._model.fit(
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 586, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 97, in get_parameters
parameters = Parallel(n_jobs=n_jobs, prefer="threads")(
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 1056, in __call__
self.retrieve()
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\multiprocessing\pool.py", line 771, in get
raise self._value
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 89, in _get_node_param
cpd = self.estimate_cpd(
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\pgmpy\estimators\BayesianEstimator.py", line 177, in estimate_cpd
pseudo_counts = np.ones(cpd_shape, dtype=int)
File "C:\A_Rone\App_installation\Anaconda3\envs\causalnex\lib\site-packages\numpy\core\numeric.py", line 204, in ones
a = empty(shape, dtype, order)
numpy.core._exceptions.MemoryError: Unable to allocate 83.3 GiB for an array with shape (570, 39251968) and data type int32
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- CausalNex version used (
pip show causalnex
): 0.11.0 - Python version used (
python -V
): 3.8 - Operating system and version: Windown 10
The dataset is attached below,
day.csv
Hi @RoneChen, and thanks for the question!
In the error body you can see that during data processing the matrix with ~40M cols is created, most likely your machine doesn't have 80Gb RAM to store this dataset. Unfortunately I cannot reproduce model training in jupyter notebook
due to the same data size error - I'd suggest working with smaller dataset.
Hope it helps!
Hi @EvgeniiaVeselova. Thanks for your reply. If I shrink the size of dataset further, I am afraid that might affect the model performance.
When I train this model in jupyter notebook
, there is not any error. However, when I train the model outside a jupyter notebook
, for example in Pycharm
, it throws the memory error.
@RoneChen you can analyze the data preprocessing steps and find out where dataset grows drastically - your initial dataset contains 16 cols while error is thrown for the dataset with 40M cols. In general memory management errors are not related to the CausalNex
as you can see in the error body.
My guess is you can reduce number of bins generated during data discretization. This is only a guess, you can also attach .ipynb
and .py
files for the detailed look.
@EvgeniiaVeselova Thanks for your suggestions! And here is my code. Please have a detailed look.
causalnex_online.zip