Suggestion: Implement a `.remove_data` function for Results

Question

Suggestion: Implement a `.remove_data` function for Results

jpweytjens opened this issue a year ago · comments

Description

Fitted results from linearmodels can be pickled with pickle.dump. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.

Example

My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.

Create a list of all desired model specifications and comparisons
Estimate all the different models
Save different comparison of these results with compare

In pseudocode

specifications = pd.DataFrame({"formulas": formulas, "criterium": criteria})

results = []
for formula in specifications["formulas"]:
    model = PanelOLS(y, x)
    res = model.fit()
    results.append(res)

specifications["results"] = results

for criteria in specifications["criteria"].unique():
    results = specificiations.query("criterium == @criteria")["results"]
    comparison = compare(results)
    comparison.summary.as_latex()

As my dataset is very large, pickeling results or the DataFrame specifications takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.

Workaround

I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.

import functools


def fake_cov(_deferred_cov, *args, **kwargs):
    return _deferred_cov


def shrink_mod_and_res(mod, res):
    """
    Remove any DataFrame and large objects that are unnecessarily stored in the model and results objects.
    """
    mod.dependent._frame = mod.dependent._frame.head(1)
    mod.dependent._original = None
    mod.dependent._panel = None
    mod.exog._frame = mod.exog._frame.head(1)
    mod.exog._original = None
    mod.exog._panel = None
    mod.weights._frame = mod.weights._frame.head(1)
    mod.weights._original = None
    mod.weights._panel = None
    mod._cov_estimators = None
    mod._x = None
    mod._y = None
    mod._w = None
    mod._not_null = None
    mod._original_index = None

    res._resids = None
    res._wresids = None
    res._original_index = None
    res._effects = None
    res._index = None
    res._fitted = None
    res._idiosyncratic = None
    res._not_null = None

    _deferred_cov = res._deferred_cov()
    res._deferred_cov = functools.partial(fake_cov, _deferred_cov=_deferred_cov)

    return mod, res

model = PanelOLS(y, x)
res = model.fit()
mod, res = shrink_mod_and_res(mod,res)

It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.

Suggestion

Implement a (cleaner) method to remove large datasets contained in the Results, similar to the remove_data flag in the .save() method of statsmodels' models.