eltonlaw / impyute

Data imputations library to preprocess datasets with missing data

Home Page:http://impyute.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enhance locf to (a) optionally allow entire row/column to be NaN (b) optionally not perform look forward

gkovaig opened this issue · comments

(a) Real-world data can occasionally have all data for a specific row/column missing.
(b) In processing time series, we know only about the past, not the future.

Example call:
impyute.imputation.ts.locf(p000008, axis=1, entire_set_nan_ok=True, no_look_forward=True)

Example code after modification - apologies, I've not done pull requests before :-)

import numpy as np
from impyute.ops import matrix
from impyute.ops import wrapper
from impyute.ops import error

@wrapper.wrappers
@wrapper.checks
def locf(data, axis=0, no_look_forward=False, entire_set_nan_ok=False):
""" Last Observation Carried Forward

For each set of missing indices, use the value of one row before(same
column). In the case that the missing value is the first row, look one
row ahead instead. If this next row is also NaN, look to the next row.
Repeat until you find a row in this column that's not NaN. All the rows
before will be filled with this value.

Parameters
----------
data: numpy.ndarray
    Data to impute.
axis: boolean (optional)
    0 if time series is in row format (Ex. data[0][:] is 1st data point).
    1 if time series is in col format (Ex. data[:][0] is 1st data point).
no_look_forward boolean (optional). Default=False
    False  if NaN in first row, try to impute by looking ahead in next row.
    True   do not impute in first row, even if NaN is present there.
                Result may contain NaN in first row.
entire_set_nan_ok boolean (optional) Default=False
    False  if entire column is NaN, raise exception.
    True   if entire column is NaN, ignore.
                Result may contain NaN in entire column.

Returns
-------
numpy.ndarray
    Imputed data.

"""
if axis == 0:
    data = np.transpose(data)
elif axis == 1:
    pass
else:
    raise error.BadInputError("Error: Axis value is invalid, please use either 0 (row format) or 1 (column format)")

nan_xy = matrix.nan_indices(data)
# print(nan_xy)
for x_i, y_i in nan_xy:
    # no_look_forward=True means do not impute first set with values from farther down
    # meant to be used in situations where index is Time, so we would not not know what will happen in the future
    # Simplest scenario, look one row back
    # print(f'{x_i}', end=' ')
    if x_i-1 > -1:
        data[x_i][y_i] = data[x_i-1][y_i]

    # Look n rows forward
    elif not no_look_forward:
        x_residuals = np.shape(data)[0]-x_i-1  # n datapoints left
        val_found = False
        for i in range(1, x_residuals):
            if not np.isnan(data[x_i+i][y_i]):
                val_found = True
                break
        if val_found:
            # pylint: disable=undefined-loop-variable
            for x_nan in range(i):
                data[x_i+x_nan][y_i] = data[x_i+i][y_i]
    else:
        if entire_set_nan_ok:
            pass
        else:
            raise Exception("Error: Entire Column is NaN")
return data

Never mind - I find this functionality is already available in pandas DataFrame.fillna(method='ffill').