eltonlaw / impyute

Data imputations library to preprocess datasets with missing data

Home Page:http://impyute.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Different results on using function fast_knn and the function's content

aadarshsingh191198 opened this issue · comments

I was trying to understand the working of the function fast_knn. So, I tried to execute it line by line in order to understand the working. Here it is:

from scipy.spatial import KDTree
def shepards(distances, power=2):
    return to_percentage(1/np.power(distances, power))

def to_percentage(vec):
    return vec/np.sum(vec)

data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] =  np.nan
k=4
eps=0
p=2
distance_upper_bound=np.inf
leafsize=10
idw_fn=shepards
init_impute_fn=mean

nan_xy = np.argwhere(np.isnan(data_temp))
data_temp_c = init_impute_fn(data_temp)
kdtree = KDTree(data_temp_c, leafsize=leafsize)
for x_i, y_i in nan_xy:
    distances, indices = kdtree.query(data_temp_c[x_i], k=k+1, eps=eps,
                                      p=p, distance_upper_bound=distance_upper_bound)
    # Will always return itself in the first index. Delete it.
    distances, indices = distances[1:], indices[1:]
    # Add small constant to distances to avoid division by 0
    distances += 1e-3
    weights = idw_fn(distances)
    # Assign missing value the weighted average of `k` nearest neighbours
    data_temp[x_i][y_i] = np.dot(weights, [data_temp_c[ind][y_i] for ind in indices])
data_temp

This outputs:

array([[ 0.        ,  1.        , 10.06569379,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])

whereas the function has a different output. The code :

data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] =  np.nan
fast_knn(data_temp, k=4)

and the output

array([[ 0.        ,  1.        , 16.78451885,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
``

The master branch (from where the function's content have been used) uses shepards for weights. Whereas the v0.0.8 (pip install) uses weights = distances/np.sum(distances)