eltonlaw / impyute

Data imputations library to preprocess datasets with missing data

Home Page:http://impyute.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fast_knn: the nearest neighbor gets the lowest weight

MinjieSh opened this issue · comments

Hi Elton,

Thank you for implementing this library, it's so convenient!
I found your library from the link below.
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

When I was using fast_knn, I found that nearer neighbor got lower weight when getting the weighted average of k nearest neighbors.

In the example you provided,

fast_knn(data, k=2) # Weighted average of nearest 2 neighbours
array([[ 0. , 1. , 10.08608891, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])

In this example, 10.086 is imputed according to kNN algorithm.
We get the 2 nearest neighbors using Euclidean distance, for the first row as a "point", the nearest neighbor is the second "point" (second row), and the second nearest neighbor is the third "point" which is the third row.
The distance between the first point and second point (nearest neighbor) is 12.5, the distance between the first point and third point (the second nearest neighbor) is 20.156.
So this is how 10.086 comes:
10.086 = 7 * 12.5/(12.5 + 20.156) + 12 * 20.156/(12.5 + 20.156)
The weight for each point is calculated based on its distance, so the nearer the point, the smaller the distance, the lower the weight, which is supposed to be the opposite.

In a nutshell, I believe the nearest neighbor should have the highest weight, in this example, the imputed value should be close to 7 instead of 12 (the average of 7 and 12 is 9.5 for reference).

Thanks.
Best,
Minjie

Hi Minjie, sorry for the late reply, I've been really busy with other things. Thanks for catching that and for the long write up, I've taken a look at it and I think I see the problem.

There was an issue with https://github.com/eltonlaw/impyute/blob/master/impyute/imputation/cs/fast_knn.py#L113 where the weights were being calculated directly as ratios of the total distance, this is why the rows further away were had higher associated weights...so that'll need to be inversed. I'll probably define another kwargs nd a few general inverse distance weighing helper functions so that we can dynamically modify the behaviour.

This wiki page talks about shepard's method and modifications, so that'll probably be a starting point.