Errors in NearestNeighborLearner
GoogleCodeExporter opened this issue · comments
Google Code Exporter commented
What steps will reproduce the problem?
Tried to use the learning.NearestNeighborLearner on the Sex Classification
dataset from this Wikipedia article on Naive Bayes classifiers:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_Classification
What is the expected output? What do you see instead?
Program wouldn't run due to bugs in the implementation of NNLearner
What version of the product are you using?
Bug exists in r30
Please provide any additional information below.
Here's my sample code:
import learning
examples =
[[6,180,12,'male'],[5.92,190,11,'male'],[5.58,170,12,'male'],[5,100,6,'female'],
[5.5,150,8,'female'],[5.42,130,7,'female'],[5.75,150,9,'female']]
ds = learning.DataSet(examples)
nnl = learning.NearestNeighborLearner(2)
nnl.train(ds)
print nnl.predict([5.1,105,6.3])
And I would expect it to print 'female'.
I believe the following fixes should work:
old learning.py, lines 217 - 231
else:
## Maintain a sorted list of (distance, example) pairs.
## For very large k, a PriorityQueue would be better
best = []
for e in examples:
d = self.distance(e, example)
if len(best) < k:
e.append((d, e))
elif d < best[-1][0]:
best[-1] = (d, e)
best.sort()
return mode([e[self.dataset.target] for (d, e) in best])
def distance(self, e1, e2):
return mean_boolean_error(e1, e2)
new learning.py:
else:
## Maintain a sorted list of (distance, example) pairs.
## For very large k, a PriorityQueue would be better
best = []
for e in self.dataset.examples:
d = self.distance(e, example)
if len(best) < self.k:
best.append((d, e))
elif d < best[-1][0]:
best[-1] = (d, e)
best.sort()
return mode([e[self.dataset.target] for (d, e) in best])
def distance(self, e1, e2):
return mean_error(e1, e2)
Specifically:
1) changed 'examples' to self.dataset.examples.
2) changed e.append((d,e)) to best.append((d, e))
3) and I could be wrong, but I believe you wanted mean_error, not
mean_boolean_error in your distance function.
For the gender classification example, it seems to work great. Thanks!
Original issue reported on code.google.com by tblana...@gmail.com
on 19 Oct 2010 at 5:21
Google Code Exporter commented
[deleted comment]
Google Code Exporter commented
Whether to use mean_error or mean_boolean_error depends on the dataset, I
believe, and the code needs some larger change to be able to do either
generically. For now, mean_boolean_error at least never blows up. The rest of
this looks right -- thanks!
Original comment by wit...@gmail.com
on 15 Sep 2011 at 2:40
Google Code Exporter commented
This issue was closed by revision r71.
Original comment by wit...@gmail.com
on 15 Sep 2011 at 2:41
- Changed state: Fixed