The KNN Algorithm (№ 01)
The Classification Problem
In this post we'll introduce the -nearest neighbors algorithm (KNN) and get a grasp for what is going on above. And further into the series we'll motivate some generalizations and eventually cover some modern innovations in the implementation of the algorithm.
To motivate the algorithm, let's take a simple example from car insurance. Consider the problem of determining whether someone is at high risk, medium risk, or low risk for getting in a car crash. Each of these different labels; high, medium, and low risk; we will call a class. We'd want to know this so we can charge the appropriate premiums.
A naive way to predict which risk class someone belongs to is to take the "walks like a duck, talks like a duck, it's probably a duck"-heuristic. If someone shares a lot of characteristics with people who are low risk, then they are probably low risk themselves.
As an example, consider an 18-year-old male named Alex. Alex is about to get car insurance for the first time. Based on the only two things we know about him, let's stereotype1 him. Well Alex doesn't have any driving history himself yet, so we can't use that. We only know his age and gender. So maybe we can use the driving history of other 18-year-old males—i.e. his nearest neighbors—to determine his premium. As it turns out, 18-year-old males crash with startling regularity. So Alex's premiums will be high. Sorry Alex.
Now this might not be fair to Alex. Maybe Alex is a skilled and risk averse driver (but still new to driving) so would be better placed in the medium risk class. We could gather more information, like Alex's high school GPA, and be a bit more accurate with our guess. But more information isn't always more helpful2 and often times we'll be restricted in what data we can legally and ethically use.
Since we're making predictions about the future, we'll make a lot of mistakes. The percentage of misclassifications is called the error rate. Often our goal in this type of problem is to simply get the error rate as low as possible.
Understanding the Algorithm
To get a feel for this algorithm, doing the math can help, but in this case we can bypass the math by letting the computer do it for us.
The demo at the top of the page will let us select a position on the graph and get a predicted class based on the KNN algorithm. We can also change the number of neighbors, the , considered in the algorithm.
When a point on the graph is selected, a circle expands from that point until there are exactly neighbors within the radius. We then tally up how those neighbors are classified. If most of the neighbors are of Type A, then the algorithm guesses that this new point we picked is also Type A. And that's it, that's enough to get us started!
Now the demo will only allow us to consider odd , which would prevent ties if there were only two classes. But our demo has three classes so we need to be careful. For , if the three nearest neighbors are each of a different class, the algorithm can tie: no good! Maybe you should try and produce a tie and see what happens? Riddle: Can you guess what determines the classification when there is a tie?3
When playing around with the demo consider where in the algorithm is doing a good job and where this algorithm will struggle. In which regions of the graph will the error rate be higher or lower?
To be continued...
This is a little tongue in cheek, but it's what we're doing aren't we?↩
More information not helping us will be the topic of another blog post in this series.↩
Riddle Solution: We predict the class will be whichever neighbors are closest. In particular, the minimum sum of distances. This can still tie, but it is exceedingly less likely now.↩