編寫我們的第一個分類器--機器學習食譜#5。 (Writing Our First Classifier - Machine Learning Recipes #5)

字幕列表影片播放

[MUSIC PLAYING]
Hey, everyone.
Welcome back.
In this episode, we're going to do something special,
and that's write our own classifier from scratch.
If you're new to machine learning,
this is a big milestone.
Because if you can follow along and do this on your own,
it means you understand an important piece of the puzzle.
The classifier we're going to write today
is a scrappy version of k-Nearest Neighbors.
That's one of the simplest classifiers around.
First, here's a quick outline of what we'll do in this episode.
We'll start with our code from Episode 4, Let's
Write a Pipeline.
Recall in that episode we did a simple experiment.
We imported a data set and split it into train and test.
We used train to train a classifier,
and test to see how accurate it was.
Writing the classifier is the part
we're going to focus on today.
Previously we imported the classifier
from a library using these two lines.
Here we'll comment them out and write our own.
The rest of the pipeline will stay exactly the same.
I'll pop in and out of the screencast to explain things
as we go along.
To start, let's run our pipeline to remind ourselves
what the accuracy was.
As you can see, it's over 90%.
And that's the goal for the classifier
we'll write ourselves.
Now let's comment out that import.
Right off the bat, this breaks our code.
So the first thing we need to do is fix our pipeline.
And to do that, we'll implement a class for our classifier.
I'll call it ScrappyKNN.
And by scrappy, I mean bare bones.
Just enough to get it working.
Next, I'll change our pipeline to use it.
Now let's see what methods we need to implement.
Looking at the interface for a classifier,
we see there are two we care about-- fit,
which does the training, and predict,
which does the prediction.
First we'll declare our fit method.
Remember this takes the features and labels for the training set
as input, so we'll add parameters for those.
Now let's move on to our predict method.
As input, this receives the features for our testing data.
And as output, it returns predictions for the labels.
Our first goal is to get the pipeline working,
and to understand what these methods do.
So before we write our real classifier,
we'll start with something simpler.
We'll write a random classifier.
And by random, I mean we'll just guess the label.
To start, we'll add some code to the fit and predict methods.
In fit, I'll store the training data in this class.
You can think of this as just memorizing it.
And you'll see why we do that later on.
Inside the predict method, remember
that we'll need to return a list of predictions.
That's because the parameter, X_test, is actually
a 2D array, or list of lists.
Each row contains the features for one testing example.
To make a prediction for each row,
I'll just randomly pick a label from the training data
and append that to our predictions.
At this point, our pipeline is working again.
So let's run it and see how well it does.
Recall there are three different types of flowers
in the iris dataset, so accuracy should be about 33%.
Now we know the interface for a classifier.
But when we started this exercise,
our accuracy was above 90%.
So let's see if we can do better.
To do that, we'll implement our classifier,
which is based on k-Nearest Neighbors.
Here's the intuition for how that algorithm works.
Let's return to our drawings of green dots and red dots
from the last episode.
Imagine the dots we see on the screen
are the training data we memorized in the fit method,
say for a toy dataset.
Now imagine we're asked to make a prediction for this testing
point that I'll draw here in gray.
How can we do that?
Well in a nearest neighbor classifier,
it works exactly like it sounds.
We'll find the training point that's
closest to the testing point.
This point is the nearest neighbor.
Then we'll predict that the testing
point has the same label.
For example, we'll guess that this testing dot is green,
because that's the color of its nearest neighbor.
As another example, if we had a testing dot over here,
we'd guess that it's red.
Now what about this one right in the middle?
Imagine that this dot is equidistant to the nearest
green dot and the nearest red one.
There's a tie, so how could we classify it?
Well one way is we could randomly break the tie.
But there's another way, and that's where k comes in.
K is the number of neighbors we consider
when making our prediction.
If k was 1, we'd just look at the closest training point.
But if k was 3, we'd look at the three closest.
In this case, two of those are green and one is red.
To predict, we could vote and predict the majority class.
Now there's more detail to this algorithm,
but that's enough to get us started.
To code this up, first we'll need a way
to find the nearest neighbor.
And to do that, we'll measure the straight line
distance between two points, just like you do with a ruler.
There's a formula for that called the Euclidean Distance,
and here's what the formula looks like.
It measures the distance between two points,
and it works a bit like the Pythagorean Theorem.
A squared plus B squared equals C squared.
You can think of this term as A, or the difference
between the first two features.
Likewise, you can think of this term as B,
or the difference between the second pair of features.
And the distance we compute is the length of the hypotenuse.
Now here's something cool.
Right now we're computing distance
in two-dimensional space, because we have just two
features in our toy dataset.
But what if we had three features or three dimensions?
Well then we'd be in a cube.
We can still visualize how to measure distance
in the space with a ruler.
But what if we had four features or four dimensions,
like we do in iris?
Well, now we're in a hypercube, and we
can't visualize this very easy.
The good news is the Euclidean Distance
works the same way regardless of the number of dimensions.
With more features, we can just add more terms to the equation.
You can find more details about this online.
Now let's code up Euclidean distance.
There are plenty of ways to do that,
but we'll use a library called scipy that's
already installed by Anaconda.
Here, A and B are lists of numeric features.
Say A is a point from our training data,
and B is a point from our testing data.
This function returns the distance between them.
That's all the math we need, so now
let's take a look at the algorithm for a classifier.
To make a prediction for a test point,
we'll calculate the distance to all the training points.
Then we'll predict the testing point has the same label
as the closest one.
I'll delete the random prediction we made,
and replace it with a method that finds the closest training
point to the test point.
For this video hard, I'll hard-code k to 1,
so we're writing a nearest neighbor classifier.
The k variable won't appear in our code,
since we'll always just find the closest point.
Inside this method, we'll loop over all the training points
and keep track of the closest one so far.
Remember that we memorized the training data in our fit
function, and X_train contains the features.
To start, I'll calculate the distance from the test point
to the first training point.
I'll use this variable to keep track of the shortest
distance we've found so far.
And I'll use this variable to keep
track of the index of the training point that's closest.
We'll need this later to retrieve its label.
Now we'll iterate over all the other training points.
And every time we find a closer one,
we'll update our variables.
Finally, we'll use the index to return
the label for the closest training example.
At this point, we have a working nearest neighbor classifier,
so let's run it and see what the accuracy is.
As you can see, it's over 90%.
And we did it.
When you run this on your own, the accuracy
might be a bit different because of randomness in the train test
split.
Now if you can code this up and understand it,
that's a big accomplishment because it
means you can write a simple classifier from scratch.
Now, there are a number of pros and cons
to this algorithm, many of which you can find online.
The basic pro is that it's relatively easy to understand,
and works reasonably well for some problems.
And the basic cons are that it's slow,
because it has to iterate over every training point
to make a prediction.
And importantly, as we saw in Episode 3,
some features are more informative than others.
But there's not an easy way to represent
that in k-Nearest Neighbors.
In the long run, we want a classifier
that learns more complex relationships between features
and the label we're trying to predict.
A decision tree is a good example of that.
And a neural network like we saw in TensorFlow Playground
is even better.
OK, hope that was helpful.
Thanks as always for watching.
You can follow me on Twitter for updates and, of course,
Google Developers.
And I'll see you guys next time.
[MUSIC PLAYING]