Lecture 11: Introduction to Machine Learning

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

Description: In this lecture, Prof. Guttag introduces machine learning and shows examples of supervised learning using feature vectors.

Instructor: Eric Grimson

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

ERIC GRIMSON: OK. Welcome back. You know, it's that time a term when we're all kind of doing this. So let me see if I can get a few smiles by simply noting to you that two weeks from today is the last class. Should be worth at least a little bit of a smile, right? Professor Guttag is smiling. He likes that idea. You're almost there. What are we doing for the last couple of lectures? We're talking about linear regression.

And I just want to remind you, this was the idea of I have some experimental data. Case of a spring where I put different weights on measure displacements. And regression was giving us a way of deducing a model to fit that data. And In some cases it was easy. We knew, for example, it was going to be a linear model. We found the best line that would fit that data.

In some cases, we said we could use validation to actually let us explore to find the best model that would fit it, whether a linear, a quadratic, a cubic, some higher order thing. So we'll be using that to deduce something about a model.

That's a nice segue into the topic for the next three lectures, the last big topic of the class, which is machine learning. And I'm going to argue, you can debate whether that's actually an example of learning. But it has many of the elements that we want to talk about when we talk about machine learning. So as always, there's a reading assignment. Chapter 22 of the book gives you a good start on this, and it will follow up with other pieces.

And I want to start by basically outlining what we're going to do. And I'm going to begin by saying, as I'm sure you're aware, this is a huge topic. I've listed just five subjects in course six that all focus on machine learning. And that doesn't include other subjects where learning is a central part. So natural language processing, computational biology, computer vision robotics all rely today, heavily on machine learning. And you'll see those in those subjects as well.

So we're not going to compress five subjects into three lectures. But what we are going to do is give you the introduction. We're going to start by talking about the basic concepts of machine learning. The idea of having examples, and how do you talk about features representing those examples, how do you measure distances between them, and use the notion of distance to try and group similar things together as a way of doing machine learning. And we're going to look, as a consequence, of two different standard ways of doing learning.

One, we call classification methods. Example we're going to see, there is something called "k nearest neighbor" and the second class, called clustering methods. Classification works well when I have what we would call labeled data. I know labels on my examples, and I'm going to use that to try and define classes that I can learn, and clustering working well, when I don't have labeled data. And we'll see what that means in a couple of minutes. But we're going to give you an early view of this.

Unless Professor Guttag changes his mind, we're probably not going to show you the current really sophisticated machine learning methods like convolutional neural nets or deep learning, things you'll read about in the news. But you're going to get a sense of what's behind those, by looking at what we do when we talk about learning algorithms.

Before I do it, I want to point out to you just how prevalent this is. And I'm going to admit with my gray hair, I started working in AI in 1975 when machine learning was a pretty simple thing to do. And it's been fascinating to watch over 40 years, the change. And if you think about it, just think about where you see it. AlphaGo, machine learning based system from Google that beat a world-class level Go player. Chess has already been conquered by computers for a while. Go now belongs to computers. Best Go players in the world are computers.

I'm sure many of you use Netflix. Any recommendation system, Netflix, Amazon, pick your favorite, uses a machine learning algorithm to suggest things for you. And in fact, you've probably seen it on Google, right? The ads that pop up on Google are coming from a machine learning algorithm that's looking at your preferences. Scary thought. Drug discovery, character recognition-- the post office does character recognition of handwritten characters using a machine learning algorithm and a computer vision system behind it.

You probably don't know this company. It's actually an MIT spin-off called Two Sigma, it's a hedge fund in New York. They heavily use AI and machine learning techniques. And two years ago, their fund returned a 56% return. I wish I'd invested in the fund. I don't have the kinds of millions you need, but that's an impressive return. 56% return on your money in one year. Last year they didn't do quite as well, but they do extremely well using machine learning techniques.

Siri. Another great MIT company called Mobileye that does computer vision systems with a heavy machine learning component that is used in assistive driving and will be used in completely autonomous driving. It will do things like kick in your brakes if you're closing too fast on the car in front of you, which is going to be really bad for me because I drive like a Bostonian. And it would be kicking in constantly.

Face recognition. Facebook uses this, many other systems do to both detect and recognize faces. IBM Watson-- cancer diagnosis. These are all just examples of machine learning being used everywhere. And it really is. I've only picked nine. So what is it? I'm going to make an obnoxious statement. You're now used to that. I'm going to claim that you could argue that almost every computer program learns something. But the level of learning really varies a lot.

So if you think back to the first lecture in 60001, we showed you Newton's method for computing square roots. And you could argue, you'd have to stretch it, but you could argue that that method learns something about how to compute square roots. In fact, you could generalize it to roots of any order power. But it really didn't learn. I really had to program it. All right. Think about last week when we talked about linear regression. Now it starts to feel a little bit more like a learning algorithm. Because what did we do?

We gave you a set of data points, mass displacement data points. And then we showed you how the computer could essentially fit a curve to that data point. And it was, in some sense, learning a model for that data that it could then use to predict behavior. In other situations. And that's getting closer to what we would like when we think about a machine learning algorithm. We'd like to have program that can learn from experience, something that it can then use to deduce new facts.

Now it's been a problem in AI for a very long time. And I love this quote. It's from a gentleman named Art Samuel. 1959 is the quote in which he says, his definition of machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. And I think many people would argue, he wrote the first such program. It learned from experience. In his case, it played checkers.

Kind of shows you how the field has progressed. But we started with checkers, we got to chess, we now do Go. But it played checkers. It beat national level players, most importantly, it learned to improve its methods by watching how it did in games and then inferring something to change what it thought about as it did that. Samuel did a bunch of other things. I just highlighted one. You may see in a follow on course, he invented what's called Alpha-Beta Pruning, which is a really useful technique for doing search.

But the idea is, how can we have the computer learn without being explicitly programmed? And one way to think about this is to think about the difference between how we would normally program and what we would like from a machine learning algorithm. Normal programming, I know you're not convinced there's such a thing as normal programming, but if you think of traditional programming, what's the process? I write a program that I input to the computer so that it can then take data and produce some appropriate output.

And the square root finder really sits there, right? I wrote code for using Newton method to find a square root, and then it gave me the process of given any number, I'll give you the square root. But if you think about what we did last time, it was a little different. And in fact, in a machine learning approach, the idea is that I'm going to give the computer output. I'm going to give it examples of what I want the program to do, labels on data, characterizations of different classes of things.

And what I want the computer to do is, given that characterization of output and data, I wanted that machine learning algorithm to actually produce for me a program, a program that I can then use to infer new information about things. And that creates, if you like, a really nice loop where I can have the machine learning algorithm learn the program which I can then use to solve some other problem. That would be really great if we could do it.

And as I suggested, that curve-fitting algorithm is a simple version of that. It learned a model for the data, which I could then use to label any other instances of the data or predict what I would see in terms of spring displacement as I changed the masses. So that's the kind of idea we're going to explore. If we want to learn things, we could also ask, so how do you learn? And how should a computer learn?

Well, for you as a human, there are a couple of possibilities. This is the boring one. This is the old style way of doing it, right? Memorize facts. Memorize as many facts as you can and hope that we ask you on the final exam instances of those facts, as opposed to some other facts you haven't memorized. This is, if you think way back to the first lecture, an example of declarative knowledge, statements of truth. Memorize as many as you can. Have Wikipedia in your back pocket.

Better way to learn is to be able to infer, to deduce new information from old. And if you think about this, this gets closer to what we called imperative knowledge-- ways to deduce new things. Now, in the first cases, we built that in when we wrote that program to do square roots. But what we'd like in a learning algorithm is to have much more like that generalization idea.

We're interested in extending our capabilities to write programs that can infer useful information from implicit patterns in the data. So not something explicitly built like that comparison of weights and displacements, but actually implicit patterns in the data, and have the algorithm figure out what those patterns are, and use those to generate a program you can use to infer new data about objects, about string displacements, whatever it is you're trying to do.

OK. So the idea then, the basic paradigm that we're going to see, is we're going to give the system some training data, some observations. We did that last time with just the spring displacements. We're going to then try and have a way to figure out, how do we write code, how do we write a program, a system that will infer something about the process that generated the data? And then from that, we want to be able to use that to make predictions about things we haven't seen before.

So again, I want to drive home this point. If you think about it, the spring example fit that model. I gave you a set of data, spatial deviations relative to mass displacements. For different masses, how far did the spring move? I then inferred something about the underlying process. In the first case, I said I know it's linear, but let me figure out what the actual linear equation is. What's the spring constant associated with it? And based on that result, I got a piece of code I could use to predict new displacements.

So it's got all of those elements, training data, an inference engine, and then the ability to use that to make new predictions. But that's a very simple kind of learning setting. So the more common one is one I'm going to use as an example, which is, when I give you a set of examples, those examples have some data associated with them, some features and some labels. For each example, I might say this is a particular kind of thing. This other one is another kind of thing. And what I want to do is figure out how to do inference on labeling new things.

So it's not just, what's the displacement of the mass, it's actually a label. And I'm going to use one of my favorite examples. I'm a big New England Patriots fan, if you're not, my apologies. But I'm going to use football players. So I'm going to show you in a second, I'm going to give you a set of examples of football players. The label is the position they play. And the data, well, it could be lots of things. We're going to use height and weight.

But what we want to do is then see how would we come up with a way of characterizing the implicit pattern of how does weight and height predict the kind of position this player could play. And then come up with an algorithm that will predict the position of new players. We'll do the draft for next year. Where do we want them to play? That's the paradigm. Set of observations, potentially labeled, potentially not. Think about how do we do inference to find a model. And then how do we use that model to make predictions.

What we're going to see, and we're going to see multiple examples today, is that that learning can be done in one of two very broad ways. The first one is called supervised learning. And in that case, for every new example I give you as part of the training data, I have a label on it. I know the kind of thing it is. And what I'm going to do is look for how do I find a rule that would predict the label associated with unseen input based on those examples. It's supervised because I know what the labeling is.

Second kind, if this is supervised, the obvious other one is called unsupervised. In that case, I'm just going to give you a bunch of examples. But I don't know the labels associated with them. I'm going to just try and find what are the natural ways to group those examples together into different models. And in some cases, I may know how many models are there. In some cases, I may want to just say what's the best grouping I can find.

OK. What I'm going to do today is not a lot of code. I was expecting cheers for that, John, but I didn't get them. Not a lot of code. What I'm going to do is show you basically, the intuitions behind doing this learning. And I"m going to start with my New England Patriots example. So here are some data points about current Patriots players. And I've got two kinds of positions. I've got receivers, and I have linemen. And each one is just labeled by the name, the height in inches, and the weight in pounds. OK? Five of each.

If I plot those on a two dimensional plot, this is what I get. OK? No big deal. What am I trying to do? I'm trying to learn, are their characteristics that distinguish the two classes from one another? And in the unlabeled case, all I have are just a set of examples. So what I want to do is decide what makes two players similar with the goal of seeing, can I separate this distribution into two or more natural groups.

Similar is a distance measure. It says how do I take two examples with values or features associated, and we're going to decide how far apart are they? And in the unlabeled case, the simple way to do it is to say, if I know that there are at least k groups there-- in this case, I'm going to tell you there are two different groups there-- how could I decide how best to cluster things together so that all the examples in one group are close to each other, all the examples in the other group are close to each other, and they're reasonably far apart.

There are many ways to do it. I'm going to show you one. It's a very standard way, and it works, basically, as follows. If all I know is that there are two groups there, I'm going to start by just picking two examples as my exemplars. Pick them at random. Actually at random is not great. I don't want to pick too closely to each other. I'm going to try and pick them far apart. But I pick two examples as my exemplars. And for all the other examples in the training data, I say which one is it closest to.

What I'm going to try and do is create clusters with the property that the distances between all of the examples of that cluster are small. The average distance is small. And see if I can find clusters that gets the average distance for both clusters as small as possible. This algorithm works by picking two examples, clustering all the other examples by simply saying put it in the group to which it's closest to that example.

Once I've got those clusters, I'm going to find the median element of that group. Not mean, but median, what's the one closest to the center? And treat those as exemplars and repeat the process. And I'll just do it either some number of times or until I don't get any change in the process. So it's clustering based on distance. And we'll come back to distance in a second.

So here's what would have my football players. If I just did this based on weight, there's the natural dividing line. And it kind of makes sense. All right? These three are obviously clustered, and again, it's just on this axis. They're all down here. These seven are at a different place. There's a natural dividing line there. If I were to do it based on height, not as clean. This is what my algorithm came up with as the best dividing line here, meaning that these four, again, just based on this axis are close together. These six are close together. But it's not nearly as clean.

And that's part of the issue we'll look at is how do I find the best clusters. If I use both height and weight, I get that, which was actually kind of nice, right? Those three cluster together. they're near each other, in terms of just distance in the plane. Those seven are near each other. There's a nice, natural dividing line through here. And in fact, that gives me a classifier. This line is the equidistant line between the centers of those two clusters. Meaning, any point along this line is the same distance to the center of that group as it is to that group.

And so any new example, if it's above the line, I would say gets that label, if it's below the line, gets that label. In a second, we'll come back to look at how do we measure the distances, but the idea here is pretty simple. I want to find groupings near each other and far apart from the other group.

Now suppose I actually knew the labels on these players. These are the receivers. Those are the linemen. And for those of you who are football fans, you can figure it out, right? Those are the two tight ends. They are much bigger. I think that's Bennett and that's Gronk if you're really a big Patriots fan. But those are tight ends, those are wide receivers, and it's going to come back in a second, but there are the labels.

Now what I want to do is say, if I could take advantage of knowing the labels, how would I divide these groups up? And that's kind of easy to see. Basic idea, in this case, is if I've got labeled groups in that feature space, what I want to do is find a subsurface that naturally divides that space. Now subsurface is a fancy word. It says, in the two-dimensional case, I want to know what's the best line, if I can find a single line, that separates all the examples with one label from all the examples of the second label.

We'll see that, if the examples are well separated, this is easy to do, and it's great. But in some cases, it's going to be more complicated because some of the examples may be very close to one another. And that's going to raise a problem that you saw last lecture. I want to avoid overfitting. I don't want to create a really complicated surface to separate things. And so we may have to tolerate a few incorrectly labeled things, if we can't pull it out.

And as you already figured out, in this case, with the labeled data, there's the best fitting line right there. Anybody over 280 pounds is going to be a great lineman. Anybody under 280 pounds is more likely to be a receiver. OK. So I've got two different ways of trying to think about doing this labeling. I'm going to come back to both of them in a second.

Now suppose I add in some new data. I want to label new instances. Now these are actually players of a different position. These are running backs. But I say, all I know about is receivers and linemen. I get these two new data points. I'd like to know, are they more likely to be a receiver or a linemen? And there's the data for these two gentlemen. So if I go back to now plotting them, oh you notice one of the issues. So there are my linemen, the red ones are my receivers, the two black dots are the two running backs.

And notice right here. It's going to be really hard to separate those two examples from one another. They are so close to each other. And that's going to be one of the things we have to trade off. But if I think about using what I learned as a classifier with unlabeled data, there were my two clusters. Now you see, oh, I've got an interesting example. This new example I would say is clearly more like a receiver than a lineman. But that one there, unclear. Almost exactly lies along that dividing line between those two clusters.

And I would either say, I want to rethink the clustering or I want to say, you know what? As I know, maybe there aren't two clusters here. Maybe there are three. And I want to classify them a little differently. So I'll come back to that. On the other hand, if I had used the labeled data, there was my dividing line. This is really easy. Both of those new examples are clearly below the dividing line. They are clearly examples that I would categorize as being more like receivers than they are like linemen.

And I know it's a football example. If you don't like football, pick another example. But you get the sense of why I can use the data in a labeled case and the unlabeled case to come up with different ways of building the clusters. So what we're going to do over the next 2 and 1/2 lectures is look at how can we write code to learn that way of separating things out?

We're going to learn models based on unlabeled data. That's the case where I don't know what the labels are, by simply trying to find ways to cluster things together nearby, and then use the clusters to assign labels to new data. And we're going to learn models by looking at labeled data and seeing how do we best come up with a way of separating with a line or a plane or a collection of lines, examples from one group, from examples of the other group.

With the acknowledgment that we want to avoid overfitting, we don't want to create a really complicated system. And as a consequence, we're going to have to make some trade-offs between what we call false positives and false negatives. But the resulting classifier can then label any new data by just deciding where you are with respect to that separating line.

So here's what you're going to see over the next 2 and 1/2 lectures. Every machine learning method has five essential components. We need to decide what's the training data, and how are we going to evaluate the success of that system. We've already seen some examples of that. We need to decide how are we going to represent each instance that we're giving it.

I happened to choose height and weight for football players. But I might have been better off to pick average speed or, I don't know, arm length, something else. How do I figure out what are the right features. And associated with that, how do I measure distances between those features? How do I decide what's close and what's not close? Maybe it should be different, in terms of weight versus height, for example. I need to make that decision.

And those are the two things we're going to show you examples of today, how to go through that. Starting next week, Professor Guttag is going to show you how you take those and actually start building more detailed versions of measuring clustering, measuring similarities to find an objective function that you want to minimize to decide what is the best cluster to use. And then what is the best optimization method you want to use to learn that model.

So let's start talking about features. I've got a set of examples, labeled or not. I need to decide what is it about those examples that's useful to use when I want to decide what's close to another thing or not. And one of the problems is, if it was really easy, it would be really easy. Features don't always capture what you want. I'm going to belabor that football analogy, but why did I pick height and weight. Because it was easy to find.

You know, if you work for the New England Patriots, what is the thing that you really look for when you're asking, what's the right feature? It's probably some other combination of things. So you, as a designer, have to say what are the features I want to use. That quote, by the way, is from one of the great statisticians of the 20th century, which I think captures it well.

So feature engineering, as you, as a programmer, comes down to deciding both what are the features I want to measure in that vector that I'm going to put together, and how do I decide relative ways to weight it? So John, and Ana, and I could have made our job this term really easy if we had sat down at the beginning of the term and said, you know, we've taught this course many times. We've got data from, I don't know, John, thousands of students, probably over this time. Let's just build a little learning algorithm that takes a set of data and predicts your final grade.

You don't have to come to class, don't have to go through all the problems, because we'll just predict your final grade. Wouldn't that be nice? Make our job a little easier, and you may or may not like that idea. But I could think about predicting that grade? Now why am I telling this example. I was trying to see if I could get a few smiles. I saw a couple of them there.

But think about the features. What I measure? Actually, I'll put this on John because it's his idea. What would he measure? Well, GPA is probably not a bad predictor of performance. You do well in other classes, you're likely to do well in this class.

I'm going to use this one very carefully. Prior programming experience is at least a predictor, but it is not a perfect predictor. Those of you who haven't programmed before, in this class, you can still do really well in this class. But it's an indication that you've seen other programming languages.

On the other hand, I don't believe in astrology. So I don't think the month in which you're born, the astrological sign under which you were born has probably anything to do with how well you'd program. I doubt that eye color has anything to do with how well you'd program. You get the idea. Some features matter, others don't.

Now I could just throw all the features in and hope that the machine learning algorithm sorts out those it wants to keep from those it doesn't. But I remind you of that idea of overfitting. If I do that, there is the danger that it will find some correlation between birth month, eye color, and GPA.

And that's going to lead to a conclusion that we really don't like. By the way, in case you're worried, I can assure you that Stu Schmill in the dean of admissions department does not use machine learning to pick you. He actually looks at a whole bunch of things because it's not easy to replace him with a machine-- yet.

All right. So what this says is we need to think about how do we pick the features. And mostly, what we're trying to do is to maximize something called the signal to noise ratio. Maximize those features that carry the most information, and remove the ones that don't. So I want to show you an example of how you might think about this. I want to label reptiles. I want to come up with a way of labeling animals as, are they a reptile or not.

And I give you a single example. With a single example, you can't really do much. But from this example, I know that a cobra, it lays eggs, it has scales, it's poisonous, it's cold blooded, it has no legs, and it's a reptile. So I could say my model of a reptile is well, I'm not certain. I don't have enough data yet.

But if I give you a second example, and it also happens to be egg-laying, have scales, poisonous, cold blooded, no legs. There is my model, right? Perfectly reasonable model, whether I design it or a machine learning algorithm would do it says, if all of these are true, label it as a reptile. OK?

And now I give you a boa constrictor. Ah. It's a reptile. But it doesn't fit the model. And in particular, it's not egg-laying, and it's not poisonous. So I've got to refine the model. Or the algorithm has got to refine the model. And this, I want to remind you, is looking at the features. So I started out with five features. This doesn't fit.

So probably what I should do is reduce it. I'm going to look at scales. I'm going to look at cold blooded. I'm going to look at legs. That captures all three examples. Again, if you think about this in terms of clustering, all three of them would fit with that.

OK. Now I give you another example-- chicken. I don't think it's a reptile. In fact, I'm pretty sure it's not a reptile. And it nicely still fits this model, right? Because, while it has scales, which you may or not realize, it's not cold blooded, and it has legs. So it is a negative example that reinforces the model. Sounds good. And now I'll give you an alligator. It's a reptile. And oh fudge, right? It doesn't satisfy the model. Because while it does have scales and it is cold blooded, it has legs.

I'm almost done with the example. But you see the point. Again, I've got to think about how do I refine this. And I could by saying, all right. Let's make it a little more complicated-- has scales, cold blooded, 0 or four legs-- I'm going to say it's a reptile. I'll give you the dart frog. Not a reptile, it's an amphibian. And that's nice because it still satisfies this. So it's an example outside of the cluster that says no scales, not cold blooded, but happens to have four legs. It's not a reptile. That's good.

And then I give you-- I have to give you a python, right? I mean, there has to be a python in here. Oh come on. At least grown at me when I say that. There has to be a python here. And I give you that and a salmon. And now I am in trouble. Because look at scales, look at cold blooded, look at legs. I can't separate them. On those features, there's no way to come up with a way that will correctly say that the python is a reptile and the salmon is not. And so there's no easy way to add in that rule.

And probably my best thing is to simply go back to just two features, scales and cold blooded. And basically say, if something has scales and it's cold blooded, I'm going to call it a reptile. If it doesn't have both of those, I'm going to say it's not a reptile. It won't be perfect. It's going to incorrectly label the salmon. But I've made a design choice here that's important. And the design choice is that I will have no false negatives.

What that means is there's not going to be any instance of something that's not a reptile that I'm going to call a reptile. I may have some false positives. So I did that the wrong way. A false negative says, everything that's not a reptile I'm going to categorize that direction. I may have some false positives, in that, I may have a few things that I will incorrectly label as a reptile. And in particular, salmon is going to be an instance of that.

This trade off of false positives and false negatives is something that we worry about, as we think about it. Because there's no perfect way, in many cases, to separate out the data. And if you think back to my example of the New England Patriots, that running back and that wide receiver were so close together in height and weight, there was no way I'm going to be able to separate them apart. And I just have to be willing to decide how many false positives or false negatives do I want to tolerate.

Once I've figured out what features to use, which is good, then I have to decide about distance. How do I compare two feature vectors? I'm going to say vector because there could be multiple dimensions to it. How do I decide how to compare them? Because I want to use the distances to figure out either how to group things together or how to find a dividing line that separates things apart.

So one of the things I have to decide is which features. I also have to decide the distance. And finally, I may want to decide how to weigh relative importance of different dimensions in the feature vector. Some may be more valuable than others in making that decision. And I want to show you an example of that.

So let's go back to my animals. I started off with a feature vector that actually had five dimensions to it. It was egg-laying, cold blooded, has scales, I forget what the other one was, and number of legs. So one of the ways I could think about this is saying I've got four binary features and one integer feature associated with each animal. And one way to learn to separate out reptiles from non reptiles is to measure the distance between pairs of examples and use that distance to decide what's near each other and what's not.

And as we've said before, it will either be used to cluster things or to find a classifier surface that separates them. So here's a simple way to do it. For each of these examples, I'm going to just let true be 1, false be 0. So the first four are either 0s or 1s. And the last one is the number of legs. And now I could say, all right. How do I measure distances between animals or anything else, but these kinds of feature vectors?

Here, we're going to use something called the Minkowski Metric or the Minkowski difference. Given two vectors and a power, p, we basically take the absolute value of the difference between each of the components of the vector, raise it to the p-th power, take the sum, and take the p-th route of that. So let's do the two obvious examples. If p is equal to 1, I just measure the absolute distance between each component, add them up, and that's my distance. It's called the Manhattan metric.

The one you've seen more, the one we saw last time, if p is equal to 2, this is Euclidean distance, right? It's the sum of the squares of the differences of the components. Take the square root. Take the square root because it makes it have certain properties of a distance. That's the Euclidean distance. So now if I want to measure difference between these two, here's the question. Is this circle closer to the star or closer to the cross?

Unfortunately, I put the answer up here. But it differs, depending on the metric I use. Right? Euclidean distance, well, that's square root of 2 times 2, so it's about 2.8. And that's three. So in terms of just standard distance in the plane, we would say that these two are closer than those two are. Manhattan distance, why is it called that? Because you can only walk along the avenues and the streets. Manhattan distance would basically say this is one, two, three, four units away. This is one, two, three units away.

And under Manhattan distance, this is closer, this pairing is closer than that pairing is. Now you're used to thinking Euclidean. We're going to use that. But this is going to be important when we think about how are we comparing distances between these different pieces. So typically, we'll use Euclidean. We're going to see Manhattan actually has some value. So if I go back to my three examples-- boy, that's a gross slide, isn't it? But there we go-- rattlesnake, boa constrictor, and dart frog. There is the representation.

I can ask, what's the distance between them? In the handout for today, we've given you a little piece of code that would do that. And if I actually run through it, I get, actually, a nice little result. Here are the distances between those vectors using Euclidean metric. I'm going to come back to them. But you can see the two snakes, nicely, are reasonably close to each other. Whereas, the dart frog is a fair distance away from that. Nice, right? That's a nice separation that says there's a difference between these two.

OK. Now I throw in the alligator. Sounds like a Dungeons & Dragons game. I throw in the alligator, and I want to do the same comparison. And I don't get nearly as nice a result. Because now it says, as before, the two snakes are close to each other. But it says that the dart frog and the alligator are much closer, under this measurement, than either of them is to the other. And to remind you, right, the alligator and the two snakes I would like to be close to one another and a distance away from the frog. Because I'm trying to classify reptiles versus not.

So what happened here? Well, this is a place where the feature engineering is going to be important. Because in fact, the alligator differs from the frog in three features. And only in two features from, say, the boa constrictor. But one of those features is the number of legs. And there, while on the binary axes, the difference is between a 0 and 1, here it can be between 0 and 4. So that is weighing the distance a lot more than we would like. The legs dimension is too large, if you like.

How would I fix this? This is actually, I would argue, a natural place to use Manhattan distance. Why should I think that the difference in the number of legs or the number of legs difference is more important than whether it has scales or not? Why should I think that measuring that distance Euclidean-wise makes sense? They are really completely different measurements. And in fact, I'm not going to do it, but if I ran Manhattan metric on this, it would get the alligator much closer to the snakes, exactly because it differs only in two features, not three.

The other way I could fix it would be to say I'm letting too much weight be associated with the difference in the number of legs. So let's just make it a binary feature. Either it doesn't have legs or it does have legs. Run the same classification. And now you see the snakes and the alligator are all close to each other. Whereas the dart frog, not as far away as it was before, but there's a pretty natural separation, especially using that number between them.

What's my point? Choice of features matters. Throwing too many features in may, in fact, give us some overfitting. And in particular, deciding the weights that I want on those features has a real impact. And you, as a designer or a programmer, have a lot of influence in how you think about using those. So feature engineering really matters. How you pick the features, what you use is going to be important. OK.

The last piece of this then is we're going to look at some examples where we give you data, got features associated with them. We're going to, in some cases have them labeled, in other cases not. And we know how now to think about how do we measure distances between them. John.

JOHN GUTTAG: You probably didn't intend to say weights of features. You intended to say how they're scaled.

ERIC GRIMSON: Sorry. The scales and not the-- thank you, John. No, I did. I take that back. I did not mean to say weights of features. I meant to say the scale of the dimension is going to be important here. Thank you, for the amplification and correction. You're absolutely right.

JOHN GUTTAG: Weights, we use in a different way, as we'll see next time.

ERIC GRIMSON: And we're going to see next time why we're going to use weights in different ways. So rephrase it. Block that out of your mind. We're going to talk about scales and the scale on the axes as being important here. And we already said we're going to look at two different kinds of learning, labeled and unlabeled, clustering and classifying. And I want to just finish up by showing you two examples of that. How we would think about them algorithmically, and we'll look at them in more detail next time.

As we look at it, I want to remind you the things that are going to be important to you. How do I measure distance between examples? What's the right way to design that? What is the right set of features to use in that vector? And then, what constraints do I want to put on the model? In the case of unlabelled data, how do I decide how many clusters I want to have? Because I can give you a really easy way to do clustering. If I give you 100 examples, I say build 100 clusters. Every example is its own cluster.

Distance is really good. It's really close to itself, but it does a lousy job of labeling things on it. So I have to think about, how do I decide how many clusters, what's the complexity of that separating service? How do I basically avoid the overfitting problem, which I don't want to have? So just to remind you, we've already seen a little version of this, the clustering method. This is a standard way to do it, simply repeating what we had on an earlier slide.

If I want to cluster it into groups, I start by saying how many clusters am I looking for? Pick an example I take as my early representation. For every other example in the training data, put it to the closest cluster. Once I've got those, find the median, repeat the process. And that led to that separation. Now once I've got it, I like to validate it. And in fact, I should have said this better. Those two clusters came without looking at the two black dots.

Once I put the black dots in, I'd like to validate, how well does this really work? And that example there is really not very encouraging. It's too close. So that's a natural place to say, OK, what if I did this with three clusters? That's what I get. I like the that. All right? That has a really nice cluster up here. The fact that the algorithm didn't know the labeling is irrelevant. There's a nice grouping of five. There's a nice grouping of four. And there's a nice grouping of three in between.

And in fact, if I looked at the average distance between examples in each of these clusters, it is much tighter than in that example. And so that leads to, then, the question of should I look for four clusters? Question, please.

AUDIENCE: Is that overlap between the two clusters not an issue?

ERIC GRIMSON: Yes. The question is, is the overlap between the two clusters a problem? No. I just drew it here so I could let you see where those pieces are. But in fact, if you like, the center is there. Those three points are all closer to that center than they are to that center. So the fact that they overlap is a good question. It's just the way I happened to draw them. I should really draw these, not as circles, but as some little bit more convoluted surface. OK? Having done three, I could say should I look for four?

Well, those points down there, as I've already said, are an example where it's going to be hard to separate them out. And I don't want to overfit. Because the only way to separate those out is going to be to come up with a really convoluted cluster, which I don't like. All right? Let me finish with showing you one other example from the other direction. Which is, suppose I give you labeled examples.

So again, the goal is I've got features associated with each example. They're going to have multiple dimensions on it. But I also know the label associated with them. And I want to learn what is the best way to come up with a rule that will let me take new examples and assign them to the right group. A number of ways to do this. You can simply say I'm looking for the simplest surface that will separate those examples. In my football case that were in the plane, what's the best line that separates them, which turns out to be easy.

I might look for a more complicated surface. And we're going to see an example in a second where maybe it's a sequence of line segments that separates them out. Because there's not just one line that does the separation. As before, I want to be careful. If I make it too complicated, I may get a really good separator, but I overfit to the data. And you're going to see next time. I'm going to just highlight it here. There's a third way, which will lead to almost the same kind of result called k nearest neighbors.

And the idea here is I've got a set of labeled data. And what I'm going to do is, for every new example, say find the k, say the five closest labeled examples. And take a vote. If 3 out of 5 or 4 out of 5 or 5 out of 5 of those labels are the same, I'm going to say it's part of that group. And if I have less than that, I'm going to leave it as unclassified. And that's a nice way of actually thinking about how to learn them.

And let me just finish by showing you an example. Now I won't use football players on this one. I'll use a different example. I'm going to give you some voting data. I think this is actually simulated data. But these are a set of voters in the United States with their preference. They tend to vote Republican. They tend to vote Democrat. And the two categories are their age and how far away they live from Boston. Whether those are relevant or not, I don't know, but they are just two things I'm going to use to classify them. And I'd like to say, how would I fit a curve to separate those two classes?

I'm going to keep half the data to test. I'm going to use half the data to train. So if this is my training data, I can say what's the best line that separates these? I don't know about best, but here are two examples. This solid line has the property that all the Democrats are on one side. Everything on the other side is a Republican, but there are some Republicans on this side of the line. I can't find a line that completely separates these, as I did with the football players. But there is a decent line to separate them.

Here's another candidate. That dash line has the property that on the right side you've got-- boy, I don't think this is deliberate, John, right-- but on the right side, you've got almost all Republicans. It seems perfectly appropriate. One Democrat, but there's a pretty good separation there. And on the left side, you've got a mix of things. But most of the Democrats are on the left side of that line. All right? The fact that left and right correlates with distance from Boston is completely irrelevant here. But it has a nice punch to it.

JOHN GUTTAG: Relevant, but not accidental.

ERIC GRIMSON: But not accidental. Thank you. All right. So now the question is, how would I evaluate these? How do I decide which one is better? And I'm simply going to show you, very quickly, some examples. First one is to look at what's called the confusion matrix. What does that mean? It says for this, one of these classifiers for example, the solid line. Here are the predictions, based on the solid line of whether they would be more likely to be Democrat or Republican. And here is the actual label. Same thing for the dashed line.

And that diagonal is important because those are the correctly labeled results. Right? It correctly, in the solid line case, gets all of the correct labelings of the Democrats. It gets half of the Republicans right. But it has some where it's actually Republican, but it labels it as a Democrat. That, we'd like to be really large. And in fact, it leads to a natural measure called the accuracy. Which is, just to go back to that, we say that these are true positives. Meaning, I labeled it as being an instance, and it really is.

These are true negatives. I label it as not being an instance, and it really isn't. And then these are the false positives. I labeled it as being an instance and it's not, and these are the false negatives. I labeled it as not being an instance, and it is. And an easy way to measure it is to look at the correct labels over all of the labels. The true positives and the true negatives, the ones I got right. And in that case, both models come up with a value of 0.7.

So which one is better? Well, I should validate that. And I'm going to do that in a second by looking at other data. We could also ask, could we find something with less training error? This is only getting 70% right. Not great. Well, here is a more complicated model. And this is where you start getting worried about overfitting. Now what I've done, is I've come up with a sequence of lines that separate them. So everything above this line, I'm going to say is a Republican. Everything below this line, I'm going to say is a Democrat.

So I'm avoiding that one. I'm avoiding that one. I'm still capturing many of the same things. And in this case, I get 12 true positives, 13 true negatives, and only 5 false positives. And that's kind of nice. You can see the 5. It's those five red ones down there. It's accuracy is 0.833. And now, if I apply that to the test data, I get an OK result. It has an accuracy of about 0.6.

I could use this idea to try and generalize to say could I come up with a better model. And you're going to see that next time. There could be other ways in which I measure this. And I want to use this as the last example. Another good measure we use is called PPV, Positive Predictive Value which is how many true positives do I come up with out of all the things I labeled positively. And in this solid model, in the dashed line, I can get values about 0.57. The complex model on the training data is better. And then the testing data is even stronger.

And finally, two other examples are called sensitivity and specificity. Sensitivity basically tells you what percentage did I correctly find. And specificity said what percentage did I correctly reject. And I show you this because this is where the trade-off comes in. If sensitivity is how many did I correctly label out of those that I both correctly labeled and incorrectly labeled as being negative, how many them did I correctly label as being the kind that I want? I can make sensitivity 1.

Label everything is the thing I'm looking for. Great. Everything is correct. But the specificity will be 0. Because I'll have a bunch of things incorrectly labeled. I could make the specificity 1, reject everything. Say nothing as an instance. True negatives goes to 1, and I'm in a great place there, but my sensitivity goes to 0. I've got a trade-off. As I think about the machine learning algorithm I'm using and my choice of that classifier, I'm going to see a trade off where I can increase specificity at the cost of sensitivity or vice versa.

And you'll see a nice technique called ROC or Receiver Operator Curve that gives you a sense of how you want to deal with that. And with that, we'll see you next time. We'll take your question off line if you don't mind, because I've run over time. But we'll see you next time where Professor Guttag will show you examples of this.