Lecture 1: Introduction to Statistics | Lecture Videos | Statistics for Applications | Mathematics

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

About this Video
Playlist
Transcript
Download this Video

*NOTE: This video was recorded in Fall 2017. The rest of the lectures were recorded in Fall 2016, but video of Lecture 1 was not available.

Now Playing

Lecture 1: Introduction to ...

Lecture 2: Introduction to ...

Lecture 3: Parametric Infer...

Lecture 4: Parametric Infer...

Lecture 5: Maximum Likeliho...

Lecture 6: Maximum Likeliho...

Lecture 7: Parametric Hypot...

Lecture 8: Parametric Hypot...

Lecture 9: Parametric Hypot...

Lecture 11: Parametric Hypo...

Lecture 12: Testing Goodnes...

Lecture 13: Regression

Lecture 14: Regression (cont.)

Lecture 15: Regression (cont.)

Lecture 17: Bayesian Statis...

Lecture 18: Bayesian Statis...

Lecture 19: Principal Compo...

Lecture 20: Principal Compo...

Lecture 21: Generalized Lin...

Lecture 22: Generalized Lin...

Lecture 23: Generalized Lin...

Lecture 24: Generalized Lin...

Download English-US transcript (PDF)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PHILIPPE RIGOLLET: OK, so the course you're currently sitting in is 18.650. And it's called Fundamentals of Statistics. And until last spring, it was still called Statistics for Applications. It turned out that really, based on the content, "Fundamentals of Statistics" was a more appropriate title.

I'll tell you a little bit about what we're going to be covering in class, what this class is about, what it's not about. I realize there's several offerings in statistics on campus. So I want to make sure that you've chosen the right one. And I also understand that for some of you, it's a matter of scheduling.

I need to actually throw out a disclaimer. I tend to speak too fast. I'm aware that.

Someone in the back, just do like that when you have no idea what I'm saying. Hopefully, I will repeat myself many times. So if you average over time, you'll see that statistics will tell you that you will get the right message that I was actually trying to stick to send.

All right, so what are the goals of this class? The first one is basically to give you an introduction. No one here is expected to have seen statistics before, but as you will see, you are expected to have seen probability. And usually, you do see some statistics in a probability course. So I'm sure some of you have some ideas, but I won't expect anything.

And we'll be using mathematics. Math class, so there's going to be a bunch of equations-- not so much real data and statistical thinking. We're going to try to provide theoretical guarantees. We have two estimators that are available for me-- how theory guides me to choose between the best of them, how certain can I be of my guarantees or prediction?

It's one thing to just bid out a number. It's another thing to put some error bars around. And we'll see how to build error bars, for example.

You will have your own applications. I'm happy to answer questions about specific applications. But rather than trying to tailor applications to an entire institute, I think we're going to work with pretty standard applications, mostly not very serious ones. And hopefully, you'll be able to take the main principles back with you and apply them to your particular problem.

What I'm hoping that you will get out of this class is that when you have a real-life situation-- and by "real life", I mean mostly at MIT, so some people probably would not call that real life-- their goal is to formulate a statistical problem in mathematical terms. If I want to say, is a drug effective, that's not in mathematical terms, I have to find out which measure I want to have to call it effective. Maybe it's over a certain period of time.

So there's a lot of things that you actually need. And I'm not really going to tell you how to go from the application to the point you need to be. But I will certainly describe to you what point you need to be at if you want to start applying statistical methodology. Then once you understand what kind of question you want to answer-- do I want a yes/no answer, do I want a number, do I want error bars, do I want to make predictions five years into future, do I have side information, or do I not have side information, all those things-- based on that, hopefully, you will have a catalog of statistical methods that you're going to be able to use and apply it in the wild.

And also, no statistical method is perfect. Some of the math people have agreed upon over the years, and people understand that this is the standard. But I want you to be able to understand what the limitations are, and when you make conclusions based on data, that those conclusions might be erroneous, for example.

All right, more practically, my goal here is to have you ready. So who has taken, for example, a machine-learning class here? All right, so many of you, actually-- maybe a third have taken a machine-learning class.

So statistics has somewhat evolved into machine learning in recent years. And my goal is to take you there. So machine learning has a strong algorithmic component.

So maybe some of you have taken a machine-learning class that displays mostly the algorithmic component. But there's also a statistical component. The machine learns from data.

So this is a statistical track. And there are some statistical machine-learning classes that you can take here. They're offered at the graduate level, I believe. But I want you to be ready to be able to take those classes, having the statistical fundamentals to understand what you're doing. And then you're going to be able to expand to broader and more sophisticated methods.

Lectures are here from 11:00 to 12:30 on Tuesday and Thursday. Victor-Emmanuel will also be-- and you can call him Victor-- will also be holding mandatory recitation. So please go on Stellar and pick your recitation. It's either 3:00 to 4:00 or 4:00 to 5:00 on Wednesdays. And it's going to be mostly focused on problem-solving.

They're mandatory in the sense that we're allowed to do this, but they're not going to cover entirely new material. But they might cover some techniques that might save you some time when it comes to the exam. So you might get by.

Attendance is not going to be taken or anything like this. But I highly recommend that you go, because, well, they're mandatory. So you cannot really complain that something was taught only in recitation. So please register on Stellar for which of the two recitations you would like to be in. They're capped at 40, so first come, first served.

Homework will be due weekly. There's a total of 11 problem sets. I realize this is a lot. Hopefully, we'll keep them light. I just want you to not rush too much.

The 10 best will be kept, and this will count for a total of 30% of the final grade. There are due Mondays at 8:00 PM on Stellar. And this is a new thing.

We're not going to use the boxes outside of the math department. We're going to use only PDF files. Well, you're always welcome to type them and practice your LaTeX or Word typing.

I also understand that this can be a bit of a strain, so just write them down on a piece of paper, use your iPhone, and take a picture of it. Dropbox has a nice, new-- so try to find something that puts a lot of contrast, especially if you use pencil, because we're going to check if they're readable. And this is your responsibility to have a readable file.

I've had over the years-- not at MIT, I must admit-- but I've had students who actually write the doc file and think that converting it to a PDF consists in erasing the extension doc and replacing it by PDF. This is not how it works. So I'm sure you will figure it out. Please try to keep them letter-sized. This is not a strict requirement, but I don't want to see thumbnails, either.

You are allowed to have two late homeworks. And by late, I mean 24 hours late. No questions asked. You submit them, this will be counted. You don't have to send an email to warn us or anything like this.

Beyond that, even that you have one slack for one 0 grade and slack for two late homeworks, you're going to have to come up with a very good explanation why you need actually more extensions than that, if you ever do. And particularly, you're going to have to keep track about why you've used your three options before.

There's going to be two midterms. One is October 3, and one is November 7. They're both going to be in class for the duration of the lecture.

When I say they last for an hour and 20 minutes, it does not mean that if you arrive 10 minutes before the end of lecture, you still get an hour and 20 minutes. It will end at the end of lecture time.

For this as well, no pressure. Only the best of the two will be kept. And this grade will count for 30% of the grade.

This will be closed-books and closed-notes. The purpose is for you to-- yes?

AUDIENCE: How many midterms did you say there are?

PHILIPPE RIGOLLET: Two.

AUDIENCE: You said the best of the two will be kept?

PHILIPPE RIGOLLET: I said the best of the two will be kept, yes.

AUDIENCE: So both the midterms will be kept?

PHILIPPE RIGOLLET: The best of the two, not the best two.

AUDIENCE: Oh.

PHILIPPE RIGOLLET: We will add them, multiply the number by 9, and that will be grade. No. I am trying to be nice, there's just a limit to what I can do.

All right, so the goal is for you to learn things and to be familiar with them. In the final, you will be allowed to have your notes with you. But the midterms are also a way for you to develop some mechanism so that you don't actually waste too much time on things that you should be able to do without thinking too much.

You will be allowed to cheat sheet, because, well, you can always forget something. And it will be two-sided letters sheet, and you can practice yourself as writing as small as you want. And you can put whatever you want on this cheat sheet.

All right, the final will be decided by the register. It's going to be three hours, and it's going to count for 40%. You cannot bring books, but you can bring your notes. Yes.

AUDIENCE: I noticed that the midterm dates aren't dated in the syllabus. So I wanted to make sure you know.

PHILIPPE RIGOLLET: They are not?

AUDIENCE: Yeah--

PHILIPPE RIGOLLET: Oh, yeah, there's a "1" that's missing on both of them, isn't there? Yeah, let's figure that out. The syllabus is the true one.

The slides are so that we can discuss, but the ones that's on the syllabus are the ones that count. And I think they're also posted on the calendar on Stellar as well. Any other question?

OK, so the pre-reqs here-- and who has looked at the first problem set already? OK, so those hands that are raised realize that there is a true prerequisite of probability for this class. It can be at the level of 18.600 or 604.1. I should say "B" now. It's two classes.

I will require you to know some calculus and have some notions of linear algebra, such as, what is a matrix, what is a vector, how do you multiply those things together, some notion of what orthonormal vectors are. We'll talk about eigenvectors and eigenvalues, but I remind you all of that. So this is not this strict pre-req. But if you've taken it, for example, it doesn't hurt to go back to your notes when we get closer to this chapter on principle-component analysis. The chapters, as they're listed in the syllabus, are in order, so you will see when it actually comes.

There's no required textbook. And I know you tend to not like that. You like to have your textbook to know where you're going and what we're doing.

I'm sorry, it's just this class. Either I would have to go to a mathematical statistics textbook, which is just too much, or to go to a more engineering-type statistics class, which is just too little. So hopefully, the problems will be enough for you to practice the recitations.

We'll have some problems to solve as well. And the material will be posted on the slides. So you should have everything you need. There's plenty of resources online if you want to expand on a particular topic or read it as said by somebody else.

The book that I recommend in the syllabus is this book called All of Statistics by Wasserman. Mainly because of the title, I'm guessing it has all of it in it. It's pretty broad. There's actually not that many.

It's more of an intro-grad level. But it's not very deep, but you see a lot of the overview. Certainly, what we're going to cover will be a subset of what's in there. The slides will be posted on Stellar before lectures before we start a new chapter and after we're done with the chapter, with the annotations, and also, with the typos corrected, like for the exam.

There will be some video lectures. Again, the first one will be posted on OCW from last year. But all of them will be available on Stellar-- of course, module technical problems.

But this is an automated system. And hopefully, it will work out well for us. So if you somehow have to miss a lecture, you can always catch it up by watching it. You can also play at that speed 0.75 in case I end up speaking too fast, but I think I've managed myself so far-- so just last warning.

All right, why should you study statistics? Well, if you read the news, you will see a lot of statistics. I mentioned machine learning. It's built on a lot of statistics.

If I were to teach this class 10 years ago, I would have to explain to you that data collection and making decisions based on data was something that made sense. But now, it's almost in our life. We're used to this idea that data helps in making decisions.

And people use data to conduct studies. So here, I found a bunch of press titles that-- I think the key word I was looking for was "study finds"-- if I want to do this. So I actually did not bother doing it again this year. This is all 2016, 2016, 2016.

But the key word that I look for is usually "study find"-- so a new study find-- traffic is bad for your health. So we had to wait for 2016 for data to tell us that. And there's a bunch of other slightly more interesting ones. For example, one that you might find interesting is that this study finds that students benefit from waiting to declare a major.

Now, there's a bunch of press titles. There one in the MIT News that finds brain connections, key to reading. And so here, we have an idea of what happened there.

Some data was collected. Some scientific hypothesis was formulated. And then the data was here to try to prove or disprove this scientific hypothesis. That's the usual scientific process.

And we need to understand how the scientific process goes, because some of those things might be actually questionable. Who is 100% sure that study finds that students-- do you think that you benefit from waiting to declare a major? Right I would be skeptical about this. I would be like, I don't want to wait to declare a major.

So what kind of thing can we bring? Well maybe this study studied people that were different from me. Or maybe the study finds that this is beneficial for a majority of people. I'm not a majority. I'm just one person.

There's a bunch of things that we need to understand what those things actually mean. And we'll see that those are actually not statements about individuals. They're not even statements about the cohort of people they've actually looked at. They're statements about a parameter of a distribution that was used to model the benefit of waiting.

So there's a lot of questions. And there are a lot of layers that come into this. And we're going to want to understand what was going on in there and try to peel it off and understand what assumptions have been put in there.

Even though it looks like a totally legit study, out of those studies, statistically, I think there's going to be one that's going to be wrong. Well, maybe not one. But if I put a long list of those, there would be a few that would actually be wrong. If I put 20, there would definitely be one that's wrong.

So you have to see that. Every time you see 20 studies, one is probably wrong. When there are studies about drug effects, out of a list of 100, one would be wrong. So we'll see what that means and what I mean by that. Of course, not only studies that make discoveries are actually making the press titles. There's also the press that talks about things that make no sense.

I love this first experiment-- the salmon experiment. Actually, it was a grad student who came to a neuroscience poster session, pulled out this poster, and explained the scientific experiment that he was conducting, which consisted in taking a previously frozen and thawed salmon, putting it in an MRI, showing it pictures of violent images, and recording its brain activity. And he was able to discover a few voxels that were activated by those violent images. And can somebody tell me what happened here? Was the salmon responding to the violent activity?

Basically, this is just a statistical fluke. That's just randomness at play. There's so many voxels that are recorded, and there's so many fluctuations. There's always a little bit of noise when you're in those things, that some of them, just by chance, got lit up. And so we need to understand how to correct for that.

In this particular instance, we need to have tools that tell us that, well, finding three voxels that are activated for that many voxels that you can find in the salmon's brain is just too small of a number. Maybe we need to find a clump of 20 of them, for example. All right, so we're going to have mathematical tools that help us find those particular numbers.

I don't know if you ever saw this one by John Oliver about phacking. Or actually, it said p-hacking. Basically, what John Oliver is saying is actually a full-length-- like there's long segments on this. And he was explaining how there's a sociology question here about how there's a huge incentive for scientists to publish results. You're not going to say, you know what? This year, I found nothing.

And so people are trying to find things. And just by searching, it's as if they were searching for all the voxels in a brain until they find one that was just lit up by chance. And so they just run all these studies. And at some point, one will be right just out of chance.

And so we have to be very careful about doing this. There's much more complicated problems associated to what's called p-hacking, which consists of violating the basic assumptions, in particular, looking at the data, and then formulating your scientific assumption based on data, and then going back to it. Your idea doesn't work. Let's just formulate another one. And if you are doing this, all bets are off.

The theory that we're going to develop is actually for a very clean use of data, which might be a little unpleasant. If you've had an army of graduate students collecting genomic data for a year, for example, maybe you don't want to say, well, I had one hypothesis that didn't work. Let's throw all the data into the trash. And so we need to find ways to be able to do this.

And there's actually a course been taught at BU. It's still in its early stages, but something called "adaptive data analysis" that will allow you to do these kind of things. Questions?

OK, so of course, statistics is not just for you to be able to read the press. Statistics will probably be used in whatever career path you choose for yourself. It started in the 10th century in Netherlands for hydrology.

Netherlands is basically under water, under sea level. And so they wanted to build some dikes. But once you're going to build a dike, you want to make sure that it's going to sustain some tides and some floods.

And so in particular, they wanted to build dikes that were high enough, but not too high. You could always say, well, I'm going to build a 500-meter dike, and then I'm going to be safe. You want something that's based on data. You want to make sure.

And so in particular, what did they do? Well, they collected data for previous floods. And then they just found a dike that was going to cover all these things.

Now, if you look at the data they probably had, maybe it was scarce. Maybe they had 10 data points. And so for those data points, then maybe they wanted to sort of interpolate between those points, maybe extrapolate for the larger one. Based on what they've seen, maybe they have chances of seeing something which is even larger than everything they've seen before. And that's exactly the goal of statistical modeling-- being able to extrapolate beyond the data that you have, guessing what you have not seen yet might happen.

When you buy insurance for your car, or your apartment, or your phone, there is a premium that you have to pay. And this premium has been determined based on how much you are, in expectation, going to cost the insurance. It says, OK, this person has, day a 10% chance of breaking their iPhone. An iPhone costs that much to repair, so I'm going to charge them that much. And then I'm going to add an extra dollar for my time.

That's basically how those things are determined. And so this is using statistics. This is basically where statistics is probably mostly used. I was personally trained as an actuary. And that's me being a statistician at an insurance company.

Clinical trials-- this is also one of the earliest success stories of statistics. It's actually now widespread. Every time a new drug is approved for market by the FDA, it requires a very strict regimen of testing with data, and control group, and treatment group, and how many people you need in there, and what kind of significance you need for those things. In particular, those things look like this, so now it's 5,000 patients.

It depends on what kind of drug it is, but for, say, 100 patients, 56 were cured, and 44 showed no improvement. Does the FDA consider that this is a good number? Do they have a table for how many patients were cured? Is there a placebo effect? Do I need a control group of people that are actually getting a placebo?

It's not clear, all these things. And so there's a lot of things to put into place. And there's a lot of floating parameters. So hopefully, we're going to be able to use statistical modeling to shrink it down to a small number of parameters to be able to ask very simple questions.

"Is a drug effective" is not a mathematical equation. But "Is p larger than 0.5?" is a mathematical question And that's essentially we're going to be doing. We're going to take this, is a drug effective, to reducing to, is a variable larger than 0.5?

Now, of course genetics are using that. That's typically actually the same size of data that you would see for FMRI data. So this is actually a study that I found.

You have about 4,000 cases of Alzheimer's and 8,000 control. So people without Alzheimer's-- that's what's called a control. That's something just to make sure that you can see the difference with people that are not affected by either a drug or a disease.

Is the gene APOE associated with Alzheimer's disease? Everybody can see why this would be an important question. We now have it crisper. It's targeted to very specific genes.

If we could edit it, or knock it down, or knock it up, or boost it, maybe we could actually have an impact on that. So those are very important questions, because we have the technology to target those things. But we need the answers about what those things are.

And there's a bunch of other questions. The minute you're going to talk to biologists about say, I can do that. They're going to say, OK, are there any other genes within the genes, or any particular snips that I can actually look at? And they're looking at very different questions.

And when you start asking all these questions, you have to be careful, because you're reusing your data again. And it might lead you to wrong conclusions. And those are all over the place, those things. And that's why they go all the way to John Oliver talking about them.

Any questions about those examples? So this is really a motivation. Again, we're not going to just take this data set of those cases and look at them in detail.

So what is common to all these examples? Like, why do we have to use statistics for all those things? Well, there's the randomness of the data.

There's some effect that we just don't understand-- for example, the randomness associated with the lining up of some voxels. Or the fact that as far as the insurance is concerned whether you're going to break your iPhone or not is essentially a coin toss. Fully, it's biased. But it's a coin toss.

From the perspective of the statistician, those things are actually random events. And we need to tame this randomness, to understand this randomness. Is this going to be a lot of randomness? Or is it going to be a little randomness?

Is it going to be something that's like, out of their people-- let's see, for example, for the floods. Were the floods that I saw consistently almost the same size? It was almost a rounding error, or they're just really widespread. All these things, we need to understand so we can understand how to build those dikes or how to make decisions based on those data. And we need to understand this randomness.

OK, so the associated questions to randomness were actually hidden in the text. So we talked about the notion of average. Right, so as far as the insurance is concerned, they want to know in average with the probability is. Like, what is your chance of actually breaking your iPhone? And that's what came in this notion of fair premium.

There's this notion of quantifying chance. We don't want to talk maybe only about average, maybe you want to cover say 99% percent of the floods. So we need to know what is the height of a flood that's higher than 99% of the floods. But maybe there's 1% of them, you know. When doomsday comes, doomsday comes. Right, we're not going to pay for it. All right, so that's most of the floods.

And then there's questions of significance, right? So you know I give this example, a second ago about clinical trials. I give you some numbers. Clearly the drug cured more people than it did not. But does it mean that it's significantly good, or was this just by chance. Maybe it's just that these people just recovered. It's like you know curing a common cold. And you feel like, oh I got cured. But it's really you waited five days and then you got cured.

All right, so there's this notion of significance, of variability. All these things are actually notions that describe randomness and quantify randomness into simple things. Randomness is a very complicated beast. But we can summarize it into things that we understand. Just like I am a complicated object. I'm made of molecules, and made of genes, and made of very complicated things. But I can be summarized as my name, my email address, my height and my weight, and maybe for most of you, this is basically enough. You will recognize me without having to do a biopsy on me every time you see me.

All right, so, to understand randomness you have to go through probability. Probability is the study of randomness. That's what it is. That's what the first sentence that a lecturer in probability will say. And so that's why I need the pre-requisite, because this is what we're going to use to describe the randomness. We'll see in a second how it interacts with statistics.

So sometimes, and actually probably most of the time throughout your semester on probability, randomness was very well understood. When you saw a probability problem, here was the chance of this happening, here was the chance of that happening. Maybe you had more complicated questions that you had some basic elements to answer.

For example, the probability that I have HBO is this much. And the probability that I watch Game of Thrones is that much. And given that I play basketball what is the probability-- you had all these crazy questions, but you were able to build them. But all the basic numbers were given to you. Statistics will be about finding those basic numbers.

All right so some examples that you've probably seen were dice, cards, roulette, flipping coins. All of these things are things that you've seen in a probability class. And the reason is because it's very easy to describe the probability of each outcome. For a die we know that each face is going to come with probably 1/6. Now I'm not going to go into a debate of whether this is pure randomness or this is determinism. I think as a model for actual randomness a die is a pretty good number, flipping a coin is a pretty good model. So those are actually a good thing.

So the questions that you would see, for example, in probabilities are the following. I roll one die. Alice gets $1 if the number of dots is less than three. Bob gets $2 if the number of dots is less than two. Do you want to be Alice or Bob given that your role is actually to make money.

Yeah, you want to be Bob, right? So let's see why. So if you look at the expectation of what Alice makes. So let's call it a. This is $1, with probability 1/2. So 3/6, that's 1/2. And the expectation of what Bob makes, this is $2 with probably 2/6 and that's 2/3. Which is definitely larger than 1/2. So Bob's expectations actually a bit higher.

So those are the kind of questions that you may ask with probability. I described to you exactly, you use the fact that the die would get less than three dots, with probability one half. We knew that. And I didn't have to describe to you what was going on there. You didn't have to collect data about a die. Same thing, you roll two dice. You choose a number between 2 and 12 and you win $100 if you choose the sum of the two dice. Which number do you pick? What?

AUDIENCE: 7.

PHILIPPE RIGOLLET: 7. Why 7?

AUDIENCE: It's the most likely.

PHILIPPE RIGOLLET: That's the most likely one, right? So your gain here will be $100 times the probability that the sum of the two dice, let's say x plus y, is equal to your little z where a little z is the number you pick. So 7 is the most likely to happen and that's the one that maximizes this function of z. And for this you need to study a more complicated function. But it's a function that enables two die. But you can compute the probability that x plus y is equal to z, for every z between 2 and 12. So you know exactly what the probabilities are and that's how you start probability.

So here that's exactly what I said. You have a very simple process that describes basic events. Probability 1/6 for each of them. And then you can build up on that, and understand probably of more complicated events. You can throw some money in there. You can be build functions. You can do very complicated things building on that.

Now if I was a statistician, a statistician would be the guy who just arrived on earth, had never seen a die and needs to understand that a die come up with probably 1/6 on each side. And the way he would do it is just to roll the die until he get some counts and tries to estimate those. And maybe that guy would come and say, well, you know, actually, the probability that I get a 1 is 1/6 plus 0.001 and the probability that I get a 2 is 1/6 minus 0.005. And there would be some fluctuations around this.

And it's going to be his role as a statistician to say, listen, this is too complicated of a model for this thing. And these should all be the same numbers. Just looking at data, they should be all the same numbers. And that's part of the modeling. You make some simplifying assumptions that essentially make your questions more accurate.

Now, of course, if your model is wrong, if it's not true that all the faces arrive with the same probability, then you have a model error here. So we will be making model errors. But that's going to be the price to pay to be able to extract anything from our data.

So for more complicated processes, so of course nobody's going to waste their time rolling dice. I mean, I'm sure you might have done this in AP stat or something. But the need is to estimate parameters from data.

All right, so for more complicated things you might want to estimate some density parameter on a particular set of material. And for this maybe you need to beam something to it, and measure how fast it's coming back. And you're going to have some measurement errors. And maybe you need to do that several times and you have a model for the physical process that's actually going on. And physics is usually a very good way to get models for engineering perspective.

But there's models for sociology where we have no physical system, right. God knows how people interact. And maybe I'm going to say that the way I make friends is by first flipping a coin in my pocket. And with probability 2/3, I'm going to make my friend at work. And with probability 1/3 I'm going to make my friend at soccer.

And once I make my friends at soccer-- I decide to make my friend soccer. Then I will face someone who's flipping the same coin with maybe be slightly different parameters. But those things actually exist. There's models about how friendships are formed. And the one I described is called the mixed-membership model. So those are models that are sort of hypothesized. And they're more reasonable than taking into account all the things that made you meet that person at that particular time.

So the goal here-- so based on data now, once we have the model is going to be reduced to maybe two, three, four parameters, depending on how complex the model is. And then your goal will be to estimate those parameters.

So sometimes the randomness we have here is real. So there's some true randomness in some surveys. If I pick a random student, as long as I believe that my random number generator that will pick your random ID is actually random, there is something random about you. The student that I pick at random will be a random student. The person that I call on the phone is a random person. So there's some randomness that I can build into my system by drawing something from a random number generator.

A biased coin is a random thing. It's not a very interesting random thing. But it is a random thing. Again, if I wash out the fact that it actually is a deterministic mechanism. But at a certain accuracy, a certain granularity, this can be thought of as a truly random experiment.

Measurement error for example, if you by some measurement device. or some optics device, for example. You will have like standard deviation and things that come on the side of the box. And it tells you, this will be making some measurement error. And it's usually thermal noise maybe, or things like this. And those are very accurately described by some random phenomenon.

But sometimes, and I'd say most times, there's no randomness. There's no randomness. It's not like you breaking your iPhone is a random event. This is just something that we sweep-- randomness is a big rug under which we sweep everything we don't understand. And we just hope that in average we've captured, the average effect of what's going on. And the rest of it might fluctuate to the right, might fluctuate to the left. But what remains is just sort of randomness that can be averaged out.

So, of course, this is where the leap of faith is. We do not know whether we were correct of doing this. Maybe we make some huge systematic biases by doing this. Maybe we forget a very important component. Right, for example, if I have-- I don't know, let's think of something-- a drug for breast cancer.

All right, and I throw out the fact that my patient is either a man or woman. I'm going to have some serious model biases. Right. So if I say I'm going to collect a random and patient. And said I'm going to start doing this. There's some information that I really need, clearly, to build into my model.

And so the model should be complicated enough, but not too complicated. Right so it should take into account things there will systematically be important.

So, in particular, the simple rule of thumb is, when you have a complicated process, you can think of it as being a simple process and some random noise. Now, again, the random noise is everything you don't understand about the complicated process. And the simple process is everything you actually do.

So good modeling, and this is not where we'll be seeing in this class, consistent choosing plausible simple models. And this requires a tremendous amount of domain knowledge. And that's why we're not doing it in this class. This is not something where I can make a blanket statement about making good modeling.

You need to know, if I were a statistician working on a study, I would have to grill the person in front of me, the expert, for two hours to know, but how about this? How about that? How does this work? So it requires to understand a lot of things.

There's this famous statistician to whom this sentence is attributed, and it's probably not his then, but Tukey said that he loves being a statistician, because you get to play in everybody's backyard. Right, so you get to go and see people. And you get to understand, at least to a certain extent, what their problems are. Enough that you can actually build a reasonable model for what they're actually doing.

So you get to do some sociology. You get to do some biology. You get to do some engineering. And you get to do a lot of different things. Right, so he was actually at some point predicting the presidential election.

So, you see, you get to do a lot of different things. But it requires a lot of time to understand what problem you're working on. And if you have a particular application in mind you're the best person to actually understand this. So I'm just going to give you the basic tools.

So this is the circle of trust. No, this is really just a simple graphic that tells you what's going on. When you do probability, you're given the truth. Somebody tells you what die God is rolling. So you know exactly what the parameters of the problems are. And what you're trying to do is to describe what the outcomes are going to be.

You can say, if you're rolling a fair die, you're going to have 1/6 of the time in your data you're going to have one. 1/6 of the time you're going to have to have two. And so you can describe-- if I told you what the truth is, you could actually go into a computer, either generate some data. Or you could describe to me some more macro properties of what the data would be like.

Oh, I would see a bunch of numbers that would be centered around 35, if I drew from a Gaussian distribution centered at 35. Right, you would know this kind of thing. I would know that it's very unlikely that if my Gaussian has standard deviation-- is centered on 0, say, with standard deviation 3. It's very unlikely that I will see numbers below minus 10 in above 10, right? You know this, that you basically will not see them.

So you know from the truth, from the distribution of a random variable that does not have mu or sigmas, really numbers there. You know what data, you're going to be having. Statistics is about going backwards. It's saying, if I have some data, what was the truth that generated it. And since there are so many possible truths, Modeling says you have to pick one of the simpler possible truths, so that you can average out.

Statistics basically means averaging. You're averaging when you do statistics. And averaging means that if I say that I received-- so if I collect all your GPAs, for example. And my model is that the possible GPAs are any possible numbers. And anybody can have any possible GPA. This is going to be a serious problem.

But if I can summarize those GPAs into two numbers, say, mean and standard deviation, than I have a pretty good description of what is going on, rather than having to have to predict the full list. Right, if I learn a full list of GPAs and I say, well this was the distribution. Then it's not going to be of any use for me to predict what the GPA would be, or some random student walking in, or something like this.

So just to finish my rant about probability versus statistics, this is a question you would see in a probability-- this is a probabilistic question, and this is a statistical question. The probabilistic question is, previous studies showed that the drug was 80% effective. So you know that. This is the effectiveness of the drug. It's given to you. This is how your problem starts. Then we can anticipate that, for a study on 100 patients, in average, 80 be cured. And at least 65 will be cured with 99% chances.

So again these are not-- I'm not predicting on 100 patients exactly the number of them they're going to be cured. And the number of them that are not. But I'm actually sort of predicting what things are going to look like on average, or some macro properties of what my data sets will look like.

So with 99 percent chances, that means that in 99.99% of the data sets you will draw from this particular draw. 99.99% of the cohort of 100 patients to whom you administer this drug, I will be able to conclude that at least 65 of them will be cured, on 99.99% percent of those data sets.

So that's a pretty accurate prediction of what's going to happen. Statistics is the opposite. It says, well, I just know that 78 out of 100 were cured. I have only one data set. I cannot make predictions for all data sets. But I can go back to the probability, make some inference about what my probability will look like, and then say, OK, then I can make those predictions later on.

So when I start with 78/100 then maybe I'm actually, in this case, I just don't know. My best guess here is that I'm confident I have to add the extra error that I bet you making by predicting that here, the drug is not 80% effective but 78% effective. And they need some error bars around this, that will hopefully contain 80%, and then based on those error bars I'm going to make slightly less precise predictions for the future.

So, to conclude, so this was, why statistics? So what is this course about? It's about understanding the mathematics behind statistical methods. It's more of a tool. We're not going to have fun and talk about algebraic geometry just for fun in the middle of it. So it justifies quantitative statements given some modeling assumptions, that we will, in this class, mostly admit that the modeling assumptions are correct.

| the first part-- in this introduction, we will go through them because it's very easy to forget what the assumptions are actually making. But this will be a pretty standard thing. The words you will hear a lot are IID-- independent and identically distributed-- that means that your data is basically all the sams. And one data point is not impacting another data point.

Hopefully we can describe some interesting mathematics arising in statistics. You know, if you've taken linear algebra, maybe we can explain to you why. If you've done some calculus, maybe we can do some interesting calculus. We'll see how in the spirit of applied math those things answer interesting questions.

And basically we'll try to carve out a math toolbox that's useful for us statistics. And maybe you can extend it to more sophisticated methods that we did not cover in this class. In particular in the immersion learning class, hopefully you'll be able to have some statistical intuition about what is going on.

So what this course is not about, it's not about spending a lot of time looking at data sets, and trying to understand some statistical thinking kind of questions. So this is more of an applied statistical perspective on things, or more modeling. So I'm going to typically give you the model. And say this is a model. And this is how we're going to build an estimator in the framework of this model.

So for example, 18.075, to a certain extent, is called "Statistical Thinking and Data Analysis." So I'm hoping there is some statistical thinking in there. We will not talk about software implementation. Unfortunately, there's just too little time in a semester. There's other courses that are giving you some overview. So the main software these days are R is the leading software I'd say in statistics, both in academia and industry, lots of packages, one every day that's probably coming out.

But there's other things, right, so now Python is probably catching up with all these scikit-learn packages that are coming up. Julia has some statistics in there, but it really if you were to learn a statistical software, let's say you love doing this, this would be the one that would prove most useful for you in the future. It does not scale super well to high dimensional data.

So there is a class an IDSS that actually uses R. It's called IDS 0.12, I think it's called "Statistics, Computation, and Applications," or something like this. I'm also preparing, with Peter Kempthorne, a course called "Computational Statistics." It's going to be offered this Spring as a special topics. And so Peter Kempthorne will be teaching it. And this class will actually focus on using R. And even beyond that, it's not just going to be about using. It's going to be about understanding-- just the same way we we're going to see how math helps you do statistics, it's going to help see how math helps you do algorithims for statistics.

All right, so we'll talk about maximum likelihood estimator. Will need to maximize some function. There's an optimization toolbox to do that. And we'll see how we can have specialized for statistics for that, and what are the principles behind it. And you know, of course, if you've taken AP stats you probably think that stats is boring to death because it was just a long laundry-list that spent a lot of time on t-test. I'm pretty sure we're not going to talk about t-test, well, maybe once. But this is not a matter of saying you're going to do this. And this is a slight variant of it. We're going to really try to understand what's going on.

So, admittedly, you have not chosen the simplest way to get an A in statistics on campus. All right, this is not the easiest class. It might be challenging at times, but I can promise you that you will maybe suffer. But you will learn something by the time you're out of this class. This will not be a waste of your time. And you will be able to understand, and not having to remember by heart how those things actually work.

Are there any questions?

Anybody want to go to other stats class on campus? Maybe it's not too late. OK.

So let's do some statistics. So I see the time now and it's 11:56, so we have another 30 minutes. I will typically give you a three, four minute break if you want to stretch, if you want to run to the bathroom, if you want to check your texts or Instagram. There was very little content in this class, hopefully it was entertaining enough that you don't need the break. But just in the future, so you know you will have a break.

So statistics, this is how it starts, I'm French, what can I say I need to put some French words. So this is not how office hours are going to go down.

Anybody know this sculpture by a Rodin, The Kiss. Maybe probably The Thinker is more famous. But this is actually a pretty famous one. But is it really this one, or is it this one.

Anybody knows which one it is?

This one? Or this one?

AUDIENCE: The previous.

PHILIPPE RIGOLLET: What's that?

AUDIENCE: This one.

PHILIPPE RIGOLLET: It's this one.

AUDIENCE: Final answer.

PHILIPPE RIGOLLET: Yeah, who votes for this one. OK. Who votes for that one? Thank you. I love that you do not want to pronounce yourself with no data actually to make any decision. This is a total coin toss right. Turns out that there is data, and there is in the very serious journal Nature, someone published a very serious paper which actually looks pretty serious.

If you look at it, it's like "Human Behavior: Adult persistence of head-turning symmetry," is a lot of fancy words in there. And this, I'm not kidding you, this study is about collecting data of people kissing, and knowing if they bend their head to the right or if they bend they head to the left. And that's all it is. And so a neonatal right-side preference makes a surprising romantic reappearance in later life. There's an explanation for it.

All right, so if we follow this Nature which one is the one.

This one? Or this one?

This one, right? Head to the right. And to be fair, for this class I was like, oh, I'm going to go and show them what Google Images does. When you Google kissing couple, it's inappropriate after maybe the first picture. And so I cannot show you this. But you know you can check for yourself.

Though I would argue, so this person here actually went out in airports and took pictures of strangers kissing and collecting data. And can somebody guess why did he just not stay home and collect data from Google Images by just googling kissing couples. What's wrong with this data? I didn't know actually before I actually went on Google Images.

AUDIENCE: It can be altered?

PHILIPPE RIGOLLET: What was that?

AUDIENCE: It can be altered.

PHILIPPE RIGOLLET: It can be altered. But, you know, who would want to do this? I mean there's no particular reason why you would want to flip an image before putting it out there. I mean, you might, but you know maybe they want to hide the brand of your Gap shirt or something.

AUDIENCE: I guess the people who post pictures of themselves kissing on Google Images are not representative of the general population.

PHILIPPE RIGOLLET: Yeah, that's very true. And actually it's even worse than that. The people who post pictures of themselves, are not posting pictures of themselves or putting pictures of the people that they took a picture of. And there usually is a stock watermark on this. And it's basically stock images. Those are actors, and so they've been directed to kiss and this is not a natural thing to do. And actually, if you go to Google Images-- and I encourage you to do this, unless you don't want to see inappropriate pictures, and they're mightily inappropriate.

And basically you will see that this study is actually not working at all. I mean, I looked briefly. I didn't actually collect numbers. But I didn't find a particular tendency to bend right. If anything, it was actually probably the opposite. And it's because those people were directed to do it. They just don't actually think about doing it.

And also because I think you need to justify writing in your paper more than, I sat in front of my computer. So again, this first sentence here, a neonatal right-side preference-- "is there a right side preference?" is not a mathematical question. But we can start saying, let's blah, and put some variables, and ask questions about those variables. So you know x is actually not a variable that's used very much in statistics for parameters. But p is one, for parameter.

And so you're going to take your parameter of interest, p, As here is going to be the proportion of couples. And that's among all couples. So here, if you talk about statistical thinking, there would be a question about what population this would actually be representative of.

| usually this is a call to your-- sorry, I should not forget this word it's important for you. OK, I forget this word. So this is-- OK,

So if you look at this proportion, maybe these couples that are in the study might be representative only of couples in airports. Maybe they actually put on a show for the other passengers. Who knows? You know, like, oh, let's just do it as well. And just like the people in Google Images they are actually doing it. So maybe you want to just restrict it. But of course clearly if it's appearing in Nature, it should not be only about couples in airports. It's supposedly representative of all couples in the world.

And so here let's just keep it vague, but you need to keep in mind what population this is actually making a statement about. So you have this full population of people in the world. Right, so those are all the couples. And this person went ahead and collected data about a bunch of them.

And we know that, in this thing, there's basically a proportion of them, that's like p, and that's the proportion of them that's bending their head to the right. And so everybody on this side is bending their heads right. And hopefully we can actually sample this thing you're informing. That's basically the process that's going on.

So this is the statistical experiment. We're going to observe n kissing couples. So here we're going to put as many variables as we can. So we don't have to stick with numbers. And then we'll just plug in the numbers. n kissing couples, and n is also, in statistics, by the way, n is the size of your sample 99.9% of the time. And collect the value of each outcome.

So we want numbers. We don't want right or left. So we're going to code them by 0 and 1, pretty naturally. And then we're going to estimate p which is unknown. So p is this area. And we're going to estimate it simply by the proportion of right So the proportion of crosses that actually fell in the right side.

So in this study what you will find is that the numbers that were collected were 124 couples, and that, out of those 124, 80 of them turned their head to the right. So, p hat is a proportion. How do we do it? Well, you don't need statistics for that. You're going to see 80 divided by 124. And you will find that in this particular study 64.5% of the couples were bending their heads to the right. That's a pretty large number, right?

The question is if I picked another 124 couples, maybe at different airports, different times, would I see same number? Would this number be all over the place? Would it be sometimes very close to 120, or sometimes for close to 10? Or would it be-- is this number actually fluctuating a lot.

And so, hopefully not too much, 64.5 percent is definitely much larger than 50%. And so there seems to be this preference. Now we're going to have to quantify how much of this preference. Is this number significantly larger than 50%? So if our data, for example, was just three couples. I'm just going there, I'm going to Logan. I call it, I do right, left right.

And then I see-- see what's the name of the fish place there? I go to I go to Wahlburgers at Logan and I'm like, OK, I'm done for the day. I collect this data. I go home, and I'm like, wow, 66.7% to the right. That's a pretty big number. It's even farther from 50% than this other guy. So I'm doing even better.

But of course you know that this is not true. Three people is definitely not representative. If I stopped at the first one, I would have actually-- at the first two, I would have even 100%.

So the question that statistics is going to help us answer is, how large should the sample be? For some reason, I don't know if you guys receive this, I'm an affiliate with the Broad Institute, and since then I receive one email per day that says, sample size determination-- how large should your sample be? Like, I know how large should with my sample be. I've taken 18.650 multiple times.

And so I know, but the question is-- is 124 a large enough number or not? Well, the answer is actually, as usual, it depends. It will depend on the true unknown value of p. But from those particular values that we got, so 120 and-- how many couples was there? 80? We actually can make some question.

So here we said that 80 was larger than 50-- was allowing us to conclude at 64.5%. So it could be one reason to say that it was larger than 50%. 50% of 124 is 62.

So the question is, would I be would I be willing to make this conclusion at 63? Is that a number that would convince you? Who would be convinced by 63? who would be convinced by 72? Who would be convinced by 75? Hopefully the number of hands that are raised should grow. Who would be convinced by 80?

All right, so basically those numbers actually don't come from anywhere. This 72 would be the number that you would need for a study-- most statistical studies would be the number that they would retain. That's not for 124. You would need to see 72 that turn their head right to actually make this conclusion. And then 75--

So we'll see that there's many ways to come to this conclusion because, as you can see, this was published in Nature with 80. So that was OK. So 80 is actually a very large number. This is 99 point-- this 99% -- no, so this is 95% confidence.

This is 99% confidence. And this is 99.9% percent confidence. So if you said 80 you're a very conservative person. Starting at 72, you can start making this conclusion.

To understand this, we need to do our little mathematical kitchen here, and we need to do some modeling. So we need to understand by modeling-- we need understand what random process we think this data is generating from. So it's going to have some unknown parameters, unlike in probability. But we need to have just basically everything written except for the values of the parameters.

When I said a die is coming uniformly with probably 1/6 then I need to have, say maybe with probability-- maybe I should say here are six numbers, and I need to just fill those numbers.

So for i equal 1 to n, I'm going to define Ri to be the indicator. An indicator is just something that takes value 1 if something is true, and 0 if not. So it's an indicator that i-th couple turns the head to the right. So, Ri, so it's indexed by i. And it's one if the i-th couple turns their head to the right, and 0 if it's-- well actually, I guess they can probably kiss straight, right? So that would be weird, but they might be able to do this. So let's say not right.

Then the estimator of p, we said, was p hat. It was just the ratio of two numbers. But really what it is is I count, I sum those Ri's. Since I only add those that take value 1, what this is is-- this sum here is actually just counting the number of 1's. Which is another way to say it's counting the number of couples that are kissing to the right.

And here I don't even have to tell you anything about the numbers or anything. I can only keep track of-- first couple is a 0 second couple is a 1, third couple is 0. The data set-- you can actually find it online-- is actually a sequence of 0's and 1's. Now clearly for the question that we're asking about this proportion, I don't need to keep track of all this information. All I need to keep track of is the number of 0's and the number of 1's. Those are completely interchangeable. There's no time effect in this. The first couple is no different than the 15th couple.

So we call this Rn bar. That's going to be a very standard notation that we use. R might be replaced by other letters like x-- so xn bar, yn bar. And this thing essentially means that I average the R's, or the Ri's over n of them. And the bar means the average. So I divide by n the total number of 1's. So here this sum was equal to 80 in our example and n was equal to 124.

Now this is an estimator. So an estimator is different from an estimate. An estimate is a number. My estimate was 64.5. My estimator is this thing where I keep all the variables free. And in particular, I keep those variables to be random because I'm going to think of a random couple kissing left to right as the outcome of a random process, just like flipping a coin be getting heads or tails.

And so this thing here is a random variable, Ri. And this average is, of course, an average of random variables. It's itself a random variable. So an estimator is a random variable. An estimate is the realization of a random variable, or, in other words, is the value that you get for this random variable once you plug in the numbers that you've collected.

So I can talk about the accuracy of an estimator. Accuracy means what? Well, what would we want for an estimator? Maybe we won't want it to fluctuate too much. It's a random variable. So I'm talking about the accuracy of a random variable. So maybe I don't want it to be too volatile.

I could have one estimator which would be-- just throw out 182 couples, keep only 2 and average those two numbers. That's definitely a worse estimator than keeping all of the 124. So I need to find a way to say that. And what I'm going to be able to say is that the number is going to be fluctuating. If I take another two couples, I'm going to be I'm probably going to get a completely different number. But if I take another 124 couples two days later, maybe I'm going to have a very number that's very close to 64.5%.

So that's one way. The other thing we would like about this estimator it's actually-- maybe it's not too volatile-- but also we want it to be close to the number that we're looking for. Here is an estimator. It's a beautiful variable. 72%, that's an estimator. Go out there just do your favorite study about drug performance. And then they're going to call you, MIT student taking statistics, they say, so how are you going to build your estimator? We've collected those 5,000 or something like that.

I'm just going to spit out 72%. Whatever the data says, that's an estimator. It's a stupid estimator but it is an estimator. But this is estimator is very not volatile. Every time you're going to have a new study, even if you change fields, it's still going to be 72%. This is beautiful. And the problem is that's probably not very close to the value you're actually trying to estimate.

So we need two things. We need are estimated to be a random variable. So think in terms of densities. We want the density to be pretty narrow. We want this thing to have very little-- so this is definitely better than this. But also, we want the number that we're interested in, p, to be very close to this-- to be close to the values that this thing is likely to take. If p is here, this is not very good for us.

So that's basically the things we're going to be looking at. The first one is referred to as variance. The second one is referred to as bias. Those things come all over in statistics.

So we need to understand a model. So here's the model that we have for this particular problem. So we need to make assumptions on the observations that we see. So we said we're going to assume that the random variable-- that's not too much of a leap of faith. We're just sweeping under the rug everything thing we don't understand about those couples.

And the assumption that we make is that Ri is a random variable. This one you will forget very soon. The second one is that each of the Ri's is-- so it's a random variable that takes value 0 or 1. Anybody can suggest the distribution for this random variable?

AUDIENCE: Bernoulli.

PHILIPPE RIGOLLET: What?

AUDIENCE: Bernoulli.

PHILIPPE RIGOLLET: Bernoulli, right? And it's actually beautiful. This is where you have to do the least statistical modeling. A random variable that takes value 0 or 1 is always a Bernoulli. That's the simplest variable you can ever think of. Any variable that takes only two possible values can be reduced to a Bernoulli. OK, so this is a Bernoulli.

And here we make the assumption that it actually takes parameter p. And there's an assumption here. Anybody can tell me what the assumption is?

AUDIENCE: It's the same.

PHILIPPE RIGOLLET: Yeah, it's same, right? I could have said p i, but it's p. And that's where I'm going to be able to start getting to do some statistics. It's that I'm going to start to be able to pull information across all my guys. If I assume that they're all pi's completely uncoupled with each other. Then I'm in trouble. There's nothing I can actually get.

And then I'm going to assume that those guys are mutually independent. And most of the time they will just say independent. Meaning that, it's not like all these guys called each other and it's actually a flash mob. And they were like, let's all turn our left side to the left. And then this is definitely not going to give you a valid conclusion.

So, again. randomness is a way of modeling lack of information. Here there is a way to figure it out. Maybe I could have followed all those guys, and knew exactly what they were-- maybe I could have looked at pictures of them in the womb and guess how they were turning-- by the way that's one of the conclusions, they're guessing that we turn our head to the right because our head is turned to the right in the womb. So we don't know what goes on in the kissers minds. And there's, you know, physics, sociology. There's a lot of things that could help us, but it's just too complicated to keep track of, or too expensive for many instances

Now again, the nicest part of this modeling was the fact that Ri's take only two values, which mean that this conclusion that they were Bernoulli was totally free for us. Once we know it's a random variable, it's a Bernoulli. Now they could have been, as we said, they could have been a Bernoulli with parameter p i.

For each i, I could have put a different parameter, but I just don't have enough information. What would I have said? I would say, well the first couple turned to the right. p1 has to be one, that's my best guess. The second couple kiss to the left, well, p2 should be 0, that's my best guess.

And so basically I need to have to be able to average my information. And the way I do it is by coupling all these guys, pi's to be the same p for all i. OK, does it make sense? Here what I am assuming is that my population is homogeneous. Maybe it's not. Maybe I could actually look at a finer grain, but I'm basically making a statement about a population.

And so maybe you kiss to the left, and then you're not-- I'm not making a statement about a person individually, I'm making a statement about the overall population.

Now independence is probably reasonable, right? This person just went and know can seriously hope that these couples did not communicate with each other. Or that you know Tanya did not text that we should all turn our head to the left now. And there's no external stimulus that forces people to do something different.

OK, so-- sorry about that. Since we have about less than 10 minutes. Let's do a little bit of exercises, is that OK with you? So I just have some exercises so we can see what an exercise going to look like. This is sort of similar to the exercises you will see with me. We should do it together, OK?

So now we're going to have-- I have a test. So that's an exam in probability. OK. And I'm going to have 15 students in this test. And hopefully, this should be 15 grades that are representative of the grades of all a large class.

Right, so if you go you know 18.600, it's a large class, there's definitely more than 15 students. And maybe, just by sampling 15 students at random, I want to have an idea of what my grade distribution will look like. I'm grading them, I want to make an educated guess.

So I'm going to make some modeling assumptions for those guys. So here, 15 students and the grades are x1 to x15. Just like we had R1, R2, all the way to R124. Those were my Ri's. And so now I have my xi's. And I'm going to assume that xi follows a Gaussian or normal distribution with min mu and variance sigma squared.

Now this is modeling, right? Nobody told me there's no physical process that makes this happen. We know that there's something called the central limit theorem in the background that says that things tend to be Gaussian, but this is really a matter of convenience.

Actually this is, if you think about it, this is terrible because this puts non-zero probability on negative scores. I'm definitely not going to get a negative score. But you know it's good enough because they know the probabilities non-zero but it's probably 10 to the minus 12. So I would be very unlucky to see a negative score.

So here's the list of grades, so I have 65, 41, 70, 90, 58, 82, 76, 78-- maybe I should have done it with 8 --59, 59-- sitting next to each other --84, 89, 134, 51, and 72.

So those are the scores that I got. There were clearly some bonus points over there. And the question is, find estimator for mu. What is my estimator for mu? Well, an estimator, again, is something that depends on the random variable. All right, so mu is the expectation, right? So a good estimator is definitely the average score, just like we had the average of the Ri's.

Now the xi's no longer need to be 0's and 1's, so it's not going to boil down to being a number of ones divided by the total numbers. Now if I'm looking for an estimate, well, I need to actually sum those numbers and divide them by 15. So my estimate is going to be 1/15.

Then I'm going to start summing those numbers-- 65 plus 72. OK, and I can do it, and it's 67.5. This is my estimate. Now if I want to compute a standard deviation-- so let's say estimate for sigma. You've seen that before, right? An estimate for sigma is what? An estimate for sigma, we'll see methods to do this, but sigma squared is the variance, or is the expectation, of x minus expectation of x squared.

And the problem is that I don't know what those expectations are. And so I'm going to do what 99.9% percent of statistics is. And what is statistics about? What's my motto? Statistics is about replacing expectations with averages. That's what all of statistics is about. There's 300 pages in a purple book called All of Statistics that tells you this. All right, and then you do something fancy. Maybe you minimize something after you replace the expectation. Maybe you need to plug in other stuff. But really, every time you see an expectation, you replace it by an average.

OK let's do this. So sigma squared hat will be what? It's going to be 1 over n, sum from i equals 1 to n of xi minus-- well, here I need to replace my expectation by an average, which is really this average. I'm going to call it mu hat squared.

There, you have replaced my expectation with average. OK so the golden thing is, take your expectation and replace it with this. Frame it, get a tattoo, I don't care but that's what it is. If you remember one thing from this class, that's what it is.

Now you can be fancy, if you look at your calculator, it's going to put an n minus 1 here because it wants to be unbiased. And those are things we are going to come to. But let's say right now we stick to this. And then when I plug in my numbers. I'm going to get an estimate for sigma, which is the square root of the estimator once I plug in the numbers. And you can check that the number, you get will be 18.

So those are basic things and if you've taken any AP stats this should be completely standard to you.

Now I have another list, and I don't have time to see it. It doesn't really matter. OK, we'll do that next time. This is fine. We'll see another list of numbers and see-- we're going to think about modeling assumptions. The goal of this exercise is not to compute those things, it's really to think about modeling assumptions. Is it reasonable to think that things are IID? Is it reasonable to think that they have all the same parameters, that they're independent, et cetera,

OK so one thing that I wanted to add is, probably by tonight, so I will try to use-- in the spirit of-- I don't know what's starting to happen. In the spirit of using my iPad and fancy things, I will try to post some videos of-- for in particular, who has never used a statistical table to read, say, the quantiles of a Gaussian distribution?

OK, so there's several of you. This is a simple but boring exercise. I will just post a video on how to do this, and you will be able to find it on Stellar. It's going to take five minutes, and then you will know everything there is to know about those things but that's something you need for the first problem set.

By the way, so the problem set has 30 exercises in probability. You need to do 15. And you only need to turn in 15. You can turn in all of 30 if you want. But you need to know, by the time we hit those things, you need to know-- well actually, by next week you need to know what's in there.

So if you don't have time to do all the homework, and then go back to your probability class to figure out how to do it, just do 15 easy that you can do. And return those things. But go back to your probability class and make sure that you know how to do all of them. Those are pretty basic questions, and those are things that I'm not going to slow down on. So you need to remember that the expectation of the product of independent random variables is a product of the expectations. Expectation of the sum, is the sum of the expectation. This kind of thing, which is a little silly, but it just requires you practice. So, just have fun. Those are simple exercises. You will have fun remembering your probability class.

All right, so I'll see you on Tuesday-- or Monday.

Free Downloads

Video

iTunes U (MP4 - 183MB)
Internet Archive (MP4 - 183MB)