The following is copied (read: re-blogged) from this link. It is good to be reminded of such concepts that often lie at the core of daily Statistical analyses.
The Ten Best Ideas in Statistics
I’ve been studying Statistics for six years now, seriously for the last four years, and as my main focus for the last three. Now that I’ve finished the core PhD curriculum at Stanford, I’ve spent some time reflecting on the best ideas I’ve learned in Probability and Statistics over the years. I’ve compiled a list of brilliant and beautiful ideas, ones that I’m still impressed with every time I think about them.
The ideas here are all classical—the most recent is from perhaps a hundred years ago, but they are foundational in the field, rather than irrelevant to it. Some of the ideas here build on previous ones, so their position in the list below reflects this partial order, and not necessarily my opinion of their relative importances. In fact, all of these ideas are simple, useful, and incredibly insightful, though many of them are relatively unknown or misunderstood outside of Statistics. Without further ado:
Suppose you were part of the physics team that discovered the Higgs Boson. Problem was, your experiment produced data with a ton of noise. People were therefore skeptical of you, and thought that the supposed “particle” you claimed to see might just have been a funny pattern in some random noise. How could you convince them that it’s not? A good strategy for arguing your point would be to say, “Look, suppose you’re right, and the patterns in my data really are just from random noise. Then how would you explain the fact that random noise very rarely produces patterns like this?”
P-values quantify this argument. In the example above, the p-value would be the probability that you would see a pattern so indicative of a new particle if you really were just looking at random noise, (which, in the case of the Higgs Boson, was below 0.000001). More generally, a p-value is the probability that you would observe something as extreme or more so than what you actually did observe, supposing the null hypothesis that “nothing interesting is going on”. (Of course, the definitions of “extreme” and “nothing interesting is going on” depend on the underlying problem). Thus small p-values indicate that whatever you saw would be very unlikely under the null (“nothing interesting”) hypothesis, and so suggest that the null hypothesis is implausible. For the more mathematically inclined, you can think of p-values as sort of quantifying a proof by contradiction argument, where we begin by assuming the null hypothesis is true, and show how it leads to an implausibility.
I was super-impressed when I learned about p-values, all the way back in AP Stat in high school. The reason I like them so much is that they abstract away most of the specifics of the problem under consideration. So I can read any scientist’s work, not knowing anything about their field, and still understand what they’re saying when they quote a p-value. And I can find any statistical test, not know anything about how it works internally, but still know how to interpret the p-value it gives me. This level of generality is relatively rare—you’ll see it in some of the later sections here because all of those ideas are also exceptional, but it’s a gem within Statistics.
P-values are absolutely ubiquitous in Statistics and the social and natural sciences. For many scientific journals, obtaining a low p-value is essentially a requirement of getting published. In recognition of this, I’ve heard p-values called the “Statistician’s stamp of approval”, (perhaps a little unfairly).
Further reading: Wikipedia
2. Confidence Intervals
When you estimate a statistical parameter from some data, you can’t be certain about what the true value of that parameter is. If you have a lot of high-quality data, then you’re more confident that your estimate is near its true value, but if you don’t have very much data, or if it’s of poor quality, then you don’t have much confidence in it.
Confidence intervals quantify these observations. Suppose you have an experiment that produces some noisy data. A confidence interval is an interval computed from your data, and is therefore random because the data are generated with randomness. The defining feature of a 95% confidence interval is that if you repeated the experiment many times, sampling new random data each time, then the computed 95% confidence interval would cover the true parameter 95% of the time. (You can of course also compute confidence intervals with confidences other than 95%). So, as you’d expect, with more data, confidence intervals get smaller, with lower quality data they get larger, and if you require lower confidence they get smaller.
Considered as a random interval then, the probability that the confidence interval contains the true parameter is 95%. But a subtle point is that once a confidence interval has actually been computed, giving you fixed numbers like (5.2,6.8), you don’t actually know the probability that your parameter of interest is in that interval—it’s either 1 or 0, depending on whether the parameter is in it or not. So that’s why statisticians say that the parameter is in that interval with 95% confidence, rather than with 95% probability.
Confidence intervals are an effective and elegant solution to the problem of determining how confident you are in a parameter estimate. You might have seen them in the newspaper when it quoted you the margin of sampling error of a poll, but they’re also used far more widely in the sciences.
Further reading: Wikipedia
3. The Multivariate Normal Distribution
The Multivariate Normal Distribution is a beautiful and useful joint probability distribution for the random variables X,1,…,Xn, generalizing the univariate normal distribution to n dimensions. It has two parameters, μ∈Rn the mean vector, and Σ∈Rn×n the covariance matrix. As you would imagine, if X1,…,Xn are multivariate normal, then the mean of Xi is μi, and the covariance between Xi and Xj is Σij.
It’s valuable and useful for a lot of reasons—it’s one of the simplest models for dependence, and has a ton of wonderfully convenient properties. For example, linear transformations of multivariate normal random variables are multivariate normal, (see the Linear Regression section later), sample means of random variables are approximately multivariate normal, (see the Central Limit Theorem section later), and multivariate normal variables are actually independent if they are uncorrelated, an implication that is not generally true.
Further reading: Wikipedia
4. The Central Limit Theorem
The Central Limit Theorem is a beautiful and often misunderstood theorem about the approximate distribution of sample means. It says that if you sample n independent random variables X1,…,Xn from some common probability distribution with mean μ and variance σ2, then the sample mean
Xn=∑ni=1Xi/n is approximately normally distributed with mean μ and variance σ2/n. That is, if you repeatedly sampled n independent random variables and computed their sample means, the distribution of those sample means would be roughly normal.
One of the other PhD students in my cohort made this nice observation about one of the philosophical consequences of the CLT: You’d think, intuitively, that it’s more complicated to deal with lots of random variables than with one, but the CLT says that if you average them it actually gets easier! In fact, asymptotically the only thing relevant about the original probability distribution is its mean and variance. The CLT is the most basic asymptotic result, and much of asymptotic theory, including the maximum likelihood theory considered later, is based upon it.
I’ve known about the CLT for five years now, and seen it proved in at least three ways, but I still think it’s amazing that it’s actually true. There really is something very special about the normal distribution. In fact, the multivariate normal distribution is special as well—the analogous theorem for taking the mean of independent, identically distributed random vectors also holds.
Further reading: Amir Dembo’s rigorous probability notes, Wikipedia, and an illustration on Wikipedia
5. Maximum Likelihood Estimation
Suppose you have some data X, and a family of probability models for how X is distributed. In particular, suppose you think that X has some density fθ, where θ∈Rp. If you observe X, how should you estimate θ?
Maximum likelihood estimation, (MLE), provides a very cheeky answer—it just says to pick the θ under which the observed data appear most likely. And remarkably, for a large variety of problems this works really well, both in practice and in theory. In theory, one can show (using the Central Limit Theorem) that for most problems, the MLE estimate is asymptotically unbiased, asymptotically multivariate normal, and with the smallest possible asymptotic variance. In practice, the MLE is the de-facto standard, so much so that people typically use it whenever they can compute it. Indeed, the parameter estimates in a number of statistical procedures, like linear regression, (considered later), and logistic regression, are actually special cases of MLE.
Further reading: Wikipedia
6. Bayesian Statistics
Bayes’s Theorem, at first glance, is a relatively obvious statement in probability, so much so, in fact, that one of my classmates enjoys calling it “Bayes’s Remark”. The theorem says that if A and B are events that could occur, then P(A|B)=P(B|A)P(A)/P(B). Here, P(A|B)=P(A∩B)/P(B) is the probability that A will happen given that B happened, so the theorem is something of a triviality if you just plug in the definitions for P(A|B) and P(B|A).
No, the real insight in Bayesian Statistics is that you can apply Bayes’s theorem when A and B are observed data and hypotheses about the world. Writing this explicitly, we have: P(H|X)=P(X|H)P(H)/P(X), where H is some hypothesis about how the world behaves, and X is the observed data. Bayesians then interpret this statement like this: P(H) is the prior, or how likely H is to be true without any other data. P(X|H) is the likelihood that X would happen, given that the world obeys the hypothesis H. And P(X) is how likely X is to occur, averaged over all possible hypotheses about the world H.
This is a fundamentally different approach to the statistical inference problem than the frequentist approach considered in the MLE section and the rest of this post. Bayesians are modeling their beliefs about the world, and adjusting those beliefs based upon the observed data, rather than attempting to figure out what version of the world they live in without any prior opinion. Because of this, a number of people really like Bayesian inference as a model of how people ought to adjust their prior beliefs, P(H), in the presence of new data X. Of course, the necessity of this prior P(H) is the major criticism of Bayesian inference—the conclusions that you draw depend to varying degrees on your prior opinions, which can be subjective.
Bayesian statistics has had enormous influence. It is safe to say that there is no field in the sciences that does not have at least a sizeable minority of Bayesians in it. Bayesian inference has found interesting applications as far ranging as spam filtering, time-series analysis, and clinical trials.
Further reading: Wikipedia
7. Almost Sure Convergence and The Strong Law of Large Numbers
The Weak Law of Large Numbers is reasonably well known, if not by name—it says that if you have a sequence of independent, identically distributed random variables X1,…,Xn, then the probability that their sample mean
Xn=1/n∑ni=1Xi is a given distance away from their true expectation goes to zero as n→∞. For example, if you flip a coin a hundred times, the probability that the ratio of heads is near a half is pretty large.
The Strong Law, on the other hand, is more subtle. Unlike everything else in this post, understanding it rigorously actually does require the basic measure theoretic definitions of measure spaces and random variables. But the main idea can be illustrated reasonably well with an example:
Suppose everyone in the world flips a coin once a day, starting today, and keeps track of the average numbers of heads they’ve seen. The Weak Law says that a year from now, most people’s average will be close to a half. And in ten years, even more people will have averages close to a half. It’s possible, though, that you personally will have an average close to a half at the one year and ten year marks, but that at some point in between you have an average that’s pretty far off. In fact, if at year n, 1/n of the people in the world had an average far from a half, and if people took turns every year being in that 1/n fraction, then everyone could take infinitely many turns having averages far away from a half, (since ∑n1/n=∞), while the aggregate fraction of people with averages away from a half would dwindle to zero.
The Strong Law says that this is impossible—that every person’s average will converge to a half, with probability one. It is in a sense a “personal” version of the “aggregate” Weak Law, and I think the difference between them is subtle and beautiful. It explains, in a sense, how stochasticity can become deterministic as more and more data accumulates.
Further reading: Amir Dembo’s rigorous probability notes, and Wikipedia
8. Multiple Linear Regression
Linear regression is one of the simplest nontrivial statistical models you can come up with. That said, it’s incredibly useful and it and its variants are ubiquitous in Statistics, the social sciences, and many natural sciences.
The model looks like this: Suppose you’re trying to model the quarterly US GDP growth. A simple yet reasonable model is that the GDP growth in a given quarter is some constant times the unemployment rate plus some constant times the inflation rate plus some constant times the price of oil plus some constant times … plus some constant plus random noise. Mathematically, if Y is growth and xj are the things that are supposed to explain it, the model is Y=β0+∑jβjxj+ϵ, where ϵ is an error and the βj are the “some constant”s. (β0 is the intercept term, i.e, the final “some constant”).
Linear regression is valuable and useful for at least three reasons: First of all, the parameter estimates are very easy to interpret: If the estimated coefficient βj corresponding to the unemployment rate is −0.4, that means that for every percentage point increase in the unemployment rate, with all else equal we expect GDP growth to be about 0.4 percentage points lower. (I made up this number. If you’re curious about what it actually is, go and fit this model yourself and let me know!)
The second reason linear regression is valuable is that it’s very easy to analyze theoretically. Consequently, there are exact results about how the parameters β should be distributed, (exactly multivariate normal, if ϵ are normal), exact results about how to test whether or not a given coefficient βj is zero, (with a T-test), and exact confidence intervals for predicting new values. There’s also a great geometric picture explaining what linear regression is doing—it projects the vector Y down into a lower dimensional subspace determined by the values of the x’s. People outside of Statistics might take this sort of theory for granted, but the truth is that there are not very many models which are understood as well as linear regression and for which the theory is so strong.
The third reason linear regression is valuable is that it’s very easy to compute—the computations involve just a bit of linear algebra that can be done with any linear algebra package.
For at least these reasons, linear regression and variants of it have found incredibly wide applications. Read any applied social sciences paper, and chances are you’ll see a few linear regressions or similar models.
Further reading: Wikipedia
9. Correlation Does not Imply Causation
Statistics is one of the more humble academic fields. When the data are not strong enough to support a conclusion, it is the job of Statistics to say so, rather than spew nonsense. Indeed, two of the fundamental ideas presented until now—p-values and confidence intervals—are explicitly concerned with quantifying how confident we can be in our conclusions based upon our data.
With this is mind, it’s no surprise that one of the things Statisticians most enjoy complaining about is attempts to infer causality from associations between variables. If you see a study in the newspaper showing that people who eat fast food regularly have greater rates of heart disease, it would be tempting to conclude that eating fast food causes you to have heart problems. A Statistician, however, would chime in and suggest that it’s possible that fast food has no effect of cardiovascular health, and that what’s really happening is that poorer people eat at fast food places because they don’t have much money and get heart problems because they don’t have access to good preventative medical care. The fundamental issue here is that if you’ve observed two associated variables, it’s possible that the association comes from one causing the other, but also it’s possible that they are both being caused by some variable not under consideration.
This idea—that you can’t be sure about causality just from observing associated variables—runs deeply in Statistics and the social and natural sciences. This simple insight has led to more rigorous and careful thinking about causality, preventing countless (though not enough) spurious and potentially dangerous claims of causality.
Further reading: Wikipedia
10. Markov Chains
Markov Chains are the most beautiful and useful sequential probability models around. The setup is that you have X1,…,Xn coming from some sequence—maybe they are the words you’re reading on this page, the price of a stock over time, or the population of a species over time. The key idea is that the distribution of Xk, given the past X1,…,Xk−1, is assumed to depend only on Xk−1, its immediate predecessor. So, for example, the model would be that the next word after “We the people of…” depends only on the word “of”. Moreover, if we know the first word in a document, then the distribution of the following words can be entirely parametrized by the transition probabilities between words, like the probability that “of” is followed by the word “the”.
Lest you think that the Markov property is too restrictive, note that you can also incorporate the recent past into the states Xi. So, for example, instead of considering each word in “We the people of…” as a state, you could consider any two neighboring words (bigram) as a state, and ask what the probability of transitioning from “people of” to “of the” is.
The Markov property yields huge theoretical and practical benefits—theoretically, Markov chains can be categorized neatly, their asymptotics are well known, and most interesting probabilities associated with them can be calculated exactly. Practically, Markov chain calculations reduce to matrix algebra, and they have been applied very successfully to tasks as diverse as speech recognition, ranking websites, and approximating integrals.
Further reading: Steve Lalley’s great, intuitive notes on the basic theory of Markov Chains, my graph of Markov Chain implications, and Wikipedia
This list of great Statistics ideas has been deliberately classical—no idea is more recent than about a hundred years old, and some are even older. Even so, each of these ideas are still incredibly relevant today. Although the last century, and even the last decade, has seen a number of great ideas in Statistics and Probability, these ideas have mostly built upon these fundamental ideas rather than supplanted them. Understand these ideas, and you’ll understand the basics of Statistics.