Wednesday, November 28, 2012

Why the projections on this website are better than all others

Since I started this website in 2005, a few other websites have started to make RPI projections of their own. In my opinion, they are all doing it wrong. It comes down to something called Jensen's Inequality.

Jensen's inequality basically says that taking an expectation of a nonlinear function of a random variable will give you a different answer than taking the same nonlinear function of an expectation. To put is simply, the order in which you do things matters. It turns out that when you are trying to predict end-of-season RPI, you need to take the expectation of a nonlinear function of a random variable. I thought about these issues very carefully before I started making predictions to make certain I was doing things the right way.

Here's why Jensen's inequality matters for RPI projections. In our case, the random variable - or actually variable-s, are the possibly outcomes of the future games. The nonlinear function is actually two nonlinear functions. First, it is the RPI formula itself which takes the wins/losses and comes up with a percentage (this is the number between 0 and 1 that we usually ignore). The next function sorts these and assigns a ranking (i.e., 1, 2, 3, 4....347 - the RPI number everybody talks about). The expectation is simply the average of these ranked RPIs across all seasons.

In other words, we want this:

Expectation(Rank(RPI(random game outcomes)))

That's exactly what I do on this site. I start in the innermost parentheses and generate random season outcomes 10,000 times. Next, I calculate the RPI (percentage) from each of the 10,000 simulated seasons. Then, I rank each of them. Finally, I calculate the expectation (or simple average) of these ranks across all 10,000 simulations.

Here's what the other sites do: They completely reverse things. They start with an expected record. Instead of simulating the games and looking at many different possibilities, they only look at the average. Then, they calculate the RPI (percentage) and then rank those RPIs.

In other words, this is what they do:

Rank(RPI(E(future game outcomes)))

Jensen's inequality says that:
Expectation(Rank(RPI(random game outcomes)))
does not equal
Rank(RPI(E(future game outcomes))).


The other sites take a shortcut which saves time (you don't have to simulate anything), but will give you biased predictions. This is especially going to matter near the top of the rankings (i.e., 1-100) and at the bottom and also with teams that are less consistent.

This discussion is behind the reason why the actual "Expected RPI" is not usually a round number. I am taking an average of many possible Ranks. Also, this is why you will almost never see any team with an RPI of 1. Because it is an expectation (or an average) across 10,000 simulations, a team would have to be ranked one in every possible scenario. This may happen near the end of the season, but never before.

Because it is human nature to want to know who is number 1, 2, 3, etc., I also provide these alternate (but as I have explained, biased) predictions in the first column - the "overall" column. You may notice that sometimes this is not sorted in the same order as the "expected RPI Rank" column. If you look closely, though, you'll see that it is the same order as the "RPI Forecast" column. This is the equivalent of the forecast that other sites will give you. BY the way, the RPI Forecast column is actually an unbiased estimate of the final RPI percentage, it's just that once you start ranking teams based on that number, you have to be careful to do things the right way and in the right order.

To summarize: RPI Forecast > Other sites

Anyway, enough of the rant and thanks to Mike Crawford for asking about this.