Wednesday, November 28, 2012

Why the projections on this website are better than all others

Since I started this website in 2005, a few other websites have started to make RPI projections of their own. In my opinion, they are all doing it wrong. It comes down to something called Jensen's Inequality.

Jensen's inequality basically says that taking an expectation of a nonlinear function of a random variable will give you a different answer than taking the same nonlinear function of an expectation. To put is simply, the order in which you do things matters. It turns out that when you are trying to predict end-of-season RPI, you need to take the expectation of a nonlinear function of a random variable. I thought about these issues very carefully before I started making predictions to make certain I was doing things the right way.

Here's why Jensen's inequality matters for RPI projections. In our case, the random variable - or actually variable-s, are the possibly outcomes of the future games. The nonlinear function is actually two nonlinear functions. First, it is the RPI formula itself which takes the wins/losses and comes up with a percentage (this is the number between 0 and 1 that we usually ignore). The next function sorts these and assigns a ranking (i.e., 1, 2, 3, 4....347 - the RPI number everybody talks about). The expectation is simply the average of these ranked RPIs across all seasons.

In other words, we want this:

Expectation(Rank(RPI(random game outcomes)))

That's exactly what I do on this site. I start in the innermost parentheses and generate random season outcomes 10,000 times. Next, I calculate the RPI (percentage) from each of the 10,000 simulated seasons. Then, I rank each of them. Finally, I calculate the expectation (or simple average) of these ranks across all 10,000 simulations.

Here's what the other sites do: They completely reverse things. They start with an expected record. Instead of simulating the games and looking at many different possibilities, they only look at the average. Then, they calculate the RPI (percentage) and then rank those RPIs.

In other words, this is what they do:

Rank(RPI(E(future game outcomes)))

Jensen's inequality says that:
Expectation(Rank(RPI(random game outcomes)))
does not equal
Rank(RPI(E(future game outcomes))).


The other sites take a shortcut which saves time (you don't have to simulate anything), but will give you biased predictions. This is especially going to matter near the top of the rankings (i.e., 1-100) and at the bottom and also with teams that are less consistent.

This discussion is behind the reason why the actual "Expected RPI" is not usually a round number. I am taking an average of many possible Ranks. Also, this is why you will almost never see any team with an RPI of 1. Because it is an expectation (or an average) across 10,000 simulations, a team would have to be ranked one in every possible scenario. This may happen near the end of the season, but never before.

Because it is human nature to want to know who is number 1, 2, 3, etc., I also provide these alternate (but as I have explained, biased) predictions in the first column - the "overall" column. You may notice that sometimes this is not sorted in the same order as the "expected RPI Rank" column. If you look closely, though, you'll see that it is the same order as the "RPI Forecast" column. This is the equivalent of the forecast that other sites will give you. BY the way, the RPI Forecast column is actually an unbiased estimate of the final RPI percentage, it's just that once you start ranking teams based on that number, you have to be careful to do things the right way and in the right order.

To summarize: RPI Forecast > Other sites

Anyway, enough of the rant and thanks to Mike Crawford for asking about this.

6 comments:

Troy said...

I'd like to say that your site is fantastic. I use it all the time. Keep up the great work.

Tarvol said...

I am rather confused. You argue the accuracy of this particular RPI forecast, but it is written that it has been updated as of 12:04 AM, 3/7/13, and you have North Carolina at 18, with a victory over Maryland last night, and a record of 21-8. North Carolina's victory gives them a 22-8 record. Does this not affect their rating? It seems that accuracy is predicated on getting certain numbers correct.

Ryan Israelsen said...

Tarvol, what you were looking at was the live-rpi portion of the website which provides the current rpi. That is updated every few minutes. The forecasts of end-of-season RPI are updated the following morning.

Thanks for visiting.

Anonymous said...

you forgot to add Drexel's game against #4 Arizona for tomorrow

Anonymous said...

also you forgot the Drexel game vs Alabama

Adam said...

Have always enjoyed your work, thanks for all the effort you've put in with the website.

How does the new NCAA ranking system affect things? I haven't even seen the actual formulas they'll be using. Is this something you can also do real time?

Thanks.