Wednesday, November 28, 2012

Why the projections on this website are better than all others

Since I started this website in 2005, a few other websites have started to make RPI projections of their own. In my opinion, they are all doing it wrong. It comes down to something called Jensen's Inequality.

Jensen's inequality basically says that taking an expectation of a nonlinear function of a random variable will give you a different answer than taking the same nonlinear function of an expectation. To put is simply, the order in which you do things matters. It turns out that when you are trying to predict end-of-season RPI, you need to take the expectation of a nonlinear function of a random variable. I thought about these issues very carefully before I started making predictions to make certain I was doing things the right way.

Here's why Jensen's inequality matters for RPI projections. In our case, the random variable - or actually variable-s, are the possibly outcomes of the future games. The nonlinear function is actually two nonlinear functions. First, it is the RPI formula itself which takes the wins/losses and comes up with a percentage (this is the number between 0 and 1 that we usually ignore). The next function sorts these and assigns a ranking (i.e., 1, 2, 3, 4....347 - the RPI number everybody talks about). The expectation is simply the average of these ranked RPIs across all seasons.

In other words, we want this:

Expectation(Rank(RPI(random game outcomes)))

That's exactly what I do on this site. I start in the innermost parentheses and generate random season outcomes 10,000 times. Next, I calculate the RPI (percentage) from each of the 10,000 simulated seasons. Then, I rank each of them. Finally, I calculate the expectation (or simple average) of these ranks across all 10,000 simulations.

Here's what the other sites do: They completely reverse things. They start with an expected record. Instead of simulating the games and looking at many different possibilities, they only look at the average. Then, they calculate the RPI (percentage) and then rank those RPIs.

In other words, this is what they do:

Rank(RPI(E(future game outcomes)))

Jensen's inequality says that:
Expectation(Rank(RPI(random game outcomes)))
does not equal
Rank(RPI(E(future game outcomes))).

The other sites take a shortcut which saves time (you don't have to simulate anything), but will give you biased predictions. This is especially going to matter near the top of the rankings (i.e., 1-100) and at the bottom and also with teams that are less consistent.

This discussion is behind the reason why the actual "Expected RPI" is not usually a round number. I am taking an average of many possible Ranks. Also, this is why you will almost never see any team with an RPI of 1. Because it is an expectation (or an average) across 10,000 simulations, a team would have to be ranked one in every possible scenario. This may happen near the end of the season, but never before.

Because it is human nature to want to know who is number 1, 2, 3, etc., I also provide these alternate (but as I have explained, biased) predictions in the first column - the "overall" column. You may notice that sometimes this is not sorted in the same order as the "expected RPI Rank" column. If you look closely, though, you'll see that it is the same order as the "RPI Forecast" column. This is the equivalent of the forecast that other sites will give you. BY the way, the RPI Forecast column is actually an unbiased estimate of the final RPI percentage, it's just that once you start ranking teams based on that number, you have to be careful to do things the right way and in the right order.

To summarize: RPI Forecast > Other sites

Anyway, enough of the rant and thanks to Mike Crawford for asking about this.

Saturday, February 11, 2012

Back in Business!

I ended up ditching the old webhost and found a much better one. Up until two weeks ago, I hadn't had any problems with the old host, but having the site unavailble twice in two weeks in unacceptable. The new host should be much more reliable (not to mention faster). If you are still not able to view the page, try hitting refresh. It might take a few hours for your ISP to renew the nameservers so it points it to the right IP address. By the end of the day, it should be fine.

Friday, February 10, 2012

Site down again.

It looks like more problems at the web host. Hopefully they can resolve this problem. It's the second time in 10 days.

Monday, January 30, 2012

Site down

The servers which host and appear to be down. I'm working on getting the sites back up.

Monday, March 14, 2011

Probabilities/Odds For NCAA Tournament Advancement

We've added a page that lists the probabilities for reaching each of the rounds of the tournament - along with winning it all - for all 68 teams.

The page is here:

There are two types of probabilites listed: UNCONDITIONAL and CONDITIONAL

You are probably most familiar with UNCONDITIONAL. That tells you the overall probability of a team reaching each of the rounds. The CONDITIONAL probabilities, however, tell you the probability that a team reaches a round given that it has already reached the previous round. For example, a team like UT-San Antonio may have a very low unconditional probability of reaching the Final 4, however, conditional on it having already reached the Elite 8, it will have a much higher probability. We'll update the webpage as we reach each new round with new probabilities. These will change as teams (and potential future matchups) are eliminated.

How'd we do?

The main goals (and the main strengths) of this site are (1) to provide accurate predictions of end-of-season RPI, and (2) to provide up to the minute current RPI. A secondary goal is to provide predictions of at-large bids and seeds. Up until this season, we were not predicting seeds, just bids. This season, we decided to use the "Dance Card" formulas for seeds in addition to the at-large formulas. In terms of at-large bids, we did a bit worse than seasons past - missing 3 (VCU, USC, and Florida State). However, only one major bracketologist got even 67 out of 68 (Fox Sports). As far as I have seen, nobody got 68. Based on the Bracket Project website, only a handful got 66 or better. Most were 65 or less, so 3 is nothing to be ashamed of - especially considering the fact that this is all done using a simple formula. Typically the Dance Card formula misses no less than 1 or 2, so this season was unusual.

As for seeds, there IS something to be ashamed of. Quite simply, the seed predictions stunk! In the name of objectivity, we wanted to avoid tweaking the seeds and let the formulas do all the work. However, something obviously needs to be done. The plan for next season is probably to abandon the formulas for seeding - or at least to modify them to allow for some "human analysis". We'll stick with the at-large formulas and figure out a better way to seed the teams. At this point, the seeding decisions seem to be more complicated than can be explained by a simple formula.

The big benefit from projecting bids/seeds relative to other bracketologists ought to come EARLIER in the season, rather than LATER. In fact, RPIFORECAST is one of the best at predicting seeds in December/January and February. We have archived all of the major bracketologists' predictions this season and will be determining who were the best and worst EARLY in the season.

Anyway, thanks for visiting, enjoy the tournament, and look for some more exciting things next season!

Monday, February 21, 2011

Bracket Predictions

I've decided to switch up the at-large and seeding formulas a bit. I had been using an equation that includes conference affiliation and representation on the selection committee. However, just today, I switched to one of the models which does not take these into account. This was mainly because of the problem with the MWC teams seeds (BYU, SDSU, etc). Because the MWC has never been this good, it may not make sense to assume that the same "bias" would exist against them this season. Anyway, BYU immediately jumped to a 2 seed and SDSU to a 5 seed (the highest 5 seed).