NMAA Sports -- 2008 New Mexico High School Ratings Workflow

New Mexico High School Sports

More Details About How Calculations are Done

Here are quick links to sections below describing steps of the calculation:

Collect the raw data
Define the model parameters
Weighting the games
The calculation method
Starting the calculation
Markov Chain Monte Carlo
Aging the initial team strengths
Collecting Statistics
Post-processing

Collect the raw data -- Games are reported to NMAA by coaches or team representatives. I download page sources for games reported each week, and feed the files to a program that reads them and extracts game information. The information captured is the HomeTeamName and AwayTeamName, and the HomeTeamScore and AwayTeamScore, even though all that's used later is the difference in scores. It is this difference in scores (GameScoreDifferences) that comprises the raw data for the simulation portion of the calculation.

We remember the PlayDates, the dates games are played so team schedules can be reconstructed later, and so we can check to see if duplicated games have been reported.

District alignments are also captured from the NMAA website, and are checked occasionally through the season, since teams change their names and district alignments get updated from time to time.

All the game data is assembled into an event list, one event per game. In soccer, there are about 700 games and 70 teams in a season, and in basketball, there are about 2000 games and 150 teams.

Weighting the games -- As discussed in more detail on another page, we weight the games so that recent games count more than early-season games, and close games count more than games with a large winning margin. Go here to read more about why we think weighting the games makes sense. The time-based weight factors and the score-based weight factors are calculated, and then, whenever we calculate the error between the model prediction for the winning margin of a game and the actual observed score difference, we scale the error obtained by the game-weighting factor. So, if an early-season game was won by a 22 point margin, the relative weight of that game compared to a recent close game (weight=1), will be close to 0.5*0.37 = 0.18. The factor of one half accounts for the time elapsed from the beginning of the season to the last day (so far) that games were played, and the factor of 0.37 represents the reduction in game importance due to the 22 point score margin.

Define the model parameters -- Once a team has played enough games against other rated teams, it is added to the pool of rated teams. Once we are well into the season, the pool consists of all the in-state varsity teams, and a few out-of-state teams, and a few JV teams. Nothing prevents us for rating these teams along with varsity teams, but they typically don't have enough games reported for their ratings to become meaningful.

For every rated team, there is a TeamStrength parameter. It is a single number for each team, and the idea is that teams with a larger strength parameter are expected to win games against teams with a smaller strength parameter, and the score difference in a game is larger when the strength difference is larger. In soccer, a difference of 100 strength points implies a one goal advantage for the stronger team, whereas in basketball, the strength difference maps directly to the score difference.

In addition to the strength parameters, there is also an additional parameter HomeBias representing the strength increment a team enjoys when playing as the home team. It is intended to capture the basics of a home field or home court advantage.

The final model parameter is Noise, a single number that characterizes the overall variability in how teams play from game to game.

So, in summary, if the number of rated teams is NTeams, and the number of games is NGames, we are trying to "fit" NGames pieces of raw data (GameScoreDifferences) to NTeams+2 model parameters.

The calculation method -- In the statistics literature, what we do next is called "Bayesian Parameter Estimation." If you want to know more, you can find all you wanted to know using a web search engine with the italicized phrase. A few sites that I found useful are:

Larry Bretthorst's site at Washington University
Larry Bretthorst's Introduction paper
Eliezer Yudkowsky's tutorial site
publications by Richard Jeffrey on probability theory

Parameter Estimation itself is carried out using a Markov Chain Monte Carlo (MCMC) algorithm. Besides the Wikipedia article, a useful paper on the subject is Radford Neal's manuscript Probabilitic Inference Using Markov Chain Monte Carlo Methods.

Starting the calculation -- The problem we are trying to solve is only weakly nonlinear, and so at the beginning, we have a rough idea what a good set of TeamStrength parameters is, based on a linearized (least sqaures) approximation to the equations that the program solves. But the actual choice of initial strength parameters is unimportant, because the full calculation forgets this choice by the time we we start collecting our reported results (see later).

The HomeBias parameter starts off near zero. It's actual value is determined by the mean of all reported game score differences (Home Team - Away Team). The Noise parameter starts out at some small value -- about 1 goal in soccer, and 10 points in basketball.

Once we have a starting set of parameters, we use the difference of home and visitor TeamStrength parameters (adjusted by adding the HomeBias parameter to the strength of each home team) to predict the score differences of every game that has been reported. The actual scores are compared with the predicted scores and the difference between them is an error value for that game. All the error values for all the games are squared and added together to obtain a global error that characterizes the goodness of the starting set of parameters. This sum of squares of prediction errors characterizes the relative goodness of every set of parameters we try out during the course of the computation.

Markov Chain Monte Carlo -- The calculation proceeds by repeatedly evaluating the goodness of trial sets of parameters. The "goodness" is essentially the sums of the squares of the errors between predicted and actual game scores -- a smaller sum means a better set of parameters. So, we take the first set of parameters (TeamStrength and HomeBias), and randomly change one of them by a little amount. We then see if the new set of parameters is better (smaller sums of squares). If it is better, we keep the new set. If it is worse, we keep it sometimes, and other times we keep the old set. The decision process that determines if we keep the new (poorer) set is the basis of what makes the Monte Carlo process a little magical, and if you want to know more about this part of the story, consult some of those references I listed above. The idea to remember is that we explore not just better and better parameter sets, but poorer sets as well, so long as they are not too poor. It is this exploration process that eventually tells us not only how strong a team is, but also tells us how variable their team strength is from game to game.

Aging the initial team strengths -- The initial team strengths (and the home court/field advantage parameter) are set to values that are determined by a least squares solution to a linearized version of the full model. If the problem were exactly linear, then this solution would give the most likely values for each team's strength parameter. But in fact, the problem is a little nonlinear, because there is a practical limit to the winning margin you can get in a contest, no matter how much stronger the better team is. Several factors account for this upper limit to the winning margin. First, all the games run for a fixed length of time, and so eventually the stronger team has to stop running up the score because the game is over. More likely, when a game obviously becomes comfortable for the stronger team, less experienced players are given a chance to play, and they are less productive in scoring. So, the score prediction is not just a straight-line prediction based on the strength difference. The line curves to limit the maximum score difference achievable, but it is very nearly a straight line when the two teams playing are close to each other in strength. Because the real problem is not the same as the linearized one, these initial parameters are not the end of the story.

Although the initial guesses are good, the final data we collect is really not sensitive to these initial parameters, because we first "age" the parameter set.

To age the parameters, we repeat the trial parameter evaluation step (described in the section above) over and over. We continue to pick randomly a different parameter (for example, the team strength for a different team), evaluate the goodness of the set, and decide whether to keep the new parameter set or the old one. We repeat the process many many thousands of times. After many thousand iterations (about 30000 to 100000), we are sitting on a set of parameters that has now effectively "forgotten" the initial set we started with, and is a set that is one (of many) reasonable choices that characterize the problem. The parameter set has now been "equilibrated." As that happens, we move from the aging step to the statistics collection step.

Conduct the MCMC walk and collect statistics -- Once the set of TeamStrength, HomeBias, and Noise parameters has equilibrated, we explore many other possible and likely choices of these parameters, simply by continuing the Monte Carlo walk. Now, we save every parameter set (or, to save computer memory, we save every N^th step, where N is about 10). After exploring nearly a million trial parameter sets, the MCMC walk is completed, and we post-process the results and report statistics.

Post-process the results -- When the MCMC walk is over, we have saved many tens of thousands of parameter sets. This means we have thousands of possible (and likely) choices for HomeBias, thousands of choices for Noise, and for every rated team, thousands of choices for TeamStrength.

The post-processing step uses the statistics gathered to calculate quantities of interest, such as the average or median < TeamStrength > for each team. This is the single number that is used to characterize the strength of each team, and that is what's shown in the ranking pages under the column heading Median Strength.

The collection of values for the TeamStrength of each team represents a numerical probability distribution function (PDF) that characterizes the variability associated with how that team plays from one game to the next, and allows us to find numbers to characterize the variation of strengths (as the 17th and 83rd percentile strengths) for the team. This result is shown on team ranking pages in the column labeled Range - Low to High. We produce a figure of this probability distribution function for the stronger teams in each classification (1A - 5A).

Using the difference between the PDF's for two teams, we can calculate the predicted distribution of score-differences that we would expect if two teams played each other. This difference between PDF's is what is used when we say that the computer plays virtual games between the two teams. From the PDF difference, we can then determine the probability of each team winning, losing, or tieing in a future contest. The team schedule pages show this result in the column that is labeled P(W-L-T).

By sorting the average < TeamStrength >'s, we can rank teams. Because we have gathered the season schedule for each team, we can use the median < TeamStrength > for every opponent a team has played (or will play) as an estimate of the Strength of Schedule (SOS) for a team.

From the event list, we can compile overall and in-district win-loss-tie records (Overall W-L-T and District W-L-T).

If every team always played at exactly its rated < TeamStrength >, then it would always win when playing weaker teams, and would always lose when playing stronger teams. Because the team's strength does change from game to game, we get some results we call surprises, in which the so-called stronger team loses to a so-called weaker team, providing a result we call "Better" for the weaker team, and "Worse" for the stronger team. When the stronger team wins, the result is "Expected." These data are reported on the ranking pages (and team schedule pages) as the E/B/W record.

We can use actual game results to calculate the effective Playing Strength a team had when playing in each game. Knowing the team strength value for each team tells us what the score difference was expected to be (adjusted by home court/field advantage). To simplify the explanation of how we calculate this quantity, let's assume for the moment that strength points and score-points are scaled the same (one strength point advantage implies a one point win for the stronger team). Suppose that Team A has a median strength rating of 1010, and that team B has a median rating of 1020. We would expect B to win by 10 points (1020-1010), but suppose actually Team A wins by 6 points. The effective team strengths of these two teams, in this particular game, must now come out so that Team A plays 6 points stronger than Team B. To keep things right on average, we start halfway between the two teams combined strengths (1015), and divide the actual score margin evenly between the two teams. So, Team A effectively played that game with a strength of 1018 (= 1015 + 6/2) and Team B effectively played that game with a strength of 1012 (= 1015 - 3). The actual 6 point win by Team A is now consistent with the two effective playing strengths (1018 - 1012).

Report the results -- Finally, the program writes out web pages.....

Go back to Main Page

Questions or comments: send email to Bob Walker ()