Where Does Vegas Get Those Wonderful (Tennis Forecasting) Toys?

Vegas is good at math.  House payouts in casino games are carefully tailored to ensure a house edge based on probability theory for the game being played (cards, dice, etc.).  For example, if you are rolling two balanced dice, there are only 36 combinations.  Each sum of the dice from 2-12 has a fixed probability.  There’s only one way to roll a 2 (snake eyes), so there’s a 1/36 chance of that.  There are four ways to roll a 5, so there’s a 1/9 chance of that.  Etc.  House payouts are set below your odds, so that over time, you will give the house more than it gives you.

Sports bookmaking is a different, in that pure theory does not work, because there are too many unknown variables.  Bookmakers have to rely on experience to achieve profitable oddsmaking.  This morning as I was comparing some of my WTA forecasts to the Vegas odds, I noticed one particular area in which the Vegas odds deviate significantly from a mathematical model of tennis, and I thought it might be fun to explore that.

First a little background on the most common mathematical model for tennis.

Markov, Again

The Basics

On this site and elsewhere, you will see discussions of a Markov model of a tennis match.  A Markov model starts with the probability of a player winning her or his serve.  The Markov model then examines all the progressions from one “state” to another “state” and generates the probabilities of each progression happening.  If you already are familiar with the Markov model, you can skip this subsection.

Let’s start at the beginning of a match to illustrate how the Markov model works.  The game starts with a score is 0-0, which is the base state.  There are two possible states that can come next (transition states).  One is 15-0 and one is 0-15.  If the server has a 70% chance of winning his serve, then the chances of the next state being 15-0 are 70% and the chances of 0-15 are 30%.   Straightforward.  The new base state is whatever happened on that last point.  Let’s assume the server won the point, so the new base state is 15-0.

The next possible transition states are either 30-0 or 15-15.  The probability of the server winning the point remains at 70% (at least theoretically), so looked at from the start of the game, the probability of reaching 30-0 is 0.70*0.70 = 49%.  The probability of reaching 15-15 after the server won the first point, looked at from the start of the game, is 0.70*0.30 = 21%.  Ah, but not so simple.  At the start of the match, you don’t know if the server will win the first point.  There’s a 30% chance it is 0-15 after the first point.  Then the chances of it being 15-15 are 0.30*0.70 = 21%.

In other words, there are two ways to get to 15-15:  (1) 70% chance the server wins the first point and 30% chance he loses the second point and (2) 30% chance the server loses the first point and 70% chance he wins the second point.   Add those together and you have a 42% chance of 15-15.  We had a 49% chance of 30-0.  So that means there’s only a 9% chance of 0-15.  We know this because there are only three possibilities after the second point (30-0, 15-15 and 0-30) and the probabilities for the first two add to 91%.  You get to the same place using the model, by squaring the returner’s chances to win both points:  0.30*0.30 = 9%.

Now all you have to do is do that for every possible state, account for the fact that one player serves every other game and his opponent probably has a different service point percentage, and take deuces into account.  If you can do that, you can get to a game probability for each player.  From a game probability, you can do the same sort of thinking to determine a set probability, which has the added trickiness of a tiebreak, with serves switching after points rather than games.  And from set probability, it is much easier to determine a match probability, based on the “best-of” number of sets.

I have a pre-match Markov calculator here, which skips all the intermediate steps.  You can plug in your projected service point probabilities and see the results.  With the proper calculations you can also use the Markov model in the middle of matches, given a particular game and set score.  I (and others) have some tools to calculate those on our home computers, but the math is far more complicated and I do not have the skills or tools to put a calculator like that on this website.

Set Scores from the Markov Model

To recap, if we can calculate game probabilities from service point probabilities, then we can calculate set probabilities from the game probabilities.  Set probabilities are used to generate match probabilities, which means you can also use the set probabilities to determine the likelihood of the set score in a match.  For this post we’ll use a best-of-3 format and use the Zarina Diyas vs Nicole Gibbs second round qualifying match at Indian Wells as an example.

First we need some service point probabilities for Diyas and Gibbs.  Where do we get those?  That’s the million dollar question.  I have a method for estimating them that suggests Diyas will be around 54.8% on her serve and Gibbs will be at 50.5%, but this is notoriously hard to forecast.

You also can back into service point probabilities, starting with the Vegas odds and going backwards through the Markov chain (i.e., instead of going serve–>game–>set–>match, you start with the Vegas match odds–>set–>game–>serve).  When I last checked OddsPortal, the decimal odds for Diyas and Gibbs were 1.61 and 2.28.  Take out the “juice” and you’ve got implied probabilities of 58.6% and 41.4%.  Running those match probabilities backwards through Markov you get service point percentage estimates of 53.54% and 51.86%.

Since we’re talking Vegas here, we will use the implied Vegas service point probability estimates.  Plug them into the Markov calculator and you’ll see Diyas’s chances of winning a set is 55.8%.

Suppose you want to use the Markov model to generate the probability of Diyas winning the match in straight sets.  Well that’s just 0.558*0.558 = 31.1%.  And since there’s only one other way to win a best-of-3, her odds of winning in three sets are her overall probability of winning the match less the straight sets probability:  0.586-0.311 = 27.5%.  Do the same for Gibbs (using 1.000 – 0.558 = 0.442 for her set percentage) and you’ll get 19.6% for straight sets and 21.8% for a three-set victory.

Notice that because Diyas is expected to win the match, the Markov model shows it will be easier for her to win in straight sets than in three sets.  The opposite is true of Gibbs.  Because she is not expected to win the match, the Markov model shows she is less likely to win in straight sets.

What Does Vegas Say the Set Probabilities Are?

If you still have your OddsPortal tab open, switch to Odd’s Portal’s “C/S” tab.  This shows the players’ odds of winning in straight sets and three sets.*  If the formatting holds, you will be seeing decimal odds, where a lower number indicates higher probability.  Don’t focus on the specific numbers yet.  Just look at how they progress from Diyas’s straight sets odds at the top of the chart to Gibbs’s three-set odds at the bottom of the chart.  Notice anything vis-a-vis the Markov model?

*I see that the match has started, so you may not see what I was seeing, but you can look at other matches for the same idea.

That’s right, Gibbs is expected to be more likely to win in straight sets than in three sets, despite being the overall underdog.  And that’s true for every WTA match for which I looked up the Vegas odds.

If you take out the Vegas juice, the set score probabilities for Diyas-Gibbs are:

Diyas 2:0 36.3%
Diyas 2:1 22.1%
Gibbs 2:0 23.4%
Gibbs 2:1 18.2%

*These do not add up to the same overall match probability I used above, but that’s because the match probability included a very large number of bookmakers in the average, and the set score subset only has 13 mostly-smaller bookmakers.  That incongruity doesn’t matter for this post.

Markov says “Gibbs is supposed to lose, so if she wins, it should be harder for her to win.”  Vegas says “Gibbs is supposed to lose, but if she wins, she probably will win in straight sets.”  Why?

Actual Match Histories

I searched all WTA tour-level events (including qualifying) to break down two set matches and three set matches.  You probably know this intuitively, but there are more straight set victories than three set victories.  What you may not know is how big a gap that is.  Since 2015, it is two-thirds in favor of straights.

What about underdogs?  When they win, is it disproportionately in two sets?  Yes, although it is closer to 62%.  (Note: For this purpose, I have based Vegas underdog status on the players’ official WTA rankings at the time).  Markov may expect the underdog to have a harder time winning in straight sets, but Vegas knows most WTA matches won by underdogs are won in two sets by a factor of more than 1.5 to 1.

So Vegas Relies on the Aggregate Match Histories?

Not exactly.  If it did that without reference to the players themselves, the set splits would be wider.  Favorites, when they win, do so in straight sets almost 70% of the time.  Vegas has Diyas with about a 62% split.  Underdogs, when they win, do so in straight sets almost 62% of the time.  Vegas has Gibbs with about a 56% split.

Let’s go back to the Markov set forecast for a second.  Under the model, Diyas should have a 31.1% chance of winning the match in straight sets and 27.5% in three sets.  That’s a 53.1% split.  Under the model, Gibbs had 19.6% for straight sets and 21.8% for a three-set victory, for a 47.3% split.

What would happen if you took the implied splits from the Markov model and averaged them with the historical results for favorites and underdogs? For Diyas, the 53.1% Markov split and 69.8% favorite split would get you 61.5%.  For Gibbs, the 47.3% model split and 61.9% underdog split would get you 54.6%.   Let’s again look at the table from above, but this time with this blended method in the rightmost column.

Vegas Blend
Diyas 2:0 36.3% 36.0%
Diyas 2:1 22.1% 22.6%
Gibbs 2:0 23.4% 22.6%
Gibbs 2:1 18.2% 18.8%

Not bad.

You may think I just got lucky with Diyas and Gibbs, or that I planned it this way, which I did not (it was the match I was most interested in).  However, I used this same procedure on all the second round qualifying WTA matches in Indian Wells, and the splits were within approximately 2% for nearly all of the 24 players (the outlier was about about 4.5%).   A 2% difference in split is smaller than it sounds when translated to the actual probabilities, where it is more like a 1% difference in expected set winning percentage.

I doubt that Vegas does it exactly this way, but it’s interesting to see how a mathematical model and match histories can come together to generate set probabilities.