Professional Documents
Culture Documents
A Hierarchical Bayesian Analysis of Hors
A Hierarchical Bayesian Analysis of Hors
Noah Silverman, MS
UCLA Department of Statistics
[email protected]
1. INTRODUCTION
Horse racing is the most popular sport in Hong Kong. Nowhere else in the
world is such attention paid to the races and such large sums of money bet. It
is literally a “national sport”. Popular literature has many stories about
computerized “betting teams” winning fortunes by using statistical
analysis.[1] Additionally, numerous academic papers have been published on
the subject, implementing a variety of statistical methods. The academic
justification for these papers is that a parimutuel game represents a study in
decisions under uncertainty, efficiency of markets, and even investor
psychology. A review of the available published literature has failed to find
any Bayesian approach to this modeling challenge.
This study will attempt to predict the running speed of a horse in a given
race. To that effect, the coefficients of a linear model are estimated using the
Bayesian method of Markov Chain Monte Carlo. Two methods of computing
the sampled posterior are used and their results compared. The Gibbs method
assumes that all the coefficients are normally distributed, while the Metropolis
method allows for their distribution to have an unknown shape. I will
calculate and compare the predictive results of several models using these
Bayesian Methods.
At the racecourses in Hong Kong, the games are truly parimutuel. The
betters all place their bets in a “pool” which is subsequently divided amongst
the winners immediately at the end of each race. Various pools exist
representing different betting combinations, but for this paper I will focus on
the “win pool” which represents bets on a given horse to win the race. Unlike
a casino, the track does not bet against the public but takes a fixed percentage
from each betting pool. (18% in Hong Kong) The pool is divided amongst the
winning betters proportionally, based upon the amount they bet.
A large tote-board at the track displays the expected winnings per dollar
bet for each horse. These are commonly called the “odds”, and are often
mistakenly interpreted, by naive betters, as a horse’s probability of winning.
1
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
What the posted odds do represent are a measure of the aggregate public
opinion about a horse’s likelihood to win the race. Empirical study shows that
there is a 40% correlation between the public’s opinion as represented by
payoff odds and a horse’s finishing position. The market may be considered
weakly efficient.
3. LITERATURE REVIEW
Since the advent of horse racing, people have searched for a way to profit
from the game. In 1986, Boltman and Chapman describe a multinomial logit
model that forms the basis for most modern prediction methods.[2]. In that
paper they describe a logistic regression model for predicting the “utility” of a
horse. Mildly positive results (profits) were produced.
In 1994 Chapman published a second paper that refined the concepts of
his first paper, while applying it to the horse racing industry in Hong Kong.[3]
He compared his predicted results to the public’s and found that a
combination of the two produced the most profitable results.
In 1994, Bill Benter published what many consider to be the seminal work
on the subject titled, “Computer Based Horse Race Handicapping and
Wagering Systems”[7]. In the paper, Benter develops a two-stage prediction
process. In stage one, he uses a conditional logit to calculate the “strength” of
a horse. In the second stage, he combines the strength measure with the
public’s predicted probability using a second conditional logit function.
Benter reports that his team has made significant profits during their 5 year
gambling operation. (Unlike the other academics discussed here, Benter
actually lived in Hong Kong and conducted a real betting operation.)
In 2007, Edelman published an extension of Benter’s two-stage technique.
Edelman proposes using a support vector machine instead of a conditional
logit for the first stage of the process. Edelman’s rationale is that a SVM will
better capture the subtleties between the data. He theorizes that the betting
market is near-efficient and that the bulk of the information about a horse is
already contained in its Market odds for the race. His method simplifies
Benters in that it only combines odds from a horse’s last race with the
outcome and conditions of that race and the conditions of the race today.
Lessman and Sung, in 2007 then expanded on Edelman’s work by
modifying the first-stage SVM process [5]. They theorized that because only
jockey’s in the first few finishing positions are trying their hardest;
information from later finishers is not accurate as they are not riding to their
full potential. The authors develop a data importance algorithm named
Normalized Discounted Cumulative Gain where they assign weights to
horse’s data as a factor of their finishing position. The result is that data from
the first place finishers is more important than the latter finishers. This NDCG
is used to tune the hyperparameters of the SVM which is then subsequently
used as part of the traditional two-stage model.
2
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
(1)
3
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
Speed, the dependent variable shows some correlation with the other
variables, as described in Table 1. The data was divided into a training set,
which consists of races prior to January 1st, 2009, and a test set consisting of
races run during 2009. A Bayesian MCMC approach will be used to estimate
the running speed of a horse, based on the covariates in the training set. Then
model performance will be tested on the test data set.
(2)
(3)
(4)
(5)
5.1 Priors
(6)
(7)
(8)
(9)
4
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
(10)
(11)
(12)
(13)
(14)
First, I wrote a custom R script to simulate draws from the posterior using
the Gibbs method. The initial tests showed some autocorrelation from the
MCMC chains, so the code was adjusted to only sample one out of every 10
runs of the chain. A total of 300,000 iterations through the chain produced
30,000 samples from the posterior. The chain converged well after 200,000
iterations, so the final 100,000 were used as samples form the converged
posterior. Storing one out of every 10 iterations gave me a chain of 10,00
draws from the converged posterior. Additionally, I calculated the Residual
Sum of Squared Error (RSS) for each run and stored the results along with
each posterior sample. This chain of 30,000 RSS errors allowed me to track
the accuracy of the inference. The residual sum of squares for this method
converged to: 988.637 which gives a predicted of .0350 for an individual
horse.
Next, I wrote custom R code to generate draws from the posterior using
Metropolis Hastings. Following the same model as the Gibbs technique
above, The Metropolis acceptance ratio was used as described by the formula:
(15)
Initially the acceptance ratio was low, so the variance of was adjusted
through trial and error to which produced an reasonable acceptance
ratio of 0.54. A total of 200,000 iterations were run. The chains converged
after 100,000 iterations, so the resulting100,000 were used as samples from
the converged posterior. SInce I am storing 1 out of every 10, the ending
posterior chain was 10,000 long. The residual sum of squares for the this
5
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
Predicted speeds were calculated for each horse in the test data set, to
measure predictive ability of the Gibbs model. 10,000 were drawn and then
used in a regression with the covariates for each horse. The maximum a-
posteriori (MAP) of the resulting regression for each horse was stored as
"predicted speed". The variance of this predicted speed was 0.0397. The horse
with the fastest predicted speed won his race 21.63% of the time. This is
better than a random choice, which would produce an expected winner
between 7.14% and 12.5% (Depending on the number of horses in a race).
However, simply betting on the horse with the highest predicted speed was
not enough to profit. (The ultimate goal is not to guess winners, but generate
profit.)
As a further step, a conditional logit as calculated to combine our
predicted speed with the public's odds estimate. As this is the "standard" for
other predictive models. The coefficients for that model were 6.4186 for the
public odds and 4.0541 for the hierarchical Bayesian predicted speed.
As a further test of performance, the expected value for each bet was
calculated as . There
were 3,323 horses in the test set with positive expected value (out of 8,618
possible.) If $1 had been bet on each horse with a positive expected value, the
return would have been 2919 resulting in a net loss of $404 (-12.15%). While
the predictive errors are small, the model is still not good enough to generate
profit without further refinement.
8 ACKNOWLEDGEMENTS
The author would like to thank Juana Sanchez, Senior Lecturer, at UCLA
for all her help, and for teaching the course C236 Bayesian Statistics, where
he conceptualized the ideas for this paper.
REFERENCES
[1] Michael Kaplan, The High Tech Trifecta. Wired Magazine, October 2003
[2] R.N. Bolton and R.G. Chapman, A multinomial Logit Model For Handicapping
Horse Races. Efficiency of Racetrack Betting Markets, Academic Press, Inc.
1994
[3] Randall G. Chapman, Still Searching For Positive Returns At the Track:
Empirical Results From 2,000 Hong Kong Races Efficiency of Racetrack Betting
Markets, Academic Press, Inc. 1994
6
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
[4] David Edelman, Adapting support vector machine methods for horserace odds
prediction. Ann Oper Res (2007) 151:325-336, Springer Science + Business
Media
[5] Stefan Lessman, Ming-Chien Sung, and Johnnie E.V. Johnson, Adapting Least-
Square Support Vector Regression Models to Forecast the Outcome Of
Horseraces The Journal of Prediction Markets (2007) 1 3, 169-187
[6] Stefan Lessman, Ming-Chien Sung, and Johnnie E.V. Johnson, Identifying
winners of competitive events: A SVM-based Classification Model for Horserace
Prediction European Journal of Operational Research 196 (2009) 569-577
[7] Bill Benter, Computer Based Horse Race Handicapping and Wagering Systems: A
Report. Efficiency of Racetrack Betting Markets, Academic Press, Inc. 1994
[8] Yulanda Chung, A Punter's Program Makes Millions Trackside AsiaWeek
Magazine, November 3, 2000 Vol. 26 No. 43
[9] Michael Kaplan, Gambling: The Hundred and Fifty Million Dollar Man, Cigar
Aficionado Magazine
[10] Peter D. Hoff A First Course in Bayesian Statistical Methods, Springer Science
and Business Media 2009
APPENDIX
7
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
8
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
Name Correlation
last rank 0.077898638
last run 1 0.091421415
last run 2 0.106251733
last run 3 0.110888381
last odds prob 0.111767807
last distance -0.562025513
last weight 0.035323525
last draw 0.007378011
last speed 0.563767210
last percentage -0.016219487
perc won 0.159669358
last 4 perc -0.024847700
last 4 rank 0.128202630
last 4 odds prob 0.147430995
rest 0.136668197
distance -0.799783155
weight 0.005454088
draw 0.022529607
total races -0.187955730
bad runs -0.077079847
jockey rides -0.030171224
dist last 30 -0.217824843
9
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
10
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
11
2012, 6 3 THE JOURNAL OF PREDICTION MARKETS
12
A HIERARCHICAL BAYESIAN ANALYSIS OF HORSE RACING
1 2 3 4 5 6 7 8 9 10
1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
last_rank -0.98 -1.34 1.48 -1.29 0.56 1.53 1.53 0.75 -0.30 -0.37
last_run_1 0.15 -0.36 -1.60 -0.72 1.32 0.87 1.12 -0.36 0.13 1.61
last_run_2 -1.30 0.12 -1.30 -1.59 1.58 -1.59 1.10 -0.61 -1.10 1.01
last_run_3 -1.37 0.36 0.13 -1.37 1.62 -0.91 1.37 -0.40 -0.66 1.32
last_odds_prob 0.40 0.16 0.63 -0.39 -0.78 -0.62 -0.28 -0.14 -0.91 0.16
last_distance -1.12 -1.12 -1.12 1.06 1.06 0.82 -0.15 -1.12 -1.12 1.06
last_weight -0.15 1.36 -0.83 0.40 -0.56 0.81 0.67 -0.83 -1.11 0.54
last_draw 0.95 -1.12 -1.12 0.95 -0.86 1.73 -0.08 -1.38 0.18 0.18
last_speed 0.10 0.53 0.40 -1.30 -1.09 -0.08 0.65 0.04 -0.44 -1.75
last_percentage -0.43 -1.00 -0.88 0.40 0.76 0.14 0.72 -1.79 -2.57 -0.38
perc_won -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 0.94
last_4_perc 0.55 -0.96 -0.00 -0.37 -1.28 -0.73 -0.29 -1.69 -2.03 0.08
last_4_rank 0.00 -0.32 1.25 -0.33 -0.50 0.96 1.31 1.16 -1.24 0.20
last_4_odds_prob 0.77 0.42 0.50 0.65 -0.35 0.21 -0.24 -0.34 -1.07 0.27
rest 0.72 1.52 1.00 2.53 0.90 1.16 1.16 0.79 0.79 3.80
weight -0.20 0.26 1.03 -0.65 0.26 0.88 0.72 0.88 -0.20 -0.65
draw -0.33 1.42 0.67 0.17 0.92 -1.34 1.67 -1.59 0.42 -1.08
total_races -0.24 -0.30 -0.24 -0.54 -0.97 -0.24 -0.67 -1.10 -1.16 -0.48
jockey_rides -0.19 -0.59 -0.19 -0.59 -0.39 -0.39 -0.59 -0.59 0.21 -0.59
dist_last_30 -1.06 -1.06 -1.06 -1.06 -1.06 -1.06 -1.06 -1.06 -1.06 -1.06
13