This GitHub page describes predicting the 2016 european soccer championships matches using ordinal regression. The GitHub repository related to this page is here.
Articles (in finnish) describing the analysis and the predictions for the EURO 2016 finals are available in tyyppiarvo.com:
Note that all articles are in Finnish.
I combined the statistics from the 2008 and 2012 qualifiers (“homeq” and “awayq” features) with the results of the EURO 2008 and 2012 final tournaments and studied how the qualifier statistics could be used to predict match outcomes.
emq <- get(load("data/EUROq.Rda"))
emq <- emq[complete.cases(emq),]
emq
Since only the relative strength of two teams can be though to predict the outcome of a match, I studied the ratios and differences of the qualifier statistics as predictors for the final tournament match goal differences and -outcomes.
# ratios and differences of team stats
homevars <- substr(names(emq),1,5)=="homeq"
awayvars <- substr(names(emq),1,5)=="awayq"
#fifa
homevars[7] <- T
awayvars[8] <- T
ratios <- emq[,homevars] / emq[,awayvars]
diffs <- emq[,homevars] - emq[,awayvars]
names(ratios) <- gsub("home", "", names(ratios))
names(diffs) <- names(ratios)
I studied the connections of hometeam vs. awayteam qualifier statistics ratios and differences with the final tournament match goal differences, using scatter plots and correlations. I did this to choose which statistics and which version (ratio or difference) I should use as predictors for the match outcomes.
#correlations
corrs <- rbind(cor_ratios=apply(ratios, 2, cor, emq$goal_diff),
cor_diffs=apply(diffs, 2, cor, emq$goal_diff))
t(corrs)
cor_ratios cor_diffs
fifa_ 0.40009797 0.39304221
q_avrg_ontarget 0.49041619 0.47996598
q_avrg_offtarget 0.58930183 0.55530451
q_avrg_shots 0.56855521 0.54889152
q_avrg_scored 0.25636224 0.22931319
q_avrg_conceded 0.11908885 0.05685243
q_avrg_goaldifference 0.27329631 0.21734655
q_offsides -0.08345245 -0.06243588
q_corners 0.40939549 0.44114740
Correlations of home vs awayteam qualifier statistic ratios and differences versus final tournament match goal difference. The strongest linear connections are given by the ratios, except with corners were the difference in qualifier corner kicks has the strongest connection to final tournament match goal difference.
library(reshape2)
library(ggplot2)
df <- melt(cbind(goaldiff = emq$goal_diff, ratios), id.vars = "goaldiff")
ggplot(df, aes(value, goaldiff)) + geom_point() +geom_smooth(method = "lm") + facet_wrap("variable", scales = "free") + ggtitle("Qualifier statistic ratios vs. tournament goal differences, EURO 2008 & 2012")
To study the explanatory power of the qualifier statistics more closely, I fit an ordinal regression model with the final tournament match outcome as the target variable and a selection of promising features (those correlating with the match goal difference) as the explanatory variables. I used lasso regularisation to study which features best explain the final tournament match outcome.
features <- cbind(ratios[,1:4], diffs[,8])
colnames(features) <- c("fifa_r","ontarget_r","offtarget_r", "shot_r","corners_d")
outcome <- emq$outcome
library(glmnetcr)
netfit <- glmnet.cr(features, outcome, maxit=500)
par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit, xvar="step", type="aic")
plot(netfit, xvar = "step", type = "coefficients")
I decided it didn’t make sense that somewhow offtarget shots were a better predictor than ontarget shots and total shots. So I excluded the off and ontarget shots and used only the ratio of the total shots.
features2 <- cbind(ratios[,c(1,4)], diffs[,8])
colnames(features2) <- c("fifa_r","shot_r","corners_d")
netfit2 <- glmnet.cr(features2, outcome, maxit=500)
par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit2, xvar="step", type="aic")
plot(netfit2, xvar = "step", type = "coefficients")
# a scatter plot of shots ratio and goal difference
emq$shot_ratio <- ratios[,4]
plot(emq$shot_ratio, emq$goal_diff,
xlab="SR",
ylab="match goal difference",
main="Qualifier shot ratio (SR)
versus tournament match goal difference
EM 2008 & 2012",
col="green4", pch=20)
abline(lm(goal_diff~shot_ratio, data = emq), col="darksalmon")
legend("bottomright",
legend=paste("correlation:",round(cor(emq$shot_ratio, emq$goal_diff),2)),
bty="n")
I an ordinal regression models using only the qualifiers shot ratio as explanatory variable. The table shows the prediction made by the model and the data used to train the model (EURO 2008 % 2012 tournament matches).
library(MASS)
fit1 <- polr(outcome ~ shot_ratio, Hess=T, data = emq)
cbind.data.frame(emq[,c("hometeam","awayteam","outcome")], predict(fit1, type="probs"))
I also studied the pre-tournament uefa rating as a predictor to the match outcomes. The definition of the uefa rating changed after 2008 so only the data from 2012 could be used in the analysis.
emu <- read.csv2("data/em2012_uefa.csv", stringsAsFactors = F)
emu$uefa_ratio <- emu$uefa_koti / emu$uefa_vieras
emu$goal_diff <- emu$maalit_koti - emu$maalit_vieras
with(emu,cor.test(uefa_ratio, goal_diff))
Pearson's product-moment correlation
data: uefa_ratio and goal_diff
t = 3.4872, df = 29, p-value = 0.001576
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2343442 0.7528831
sample estimates:
cor
0.5435493
The correlation between ratios of pre-tournament uefa rating and tournament match goal difference is positive and statistically significant.
with(emu,{
plot(uefa_ratio,goal_diff,
xlab="UCR",
ylab="match goal difference",
main="Uefa coefficient ratio (UCR)
versus tournament match goal difference \n EM 2012.",
col="green4", pch=20)
abline(lm(goal_diff~uefa_ratio), col="darksalmon")
legend("bottomright",
legend=paste("correlation:",round(cor(uefa_ratio, goal_diff),2)),
bty="n")}
)
Pre-tournament uefa ratio correlated reasonably with the final tournament match goals so I fit a second ordinal regression model using the uefa ratio as a predictor. The table shows the predictions made by the model and the data used to train the model (EURO 2012 tournament matches).
# model based on 2012 uefa ratings
# define the outcome
emu$outcome <- 1
emu$outcome[emu$goal_diff > 0] <- 3
emu$outcome[emu$goal_diff < 0 ] <- 0
emu$outcome <- factor(emu$outcome,ordered = T, levels=c(0,1,3),labels=c("loss","draw","win"))
fit2 <- polr(outcome ~ uefa_ratio, Hess=T, data=emu)
cbind.data.frame(emu[,c("kotijoukkue","vierasjoukkue","outcome")], predict(fit2, type="probs"))
I averaged the predictions given by the two ordinal regression models to predict the matches of the 2016 tournament, using the ratio of qualifier shots and uefa ratings as the predicting features.
# load 2016 matches
matches2016 <- get(load("data/matches2016.Rda"))
# use mass library to fit ordinal regression models
library(MASS)
# use the models to predict match outcome probabilities for 2016
probs1 <- predict(fit1, newdata=matches2016, type="probs")
probs2 <- predict(fit2, newdata=matches2016, type="probs")
# combine the predictions with the match data
predictions2016 <- 0.5*(probs1+probs2)[,3:1]
alkulohko <- cbind(matches2016, predictions2016)
# exclude the explanatory variables
alkulohko2016 <- alkulohko[,c("date","group","hometeam","awayteam","win","draw","loss", "homeuefa", "awayuefa", "uefa_ratio", "homeavrg_shots", "awayavrg_shots", "shot_ratio")]
alkulohko2016
# calculate expected points per match
getEV <- function(p) {sum(c(3,1,0)*p)}
homePoints <- apply(predictions2016, 1, getEV)
awayPoints <- apply(predictions2016[,3:1], 1, getEV)
alkulohko2016 <- cbind(alkulohko2016[, 1:7], homePoints, awayPoints)
# rounding
alkulohko2016[,5:9] <- round(alkulohko2016[,5:9], 2)
alkulohko2016
# calculate expected points by group
# function
get_total_points <- function(data, team) {
homedata <- data[data$hometeam==team,]
awaydata <- data[data$awayteam==team,]
homeEV <- sum(homedata$homePoints)
awayEV <- sum(awaydata$awayPoints)
return(homeEV+awayEV)
}
group <- alkulohko2016$group
# group predictions
by(alkulohko2016,group,FUN=function(d) {
teams <- unique(c(d$hometeam,d$awayteam))
points <- t(sort(sapply(teams,get_total_points,data=d),decreasing = T))
rownames(points) <- "points"
points
})
group: A
France Switzerland Romania Albania
points 6 5.61 3.84 1.49
--------------------------------------------------------
group: B
England Russia Slovakia Wales
points 6.21 5.02 2.86 2.73
--------------------------------------------------------
group: C
Germany Ukraine Poland Northern Ireland
points 6.92 3.97 3.94 2.08
--------------------------------------------------------
group: D
Spain Croatia Turkey Czech Republic
points 6.11 4.1 3.42 3.11
--------------------------------------------------------
group: E
Belgium Italy Sweden Republic of Ireland
points 6.17 5.28 3.35 2.06
--------------------------------------------------------
group: F
Austria Portugal Hungary Iceland
points 5.76 5.48 2.9 2.69
Tuomo Nieminen 2017