Welcome

This GitHub page describes predicting the 2016 european soccer championships matches using ordinal regression. The GitHub repository related to this page is here.

Articles (in finnish) describing the analysis and the predictions for the EURO 2016 finals are available in tyyppiarvo.com:

Note that all articles are in Finnish.

2008 & 2012 qualifier and final tournament data

I combined the statistics from the 2008 and 2012 qualifiers (“homeq” and “awayq” features) with the results of the EURO 2008 and 2012 final tournaments and studied how the qualifier statistics could be used to predict match outcomes.

emq <- get(load("data/EUROq.Rda"))
emq <- emq[complete.cases(emq),]
emq

Feature engineering

Since only the relative strength of two teams can be though to predict the outcome of a match, I studied the ratios and differences of the qualifier statistics as predictors for the final tournament match goal differences and -outcomes.

# ratios and differences of team stats
homevars <- substr(names(emq),1,5)=="homeq"
awayvars <- substr(names(emq),1,5)=="awayq"
#fifa

homevars[7] <- T
awayvars[8] <- T
ratios <- emq[,homevars] / emq[,awayvars]
diffs <- emq[,homevars] - emq[,awayvars]

names(ratios) <- gsub("home", "", names(ratios))
names(diffs) <- names(ratios)

Correlations of qualifier stats and final tournament goal difference

I studied the connections of hometeam vs. awayteam qualifier statistics ratios and differences with the final tournament match goal differences, using scatter plots and correlations. I did this to choose which statistics and which version (ratio or difference) I should use as predictors for the match outcomes.

#correlations
corrs <- rbind(cor_ratios=apply(ratios, 2, cor, emq$goal_diff),
               cor_diffs=apply(diffs, 2, cor, emq$goal_diff))

t(corrs)
                       cor_ratios   cor_diffs
fifa_                  0.40009797  0.39304221
q_avrg_ontarget        0.49041619  0.47996598
q_avrg_offtarget       0.58930183  0.55530451
q_avrg_shots           0.56855521  0.54889152
q_avrg_scored          0.25636224  0.22931319
q_avrg_conceded        0.11908885  0.05685243
q_avrg_goaldifference  0.27329631  0.21734655
q_offsides            -0.08345245 -0.06243588
q_corners              0.40939549  0.44114740

Correlations of home vs awayteam qualifier statistic ratios and differences versus final tournament match goal difference. The strongest linear connections are given by the ratios, except with corners were the difference in qualifier corner kicks has the strongest connection to final tournament match goal difference.


Ratios of qualifier stats versus tournament match goal differences

library(reshape2)
library(ggplot2)
df <- melt(cbind(goaldiff = emq$goal_diff, ratios), id.vars = "goaldiff")
ggplot(df, aes(value, goaldiff)) + geom_point() +geom_smooth(method = "lm") + facet_wrap("variable", scales = "free") + ggtitle("Qualifier statistic ratios vs. tournament goal differences, EURO 2008 & 2012")
Linear connections between ratios of qualifier statistics and tournament match goal differences. All the statistics related to shot attempts have positive linear connections with the match goal difference.

Linear connections between ratios of qualifier statistics and tournament match goal differences. All the statistics related to shot attempts have positive linear connections with the match goal difference.

Feature selection with regularized regression

To study the explanatory power of the qualifier statistics more closely, I fit an ordinal regression model with the final tournament match outcome as the target variable and a selection of promising features (those correlating with the match goal difference) as the explanatory variables. I used lasso regularisation to study which features best explain the final tournament match outcome.

features <- cbind(ratios[,1:4], diffs[,8])
colnames(features) <- c("fifa_r","ontarget_r","offtarget_r", "shot_r","corners_d")
outcome <- emq$outcome
library(glmnetcr)
netfit <- glmnet.cr(features, outcome, maxit=500)

Bayesian information criteria and coefficient path (1)

par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit, xvar="step", type="aic")
plot(netfit, xvar = "step", type = "coefficients")
Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used (step) in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the offtarget shot ratio deviates from zero at this point.

Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used (step) in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the offtarget shot ratio deviates from zero at this point.


I decided it didn’t make sense that somewhow offtarget shots were a better predictor than ontarget shots and total shots. So I excluded the off and ontarget shots and used only the ratio of the total shots.

features2 <- cbind(ratios[,c(1,4)], diffs[,8])
colnames(features2) <- c("fifa_r","shot_r","corners_d")
netfit2 <- glmnet.cr(features2, outcome, maxit=500)

Bayesian information criteria and coefficient path (2)

par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit2, xvar="step", type="aic")
plot(netfit2, xvar = "step", type = "coefficients")
Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the shot ratio deviates from zero at this point.

Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the shot ratio deviates from zero at this point.


Qualifier shot ratio versus final tournament match goal difference

# a scatter plot of shots ratio and goal difference
emq$shot_ratio <- ratios[,4]
plot(emq$shot_ratio, emq$goal_diff,
     xlab="SR", 
     ylab="match goal difference",
     main="Qualifier shot ratio (SR) 
     versus tournament match goal difference 
     EM 2008 & 2012",
     col="green4", pch=20)
abline(lm(goal_diff~shot_ratio, data = emq), col="darksalmon")
legend("bottomright",
       legend=paste("correlation:",round(cor(emq$shot_ratio, emq$goal_diff),2)),
       bty="n")


Ordinal regression with shot ratio

I an ordinal regression models using only the qualifiers shot ratio as explanatory variable. The table shows the prediction made by the model and the data used to train the model (EURO 2008 % 2012 tournament matches).

library(MASS)
fit1 <- polr(outcome ~ shot_ratio, Hess=T, data = emq)
cbind.data.frame(emq[,c("hometeam","awayteam","outcome")], predict(fit1, type="probs"))

Uefa ratings

I also studied the pre-tournament uefa rating as a predictor to the match outcomes. The definition of the uefa rating changed after 2008 so only the data from 2012 could be used in the analysis.

Correlation of pre-tournament uefa ratio and tournament match goal difference

emu <- read.csv2("data/em2012_uefa.csv", stringsAsFactors = F)
emu$uefa_ratio <- emu$uefa_koti / emu$uefa_vieras
emu$goal_diff <- emu$maalit_koti - emu$maalit_vieras

with(emu,cor.test(uefa_ratio, goal_diff))

    Pearson's product-moment correlation

data:  uefa_ratio and goal_diff
t = 3.4872, df = 29, p-value = 0.001576
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2343442 0.7528831
sample estimates:
      cor 
0.5435493 

The correlation between ratios of pre-tournament uefa rating and tournament match goal difference is positive and statistically significant.


Scatter plot of uefa ratio versus match goals

with(emu,{
     plot(uefa_ratio,goal_diff,
          xlab="UCR", 
          ylab="match goal difference",
          main="Uefa coefficient ratio (UCR) 
     versus tournament match goal difference \n  EM 2012.",
     col="green4", pch=20)
     abline(lm(goal_diff~uefa_ratio), col="darksalmon")
     legend("bottomright",
            legend=paste("correlation:",round(cor(uefa_ratio, goal_diff),2)),
            bty="n")}
     )


Ordinal regression with uefa ratio

Pre-tournament uefa ratio correlated reasonably with the final tournament match goals so I fit a second ordinal regression model using the uefa ratio as a predictor. The table shows the predictions made by the model and the data used to train the model (EURO 2012 tournament matches).

# model based on 2012 uefa ratings
# define the outcome
emu$outcome <- 1
emu$outcome[emu$goal_diff > 0] <- 3
emu$outcome[emu$goal_diff < 0 ] <- 0
emu$outcome <- factor(emu$outcome,ordered = T, levels=c(0,1,3),labels=c("loss","draw","win"))

fit2 <- polr(outcome ~ uefa_ratio, Hess=T, data=emu)

cbind.data.frame(emu[,c("kotijoukkue","vierasjoukkue","outcome")], predict(fit2, type="probs"))


The 2016 matches and predictions

I averaged the predictions given by the two ordinal regression models to predict the matches of the 2016 tournament, using the ratio of qualifier shots and uefa ratings as the predicting features.

# load 2016 matches
matches2016 <- get(load("data/matches2016.Rda"))

# use mass library to fit ordinal regression models
library(MASS)

# use the models to predict match outcome probabilities for 2016
probs1 <- predict(fit1, newdata=matches2016, type="probs")
probs2 <- predict(fit2, newdata=matches2016, type="probs")

# combine the predictions with the match data

predictions2016 <- 0.5*(probs1+probs2)[,3:1]
alkulohko <- cbind(matches2016, predictions2016)

# exclude the explanatory variables 
alkulohko2016 <- alkulohko[,c("date","group","hometeam","awayteam","win","draw","loss", "homeuefa", "awayuefa", "uefa_ratio", "homeavrg_shots", "awayavrg_shots", "shot_ratio")]
alkulohko2016


Expected points per match

# calculate expected points per match
getEV <- function(p) {sum(c(3,1,0)*p)}
homePoints <- apply(predictions2016, 1, getEV)
awayPoints <- apply(predictions2016[,3:1], 1, getEV)
alkulohko2016 <- cbind(alkulohko2016[, 1:7], homePoints, awayPoints)
# rounding
alkulohko2016[,5:9] <- round(alkulohko2016[,5:9], 2)

alkulohko2016


Expected points by group

# calculate expected points by group

#  function
get_total_points <- function(data, team) {
  homedata <- data[data$hometeam==team,]
  awaydata <- data[data$awayteam==team,]
  homeEV <- sum(homedata$homePoints)
  awayEV <- sum(awaydata$awayPoints)
  return(homeEV+awayEV)
}

group <- alkulohko2016$group

#  group predictions
by(alkulohko2016,group,FUN=function(d) {
  teams <- unique(c(d$hometeam,d$awayteam))
  points <- t(sort(sapply(teams,get_total_points,data=d),decreasing = T))
  rownames(points) <- "points"
  points
})
group: A
       France Switzerland Romania Albania
points      6        5.61    3.84    1.49
-------------------------------------------------------- 
group: B
       England Russia Slovakia Wales
points    6.21   5.02     2.86  2.73
-------------------------------------------------------- 
group: C
       Germany Ukraine Poland Northern Ireland
points    6.92    3.97   3.94             2.08
-------------------------------------------------------- 
group: D
       Spain Croatia Turkey Czech Republic
points  6.11     4.1   3.42           3.11
-------------------------------------------------------- 
group: E
       Belgium Italy Sweden Republic of Ireland
points    6.17  5.28   3.35                2.06
-------------------------------------------------------- 
group: F
       Austria Portugal Hungary Iceland
points    5.76     5.48     2.9    2.69



Tuomo Nieminen 2017