Predictions for the 2016 Miss Finland competition. The predictions are a result of statistical learning, where data from past competitors and competitions were used to predict future results. The percentages give the probabilities for the 2016 contestents to reach the top 3 of the competition, based on their body measurements, eigenfaces (a mathematical approximation of their head shot) and demographic information.
This GitHub web page provides a summary of a statistical analysis related to the Miss Finland beauty contest. The page introduces methods used to predict the winners of the 2016 Miss Finland competition, using data from contestents faces, body measurements and demographic information. The analysis is based on methods of dimensionality reduction, binary classification, cross-validation and ensemble learning.
The repository of this GitHub page is here and an article describing the analysis in finnish is here.
Miss Finland is a beauty contest held annually in Finland. During the spring of 2016, as the head of news for the tyyppiarvo.com magazine, I presented an open challenge to our readers to predict the winner of the 2016 Miss Finland competition using statistical analysis. The following people deserve thanks for enabling the competition:
On this page I will introduce and explain my own solution and submission to the competition, which ended up being the winning submission.
The competition data was formed by combining the image data collected by Tyyppiarvo staff with the data provided by Mika Sutela. The competition data included demographic information, body measurements, past competition results and numeric vectors representing greyscale pictures of the competitors faces. The data also included the (then future) 2016 competitors.
missdata <- read.csv("Data/challenge_data.csv",
encoding="UTF-8", stringsAsFactors = F)
The data is very high dimensional due to including a picture of each competitors face. The pictures are greyscale 64 x 64 (= 4096) pixel images. From a computational point of view the pictures are vectors where each entry is an integer representing the amount of grey in the matching pixel.
NR <- nrow(missdata)
NC <- ncol(missdata)
c(rows=NR,columns=NC)
## rows columns
## 97 4114
The data includes the following features:
The goal was to predict the winners of the Miss Finland 2016 competition, based on the high dimensional competition data. My solution involves the following steps:
library(LiblineaR)
source("missR.R")
library(IM)
library(dplyr)
rownames(missdata) <- paste0(missdata$name, " (",missdata$year,")")
faces <- as.matrix(missdata[,19:NC, drop = F])
randomface <- faces[sample(1:NR,1),, drop = F]
drawFace(randomface)
# visualize the average perintoprinsessa ('PP', places 2-3)
PPs <- subset(faces, missdata$PP==1)
colMeans(PPs) %>% drawFace
# visualize the 2016 competitors
miss2016 <- faces[missdata$year==2016,,drop = F]
drawMultipleFaces(miss2016)
Histogram equalization can help to nullify the effect of different lighting and contrast in the pictures. We could also binarize the images by choosing a treshold and simply coding the pictures as “black” and “white”. What would that look like?
# perform histogram equalisation to all images
eq_faces <- apply(faces,1, IM::histeq)
eq_faces <- t(round(eq_faces))
drawSample(eq_faces)
# furthed simplify pics by "binarizing"
co <- 130
bi_faces <- eq_faces
bi_faces[bi_faces <= co] <- 0
bi_faces[co < bi_faces] <- 256
drawSample(bi_faces)
That’s pretty artsy.
colMeans(eq_faces) %>% drawFace
colMeans(bi_faces) %>% drawFace
Right now, there are over 4000 variables related to each competitor. Next, we’ll reduce the dimensionality of the data with Principal Component Analysis (PCA).
eq_pca <- prcomp(eq_faces)
PC <- eq_pca$rotation
s <- summary(eq_pca)
df <- data.frame(t(s$importance))
df$PC <- rownames(df)
df
A summary of principal component analysis performed with the prcomp() function. The summary shows that 86 Principal components are needed to capture 99% of the variance in the original data. 86 is a lot but it is a lot less than 4096!
compare_faces(eq_faces, PC, which = 1:86, n = 2)
compare_faces(eq_faces, PC, which = 1:50, n = 2)
compare_faces(eq_faces, PC, which = 1:7, n = 2)
After PCA the data still had more features than observations, which will usually result in overfitting on the training data. I used l1 regularization to further reduce the number of predictors. Logistic regression was used to make preditions.
data <- cbind(missdata[,1:18], eq_faces)
train_data <- subset(data, year < 2016)
target_data <- subset(data, year == 2016)
PC <- eq_pca$rotation
face_dimensions <- 1:86
The regularisation can be calibrated and it might not be optimal to include most of the principal components. I used cross-validation to find the best performing parameters for the regularisation and the best number of PC components to use as predictors. During each round of cross-validation, a single competition year was set aside as the testing group. The model was then fit with rest of the data and it’s predictive power tested on the test year data.
# parameter grid for cross validation
costs <- c(0.01,0.1,1,10,1e2,1e4,1e7)
# returns a matrix containing avarage prct of correct predictions
# target is either "Miss" for winner predictions, or "Kolme" for top3 predictions
# winner predictions might not be very reliable since there is so little data
results <- cross_validate_grid(data = train_data, PC = PC,
C = costs, FD = face_dimensions, target = "Kolme")
best_predictors <- which(results > quantile(results, probs=0.8), arr.ind = T)
chosen_costs <- costs[best_predictors[,2]]
chosen_facedims <- face_dimensions[best_predictors[,1]]
data.frame(cost = chosen_costs, facedims = chosen_facedims, accuracy = results[best_predictors])
The regularisation costs and eigenface dimensionalities (facedim) of the top 20% performing logistic regression models, when measured by their accuracy in the cross-validation rounds.
I created an ensemble from the top 20% performing models. The distributions of the weights related to each feature in the data are given below.
res <- fit_l1_logreg(tr_data = train_data, PC = PC, costs = chosen_costs,
facedims = chosen_facedims, target_data = target_data,
target = "Kolme")
w_summaries <- do.call(cbind, lapply(data.frame(res$weights), summary)) %>% t %>% data.frame
w_summaries$feature <- rownames(w_summaries)
w_summaries
The distribution of weights for each feature in the data, learned using the top 20% performing l1 reguralized logistic regressio models. The negative value of the AsuinP feature tells us that the winners are most often from the southern part of Finland compared to the middle or northern parts. The competitors who do well are not usually from Helsinki though and you might be better off if you hail from Turku instead. All of the ‘curviness’ features vyotaro (waist), rinta ( chest), lantio (hips) as well as pituus (height) have on average small positive weights associated to them.
To get the final predictions, I averaged the predictions of the different models to get a single probability value for each competitor.
top3prob <- rowMeans(res$probs)
names(top3prob) <- target_data$name
P <- data.frame(top3prob)
P[order(-P),, drop = F]
Predictions for the 2016 Miss Finland competition as probabilities of reaching the top 3. The statistical model predicts Shirly Karvinen and Emilia Seppänen as the favourites.
pred_labels <- paste0(target_data$name," ", 100*round(top3prob,2),"%")
faces2016 <- target_data[,19:ncol(target_data)]
drawMultipleFaces(faces2016, titles = pred_labels, cex = 1.2)
Two of the predictions of my model turned out to be correct. Shirly Karvinen won the competition and Emilia Seppänen was in the top 3. The actual top3 also had Heta Sallinen in second place.