University of Helsinki, spring 2017
Powered by Rpresentation. The code for this presentation is here
Simple regression
Multiple regression
A statistical model:
Linear regression is an approach for modeling the relationship between a dependent variable \( \boldsymbol{y} \) and one or more explanatory variables \( \boldsymbol{X} \).
There are many applications for linear models such as
In a simple case, the model includes one explanatory variable \( \boldsymbol{x} \)
\( \boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon} \)
R:
lm(y ~ x)
The model can also include more than one explanatory variable
\[ \boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon} \]
R:
lm(y ~ x1 + x2)
In linear regression, it is assumed that the relationship between the target variable \( \boldsymbol{y} \) and the parameters (\( \alpha \), \( \boldsymbol{\beta} \)) is linear:
\[ \boldsymbol{y} = \boldsymbol{\alpha} + \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
In the simple linear equation \( \boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon} \)
The best model is found by minimizing the prediction errors that the model would make
When the model is \[ \boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon} \]
For a quick rundown of interpreting R's regression summary, see the 'Calling summary' section of this blog post or read about coefficients and p-values here
Call:
lm(formula = Y ~ some_variable)
Residuals:
Min 1Q Median 3Q Max
-5.2528 -1.8261 -0.1636 1.5288 5.8723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04364 0.49417 -0.088 0.93026
some_variable 1.81379 0.58925 3.078 0.00463 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.643 on 28 degrees of freedom
Multiple R-squared: 0.2528, Adjusted R-squared: 0.2262
F-statistic: 9.475 on 1 and 28 DF, p-value: 0.004626
The linearity assumption isn't as restrictive as one could imagine.
It is possible to add polynomial terms to the model if the effect of a variable is non-linear
\[ \boldsymbol{y} = \alpha + \beta_1 \cdot \boldsymbol{x} + \beta_2 \cdot \boldsymbol{x}^2 + \boldsymbol{\epsilon} \]
R:
lm(y ~ x + I(x^2))
A statistical model always includes several assumptions which describe the data generating process.
Analyzing the residuals of the model provides a method to explore the validity of the model assumptions. A lot of interesting assumptions are included in the expression
\[ \boldsymbol{\epsilon} \sim N(0, \sigma^2) \]
QQ-plot of the residuals provides a method to explore the assumption that the errors of the model are normally distributed
The constant variance assumption implies that the size of the errors should not depend on the explanatory variables.
This can be explored with a simple scatter plot of residuals versus model predictions.
Any patter in the scatter plot implies a problem with the assumptions
Leverage measures how much impact a single observation has on the model.
Odds and probability
Predicting binary outcomes
In regression analysis, the target variable \( \boldsymbol{Y} \) is modelled as a linear combination of the model parameters and the explanatory variables \( \boldsymbol{X} \)
\[ \boldsymbol{Y} = \boldsymbol{\alpha} + \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Another way to express this is to use conditional expectation
\[ E[\boldsymbol{Y} \mid \boldsymbol{X}] = \boldsymbol{\alpha} + \boldsymbol{X}\boldsymbol{\beta} \]
So, linear regression is a model for the (conditional) expected value of Y.
If the target variable \( Y \) is binary, taking only the values
with probability \( p \), then \( E[Y] = p \).
The ratio of expected “successes” to “failures” are called the odds:
\[ \frac{p}{1-p} \]
Odds of 2 to 1: the probability of success is twice as likely as the probability of failure, when p = 2/3.
To transform \( p \) into a completely unrestricted scale, we can take the logarithm of odds:
\[ log\left( \frac{p}{1 - p} \right) \in [- \infty, \infty] \]
In a simple case, the logistic regression model for the expected value \( p \) of a binary variable \( Y \), is:
\[ log\left( \frac{p}{1 - p} \right) = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon} \]
which implies
\[ P(Y_i = 1) = \frac{1}{1 + e^{-\alpha - \beta \cdot x_i}} = p_i \]
The ratio of two odds is called the odds ratio. It can be computed by the following steps:
Odds ratio can be used to quantify the relationship between \( X \) and \( Y \). Odds higher than 1 mean that \( X \) is positively associated with “success”.
toy data:
X | X_ | total | |
---|---|---|---|
1 | 10 | 16 | 26 |
0 | 15 | 4 | 19 |
total | 25 | 20 | 45 |
The following conditional probabilities can be calculated from the (toy) data:
Odds is the ratio of successes to failures:
Odds ratio (OR) is the ratio of the two odds:
\[ OR = \frac{Odds(Y \mid X)}{Odds(Y \mid X\_)} = \frac{2/3}{4} = \frac{1}{6} \]
toy data:
X | X_ | total | |
---|---|---|---|
1 | 10 | 16 | 26 |
0 | 15 | 4 | 19 |
total | 25 | 20 | 45 |
From the fact that the computational target variable in the logistic regression model is the log of odds, it follows that applying the exponent function to the fitted values gives the odds:
\[ \exp \left( log\left( \frac{\hat{p}}{1 - \hat{p}} \right) \right) = \frac{\hat{p}}{1 - \hat{p}}. \]
The exponents of the coefficients can be interpret as odds ratios between a unit change (vs no change) in the corresponding explanatory variable.
\[ exp(\hat{\beta_x}) = Odds(Y \mid x + 1) / Odds(Y \mid x) \]
https://prateekvjoshi.files.wordpress.com/2013/06/cross-validation.png
A statistical model can be used to make predictions. An intuitive way of comparing different models is to test their predictive power on unseen data.
In order to assert model performance, we need to measure it somehow.
Cross-validation is a powerful general technique for assessing how the results of a statistical analysis will generalize to an independent data set.
One round of cross-validation involves
This process is repeated so that eventually all of the data is used for both training and testing.
Below is an example of 5-fold cross-validation
The data is divided into subsets K times and eventually all the data is used for both training and testing.
Classification:
Clustering:
Linear discriminant analysis (LDA) is a classification method. It can be used to model binary variables, like in logistic regression, or multiple class variables. The target variable needs to be categorical.
It can be used to
This is a good and simple blog post about LDA. R-Bloggers also have a post about LDA, see it here.
Linear discriminant analysis produces results based on the assumptions that
Because of the assumptions, the data might need scaling before fitting the model. The variables also need to be continuous.
Let's see an example next to wrap our heads around what LDA is really doing.
Call:
lda(Species ~ ., data = d)
Prior probabilities of groups:
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -1.0111914 0.8504137 -1.3006301 -1.2507035
versicolor 0.1119073 -0.6592236 0.2843712 0.1661774
virginica 0.8992841 -0.1911901 1.0162589 1.0845261
Coefficients of linear discriminants:
LD1 LD2
Sepal.Length 0.6867795 0.01995817
Sepal.Width 0.6688251 0.94344183
Petal.Length -3.8857950 -1.64511887
Petal.Width -2.1422387 2.16413593
Proportion of trace:
LD1 LD2
0.9912 0.0088
Classifying new observations:
How to determine if observations are similar or dissimilar with each others?
Continue updating steps until the centroids or the clusters do not change
Source: This R-Bloggers Post
Remarks about k-means:
Total within sum of squares is calculated by adding the within cluster sum of squares (WCSS) of every cluster together. The WCSS can be calculated with the pattern
\( WCSS = \sum_i^N (x_i - centroid)^2 \)
So you are searching for the number of clusters, where the observations are closest to the cluster center.
For IODS by Tuomo Nieminen & Emma Kämäräinen
Powered by Rpresentation. The code for this presentation is here
From high…
.. to lower dimensionality
In statistical analysis, one can think of dimensionality as the number of variables (features) related to each observation in the data.
The original variables of high dimensional data might contain “too much” information (and noise or some other random error) for representing the underlying phenomenom of interest.
On the linear algebra level, Singular Value Decomposition (SVD) is the most important tool for reducing the number of dimensions in multivariate data.
In PCA, the data is first transformed to a new space with equal or less number of dimensions (new features). These new features are called the principal components. They always have the following properties:
Given the properties of the principal components, we can simply choose the first few principal components to represent our data.
This will give us uncorrelated variables which capture the maximum amount of variation in the data!
The dimensionality of iris reduced to two principal components (PC). The first PC captures more than 70% of the total variance in the 4 original variables.
Unlike LDA, PCA has no criteria or target variable. PCA may therefore be called an unsupervised method.
PCA is a mathematical tool, not a statistical model, which is why linear algebra (SVD) is enough.
Correlations of iris
Sep.Len Sep.Wid Pet.Len Pet.Wid
Sep.Len 1.00 -0.12 0.87 0.82
Sep.Wid -0.12 1.00 -0.43 -0.37
Pet.Len 0.87 -0.43 1.00 0.96
Pet.Wid 0.82 -0.37 0.96 1.00
The correlations (and more) can be interpret from the biplot on the left, but how?
A biplot is a way of visualizing two representations of the same data. The biplot displays:
(1) The observations in a lower (2-)dimensional representation
(2) The original features and their relationships with both each other and the principal components
In a biplot, the following connections hold:
Biplots can be used to visualize the results of dimensionality reduction methods such as LDA, PCA, Correspondence Analysis (CA) and Multiple CA.
There are also several other variations of the CA methods
And next, let's look how the MCA outputs look in R!
Output of MCA summary contains…
Call:
MCA(X = data, graph = FALSE, method = "indicator")
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 0.413 0.334 0.330 0.256
% of var. 30.992 25.053 24.743 19.212
Cumulative % of var. 30.992 56.045 80.788 100.000
Individuals (the 3 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
1 | 0.283 0.065 0.056 | -0.283 0.080 0.056 | -0.534 0.288
2 | 1.055 0.898 0.802 | -0.431 0.185 0.134 | -0.209 0.044
3 | 0.138 0.015 0.035 | -0.048 0.002 0.004 | -0.103 0.011
cos2
1 0.199 |
2 0.032 |
3 0.019 |
Categories (the 3 first)
Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
Label1 | 1.199 28.599 0.471 11.862 | -0.786 15.197 0.202 -7.775
Label2 | -0.569 16.801 0.584 -13.214 | -0.121 0.935 0.026 -2.803
Label3 | 0.639 3.627 0.051 3.887 | 2.468 66.852 0.753 15.002
Dim.3 ctr cos2 v.test
Label1 | -0.274 1.875 0.025 -2.714 |
Label2 | -0.092 0.553 0.015 -2.142 |
Label3 | 1.154 14.813 0.165 7.018 |
Categorical variables (eta2)
Dim.1 Dim.2 Dim.3
Var1 | 0.608 0.832 0.171 |
Var2 | 0.079 0.154 0.741 |
Var3 | 0.553 0.017 0.078 |
Output of MCA summary contains…
Call:
MCA(X = data, graph = FALSE, method = "indicator")
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 0.413 0.334 0.330 0.256
% of var. 30.992 25.053 24.743 19.212
Cumulative % of var. 30.992 56.045 80.788 100.000
Individuals (the 3 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
1 | 0.283 0.065 0.056 | -0.283 0.080 0.056 | -0.534 0.288
2 | 1.055 0.898 0.802 | -0.431 0.185 0.134 | -0.209 0.044
3 | 0.138 0.015 0.035 | -0.048 0.002 0.004 | -0.103 0.011
cos2
1 0.199 |
2 0.032 |
3 0.019 |
Categories (the 3 first)
Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
Label1 | 1.199 28.599 0.471 11.862 | -0.786 15.197 0.202 -7.775
Label2 | -0.569 16.801 0.584 -13.214 | -0.121 0.935 0.026 -2.803
Label3 | 0.639 3.627 0.051 3.887 | 2.468 66.852 0.753 15.002
Dim.3 ctr cos2 v.test
Label1 | -0.274 1.875 0.025 -2.714 |
Label2 | -0.092 0.553 0.015 -2.142 |
Label3 | 1.154 14.813 0.165 7.018 |
Categorical variables (eta2)
Dim.1 Dim.2 Dim.3
Var1 | 0.608 0.832 0.171 |
Var2 | 0.079 0.154 0.741 |
Var3 | 0.553 0.017 0.078 |
Visualizing MCA:
plot.MCA()
function in R (from FactoMineR) has a lot of options for plotting