Saskia A. Otto
Postdoctoral Researcher
source: R for Data Science by Wickam & Grolemund, 2017 (licensed under CC-BY-NC-ND 3.0 US)
a measure that describes the entire population (e.g. $\mu, \sigma^{2}$)
Parameters are rarely known and are usually estimated by statistics describing a sample (e.g. $\bar{x}, s^2$)
median(x)
or quantile(x, probs = 0.5)
mean()
$\mu; $ \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}\)
weighted.mean(x, w, ...)
p
The arithmetic mean can be a good measure for roughly symmetric distributions but can be misleading in skewed distributions since it can be greatly influenced by extreme scores.
range()
var()
sd()
sd(x)/sqrt(length(x))
)sd(v)/mean(v)
)The variance is the 'Mean Sum of Squared' (MSS) distances of individual measurements from the overall mean:
The variance is the 'Mean Sum of Squared' (MSS) distances of individual measurements from the overall mean:
a sample has 5 numbers and a mean of 4 → the sum has to be 20. If the first 4 numbers are 1, 1, 4, and 10, the last number has to be 4: so we have 4 df for 5 numbers.
Here is a simulation to demonstrate how the variance spread changes with sample size
plot(c(0,32), c(0,15), type = "n",
xlab = "Sample size",
ylab = "Variance")
for (df in seq(3,31,2)) {
for( i in 1:30) {
x <- rnorm(df, mean = 10, sd = 2)
points(df, var(x))
}
}
Press 'p' in the html document for more info.
p
The standard deviation is just the square root of the variance \[\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} \left(x_{i} - \bar{x}\right)^{2}} {n-1}}\] The standard error normalizes the SD by the square root of the sample size n \[SD = \frac{\sigma}{\sqrt{n}}\]
Both parameters are typically used for error bars in graphs:
source:
Sizing ocean giants: patterns of intraspecific size variation in marine megafauna
by McClain et al., DOI: 10.7717/peerj.715
1-dimensional (left panel): Point A might be an outlier of variable y, but not of x; C is an outlier of x but not y, and B is an outlier of both x and y separately
2-dimensional space xy (left panel): A and C could be outliers (both are far away from the regression line) but B is not an outlier anymore, since it lies on the regression line. Right panel: Point A could be an outlier in the x-space but is definitely one in the xy-space
sources: reality tree by
www.pixabay.com (CCO 1.0); stick man graph on the right J. Nielsen (masterthesis
"Conversion of Graphs to Polygonal Meshes", Technical University Copenhagen), www.tchami.com
\[Y = constant + coefficient * X_{1} + coefficient2 * X_{1} ... + error\]
Y: response variable
\(X_{1}, X_{2}\): predictor variables
constant: mean or intercept of Y (value when all X‘s equal zero)
coefficient: regression slopes or treatment effects
error: part of Y not explained or predicted by X‘s
predictor variable and parameters relating predictor to response variable are included in the model as a linear combination (nonlinear terms also possible)
term „linear“ refers to the combination of parameters, not the shape of the distribution: linear combination of series of parameters (regression slope, intercept) where no parameter appears as an exponent or is divided / multiplied by another parameter
\[Y_{i} = \alpha + \beta_{1}*X_{i} + \beta_{2}*X_{i}^2 + \epsilon_{i}\] \[Y_{i} = \alpha + \beta_{1}*(X_{i}*W_{i}) + \epsilon_{i}\] \[Y_{i} = \alpha + \beta_{1}*log(X_{i}) + \epsilon_{i}\] \[Y_{i} = \alpha + \beta_{1}*exp(X_{i}) + \epsilon_{i}\] \[Y_{i} = \alpha + \beta_{1}*sin(X_{i}) + \epsilon_{i}\]
lm(formula, data)
coef(model)
summary(model)
Iris
length relationshipApply a linear regression model to the iris
dataset and model the petal width as a function of the petal length (Petal.Width ~ Petal.Length
)
Look at the summary of your model, what would be the corresponding linear regression equation?
Most common way of writing the equation:
Most common way of writing the equation:
sample intercept estimates → expected value of y when x is 0
sample regression slope estimates → the expected increase in y associated to a one unit increase in x
Most common way of writing the equation:
So we need to be specific whether we want to specify the parameters for the population or sample and whether for observed or fitted y values
The population parameters $\alpha$, $\beta$ and $\sigma^2$ (population variance) are estimated by $a$, $b$ and $s^2$ (sample variance).
Underlying mathematical tool:
(taken from the ICES hydro dataset, station 0076, 2015-02-01)
mod <- lm(sal ~ pres, data = sal_profile)
coef(mod) # same as: coefficients(mod)
## (Intercept) pres
## 5.57662340 0.01282616
Use geom_smooth
and set the method argument to the name of the modelling function (in this case 'lm'). If you don't want to show the standard error additionally, set se = FALSE
:
p <- ggplot(sal_profile,
aes(x = pres, y = sal)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE)
p
You can also compute the predicted values for your model using
predict(model)
from the stats package → simply returns a vector, which you can add to your data using the dplyr::mutate()
functionadd_predictions(data, model, var = "pred")
from the modelr package (in tidyverse) → adds the variable 'pred' containing the predicted values to your data frame; useful when piping operations!modelr::add_predictions(
sal_profile, mod) %>%
ggplot(aes(x = pres)) +
geom_point(aes(y = sal)) +
geom_line(aes(y = pred))
Predictions are best visualised from an evenly spaced grid of X values that covers the region where your data lies
newdata
in a data frame that includes an even X sequence:
predict(model, newdata = data.frame(x = seq(min(x), max(x), 0.1)))
modelr::data_grid(data, X1, X2)
to generate this grid. If you have more than one X, it finds the unique variables and then generates all combinations.
add_predictions(data = grid, model)
function
library(modelr)
grid <- sal_profile %>%
data_grid(pres = seq_range(pres, 20)) # 20 evenly spaced values between min and max
add_predictions(grid, mod) %>%
ggplot(aes(x = pres)) + geom_line(aes(y = pred))
coef()
or summary()
(as a help: look at the names or structure to see what you could extract).a <- round(coef(mod)[1], 3)
b <- round(coef(mod)[2], 3)
mod_summ <- summary(mod)
names(mod_summ)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
r_sq <- round(mod_summ$r.squared, 3)
annotate("text", x, y, label)
function. paste()
function to concatenate text and values stored in objects:p +
annotate("text", x = 20, y = 6.2,
label = paste("Sal =",a ,
" + ",b , "x Depth")) +
annotate("text", x = 15, y = 6.1,
label = paste("R-sq =", r_sq))
Iris
length relationshipVisualize your Petal.Width ~ Petal.Length
model using
geom_smooth()
predict()
or modelr::add_predictions()
to add a line manually. y ~ x
You can include an intercept explicitly by adding a 1 to the formula term. If you want to exclude the intercept, add a 0 or substract a 1:
# includes intercept
mod1 <- lm(sal ~ pres, data = sal_profile)
mod1 <- lm(sal ~ 1 + pres, data = sal_profile)
# excludes intercept
mod2 <- lm(sal ~ 0 + pres, data = sal_profile)
mod2 <- lm(sal ~ pres - 1, data = sal_profile)
mod1
##
## Call:
## lm(formula = sal ~ 1 + pres, data = sal_profile)
##
## Coefficients:
## (Intercept) pres
## 5.57662 0.01283
mod2
##
## Call:
## lm(formula = sal ~ pres - 1, data = sal_profile)
##
## Coefficients:
## pres
## 0.1644
Not including a forces the intercept through the origin of coordinates
Iris
length relationshipCompare your Petal.Width ~ Petal.Length
model including an intercept with the same without an intercept
Simply add both regression lines to the same plot. Or if you want to show both plots next to each other, save each plot under, e.g., p1 and p2 and use the following line of code
gridExtra::grid.arrange(p1,p2, nrow = 1)
Total variation in response variable (\(SS_{Y}\) or \(SS_{Total}\)) can be partitioned into two components
The total variance in the denominator increases with an increasing number of covariates, even if they do not contribute to the explained variance.
= corrected coefficient of determination
Iris
length relationshipHow much variation in Sepal.Length
is explained by Petal.Length
in your model? Do you consider it a good model?
\[MS_{Regression} = \frac{SS_{Regression}}{df_{Regression}}= \frac{\sum(\hat{y} - \bar{y})^{2}}{n-1}\]
\[MS_{Residual} = \frac{SS_{Residual}}{df_{Residual}} = \frac{\sum(y_{i} - \hat{y})^{2}}{n-2}\]
\[MS_{Total} = \frac{SS_{Total}}{df_{Total}} = \frac{\sum(y_{i} - \bar{y})^{2}}{n-1}\]
The uncertainty of the estimated slope and intercept
The uncertainty of the intercept estimate additionally increases
The uncertainty of the estimated slope and intercept
The uncertainty of the intercept estimate additionally increases
summary(mod)
##
## Call:
## lm(formula = sal ~ pres, data = sal_profile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.034885 -0.026902 -0.002162 0.013069 0.060330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5766234 0.0145514 383.24 < 2e-16 ***
## pres 0.0128262 0.0004551 28.18 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03057 on 16 degrees of freedom
## Multiple R-squared: 0.9803, Adjusted R-squared: 0.979
## F-statistic: 794.2 on 1 and 16 DF, p-value: 4.583e-15
Iris
length relationshipLook at the standard errors of your model.
Visualize the standard error: simply set the se
argument in geom_smooth()
to TRUE
.
Imagine, each data point represents an individual fish and all data points together represent the entire population. If you take a random sample of 10 fish, how similar would the estimated coefficients be?
Guess: Which ones are more similar and which ones might be closer to the true population parameters?
2 approaches for reasoning about uncertain model parameters:
If a p-value is very low ( < 0.05), it suggests that either
Knowing probabilities also lets us calculate confidence intervals for \(\beta\)
# Models with n = 10
confint(mod_list[[1]], level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -9.5084037 25.537982
## x -0.5626634 1.686557
confint(mod_list[[2]], level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.2339388 23.6086437
## x -0.3394683 0.4677254
# Models with n = 100
confint(mod_list[[5]], level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 4.7328994 11.0508714
## x 0.4490422 0.8620645
confint(mod_list[[7]], level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 9.0867047 15.4839002
## x 0.1259973 0.5433739
##
## Call:
## lm(formula = len ~ age, data = cc_fn1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.452 -12.315 -2.452 13.003 55.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.997 2.058 19.92 <2e-16 ***
## age 35.454 1.408 25.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.84 on 216 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.7448
## F-statistic: 634.3 on 1 and 216 DF, p-value: < 2.2e-16
exercise is based on IFAR by D. Ogle, 2016
descriptive stats: median()
, quantile()
, mean()
, weighted.mean()
,
range()
, var()
, sd()
linear regression model: lm()
, coef()
, summary()
, confint()
geom_smooth(method = "lm")
,
predict()
, modelr::add_predictions()
Try out the exercises and read up on linear regressions in
Apply a model for the weight variables in the iris
dataset. Load the hydro data and model the temperature or oxygen as a function of depth. Choose a single stations or model across all stations.
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)