Saskia A. Otto
Postdoctoral Researcher
- Dependence inflates p-values
- Dependence you can "smell"
- Dependence due to model misfit
- Temporal or spational dependence
- df incorrect (number of independent components)
→ The variance in Y is constant (i.e. the variance does not change as Y gets bigger/smaller).
→ Only one variance has to be estimated and not one for every X value.
→ Also the multiple residuals for every X are expected to be homogeneous.
p
Adapted from: Zuur et al. (2007)
p
(taken from the ICES hydro dataset, station 0076, 2015-02-01)
par(mfrow = c(2,2))
plot(mod)
The mfrow argument creates a multi-paneled plot; first argument in the vector specifies the number of rows and the second the number of columns of plots.
par(mfrow = c(2,2))
plot(mod)
par(mfrow = c(2,2))
plot(mod)
$\epsilon$ seems slightly correlated with Y
$\epsilon$ ~ N
No outlier or too influential data point
You can compute both types of residuals using
residuals(model)
(works too: resid()
) from the stats package → returns a vector with the ordinary residualsrstandard(model)
from the stats package → returns a vector with the standardized residuals
add_residuals(data, model, var = "resid")
from the modelr package (in tidyverse) → adds the variable 'resid' containing the ordinary residuals to your data frame; useful when piping operations!
Load the 4 datasets into your workspace: assumptions1.txt, assumptions2.txt, assumptions3.txt, assumptions4.txt
df <- read_delim("data/assumptions1.txt")
str(df)
Regress each y variables against x in the same dataset using the lm
function and inspect the 4 diagnostic plots per model.
par(mfrow = c(2,2))
plot(your_lin_model)
Each of the 4 models will have outliers or assumptions that are not met. Find these and try to find a solution! How do the summary outputs change?
source flowchart: R for Data Science by Wickam & Grolemund, 2017 (licensed under CC-BY-NC-ND 3.0 US)
Can you find a latitudinal or longitudinal gradient in cod CPUE during the first quarter in 2015?
In this exercise, you can apply yourself what has been outlined in the EDA cycle: Follow up on the exercise from lecture 10, in which we tried to identify visually whether the CPUE differs between areas. Now you will explore visually but also statistically, whether the observed differences follow a specific pattern: Does the CPUE increase or decrease with latitude or longitude?
cod15
p
Load the following dataset, which is a subset of the full CPUE dataset ("CPUE per age per area_2017-11-20 06_48_16.csv") we used already in lecture 10:
load("data/cod_2015_q1.R")
ls()
## [1] "cod15"
p
To get the latitude and longitude for each area
Area
variable as well as the respective Lat
and Long
values and than cod15
tibble using a join
function from the dplyr packageHere are some rough approximations of the central coordinates of each area:
sd_coord <- tibble(
Area = factor(c(21,22,23,24,25,26,27,28,29,30,31,32)),
Lat = c(57,55,55.75,55,55.5,55.5,58,57.5,59.5,62,64.75,60),
Long = c(11.5,11,12.5,13.5,16,19.5,18,20,21,19.5,22.5,26)
)
The merging can be done using the left_join()
function from dpylr
:
cod15 <- left_join(cod15, sd_coord,
by = "Area") %>% print(n = 5)
## # A tibble: 88 x 5
## Area Age CPUE Lat Long
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 21 0 0 57 11.5
## 2 22 0 0 55 11
## 3 23 0 0 55.8 12.5
## 4 24 0 0 55 13.5
## 5 25 0 0 55.5 16
## # ... with 83 more rows
This function returns all rows from the left table (1st table listed in the function), and all columns from both tables. Alternative functions: right_join(), inner_join(), full_join()
p_lat <- cod15 %>%
ggplot(aes(x = Lat, y = CPUE)) +
geom_point() +
geom_smooth(method="lm", se=F)
p_long <- cod15 %>%
ggplot(aes(x = Long, y = CPUE)) +
geom_point() +
geom_smooth(method="lm", se=F)
grid.arrange(p_lat, p_long, nrow = 2)
m_lat <- lm(formula = CPUE ~ Lat, data = cod15)
m_long <- lm(formula = CPUE ~ Long, data = cod15)
par(mfrow = c(2,4))
plot(m_lat)
plot(m_long)
m_lat <- lm(formula = CPUE ~ Lat, data = cod15)
m_long <- lm(formula = CPUE ~ Long, data = cod15)
par(mfrow = c(2,4))
plot(m_lat)
plot(m_long)
What do you think about the residual distributions? Any outlier?
m_lat <- lm(formula = CPUE ~ Lat, data = cod15)
m_long <- lm(formula = CPUE ~ Long, data = cod15)
par(mfrow = c(2,4))
plot(m_lat)
plot(m_long)
What do you think about the residual distributions? Any outlier?
Compute the residuals and generate histograms for both models
cod15 <- cod15 %>%
add_residuals(model = m_lat, var = "Res_lat") %>%
add_residuals(model = m_long, var = "Res_long")
p_lat <- cod15 %>%
ggplot(aes(x = Res_lat)) +
geom_histogram(binwidth = 5) +
geom_vline(xintercept = 0,
colour = "blue", size = 0.5)
p_long <- cod15 %>%
ggplot(aes(x = Res_lat)) +
geom_histogram(binwidth = 5) +
geom_vline(xintercept = 0,
colour = "blue", size = 0.5)
grid.arrange(p_lat, p_long, nrow = 2)
One should stop here, in fact, as the model assumptions are violated, and do something such as transforming the data or excluding age groups. This will be discussed in the next lecture.
Lets visualize again the relationship between CPUE and Lat/Long but this time colour the data points by the age groups and plot the predictions manually.
p_lat <- cod15 %>%
add_predictions(m_lat, "Pred") %>%
ggplot(aes(x = Lat)) +
geom_point(aes(y = CPUE, colour = Age)) +
geom_line(aes(y = Pred))
p_long <- cod15 %>%
add_predictions(m_long, "Pred") %>%
ggplot(aes(x = Long)) +
geom_point(aes(y = CPUE, colour = Age)) +
geom_line(aes(y = Pred)) +
guides(colour = "none")
grid.arrange(p_lat, p_long, nrow = 2)
par()
for setting global graphical parameters
linear regression model: plot(model)
, residuals()
, resid()
, rstandard(model)
, modelr::add_residuals(data)
joining tables: dplyr::left_join()
Try out the exercises and read up on linear regressions in
Think of solutions for the CPUE ~ Lat/Long model and compare the model results.
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)