Saskia A. Otto
Postdoctoral Researcher
source: Data Import cheat sheet (licensed under CC-BY-SA)
source: Data Import cheat sheet (licensed under CC-BY-SA)
This table shows the number of times the 3 fish species were caught in 5 trawls: How many variables do you see?
How would you access elements, e.g., to
How would you access elements, e.g., to
names(df)[2:3]
df[1, 2:4]
c( df[1, ], df[2, ], df[3, ])
names(df)[2:3]
df[1, 2:4]
c( df[1, ], df[2, ], df[3, ])
df$occurrences
df$species
df$n
What can be potential issues with such a table?
What can be potential issues with such a table?
gather()
and spread()
gather()
Moves column names into a key column, gathering the column values into a single value column.
spread()
Moves the unique values of a key column into the column names, spreading the values of a value column across the new columns.
In the R console you would write for the hydrographical example
cruise <- tibble(pressure = 1:5, station_1 = c(14.1,14.0,14.0,13.9,13.8),
station_2 = c(15.3,15.4,15.2,15.0,14.9))
cruise
# Make tibble long
cruise_long <- gather(cruise, station_1, station_2, key = "station", value = "temp")
cruise_long
# Reshape tibble back into wide format
cruise_wide <- spread(cruise_long, station, temp)
cruise_wide # same as cruise!
Why are gather()
and spread()
not perfectly symmetrical?
Carefully consider the following example:
(stocks <- tibble(
year = c(rep(2014,3), rep(2015, 4), rep(2016,2)),
quarter = c(2:4, 1:4, 1:2), return = round(rnorm(9, mean = 1, sd = 0.3), 2) ) )
## # A tibble: 9 x 3
## year quarter return
## <dbl> <int> <dbl>
## 1 2014 2 1.45
## 2 2014 3 1.12
## 3 2014 4 0.81
## 4 2015 1 0.34
## 5 2015 2 1.34
## 6 2015 3 0.99
## 7 2015 4 1
## 8 2016 1 1.28
## 9 2016 2 1.25
Why are gather()
and spread()
not perfectly symmetrical?
Carefully consider the following example:
(stocks <- tibble(
year = c(rep(2014,3), rep(2015, 4), rep(2016,2)),
quarter = c(2:4, 1:4, 1:2), return = round(rnorm(9, mean = 1, sd = 0.3), 2) ) )
Now lets make the data wide and then long again:
stocks_wide <- spread(stocks, quarter, return)
stocks_long <- gather(stocks_wide, `1`:`4`, key = "quarter", value = "return")
What is the difference between stocks and stocks_long? (Hint: look also at the variable types and think about column names.)
What is the argument 'factor_key' in gather()
for?
Read the help documentation ?gather
.
Use these functions to split or combine cells into individual, isolated
values: separate()
, separate_row()
, unite()
Use these functions to split or combine cells into individual, isolated
values: separate()
, separate_row()
, unite()
df <- tibble(
sd_station = c("25_BY5", "26_BMPJ2", "26_J56", "26_K32",
"27_B1", "28_BY15", "29_F64", "30_SR5", "30_US5B"),
temp = c(14.1, 13.0, 15.2, 17.9, 14.8, 12.9, 12.1, 11.3, 11.1)
)
df_split <- separate(df, sd_station, into = c("sd", "station"))
df_join <- unite(df_split, sd, station, col = "sd_station", sep = "/")
What do the extra and fill arguments do in separate()
? Experiment with the various options for the following two toy datasets.
df1 <- tibble(x = c("a,b,c", "d,e,f,g", "h,i,j"))
separate(df1, x, c("one", "two", "three"), remove =F)
df2 <- tibble(x = c("a,b,c", "d,e", "f,g,i"))
separate(df2, x, c("one", "two", "three"))
(The solution is at the end of the presentation.)
Both unite()
and separate()
have a remove argument. What does it do?
Why would you set it to FALSE
?
Look into the help ?separate
and play with the argument and the above toy datasets.
You can apply the function is.na()
to single vectors and single variables in a data frame as you learned in lecture 2. Here an example with the ICES dataset
is.na(hydro$temp)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [34] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [56] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [89] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [155] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [331] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [408] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [419] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [430] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [441] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [452] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [463] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [474] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [485] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [496] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [507] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [518] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [540] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [551] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [562] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [584] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [595] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [606] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [617] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [628] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [639] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [650] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [672] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [683] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [694] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [705] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [716] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [727] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [738] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [749] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [760] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [771] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [782] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [804] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [815] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [826] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [837] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [848] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [859] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [870] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [881] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [892] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [903] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [914] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [925] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [936] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [947] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [958] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [969] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [980] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [991] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [ reached getOption("max.print") -- omitted 29012 entries ]
Remember, the is.na()
function returns a logical vector of the same length then the original vector (which has, in our case, 30012 values!).
BETTER: If you wrap the function by the sum()
function, you can calculate the sum of all TRUEs in this vector:
sum(is.na(hydro$temp))
## [1] 1714
sum(is.na(hydro$psal))
## [1] 2382
You can do the same with the entire data frame
sum(is.na(hydro))
## [1] 13809
A very fast way to see whether and how many NAs you have in a dataframe is to use the summary()
function, which displays not only some descriptive statistics but also the number of NAs:
summary(hydro[ ,9:11]) # for display purposes I selected not all columns
## temp psal doxy
## Min. :-0.700 Min. : 0.065 Min. : 0.000
## 1st Qu.: 4.200 1st Qu.: 6.548 1st Qu.: 5.870
## Median : 6.067 Median : 7.430 Median : 6.960
## Mean : 7.668 Mean : 8.082 Mean : 6.493
## 3rd Qu.:10.500 3rd Qu.: 8.436 3rd Qu.: 8.200
## Max. :24.400 Max. :34.016 Max. :11.760
## NA's :1714 NA's :2382 NA's :7304
drop_na(data, ...)
: Drops entire rows containing missing values. fill(data, ..., .direction = c("down", "up"))
: Fills missing values in using the previous (direction = "down") or following (direction = "up") entry. Useful if only values that change are recorded.replace_na(data, replace = list(), ...)
: Replaces missing values with a value specified for each column.Use the indexing rules and operators for subsetting you've learned so far:
cruise <- tibble(pressure = 1:5, station_1 = c(4.1,4.0,4.0,3.9,33.8),
station_2 = c(-5.3,-5.4,5.2,5.0,4.9))
Use the indexing rules and operators for subsetting you've learned so far:
cruise <- tibble(pressure = 1:5, station_1 = c(4.1,4.0,4.0,3.9,33.8),
station_2 = c(-5.3,-5.4,5.2,5.0,4.9))
# if you know the row of the outlier
cruise$station_1[5] <- NA
cruise$station_2[
cruise$station_2 < 0] <- NA
# works only if you have no NAs,
# otherwise try
cruise$station_2[
!is.na(cruise$station_2) &
cruise$station_2< 0] <- NA
A useful function to identify the position of elements in a vector for which the specified conditions holds is which()
.
Lets identify the temperature value(s) that are equal to the maximum observed temperature value and set this/these to NA:
id <- which(cruise$station_1 == max(cruise$station_1))
id
## [1] 5
cruise$station_1[id] <- NA
We have the following dataset:
df <- tibble(
station = paste0("station_", 1:20), # concatenates "station" with the number 1 to 20
temp = rnorm(20, mean = 14, sd = 2), # generates 20 values from a normal distribution
sal = rnorm(20, mean = 8, sd = 0.5)
)
random_nas <- sample(1:20, 7) # samples 7 random positions
df$temp[random_nas] <- NA # fill these positions in variable 'temp' with NAs
df$sal[random_nas] <- NA
Try to deal with the NAs in the variable 'temp' in the following way:
What are the consequences if you do this? What is the best option (out of the four and having the original NAs in) for which situation?
As a help: calculate for each option the mean temperature and compare the results (the solution is at the end of the presentation).
Restore the original df (including the NAs in temp) and figure out how to change NAs (with the previous value) in both variables 'temp' and 'sal' IN ONE STEP.
(The solution is at the end of the presentation.)
Cheat sheet is freely available at https://www.rstudio.com/resources/cheatsheets/
WHAT ARE THE THINGS TO CONSIDER????
base package: is.na()
, which()
, summary()
tidyr package: gather()
, spread()
,
separate()
, separate_row()
, unite()
,
drop_na()
, fill()
, replace_na()
Go thorougly through the tasks and quizzes. Read the chapter 12 Tidy Data in 'R for Data Science'.
Then play around already with the full hydro dataset "1111473b.csv" and make it tidy.
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)
separate(df1, x, c("one", "two", "three"), extra = "merge")
separate(df2, x, c("one", "two", "three"), fill = "left")
# This would be your code:
# 1.
drop_na(df, temp)
# 2.
fill(df, temp) # default is previous value
# 3.
temp_mean <- mean(df$temp, na.rm= TRUE)
replace_na(df, list(temp = temp_mean))
# 4.
replace_na(df, list(temp = -999))
Some function cannot handle NAs so one way to solve the problem is to drop the entire rows that contain NAs somewhere or replace the NAs with some values such as the mean, median, or previous values:
dropping rows: a simple, clean solution that doesn’t bring any artificial information into the data. But you might loose so many rows that your sample size becomes too small. In that case, filling NAs with values is a better option.
previous values: can be useful in time series or spatial data (including depth profiles) where values next to each other (e.g. temperatures the next day or the adjacent water depth) are likely to be similar. Problematic here is if you have large gaps, e.g. the temperature in 1m depth and then in 100m depth again. For all intermediate depths (2-99m) the values would resemble the one from 1m depth, which is highly unrealistic. In such a case, removing the entire sample might be the best choice.
mean: replacing NAs with means (or medians) is a common technique, particularly if samples are not closely related (e.g. temperature measurements of stations far away from each other). But, think carefully which mean to take. The overall mean (across all samples, stations, etc)? The mean of that sample? The mean of the specific depth? The mean of the specific depth in the specific month?
Some datasets come with the value -999 for missing values. To be consistent you might want to do the same for other datasets. But as soon as you do any calculations, this value will be treated as a real number messing up all your results (imagine the mean of 15, 16 and -999). Always change such missing values to NAs before you do any calculation, visualisation, or modelling.
Best solution: there isn’t a best solution but a good practice is to keep your missing values as NAs and only for particular methods you deal with NAs in a way that is best suited for the type of data.
If you don’t specify any variable, NAs will be replace in the entire dataset:
fill(df)
If you want to make sure only in temp and sal NAs are replaced specify that:
fill(df, temp, sal)