Saskia A. Otto
Postdoctoral Researcher
Five data types most often used in data analysis:
Dimensions | Homogeneous | Heterogeneous |
---|---|---|
1d | Atomic vector | List |
2d | Matrix | Data frame |
nd | Array |
list()
instead of c()
:x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
## $ : num [1:2] 2.3 5.9
NULL
is often used to represent the absence of a vector (as opposed to NA
which is used to represent the absence of a value in a vector). NULL
typically behaves like a vector of length 0.
x <- list(list(list(list())))
str(x)
## List of 1
## $ :List of 1
## ..$ :List of 1
## .. ..$ : list()
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
source: R for Data Science by Wickam & Grolemund, 2017 (licensed under CC-BY-NC-ND 3.0 US)
typeof()
a list is a listis.list()
and as.list()
unlist()
. unlist()
uses the same coercion rules as c()
.A very useful tool for working with lists is str()
because it focuses on the structure, not the content.
x <- list(1, 2, 3)
str(x)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
A very useful tool for working with lists is str()
because it focuses on the structure, not the content.
x <- list(1, 2, 3)
str(x)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
Three ways to subset a list:
[
extracts a sublist. [[
extracts a single component from a list. $
is a shorthand for extracting named elements of a list.I will demonstrate each way using the following list:
a <- list(a = 1:3, b = "a string", c = pi, list(-1, -5))
str(a)
## List of 4
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
## $ c: num 3.14
## $ :List of 2
## ..$ : num -1
## ..$ : num -5
1.[
extracts a sublist. The result will always be a list (it keeps the original list 'container' and removes all elements not selected). Like with vectors, you can subset with a logical, integer, or character vector.
str(a[1:2])
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
str(a[4])
## List of 1
## $ :List of 2
## ..$ : num -1
## ..$ : num -5
1.[
extracts a sublist. The result will always be a list (it keeps the original list 'container' and removes all elements not selected). Like with vectors, you can subset with a logical, integer, or character vector.
str(a[1:2])
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
str(a[4])
## List of 1
## $ :List of 2
## ..$ : num -1
## ..$ : num -5
a[4]
## [[1]]
## [[1]][[1]]
## [1] -1
##
## [[1]][[2]]
## [1] -5
2.[[
extracts a single component from a list. It removes a level of hierarchy from the list (= you remove one 'container').
str(a[[1]])
## int [1:3] 1 2 3
str(a[[4]])
## List of 2
## $ : num -1
## $ : num -5
a[[4]]
## [[1]]
## [1] -1
##
## [[2]]
## [1] -5
3.$
is a shorthand for extracting named elements of a list. It works similarly to [[
except that you don’t need to use quotes.
a$a
## [1] 1 2 3
# same as
a[["a"]]
## [1] 1 2 3
source: R for Data Science by Wickam & Grolemund, 2017 (licensed under CC-BY-NC-ND 3.0 US)
1. list(a, b, list(c, d), list(e, f))
2. list(list(list(list(list(list(a))))))
The following list has been created:
list_example <- list(one = 1:10, two = letters,
three = list(abc = c(132, 876, 42), xyz = c(T,F,F,T,F,T)), four = NULL)
What does the following return? list_example[1:2]
[
extracts always sublists and returns a list. In this example, you extract the first and second sublists, which are the numeric vector "one" and the character vector "two".
The following list has been created:
list_example <- list(one = 1:10, two = letters,
three = list(abc = c(132, 876, 42), xyz = c(T,F,F,T,F,T)), four = NULL)
What does the following return? list_example["four"]
[
extracts always sublists and returns a list. In this example, you extract the sublist that is called "four", which is the fourth sublist containing NULL elements.
The following list has been created:
list_example <- list(one = 1:10, two = letters,
three = list(abc = c(132, 876, 42), xyz = c(T,F,F,T,F,T)), four = NULL)
What does the following return? list_example[[1]][2]
[[
extracts always single components from a list. In this case, it extracts the first sublist, which is a vector, and from there the 2nd element, which is a 2. So the returned object is also a vector containing only the number 2.
The following list has been created:
list_example <- list(one = 1:10, two = letters,
three = list(abc = c(132, 876, 42), xyz = c(T,F,F,T,F,T)), four = NULL)
What does the following return? list_example[3][[2]]
This is a bit trickier! The code snippet tries to subset the following: return a list that extracts from 'list_example' the 3rd sublist and from this list take sublist 2.
BUT: that list has only 1 element, which is the list 'three' (containing vectors 'abc' and 'xyz'); a sublist 2 doesn't exist. However, the following would have worked: list_example[2:3][[2]]
since now the extracted list contains 2 sublists, "two" and "three", from which it can subset list "three".
What is equivalent to the following code (multiple answers correct)? And which of the options below returns a suprising value?
list_example[["three"]][c("abc", "xyz")]
Option 2 subsets something else, which is not so intuitive: it extracts also the 3rd sublist "three", but from there it subsets the 2nd element of the 1st sublist (the vector "abc"). DO NOT use this notation for extracting the number '876', use instead: list_example[[3]][[1]][2]
(element 2 in sublist 1 within sublist 3)!!!
Create a new vector that contains the 4th element of sublist "one" and element 1 and 3 from sublist "abc" within "three" in 'list_example'.
Within the vector you extract first the value from sublist "one" and then values from sublist "abc" within sublist "three" (both seperated by a comma).
To get the correct answer you could subset for instance like this:
sum( c(list_example[[1]][2], list_example[[3]][[1]][c(1,3)]) )
or using the $
sign:
sum( c(list_example$one[2], list_example$three$abc[c(1,3)]) )
Execute the following R command in your console
lm_list <- lm(Sepal.Length ~ Sepal.Width, data = iris)
and look at the structure of the list you created with
str(lm_list, max.level = 1) # max.level=1 shows only the first level
# of the hierarchy (and not all sub/sub/..lists))
Look at the names of the sublists in lm_list
. Is there anything that sounds like what we are looking for? If you found it check the number of elements it contains (we want the last one!).
The object created from a linear regression (that is what the lm()
function does) is a long list with many nested lists. lm_list contains 12 sublists amongst one is a vector called 'residuals' (thats what we are after). So to get the last value (from the 150 values) type:
lm_list$residuals[150]
dim
attribute to an atomic vector allows it to behave like a multi-dimensional array. Matrices are created with
matrix()
cbind()
(stands for column-binding).Matrices are created with
matrix()
cbind()
(stands for column-binding).a <- matrix(1:6, ncol = 3, nrow = 2)
a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
v1 <- 1:3
v2 <- 4:6
a <- cbind(v1, v2)
a
## v1 v2
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
length()
and names()
have high-dimensional generalisations:
length()
generalises to
nrow()
and ncol()
for matrices, and dim()
for arrays.length(a)
## [1] 6
nrow(a)
## [1] 3
ncol(a)
## [1] 2
names()
generalises to
rownames()
and colnames()
for matrices, and dimnames()
, a list of character vectors, for arrays.colnames(a) <- c("A","B")
rownames(a) <- c("a","b","c")
a
## A B
## a 1 4
## b 2 5
## c 3 6
Most common way of subsetting matrices (2d) and arrays (>2d) is a simple generalisation of 1d subsetting:
Most common way of subsetting matrices (2d) and arrays (>2d) is a simple generalisation of 1d subsetting:
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
## A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
a[1:2, 2]
a[, c("A", "C") ]
Guess which values you get!
Most common way of subsetting matrices (2d) and arrays (>2d) is a simple generalisation of 1d subsetting:
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
## A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
a[1:2, 2]
## [1] 4 5
a[, c("A", "C") ]
## A C
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
Subset the following matrix...
(mat <- matrix(c(0,NA,4,18,35,97,7,9,20), nrow = 3))
## [,1] [,2] [,3]
## [1,] 0 18 7
## [2,] NA 35 9
## [3,] 4 97 20
to get all values in row 1
Submit and Compare ClearSubset the following matrix...
(mat <- matrix(c(0,NA,4,18,35,97,7,9,20), nrow = 3))
## [,1] [,2] [,3]
## [1,] 0 18 7
## [2,] NA 35 9
## [3,] 4 97 20
to get all values in row 1 and 3
Submit and Compare ClearSubset the following matrix...
(mat <- matrix(c(0,NA,4,18,35,97,7,9,20), nrow = 3))
## [,1] [,2] [,3]
## [1,] 0 18 7
## [2,] NA 35 9
## [3,] 4 97 20
to get all values in column 2
Submit and Compare ClearSubset the following matrix...
(mat <- matrix(c(0,NA,4,18,35,97,7,9,20), nrow = 3))
## [,1] [,2] [,3]
## [1,] 0 18 7
## [2,] NA 35 9
## [3,] 4 97 20
to get all values in row 2 and 3 and column 1 and 2
Submit and Compare Clear
names()
, colnames()
, and rownames()
, although names()
and colnames()
are the same thing. length()
is the length of the underlying list and so is the same as ncol()
; nrow()
gives the number of rows.You create a data frame using data.frame()
(note the point inbetween both words!), which takes named vectors as input:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
You create a data frame using data.frame()
(note the point inbetween both words!), which takes named vectors as input:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
data.frame()
s default behaviour, which turns strings into factors. Use stringsAsFactors = FALSE
to suppress this behaviour:df <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
Either like a matrix (useful if several columns and rows are selected)
df[1:2, 1] # row 1-2, column 1
## [1] 1 2
Either like a matrix (useful if several columns and rows are selected)
df[1:2, 1] # row 1-2, column 1
## [1] 1 2
Or like a list
df$x # shows all elements of column 'x'
## [1] 1 2 3
df$y[2] # 2nd element of column 'y'
## [1] "b"
df[[2]][2] # same
## [1] "b"
that contains
iris
dataset - data structureExplore the following dataset
head(iris, 1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
What type of data structure is iris
?
Use the class()
function to answer this question.
iris
is a data frame because it is 2-dimensional and contains different types (it is heterogeneous).
iris
dataset - dimensionsWhat are the dimensions of the dataset iris
?
Use the nrow()
and ncol()
functions to answer this question.
iris
dataset - data typeWhich basic data types does the dataset iris
contain?
Use the str()
function to answer this question.
The first 4 columns contain decimal numbers, the last is a character vector with additional attributes, which makes it a factor (see also attributes(iris$Species)
).
iris
dataset - subsettingsum()
The easiest is to subset matrix-like within the sum
function: sum( iris[, 1:4] )
When you create R objects, you'll see them appear in your environment pane under Global Environment:
x <- 1:10
y <- 1:10
z <- cbind(x,y) # matrix
The global environment, more often known as the user's workspace, is the first item on the search path. When a user starts a new session in R, the R system creates a new environment for objects created during that session.
You can list all objects in the workspace using the function ls()
:
x <- 1:10
y <- 1:10
z <- cbind(x,y) # matrix
ls()
## [1] "x" "y" "z"
You can remove an object with rm()
:
x <- 4
x
rm(x)
You can remove an object with rm()
:
x <- 4
x
rm(x)
Or remove all objects in one go:
str()
, [
, [[
, $
,
list
, is.list()
, as.list()
, unlist()
,
matrix()
, cbind()
, nrow()
, ncol()
, dim
, rownames()
, colnames()
, dimnames()
,
data.frame()
, data.frame(stringsAsFactors = FALSE)
Again, try out the online tutorial at Data Camp.
And go over this lecture again and do the quizzes.
Then try out the following: Calculate for the iris
data set
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)