Saskia A. Otto
Postdoctoral Researcher
4.5
are called doubles.4
are called integers. Integers and doubles are both called numerics. TRUE
or FALSE
) are called logical.4.5
are called doubles.4
are called integers. Integers and doubles are both called numerics. TRUE
or FALSE
) are called logical.my_double <- 42.5
my_integer <- 5
# With the L suffix, you get an integer rather than a double
my_integer_correct <- 5L
my_logical <- TRUE
my_character <- "some text"
# Note how the quotation marks on the right indicate that "some text" is a character.
To determine the (R internal) type or storage mode of any object or variable use the function typeof()
typeof(my_double)
## [1] "double"
typeof(my_integer)
## [1] "double"
typeof(my_integer_correct)
## [1] "integer"
You can check if an object is of a specific type with an 'is.' function:
int_var <- 10L
is.integer(int_var)
## [1] TRUE
dbl_var <- 4.5
is.double(dbl_var)
## [1] TRUE
Overview of 'is.' functions
Function | lgl | int | dbl | num | chr |
---|---|---|---|---|---|
is.logical() | x | ||||
is.integer() | x | ||||
is.double() | x | ||||
is.numeric() | x | x | x | ||
is.character() | x |
NA
NA
will always be coerced to the correct type if used inside a vector, or you can create NAs of a specific type with:NA # logical
NA_integer_ # integer
NA_real_ # double
NA_character_ # character
is.na()
x <- NA
is.na(x)
## [1] TRUE
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:
Dimensions | Homogeneous | Heterogeneous |
---|---|---|
1d | Atomic vector | List |
2d | Matrix | Data frame |
nd | Array |
c()
, short for combine:dbl_var <- c(1, 2.5, 4.5)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
c()
, short for combine:dbl_var <- c(1, 2.5, 4.5)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
seq()
(= sequence) seq(from = 0, to = 1, by = 0.2)
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
rep()
(= repeat)rep("a", times = 5)
## [1] "a" "a" "a" "a" "a"
c()
’s:c(1, c(2, c(3, 4)))
## [1] 1 2 3 4
# the same as
c(1, 2, 3, 4)
## [1] 1 2 3 4
typeof()
.length()
.typeof()
.length()
.typeof(1:10)
## [1] "integer"
x <- c(200, 50, 40, 1, 100, 20)
length(x)
## [1] 6
How to convert from one type to another, and when that happens automatically?
What happens when you work with vectors of different lengths?
How to name the elements of a vector?
How to pull out elements of interest?
For example, combining a character and an integer yields a character:
str(c("a", 1))
## chr [1:2] "a" "1"
When a logical vector is coerced to an integer or double, TRUE
becomes 1 and FALSE
becomes 0. This is very useful in conjunction with sum()
and mean()
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
## [1] 0 0 1
# Total number of TRUEs
sum(x)
## [1] 1
Test your knowledge of vector coercion rules by predicting the output of the following uses of c()
:
c(1, FALSE)
The infite set of numbers cannot be reduced to simply 2 states whereas TRUE or FALSE can easily be coerced into the two numbers 0 and 1. As the value 1 in this vector is not specified explicitly as integer the vector coerces both to type double.
Test your knowledge of vector coercion rules by predicting the output of the following uses of c()
:
c("a", 1)
A string (in this case "a") has no corresponding number it can be coerced to. But instead, a number such as the 1 can be coerced to a string.
Test your knowledge of vector coercion rules by predicting the output of the following uses of c()
:
c(TRUE, 1L)
Now the number 1 is explicitly defined as integer, hence the TRUE is coerced to the integer 1.
x <- c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
x
?Remember, all TRUEs are coerced to 1 and all FALSEs to 0.
Simply use the sum
function: sum(x)
Type the following into the R console (or run it in your script), which will create a long vector containing a random number of NAs.
x <- 1:10000
set.seed(123) # so we get all the same results
y <- sample(1:10000, 1) # random number of NAs
z <- sample(1:10000, y) # randomly assign positions of the y NAs
x[z] <- NA # place NAs on the positions in z
NA
s are in x
?Recall, the is.na()
function tests for NAs and returns a logical vector, which can be coerced to numbers for further calculations.
You calculate the sum of all true NAs by typing: sum(is.na(x))
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
1:10 + 100
## [1] 101 102 103 104 105 106 107 108 109 110
# What will happen with this summation?
1:10 + 1:2
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
1:10 + 100
## [1] 101 102 103 104 105 106 107 108 109 110
# What will happen with this summation?
1:10 + 1:2
## [1] 2 4 4 6 6 8 8 10 10 12
What happens when you do the following calculation?
a <- c(10, 5, 100)
b <- 1:5
(a*b)*3
The shorter vector a gets recycled to the length of the longer vector b: R fills the gap with the element from the shorter vector one by one, i.e. position 4 gets the first element again, position 5 the second. And <3> gets also repeated as many times as the length of the longest vector b.
All types of vectors can be named. You can name them during creation with c()
:
c(a = 1, b = 2, c = 4)
## a b c
## 1 2 4
Or afterwards by using the function names()
x <- c(1,5,3)
names(x) <- c("a", "b", "c")
x
## a b c
## 1 5 3
[
is the subsetting function, and is called like x[a]
.
There are 4 ways to subset a vector:
1.Using a numeric vector containing only integers.
x <- c("one", "two", "three", "four", "five")
# positive integers keep elements at position:
x[c(5, 1, 3)]
## [1] "five" "one" "three"
1.Using a numeric vector containing only integers.
x <- c("one", "two", "three", "four", "five")
# positive integers keep elements at position:
x[c(5, 1, 3)]
## [1] "five" "one" "three"
# repeating integers make vectors longer:
x[c(1,1,1,1,2,2,2,2,3,3,3,4,4,5,5)]
## [1] "one" "one" "one" "one" "two" "two"
## [7] "two" "two" "three" "three" "three" "four"
## [13] "four" "five" "five"
1.Using a numeric vector containing only integers.
x <- c("one", "two", "three", "four", "five")
# positive integers keep elements at position:
x[c(5, 1, 3)]
## [1] "five" "one" "three"
# repeating integers make vector longer:
x[c(1,1,1,1,2,2,2,2,3,3,3,4,4,5,5)]
## [1] "one" "one" "one" "one" "two" "two"
## [7] "two" "two" "three" "three" "three" "four"
## [13] "four" "five" "five"
# negative integers remove elements:
x[c(-3,-5)]
## [1] "one" "two" "four"
1.Using a numeric vector containing only integers.
# but you cannot mix
# x[c(1,2,-5)] # --> gives error message
# Using zero
x[0] # --> returns an empty vector
## character(0)
2.Subsetting with a logical vector keeps all values corresponding to a TRUE
value. This is most often useful in conjunction with the comparison functions.
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
b <- is.na(x)
x[!b] # the ! reverses the TRUE/FALSE values
## [1] 10 3 5 8 1
# All even (or missing!) values of x
x[x %% 2 == 0]
## [1] 10 NA 8 NA
3.If you have a named vector, you can subset it with a character vector:
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
## xyz def
## 5 2
# you can also duplicate elements
x[c("xyz", "def", "def")]
## xyz def def
## 5 2 2
4.Using nothing returns the original vector. More important for other data structures
x[]
## abc def xyz
## 1 2 5
A vector x has been created by drawing 20 numbers randomly from 1 to 1000:
set.seed(1) # (= state of the Random Number Generator set to 1)
x <- sample(1:1000, 20)
Try it out yourself and answer the following 3 questions:
1.To get the 5th element: x[5]
2.To sum up over the first 4 elements:
sum(x[1:4])
or sum(x[c(1,2,3,4)])
3.To remove the 3rd and 15th element before summation:
sum(x[c(-3, -15)])
What happens when you subset with a positive integer that’s bigger than the length of the vector?
The vector gets indeed extended to the length of the requested element but not by recycling but by filling the gap with NAs.
What happens when you subset with a name that doesn’t exist?
Instead of returning an error message, R returns NAs for all those pulled elements that do not exist (by the selected name). Beware of this behaviour and check your results at each step!
attr()
or all at once (as a list) with attributes()
.attr()
or all at once with attributes()
.temp <- c(17.4, 18.3, 20.8, 16.9, 28.1)
# this metadata is typically written in the header in Excel or in an extra
# spreadsheet, but can be put as attributes into R:
attr(temp, "unit") <- "°C"
attr(temp, "samplinginfo") <- "surface temperature (0.5m depth), measured with CTD"
attributes(temp)
## $unit
## [1] "°C"
##
## $samplinginfo
## [1] "surface temperature (0.5m depth), measured with CTD"
Each of these attributes has a specific accessor function to get and set values:
names(x)
length(x)
(for 1-dimensional structures: vectors, list) otherwise dim(x)
class(x)
The attribute names and other attributes that you set manually will always appear when you look at the content of your vector:
# add stationnames
names(temp) <- c("st_03", "st_11", "st_17", "st_21", "st_25")
temp
## st_03 st_11 st_17 st_21 st_25
## 17.4 18.3 20.8 16.9 28.1
## attr(,"unit")
## [1] "°C"
## attr(,"samplinginfo")
## [1] "surface temperature (0.5m depth), measured with CTD"
These attributes are only visible when you call them explicitly:
length(temp)
## [1] 5
class(temp)
## [1] "numeric"
One important use of attributes is to define factors. Factors are
biomass <- factor(c("low", "medium", "low", "high", "medium"))
biomass
## [1] low medium low high medium
## Levels: high low medium
class(biomass)
## [1] "factor"
levels(biomass) # shown in alphabetic order if not specified
## [1] "high" "low" "medium"
a <- c(1,2,3,4)
c <- (a + sqrt(a))/(exp(2)+1)
c
## [1] 0.2384058 0.4069842 0.5640743 0.7152175
R calculations are vectorized, that means certain calculations are done with each element of a vector.
R calculations are vectorized, that means certain calculations are done with each element of a vector.
a <- c(1,2,3,4)
b <- 10
a + b
a * b
R calculations are vectorized, that means certain calculations are done with each element of a vector.
a <- c(1,2,3,4)
b <- 10 # b gets recycled to the length of a
a + b # = a[1] + b[1], a[2] + b["2"], a[3] + b["3"], a[4] + b["4"]
## [1] 11 12 13 14
a * b # = a[1] * b[1], ...
## [1] 10 20 30 40
Calculate for the following vector
set.seed(1)
x <- sample(1:20, 20, replace = TRUE)
the sum, over all observations, of squared deviation of each observation from the overall mean.
The follow functions are useful: sum()
, mean()
.
Remember, the calculation needs to be from the innermost to the
outermost parenthesis (just like a calculator). So your order should be:
1. calculate deviations of vector,
2. square deviations,
3. sum all up.
sum( (x-mean(x))^2 )
c()
, typeof()
, length()
, is.logical()
, as.logical()
, is.integer()
, as.integer()
, is.double()
, as.double()
, is.numeric()
, as.numeric()
, is.character()
, as.character()
, str()
,
names()
, []
, is.na()
, set.seed()
, sample()
, attr()
, attributes()
, dim()
, class()
,
factor()
, levels()
,
+
, -
, *
, /
, ^
, sqrt()
, exp()
Try out the online tutorial at Data Camp
Don't worry! Soon you won't be bored anymore!!
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)