Saskia A. Otto
Postdoctoral Researcher
source of flowchart: R for Data Science by Wickam & Grolemund, 2017 (licensed under CC-BY-NC-ND 3.0 US)
Is a collection of R packages that share common philosophies and are designed to work together:
You will get to know during the course
utils::write.csv(row.names = FALSE)
, readr::write_csv()
The easiest way to get these packages is to install the whole tidyverse:
install.packages("tidyverse")
The chart was created using this code from Andrie de Vries (on Oct 12th, 2018).
The 'approved' versions can be downloaded from CRAN using the function
install.packages("package_name")
or via R Studio:
You load a package using the functions library()
or require()
. R checks whether this package has been installed and if it doesn’t exist, you’ll get an error message. The main difference between both functions is what happens if a package is not found. For consistency, simply stick to one function:
library(any_package) # library("any_package") would also work
## Error in library(any_package): there is no package called 'any_package'
require(any_package) # require("any_package")
If you load a specific package you add it to the search paths:
modified from Advanced R by H. Wickam, 2014
You can see the search path and package list by running search()
.
search()
## [1] ".GlobalEnv" "package:FSAdata" "package:modelr"
## [4] "package:FSA" "package:magrittr" "package:cowplot"
## [7] "package:gridExtra" "package:grid" "package:maps"
## [10] "package:ggthemes" "package:ggmap" "package:ggridges"
## [13] "package:bindrcpp" "package:lubridate" "package:forcats"
## [16] "package:stringr" "package:dplyr" "package:purrr"
## [19] "package:readr" "package:tidyr" "package:tibble"
## [22] "package:ggplot2" "package:tidyverse" "tools:rstudio"
## [25] "package:stats" "package:graphics" "package:grDevices"
## [28] "package:utils" "package:datasets" "package:methods"
## [31] "Autoloads" "package:base"
After loading
library(tidyverse)
you see that 8 additional tidyverse core packages are loaded.
You also see a conflict of function names (filter()
and lag()
exist in 2 packages)!
Lets look at the search path again:
search()
## [1] ".GlobalEnv" "package:FSAdata" "package:modelr"
## [4] "package:FSA" "package:magrittr" "package:cowplot"
## [7] "package:gridExtra" "package:grid" "package:maps"
## [10] "package:ggthemes" "package:ggmap" "package:ggridges"
## [13] "package:bindrcpp" "package:lubridate" "package:forcats"
## [16] "package:stringr" "package:dplyr" "package:purrr"
## [19] "package:readr" "package:tidyr" "package:tibble"
## [22] "package:ggplot2" "package:tidyverse" "tools:rstudio"
## [25] "package:stats" "package:graphics" "package:grDevices"
## [28] "package:utils" "package:datasets" "package:methods"
## [31] "Autoloads" "package:base"
You now see the 9 packages added to the search path (right after the global environment).
From which packages will R use the functions filter()
and lag()
?
dplyr
has been loaded after stats
(which R automatically loads in every session) so dplyr
comes before stats
in the search path (position 2 vs 10). After R didn't find the functions filter()
and lag()
in the global environment it will look next in dplyr
. As R is successfull finding the functions here, it will not continue searching elsewhere.
You remove a package from the search path using the function
detach(packagename)
or by unchecking the box next to the package name in the 'Packages' pane.
?packagename
(e.g. ?tidyverse
) you get some more information of what the package does and sometimes lists of functions available in this package or weblinks for further information.vignette("packagename")
.browseVignettes("packagename")
.vignette("dplyr")
browseVignettes("dplyr")
read_delim()
: reads in files with any delimiter.read_csv()
: comma delimited files (.csv)read_csv2()
: semicolon separated files (.csv) - common when comma used as decimal markread_tsv()
: tab delimited files (.txt files)read_table()
,read_fwf()
, read_log()
Inline csv files are useful for experimenting and for creating reproducible examples:
read_csv("a,b,c
1,2,3
4,5,6")
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
You can skip the first n lines of metadata at the top of the file using skip = n:
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
## # A tibble: 1 x 3
## x y z
## <int> <int> <int>
## 1 1 2 3
Or use comment = "#" to drop all lines that start with (e.g.) #.
read_csv("# A comment to skip
x,y,z
1,2,3", comment = "#")
## # A tibble: 1 x 3
## x y z
## <int> <int> <int>
## 1 1 2 3
If you don't have column names set col_names = FALSE; R labels them sequentially from X1 to Xn:
read_csv("1,2,3
4,5,6", col_names = FALSE)
## # A tibble: 2 x 3
## X1 X2 X3
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
You can also pass a character vector to col_names:
read_csv("1,2,3
4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 x 3
## x y z
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
readr functions guess the type of each column and convert types when appropriate (but will NOT convert strings to factors automatically). If you want to specify other types use a col_function in the col_types argument to guide parsing.
read_csv("your_file.csv", col_types = cols(
a = col_integer(),
b = col_character(),
c = col_logical() )
)
The argument na
specifies the value (or values) that are used to represent missing values in your file (-999 or -9999 is a typical place holder for missing values).
read_csv("a,b,c
1,2,.", na = ".")
## # A tibble: 1 x 3
## a b c
## <int> <int> <chr>
## 1 1 2 <NA>
read_csv("a,b,c
1,-9999,2", na = "-9999")
## # A tibble: 1 x 3
## a b c
## <int> <chr> <int>
## 1 1 <NA> 2
vignette("tibble")
.tibble::as_tibble(your_dataframe)
(NOTE: tidyverse uses underscores, not points)tibble::as_tibble(your_dataframe)
(NOTE: tidyverse uses underscores, not points)iris_tbl <- as_tibble(iris)
# Compare the difference:
class(iris)
## [1] "data.frame"
class(iris_tbl)
## [1] "tbl_df" "tbl" "data.frame"
As you see, iris_tbl inherits still the data.frame
class, but has in addition also the tbl_df
class.
tibble()
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
## # A tibble: 5 x 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
Inputs of shorter length are automatically recycled!
print()
and change the arguments:print()
and change the arguments:print(iris_tbl, n = 2, width = Inf) # = Inf shows all columns
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## # ... with 148 more rows
Cheat sheet is freely available at https://www.rstudio.com/resources/cheatsheets/
What function would you use to read a file where fields are separated with "|"?
You should find a solution without a hint. But look in the help documentation and play around with one of the previous inline csv file examples.
read_delim()
has the extra argument delim
where you can specify the character used to seperate fields.
What function would you use if you generated a CSV file on your own computer?
Check which symbol your computer uses as decimal mark (you can see that in Excel).
Computers in Germany typically use a comma as decimal mark, hence, when you generate a CSV file in Excel a semicolon will automatically be used as delimitor. In that case you should use read_csv2()
otherwise read_csv()
is the appropriate function.
What arguments do read_delim()
and read_csv()
have NOT in common?
Identify what is wrong with each of the following inline CSV files. What happens when you run the code? (You'll find the solutions at the end of the presentation.)
read_csv("a,b
1,2,3
4,5,6")
read_csv("a,b,c
1,2
1,2,3,4")
read_csv("a,b
1,2
a,b")
read_csv("a;b
1;3")
Compare and contrast the following operations on a data.frame
and equivalent tibble
.
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
What is different? Why might the default data frame behaviours cause you frustration?
If you have other types of files to import try one of the following packages:
hydro <- read_csv("data/1111473b.csv")
print(hydro, n = 5)
## # A tibble: 30,012 x 11
## Cruise Station Type `yyyy-mm-ddThh:mm` `Latitude [degr…
## <chr> <chr> <chr> <dttm> <dbl>
## 1 ???? 0247 B 2015-02-17 09:54:00 55
## 2 ???? 0247 B 2015-02-17 09:54:00 55
## 3 ???? 0247 B 2015-02-17 09:54:00 55
## 4 ???? 0247 B 2015-02-17 09:54:00 55
## 5 ???? 0247 B 2015-02-17 09:54:00 55
## # ... with 3.001e+04 more rows, and 6 more variables: `Longitude
## # [degrees_east]` <dbl>, `Bot. Depth [m]` <chr>, `PRES [db]` <dbl>,
## # `TEMP [deg C]` <dbl>, `PSAL [psu]` <dbl>, `DOXY [ml/l]` <dbl>
To make subsetting and data manipulation easier change the column names, e.g.
names(hydro) <- c("cruise", "station", "type", "date_time",
"lat", "long", "depth", "pres", "temp", "psal", "doxy")
hydro
. Do they match with what you've seen in the Editor?Subset the data to get only observations of Station "0613" and
Hints for question 3: You want ALL press
values to be >=1 and <= 10 (not only integer values that are = 1,2,3, ect!) and you have NAs in the temperature variable, which the default setting of the mean()
function cannot handle --> check the help to see whether you can change some arguments!
Explanation
1. and 2: Either you subset first and then calculate the means:
hydro_sub1 <- hydro[hydro$station == "0613", ]
and mean(hydro_sub1$psal)
and
mean(hydro_sub1$doxy)
or you do both in one step:
mean(hydro$psal[hydro$station == "0613"])
and
mean(hydro$doxy[hydro$station == "0613"])
3.Subset first the range:
hydro_sub2 <- hydro[hydro$pres >= 1 & hydro$pres <= 10, ]
DON'T use: hydro_sub2 <- hydro[hydro$pres %in% 1:10, ]
as this would filter only integers, exluding pres values such as 4.5!
Now handle the NAs, either directly in the mean()
function using the na.rm
argument:
mean(hydro_sub2$temp, na.rm=T)
or by excluding all NAs manually: mean(hydro_sub2$temp[!is.na(hydro_sub2$temp)])
You can save your tibble or data frame as an .R object and load it later with save(your_tibble, "filename")
and load("filename")
:
save(your_subset, file = "My_first_object.R")
# Lets remove your subset and see what happens when we load it again
rm(your_subset)
your_subset
load(file = "My_first_object.R")
your_subset # now it should be back again
read_delim()
--> write_delim()
read_csv()
--> write_csv()
read_tsv()
--> write_tsv()
write_excel_csv()
With the subsets you created (or any other tibble/data frame in your workspace):
install.packages()
, library()
, require()
, search()
, detach()
, vignette()
, browseVignettes()
,
read_delim()
, read_csv()
, read_csv2()
, read_tsv()
, read_table()
, read_fwf()
, read_log()
,
tibble()
, as_tibble()
, print()
, save()
, load()
,
write_delim()
, write_csv()
, write_tsv()
, write_excel_csv()
Go thorougly through the tasks and quizzes. Read the chapter 10 Tibbles and 11 Data import in 'R for Data Science'.
Then try out to import, explore and export other datasets you have (from Excel).
Then go grab a coffee, lean back and enjoy the rest of the day...!
For more information contact me: saskia.otto@uni-hamburg.de
http://www.researchgate.net/profile/Saskia_Otto
http://www.github.com/saskiaotto
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License except for the
borrowed and mentioned with proper source: statements.
Image on title and end slide: Section of an infrared satallite image showing the Larsen C
ice shelf on the Antarctic
Peninsula - USGS/NASA Landsat:
A Crack of Light in the Polar Dark, Landsat 8 - TIRS, June 17, 2017
(under CC0 license)
Example: The header has 1 element (=column) less than the data --> R skips the 3rd element of each data row then completely (3 and 6 are not shown anymore).
Example: The 1st data row has 1 element less than the header and the 2nd data row --> R automatically fills the missing element with a NA.
Example: The data rows have mixed data types --> R coerces all values to the more general character data type .
Example: Remember, the function read_csv()
expects a comma as delimiter, NOT a semicolon --> R reads it then as 1 element per row. Try alternatively:
read_csv2("a;b
1;3")
For tibbles the complete column name is needed. This can be useful in case "x" doesn't exist but 2 other columns that contain the letter x in their names. If you subset tibbles like a matrix ([row, col]) you will always get a tibble returned and no vectors (as data frames do in the 2nd example).
df_tbl <- as_tibble(df)
df_tbl$x
## NULL
df_tbl[, "xyz"]
## # A tibble: 1 x 1
## xyz
## <fct>
## 1 a
df_tbl[, c("abc", "xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <fct>
## 1 1 a