6 Tutorial: Advanced data management
After working through Tutorial 6, you’ll…
- have learned some neat tricks that took me several years to figure out… :)
- … for example, soft-coding your working directory
- … using relative paths to navigate your folders
- … using advanced column selectors
Some parts of this tutorial are heavily inspired by Christian Burkhart’s book Going from Beginner to Advanced in the Tidyverse. This book is the next step in your journey to tidyverse mastery!
6.1 Stop hard-coding your working directory!
Hard-coding your working directory with
setwd("C:/Users/LaraK/Documents/IPR/")
comes with a significant
drawback: it is not reproducible on your colleagues’ machines! Your
colleagues may have different user names, run the script from different
source file paths, or have different folder structures.
To avoid these problems, you can use the rstudioapi
package. It
provides a set of functions that allow you to interact with RStudio’s
IDE (Integrated Development Environment) programmatically. First,
install and load the package:
Now you should open or create an R script file in RStudio that is
located in the desired folder that you want to become your working
directory. I will create a script in a folder called IPR
that rests
under the folder path: "C:/Users/LaraK/Documents/IPR/"
. Save that
script in your folder!
Now copy the following line of code in your R script and execute it:
By using the dirname()
function on
rstudioapi::getSourceEditorContext()$path
, we extract the directory
path of the currently active R script and pass it to setwd()
to
set it as the working directory. This approach also ensures that the
working directory is set to the correct folder, even if you rename or
move the R script within RStudio. Note : Remember to save the R
script in the desired folder before executing the setwd()
function
to ensure that it sets the correct working directory.
That’s really handy, isn’t it?
6.3 Reading multiple files into R
There are several common challenges when reading multiple files into R,
such as inconsistent column names and file paths, as well as selecting
only the necessary files. In this tutorial, I’ll teach you a convenient
solution by creating a character vector of file paths and reading them
into R using the read_*
functions.
First, you will need to set the folder "multiple_files"
as your
working directory.
Now you need to save the folder path of your working directory as a
character string. You can use getwd()
and save the result in an R
object called path
:
Next, create a character vector that contains the file names of all
files contained in your subfolder "multiple_files"
. You can look at
the file names of a folder by using the function list.files()
:
Now turn these names into a proper character vector of all the file
paths by adding additional arguments to the function list.files()
:
file_paths <- list.files(
path = path, # takes the path to the working directory / folder
pattern = "csv", # file should contain the string 'csv', but you could also opt for "WoJ_names"
full.names = TRUE
) # store the full file paths, not just the file names
file_paths
Finally, read the files using map()
: Use map()
from the purrr
package (part of the tidyverse
metapackage) to iterate over the file
paths and read each file using a function like read.csv()
(or
read_excel()
, etc.). The result will be a list of data frames.
data_list <- file_paths %>%
map(~ read.csv2(file = .x)) # use map to apply read.csv2 (semicolon-separated) to each individual file path, .x stands for each individual file path
data_list
However, we don’t want our data to be read in as separate files. We want
to combine them in one single file. We can use bind_rows()
from the
dpylr
package for that:
combined_data <- data_list %>% bind_rows() # bind all of your rows in your data list into one dataframe
That involved a great number of individual steps. Let’s now consolidate everything into a simpler pipeline and execute it:
path <- getwd()
file_paths <- list.files(
path = path,
pattern = "csv",
full.names = TRUE
)
combined_data <- file_paths %>%
map(~ read.csv2(file = .x)) %>%
bind_rows()
Note: If the files have different column names or structures, you may encounter issues when binding them together with this approach. Make sure the files have compatible structures or consider preprocessing the data, as shown in the next chapter.
6.3.1 Reading multiple files into R with different column names
Things can sometimes get messy. Please download the .csv files that are
contained in the folder multiple_files_messy
on LRZ Sync &
Share
and store them in a folder called multiple_files_messy
in your working
directory.
In this messy folder, the WoJ_names_1.csv
file has been altered. The
column name country
has been capitalized (i.e., changed to COUNTRY
)
, and the column containing names has been accidentally deleted. Let’s
attempt to reload the .csv files once more and set the column names to
lowercase dynamically. To handle the missing column, we will use
bind_rows
function:
setwd("./multiple_files_messy")
path <- getwd()
file_paths <- list.files(
path = path,
pattern = "csv",
full.names = TRUE
)
combined_data <- file_paths %>%
map(~ read.csv2(file = .x) %>%
setNames(tolower(names(.)))) %>%
bind_rows()
Tip for advanced users: The line setNames(tolower(names(.)))) %>%
is where the mismatch between COUNTRY and country is being solved. There are several other common mismatches between data files that you can solve using these lines of code:
1. Extraneous white spaces (leading/trailing spaces):
setNames(gsub("^\\s+|\\s+$", "", names(.))) %>%
2. Replace spaces with underscores:
setNames(gsub("\\s+", "_", names(.))) %>%
3. Remove special characters:
setNames(gsub("[^[:alnum:]_]", "", names(.))) %>%
4. Standardize certain column names:
If you know in advance that certain files might have deviations in column names, you can use a named vector of replacements. For this, you’d likely want to break the renaming process out of the setNames function for clarity:
name_corrections <- c("First Column" = "first_column", "Last Column" = "last_column")
corrected_names <- ifelse(names(.) %in% names(name_corrections), name_corrections[names(.)], names(.))
combined_data <- file_paths %>%
map(~ read.csv2(file = .x) %>%
setNames(corrected_names) %>%
bind_rows()
Neat! Note that you can always exchange the string in the pattern
argument of list.files
to include a regular expression. I can’t cover
regular expressions and the stringr
package in this tutorial, but if
you are interested in mastering regular expressions, you can have look
at my tutorial for our MA journalism
students.
6.4 Joining two datasets
We’ve already covered how to join multiple datasets while loading the data into R. This next approach joins datasets after they’ve been loaded into R. It’s the more common approach.
First things first, let’s load the libraries and data we’ll need. As we load this data, we’re doing an extra step in our pipeline. For every row of data, we’re adding a unique ID. This ID is simply the row’s position in the list (so, the first row is 1, the second is 2, and so on). We’re naming this unique ID as case_id.
# install.packages("tidycomm") # only the first time
library(tidycomm)
library(tidyverse)
WoJ <- tidycomm::WoJ %>%
mutate(case_id = row_number())
Next, we’ll split the WoJ dataset into two groups based on work experience: rookies and veterans. We’ll also reduce the numbers of rookies and veterans based on an arbitrary third variable to make the cases included in the datasets somewhat different from each other.
rookies <- WoJ %>%
filter(work_experience <= 12) %>%
filter(autonomy_selection > 1)
veterans <- WoJ %>%
filter(work_experience >= 10) %>%
filter(ethics_1 < 5)
With these restrictions set in place, these two datasets include 1,169 unique journalists. The rookies are made up of 433 journalists and the veterans make up 853. 117 journalists are being classified as BOTH rookie AND veteran, i.e. have been counted twice. This means we are dealing with: 433 + 853 - 117 = 1,169.
Now, let’s say you want to merge these datasets back together for some comparative analysis. dplyr
offers you different types of joins for this exact purpose. Let’s explore them!
6.4.1 inner_join
This is like an overly correct librarian. It only takes rows that have matching keys in both datasets, otherwise it will drop the rows.
## # A tibble: 117 × 31
## country.x reach.x employment.x temp_contract.x autonomy_selection.x
## <fct> <fct> <chr> <fct> <dbl>
## 1 Germany National Full-time Permanent 5
## 2 Denmark National Full-time Permanent 3
## 3 UK Regional Full-time Permanent 3
## 4 UK Transnational Full-time Permanent 4
## 5 UK Regional Full-time Permanent 4
## 6 Germany Local Full-time Permanent 4
## 7 UK Regional Full-time Permanent 4
## 8 Switzerland Regional Freelancer <NA> 2
## 9 Denmark Regional Full-time Temporary 3
## 10 Austria National Full-time Permanent 3
## # ℹ 107 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## # ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## # work_experience.x <dbl>, trust_parliament.x <dbl>,
## # trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## # case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## # temp_contract.y <fct>, autonomy_selection.y <dbl>, …
6.4.2 left_join
This is like a fan of the first dataset. It takes all rows from the first dataset and the matched rows from the second dataset.
## # A tibble: 433 × 31
## country.x reach.x employment.x temp_contract.x autonomy_selection.x
## <fct> <fct> <chr> <fct> <dbl>
## 1 Germany National Full-time Permanent 5
## 2 Germany National Full-time Permanent 3
## 3 Switzerland Regional Full-time Permanent 4
## 4 Switzerland Local Part-time Permanent 4
## 5 Denmark National Full-time Permanent 3
## 6 Denmark National Full-time Permanent 2
## 7 Austria Local Full-time Permanent 5
## 8 UK Regional Full-time Permanent 3
## 9 UK Transnational Full-time Permanent 4
## 10 UK Transnational Full-time Permanent 4
## # ℹ 423 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## # ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## # work_experience.x <dbl>, trust_parliament.x <dbl>,
## # trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## # case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## # temp_contract.y <fct>, autonomy_selection.y <dbl>, …
6.4.3 right_join
This is like a fan of the second dataset. It takes all rows from the second dataset and the matched rows from the first dataset.
right_joined <- right_join(rookies, veterans, by = "case_id")
# Pro tip: you can use the argument `by = your_column_here` to match the two datsets based on a specific column, e.g. case id
right_joined
## # A tibble: 853 × 31
## country.x reach.x employment.x temp_contract.x autonomy_selection.x
## <fct> <fct> <chr> <fct> <dbl>
## 1 Germany National Full-time Permanent 5
## 2 Denmark National Full-time Permanent 3
## 3 UK Regional Full-time Permanent 3
## 4 UK Transnational Full-time Permanent 4
## 5 UK Regional Full-time Permanent 4
## 6 Germany Local Full-time Permanent 4
## 7 UK Regional Full-time Permanent 4
## 8 Switzerland Regional Freelancer <NA> 2
## 9 Denmark Regional Full-time Temporary 3
## 10 Austria National Full-time Permanent 3
## # ℹ 843 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## # ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## # work_experience.x <dbl>, trust_parliament.x <dbl>,
## # trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## # case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## # temp_contract.y <fct>, autonomy_selection.y <dbl>, …
6.4.4 full_join
This is like a hoarder. It takes all rows and throws in missings for variables that are not part of both datasets.
full_joined <- full_join(rookies, veterans, by = "case_id")
# Pro tip: you can use the argument `by = your_column_here` to match the two datsets based on a specific column, e.g. case id
full_joined
## # A tibble: 1,169 × 31
## country.x reach.x employment.x temp_contract.x autonomy_selection.x
## <fct> <fct> <chr> <fct> <dbl>
## 1 Germany National Full-time Permanent 5
## 2 Germany National Full-time Permanent 3
## 3 Switzerland Regional Full-time Permanent 4
## 4 Switzerland Local Part-time Permanent 4
## 5 Denmark National Full-time Permanent 3
## 6 Denmark National Full-time Permanent 2
## 7 Austria Local Full-time Permanent 5
## 8 UK Regional Full-time Permanent 3
## 9 UK Transnational Full-time Permanent 4
## 10 UK Transnational Full-time Permanent 4
## # ℹ 1,159 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## # ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## # work_experience.x <dbl>, trust_parliament.x <dbl>,
## # trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## # case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## # temp_contract.y <fct>, autonomy_selection.y <dbl>, …
6.4.5 Join without .x and .y suffixes
You might have noticed that when you’re using dplyr to join two datasets, .x and .y suffixes are added to any columns that have the same names in both datasets. For example, if we join using the case_id
variable, all other variables in the resulting combined dataset will be named something like country.x
and country.y
. The .x and .y suffixes differentiate the country columns from the rookies
dataset and the veterans
dataset, respectively.
However, if you’re certain that columns with the same name in both datasets also contain the same values, or if you simply don’t want to retain both versions in the output, there are a couple of approaches to prevent or remove the duplicate columns with the .x and .y suffixes:
1. Using the by
argument:
The by
argument in the join
functions lets you specify which columns to join by. By creating a character vector of the column names in your data frames, you can provide this column name vector to the by
argument. This way, you’re specifying to join on all columns in your data frames. Since the column names and values are identical in both data frames, dplyr
won’t create the duplicate columns with .x and .y suffixes.
column_names <- colnames(rookies) # colnames() grabs the column names from the rookies data frame and turns them into a character vector
full_joined <- full_join(rookies, veterans, by = column_names)
full_joined
## # A tibble: 1,169 × 16
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Germany Nati… Full-time Permanent 5 4
## 2 Germany Nati… Full-time Permanent 3 4
## 3 Switzerl… Regi… Full-time Permanent 4 4
## 4 Switzerl… Local Part-time Permanent 4 5
## 5 Denmark Nati… Full-time Permanent 3 3
## 6 Denmark Nati… Full-time Permanent 2 4
## 7 Austria Local Full-time Permanent 5 5
## 8 UK Regi… Full-time Permanent 3 4
## 9 UK Tran… Full-time Permanent 4 5
## 10 UK Tran… Full-time Permanent 4 4
## # ℹ 1,159 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## # case_id <int>
You receive only 16 columns in your combined data set. This is the most elegant solution.
2. Renaming .x columns and dropping .y columns: Using the knowledge that you’ll acquire in the next few tutorials, you could also rename all the .x columns in your combined data set and delete all .y columns.
inner_joined <- inner_joined %>%
rename_with(~ gsub("\\.x$", "", .), everything()) %>% # replaces all column endings ".x" with nothing: ""
select(-matches("\\.y$")) # deselects all columns that end with ".y"
inner_joined
## # A tibble: 117 × 16
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Germany Nati… Full-time Permanent 5 4
## 2 Denmark Nati… Full-time Permanent 3 3
## 3 UK Regi… Full-time Permanent 3 4
## 4 UK Tran… Full-time Permanent 4 4
## 5 UK Regi… Full-time Permanent 4 4
## 6 Germany Local Full-time Permanent 4 5
## 7 UK Regi… Full-time Permanent 4 4
## 8 Switzerl… Regi… Freelancer <NA> 2 3
## 9 Denmark Regi… Full-time Temporary 3 3
## 10 Austria Nati… Full-time Permanent 3 4
## # ℹ 107 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## # case_id <int>
You are left with only 16 columns.
3. Drop the columns from one data frame before joining: This is the most basic and error-prone, but quickest approach.
veterans <- veterans %>%
select(case_id) # drop all columns except case_id
left_joined <- left_join(rookies, veterans, by = "case_id")
left_joined
## # A tibble: 433 × 16
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Germany Nati… Full-time Permanent 5 4
## 2 Germany Nati… Full-time Permanent 3 4
## 3 Switzerl… Regi… Full-time Permanent 4 4
## 4 Switzerl… Local Part-time Permanent 4 5
## 5 Denmark Nati… Full-time Permanent 3 3
## 6 Denmark Nati… Full-time Permanent 2 4
## 7 Austria Local Full-time Permanent 5 5
## 8 UK Regi… Full-time Permanent 3 4
## 9 UK Tran… Full-time Permanent 4 5
## 10 UK Tran… Full-time Permanent 4 4
## # ℹ 423 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## # case_id <int>
6.5 Reformatting datasets
In this subtutorial, we will learn how to use pivot_longer()
and
pivot_wider()
functions from the tidyr
package. We will create a
fictional dataset based on the Worlds of Journalism (WoJ) dataset, with
a twist - it will have several panel waves. Imagine that the WoJ dataset
has been expanded to include data from different years:
set.seed(123)
WoJ_panel <- WoJ %>%
slice_sample(n = 10) %>%
select(case_id, work_experience, autonomy_selection) %>%
mutate(
work_experience_w2018 = work_experience + sample(1:1, 10, replace = TRUE),
work_experience_w2019 = work_experience_w2018 + sample(1:1, 10, replace = TRUE),
autonomy_selection_w2018 = autonomy_selection + sample(-1:1, 10, replace = TRUE),
autonomy_selection_w2019 = autonomy_selection_w2018 + sample(-1:1, 10, replace = TRUE)
) %>%
select(-c(work_experience, autonomy_selection))
WoJ_panel
## # A tibble: 10 × 5
## case_id work_experience_w2018 work_experience_w2019 autonomy_selection_w2018
## <int> <dbl> <dbl> <dbl>
## 1 415 24 25 3
## 2 463 5 6 4
## 3 179 26 27 4
## 4 526 38 39 3
## 5 195 13 14 5
## 6 938 26 27 4
## 7 1142 4 5 3
## 8 1038 10 11 6
## 9 665 16 17 2
## 10 602 3 4 1
## # ℹ 1 more variable: autonomy_selection_w2019 <dbl>
6.5.1 pivot_longer()
Let’s say you want to analyze how work experience and autonomy selection change over time. For this, you need to reshape the dataset into a longer format.
long_data <- WoJ_panel %>%
pivot_longer(
cols = contains("_w"),
names_to = c("variable", "wave"),
names_pattern = "(.*)_(w\\d+)", # the symbol starts with any symbol that repeats 0 to several times until a _ is reached, followed by a "w" and at least one digit
values_to = "value"
)
long_data
## # A tibble: 40 × 4
## case_id variable wave value
## <int> <chr> <chr> <dbl>
## 1 415 work_experience w2018 24
## 2 415 work_experience w2019 25
## 3 415 autonomy_selection w2018 3
## 4 415 autonomy_selection w2019 3
## 5 463 work_experience w2018 5
## 6 463 work_experience w2019 6
## 7 463 autonomy_selection w2018 4
## 8 463 autonomy_selection w2019 5
## 9 179 work_experience w2018 26
## 10 179 work_experience w2019 27
## # ℹ 30 more rows
This code transforms the dataset into a longer format where each row represents a single observation for a variable at a specific wave.
6.5.2 pivot_wider()
Now, let’s say you want to do the opposite. You have a long dataset and you want to reshape it into a wider format.
wide_data <- long_data %>%
pivot_wider(
names_from = c("variable", "wave"),
names_sep = "_",
values_from = "value"
)
wide_data
## # A tibble: 10 × 5
## case_id work_experience_w2018 work_experience_w2019 autonomy_selection_w2018
## <int> <dbl> <dbl> <dbl>
## 1 415 24 25 3
## 2 463 5 6 4
## 3 179 26 27 4
## 4 526 38 39 3
## 5 195 13 14 5
## 6 938 26 27 4
## 7 1142 4 5 3
## 8 1038 10 11 6
## 9 665 16 17 2
## 10 602 3 4 1
## # ℹ 1 more variable: autonomy_selection_w2019 <dbl>
6.6 Advanced column selectors
Tidyselect
is a package within the tidyverse
that provides a set of
functions for selecting and manipulating columns in a data frame. It is
mainly used internally by other tidyverse packages, such as dplyr
and
tidyr
, to handle column selections. Tidyselect functions, such as
starts_with()
, ends_with()
, and contains()
, allow you to specify
column selections based on various patterns, such as prefixes, suffixes,
and substrings. In this tutorial, we will look at the following tidyselect
functions:
- last_col()
- starts_with()
- ends_with()
- contains()
- matches()
- num_range()
- where()
Let’s first import the full Worlds of Journalism
dataset from the
tidycomm
package:
6.6.1 Selecting the last column
Use last_col()
when you want to select the last column of a data frame
regardless of its name.
## # A tibble: 1,200 × 1
## trust_politicians
## <dbl>
## 1 3
## 2 3
## 3 3
## 4 3
## 5 2
## 6 2
## 7 2
## 8 3
## 9 1
## 10 3
## # ℹ 1,190 more rows
You can also select all columns except the last one:
## # A tibble: 1,200 × 14
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Germany Nati… Full-time Permanent 5 4
## 2 Germany Nati… Full-time Permanent 3 4
## 3 Switzerl… Regi… Full-time Permanent 4 4
## 4 Switzerl… Local Part-time Permanent 4 5
## 5 Austria Nati… Part-time Permanent 4 4
## 6 Switzerl… Local Freelancer <NA> 4 4
## 7 Germany Local Full-time Permanent 4 4
## 8 Denmark Nati… Full-time Permanent 3 3
## 9 Switzerl… Local Full-time Permanent 5 5
## 10 Denmark Nati… Full-time Permanent 2 4
## # ℹ 1,190 more rows
## # ℹ 8 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>
6.6.2 Selecting columns that start with, end with or contain a specific string
starts_with()
selects columns that start with a specific prefix:
## # A tibble: 1,200 × 4
## ethics_1 ethics_2 ethics_3 ethics_4
## <dbl> <dbl> <dbl> <dbl>
## 1 2 3 2 1
## 2 1 2 2 1
## 3 2 4 2 1
## 4 1 3 1 2
## 5 2 3 2 1
## 6 2 4 4 3
## 7 1 3 2 2
## 8 2 4 4 4
## 9 1 2 1 3
## 10 1 4 4 4
## # ℹ 1,190 more rows
ends_with()
selects columns that end with one or more specific
suffixes:
## # A tibble: 1,200 × 2
## ethics_1 ethics_2
## <dbl> <dbl>
## 1 2 3
## 2 1 2
## 3 2 4
## 4 1 3
## 5 2 3
## 6 2 4
## 7 1 3
## 8 2 4
## 9 1 2
## 10 1 4
## # ℹ 1,190 more rows
contains()
selects columns that contain a specific substring:
## # A tibble: 1,200 × 12
## temp_contract autonomy_selection autonomy_emphasis ethics_1 ethics_2 ethics_3
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Permanent 5 4 2 3 2
## 2 Permanent 3 4 1 2 2
## 3 Permanent 4 4 2 4 2
## 4 Permanent 4 5 1 3 1
## 5 Permanent 4 4 2 3 2
## 6 <NA> 4 4 2 4 4
## 7 Permanent 4 4 1 3 2
## 8 Permanent 3 3 2 4 4
## 9 Permanent 5 5 1 2 1
## 10 Permanent 2 4 1 4 4
## # ℹ 1,190 more rows
## # ℹ 6 more variables: ethics_4 <dbl>, work_experience <dbl>,
## # trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
## # trust_politicians <dbl>
matches()
selects columns based on a regular expression pattern:
## # A tibble: 1,200 × 4
## ethics_1 ethics_2 ethics_3 ethics_4
## <dbl> <dbl> <dbl> <dbl>
## 1 2 3 2 1
## 2 1 2 2 1
## 3 2 4 2 1
## 4 1 3 1 2
## 5 2 3 2 1
## 6 2 4 4 3
## 7 1 3 2 2
## 8 2 4 4 4
## 9 1 2 1 3
## 10 1 4 4 4
## # ℹ 1,190 more rows
## # A tibble: 1,200 × 4
## ethics_1 ethics_2 ethics_3 ethics_4
## <dbl> <dbl> <dbl> <dbl>
## 1 2 3 2 1
## 2 1 2 2 1
## 3 2 4 2 1
## 4 1 3 1 2
## 5 2 3 2 1
## 6 2 4 4 3
## 7 1 3 2 2
## 8 2 4 4 4
## 9 1 2 1 3
## 10 1 4 4 4
## # ℹ 1,190 more rows
Note: All of the above functions, except matches()
are not case-sensetive. This means that they will match both upper- and
lowercase letters. If you want to make them case-sensetive, use the ignore.case
argument and set it to FALSE.
WoJ %>% select(starts_with("Ethics", ignore.case = FALSE)) # but there is no capitalized Ethics in the data
## # A tibble: 1,200 × 0
6.6.3 Selecting columns based on a numeric range
You can selects columns based on a numerical range with num_range()
:
## # A tibble: 1,200 × 2
## ethics_1 ethics_2
## <dbl> <dbl>
## 1 2 3
## 2 1 2
## 3 2 4
## 4 1 3
## 5 2 3
## 6 2 4
## 7 1 3
## 8 2 4
## 9 1 2
## 10 1 4
## # ℹ 1,190 more rows
6.6.4 Selecting columns based on their type
You can selects columns based on their type with where()
, e.g., all
character columns:
## # A tibble: 1,200 × 1
## employment
## <chr>
## 1 Full-time
## 2 Full-time
## 3 Full-time
## 4 Part-time
## 5 Part-time
## 6 Freelancer
## 7 Full-time
## 8 Full-time
## 9 Full-time
## 10 Full-time
## # ℹ 1,190 more rows
Or you could select all columns that contain numbers:
## # A tibble: 1,200 × 11
## autonomy_selection autonomy_emphasis ethics_1 ethics_2 ethics_3 ethics_4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 4 2 3 2 1
## 2 3 4 1 2 2 1
## 3 4 4 2 4 2 1
## 4 4 5 1 3 1 2
## 5 4 4 2 3 2 1
## 6 4 4 2 4 4 3
## 7 4 4 1 3 2 2
## 8 3 3 2 4 4 4
## 9 5 5 1 2 1 3
## 10 2 4 1 4 4 4
## # ℹ 1,190 more rows
## # ℹ 5 more variables: work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
You could also use:
- is.integer (whole numbers, e.g., 7)
- is.double (decimal numbers, e.g., 7.25)
- is.logical (e.g., TRUE/FALSE)
- is.factor (e.g., “great”, “greater”, “the greatest”)
6.6.5 Combine selectors
You can combine selectors, of course. Let’s select all character columns and columns that start with “autonomy”.
## # A tibble: 1,200 × 3
## employment autonomy_selection autonomy_emphasis
## <chr> <dbl> <dbl>
## 1 Full-time 5 4
## 2 Full-time 3 4
## 3 Full-time 4 4
## 4 Part-time 4 5
## 5 Part-time 4 4
## 6 Freelancer 4 4
## 7 Full-time 4 4
## 8 Full-time 3 3
## 9 Full-time 5 5
## 10 Full-time 2 4
## # ℹ 1,190 more rows
6.6.6 Reorder your columns with everything()
Finally, you can select all columns with everything()
. This allows
you to reorder your columns quickly:
## # A tibble: 1,200 × 15
## temp_contract country reach employment autonomy_selection autonomy_emphasis
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 Permanent Germany Nati… Full-time 5 4
## 2 Permanent Germany Nati… Full-time 3 4
## 3 Permanent Switzerl… Regi… Full-time 4 4
## 4 Permanent Switzerl… Local Part-time 4 5
## 5 Permanent Austria Nati… Part-time 4 4
## 6 <NA> Switzerl… Local Freelancer 4 4
## 7 Permanent Germany Local Full-time 4 4
## 8 Permanent Denmark Nati… Full-time 3 3
## 9 Permanent Switzerl… Local Full-time 5 5
## 10 Permanent Denmark Nati… Full-time 2 4
## # ℹ 1,190 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
6.7 Advanced selection of rows
You can make very advanced row selections by using the function
str_detect
from the stringr
package in combination with filter
from the dplyr
package, especially if you combine str_detect
with
regular expressions.
Let’s start with partial matching, i.e., when your pattern is part of a string, select it:
## # A tibble: 172 × 15
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Switzerl… Local Freelancer <NA> 4 4
## 2 Denmark Nati… Freelancer <NA> 4 4
## 3 Denmark Nati… Freelancer <NA> 5 5
## 4 UK Tran… Freelancer <NA> 4 4
## 5 Germany Regi… Freelancer <NA> 2 4
## 6 Denmark Nati… Freelancer <NA> 4 4
## 7 Denmark Nati… Freelancer <NA> 2 3
## 8 Denmark Nati… Freelancer <NA> 4 3
## 9 Austria Nati… Freelancer <NA> 5 5
## 10 Denmark Regi… Freelancer <NA> 4 4
## # ℹ 162 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
Note that str_detect is case-sensitive, i.e., it reacts sensitive to capitalization. You won’t get a match for “free”:
## # A tibble: 0 × 15
## # ℹ 15 variables: country <fct>, reach <fct>, employment <chr>,
## # temp_contract <fct>, autonomy_selection <dbl>, autonomy_emphasis <dbl>,
## # ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>,
## # work_experience <dbl>, trust_parliament <dbl>, trust_government <dbl>,
## # trust_parties <dbl>, trust_politicians <dbl>
In regex language, you can ask str_detectto
look for lowercase and
uppercase matches, when you are unsure of whether the entries are
capitalized in your dataset:
## # A tibble: 172 × 15
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Switzerl… Local Freelancer <NA> 4 4
## 2 Denmark Nati… Freelancer <NA> 4 4
## 3 Denmark Nati… Freelancer <NA> 5 5
## 4 UK Tran… Freelancer <NA> 4 4
## 5 Germany Regi… Freelancer <NA> 2 4
## 6 Denmark Nati… Freelancer <NA> 4 4
## 7 Denmark Nati… Freelancer <NA> 2 3
## 8 Denmark Nati… Freelancer <NA> 4 3
## 9 Austria Nati… Freelancer <NA> 5 5
## 10 Denmark Regi… Freelancer <NA> 4 4
## # ℹ 162 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
You can also ask for matches that end on your pattern:
## # A tibble: 1,028 × 15
## country reach employment temp_contract autonomy_selection autonomy_emphasis
## <fct> <fct> <chr> <fct> <dbl> <dbl>
## 1 Germany Nati… Full-time Permanent 5 4
## 2 Germany Nati… Full-time Permanent 3 4
## 3 Switzerl… Regi… Full-time Permanent 4 4
## 4 Switzerl… Local Part-time Permanent 4 5
## 5 Austria Nati… Part-time Permanent 4 4
## 6 Germany Local Full-time Permanent 4 4
## 7 Denmark Nati… Full-time Permanent 3 3
## 8 Switzerl… Local Full-time Permanent 5 5
## 9 Denmark Nati… Full-time Permanent 2 4
## 10 Austria Local Full-time Permanent 5 5
## # ℹ 1,018 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
I can’t cover regular expressions and the stringr
package in this
tutorial, but if you are interested in mastering regular expressions,
you can have a look at my tutorial for our MA journalism
students.
6.8 Take-Aways
- Working directory: Avoid hard-coding your working directory as
it is not reproducible on different machines. Instead, use the
rstudioapi
package to set the working directory dynamically based on the location of the R script. - Use relative paths: Use relative paths to navigate your working
directory. They describe how to navigate through folders and
subfolders starting from your current location. Use
./
to move down a folder and../
to move up one level in the folder hierarchy. - Reading multiple files: When reading multiple files into R,
create a character vector of file paths using
list.files()
. Use themap()
function from thepurrr
package to iterate over the file paths and read each file using functions likeread.csv()
. Combine the resulting data frames usingbind_rows()
from thedplyr
package. - Advanced column selectors: Use advanced column selectors from
the
tidyselect
package indplyr::select()
to select and manipulate columns in a data frame. Examples includelast_col()
,starts_with()
,ends_with()
,contains()
,matches()
,num_range()
, andwhere()
inselect()
. - Reorder columns: using
everything()
inselect()
to select all columns and place them in a desired order. - Advanced row selectors: Use st
r_detect()
from thestringr
package indplyr::filter()
to filter observations based on a pattern or regular expression.
You still have questions? The following book can help you with that: