6 Tutorial: Advanced data management

After working through Tutorial 6, you’ll…

  • have learned some neat tricks that took me several years to figure out… :)
  • … for example, soft-coding your working directory
  • … using relative paths to navigate your folders
  • … using advanced column selectors

Some parts of this tutorial are heavily inspired by Christian Burkhart’s book Going from Beginner to Advanced in the Tidyverse. This book is the next step in your journey to tidyverse mastery!

6.1 Stop hard-coding your working directory!

Hard-coding your working directory with setwd("C:/Users/LaraK/Documents/IPR/") comes with a significant drawback: it is not reproducible on your colleagues’ machines! Your colleagues may have different user names, run the script from different source file paths, or have different folder structures.

To avoid these problems, you can use the rstudioapi package. It provides a set of functions that allow you to interact with RStudio’s IDE (Integrated Development Environment) programmatically. First, install and load the package:

# install.packages("rstudioapi") # run only the first time
library(rstudioapi)

Now you should open or create an R script file in RStudio that is located in the desired folder that you want to become your working directory. I will create a script in a folder called IPR that rests under the folder path: "C:/Users/LaraK/Documents/IPR/". Save that script in your folder!

Now copy the following line of code in your R script and execute it:

setwd(dirname(rstudioapi::getSourceEditorContext()$path))

By using the dirname() function on rstudioapi::getSourceEditorContext()$path, we extract the directory path of the currently active R script and pass it to setwd() to set it as the working directory. This approach also ensures that the working directory is set to the correct folder, even if you rename or move the R script within RStudio. Note : Remember to save the R script in the desired folder before executing the setwd() function to ensure that it sets the correct working directory.

That’s really handy, isn’t it?

6.2 Start navigating your working directory like a pro!

6.2.1 The concept of relative paths

If you have subfolders in your working directory or when your working directory itself is a subfolder of a more high-level folder, you can start navigating in it by using relative paths. A relative path is like giving directions from your current location (e.g.: “When you leave the IfKW, you turn right on the Oettingenstraße. Walk down the road until you see a crossroad. Turn left and then you can see the Bayrische Eismanufaktur!”). It’s the contrary of providing an absolute path, which means providing the full address of a place (e.g.: “You can find the Bayrische Eismanufaktur at Oettingenstr. 42a – 80538 München, longitude: 48.14646612168593, latitude: 11.593379619133687”).

In R, a relative path specifies the path to a folder or file relative to your current working directory in RStudio. It doesn’t include the complete path from the root of your file system. Instead, it describes how to navigate through the folders and subfolders starting from your current location. Relative paths are flexible and can adapt to different working directories.

For example, if your current working directory is "C:/Users/LaraK/Documents/IPR/", a relative path to a file named WoJ_names_1.csv in a subfolder named multiple_files would be multiple_files/WoJ_names_1.csv.

6.2.2 Moving up and down folders by using relative paths

Please download the .csv files that are contained in the folder multiple_files on LRZ Sync & Share (click here) and store them in a subfolder called multiple_files in your working directory.

To move down a folder, you can use the ./ notation. The ./ represents the child directory of the current working directory that you want to navigate to. For example, if you’re in the folder "C:/Users/LaraK/Documents/IPR/" and you want to move your working directory down to the "multiple_files" subfolder, you can use the following code:

setwd("./multiple_files")

To move up one level in the folder hierarchy again, you can use the ../ notation. The ../ represents the parent directory of the current working directory. You can use this notation with the setwd() function to navigate up the folder again:

setwd("../")

6.3 Reading multiple files into R

There are several common challenges when reading multiple files into R, such as inconsistent column names and file paths, as well as selecting only the necessary files. In this tutorial, I’ll teach you a convenient solution by creating a character vector of file paths and reading them into R using the read_* functions.

First, you will need to set the folder "multiple_files" as your working directory.

setwd("./multiple_files")

Now you need to save the folder path of your working directory as a character string. You can use getwd() and save the result in an R object called path:

path <- getwd()
path

Next, create a character vector that contains the file names of all files contained in your subfolder "multiple_files". You can look at the file names of a folder by using the function list.files():

file_names <- list.files(path)
file_names

Now turn these names into a proper character vector of all the file paths by adding additional arguments to the function list.files():

file_paths <- list.files(
  path = path, # takes the path to the working directory / folder
  pattern = "csv", # file should contain the string 'csv', but you could also opt for "WoJ_names"
  full.names = TRUE
) # store the full file paths, not just the file names
file_paths

Finally, read the files using map(): Use map() from the purrr package (part of the tidyverse metapackage) to iterate over the file paths and read each file using a function like read.csv() (or read_excel(), etc.). The result will be a list of data frames.

data_list <- file_paths %>%
  map(~ read.csv2(file = .x)) # use map to apply read.csv2 (semicolon-separated) to each individual file path, .x stands for each individual file path
data_list

However, we don’t want our data to be read in as separate files. We want to combine them in one single file. We can use bind_rows() from the dpylr package for that:

combined_data <- data_list %>% bind_rows() # bind all of your rows in your data list into one dataframe

That involved a great number of individual steps. Let’s now consolidate everything into a simpler pipeline and execute it:

path <- getwd()
file_paths <- list.files(
  path = path,
  pattern = "csv",
  full.names = TRUE
)

combined_data <- file_paths %>%
  map(~ read.csv2(file = .x)) %>%
  bind_rows()

Note: If the files have different column names or structures, you may encounter issues when binding them together with this approach. Make sure the files have compatible structures or consider preprocessing the data, as shown in the next chapter.

6.3.1 Reading multiple files into R with different column names

Things can sometimes get messy. Please download the .csv files that are contained in the folder multiple_files_messy on LRZ Sync & Share and store them in a folder called multiple_files_messyin your working directory.

In this messy folder, the WoJ_names_1.csv file has been altered. The column name country has been capitalized (i.e., changed to COUNTRY) , and the column containing names has been accidentally deleted. Let’s attempt to reload the .csv files once more and set the column names to lowercase dynamically. To handle the missing column, we will use bind_rows function:

setwd("./multiple_files_messy")

path <- getwd()
file_paths <- list.files(
  path = path,
  pattern = "csv",
  full.names = TRUE
)

combined_data <- file_paths %>%
  map(~ read.csv2(file = .x) %>%
    setNames(tolower(names(.)))) %>%
  bind_rows()

Tip for advanced users: The line setNames(tolower(names(.)))) %>% is where the mismatch between COUNTRY and country is being solved. There are several other common mismatches between data files that you can solve using these lines of code:

1. Extraneous white spaces (leading/trailing spaces): setNames(gsub("^\\s+|\\s+$", "", names(.))) %>%

2. Replace spaces with underscores: setNames(gsub("\\s+", "_", names(.))) %>%

3. Remove special characters: setNames(gsub("[^[:alnum:]_]", "", names(.))) %>%

4. Standardize certain column names: If you know in advance that certain files might have deviations in column names, you can use a named vector of replacements. For this, you’d likely want to break the renaming process out of the setNames function for clarity: name_corrections <- c("First Column" = "first_column", "Last Column" = "last_column") corrected_names <- ifelse(names(.) %in% names(name_corrections), name_corrections[names(.)], names(.))

combined_data <- file_paths %>% map(~ read.csv2(file = .x) %>% setNames(corrected_names) %>% bind_rows()

Neat! Note that you can always exchange the string in the pattern argument of list.files to include a regular expression. I can’t cover regular expressions and the stringr package in this tutorial, but if you are interested in mastering regular expressions, you can have look at my tutorial for our MA journalism students.

6.4 Joining two datasets

We’ve already covered how to join multiple datasets while loading the data into R. This next approach joins datasets after they’ve been loaded into R. It’s the more common approach.

First things first, let’s load the libraries and data we’ll need. As we load this data, we’re doing an extra step in our pipeline. For every row of data, we’re adding a unique ID. This ID is simply the row’s position in the list (so, the first row is 1, the second is 2, and so on). We’re naming this unique ID as case_id.

# install.packages("tidycomm") # only the first time
library(tidycomm)
library(tidyverse)
WoJ <- tidycomm::WoJ %>%
  mutate(case_id = row_number())

Next, we’ll split the WoJ dataset into two groups based on work experience: rookies and veterans. We’ll also reduce the numbers of rookies and veterans based on an arbitrary third variable to make the cases included in the datasets somewhat different from each other.

rookies <- WoJ %>%
  filter(work_experience <= 12) %>%
  filter(autonomy_selection > 1)

veterans <- WoJ %>%
  filter(work_experience >= 10) %>%
  filter(ethics_1 < 5)

With these restrictions set in place, these two datasets include 1,169 unique journalists. The rookies are made up of 433 journalists and the veterans make up 853. 117 journalists are being classified as BOTH rookie AND veteran, i.e. have been counted twice. This means we are dealing with: 433 + 853 - 117 = 1,169.

Now, let’s say you want to merge these datasets back together for some comparative analysis. dplyr offers you different types of joins for this exact purpose. Let’s explore them!

6.4.1 inner_join

This is like an overly correct librarian. It only takes rows that have matching keys in both datasets, otherwise it will drop the rows.

inner_joined <- inner_join(rookies, veterans, by = "case_id")

inner_joined
## # A tibble: 117 × 31
##    country.x   reach.x       employment.x temp_contract.x autonomy_selection.x
##    <fct>       <fct>         <chr>        <fct>                          <dbl>
##  1 Germany     National      Full-time    Permanent                          5
##  2 Denmark     National      Full-time    Permanent                          3
##  3 UK          Regional      Full-time    Permanent                          3
##  4 UK          Transnational Full-time    Permanent                          4
##  5 UK          Regional      Full-time    Permanent                          4
##  6 Germany     Local         Full-time    Permanent                          4
##  7 UK          Regional      Full-time    Permanent                          4
##  8 Switzerland Regional      Freelancer   <NA>                               2
##  9 Denmark     Regional      Full-time    Temporary                          3
## 10 Austria     National      Full-time    Permanent                          3
## # ℹ 107 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## #   ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## #   work_experience.x <dbl>, trust_parliament.x <dbl>,
## #   trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## #   case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## #   temp_contract.y <fct>, autonomy_selection.y <dbl>, …

6.4.2 left_join

This is like a fan of the first dataset. It takes all rows from the first dataset and the matched rows from the second dataset.

left_joined <- left_join(rookies, veterans, by = "case_id")

left_joined
## # A tibble: 433 × 31
##    country.x   reach.x       employment.x temp_contract.x autonomy_selection.x
##    <fct>       <fct>         <chr>        <fct>                          <dbl>
##  1 Germany     National      Full-time    Permanent                          5
##  2 Germany     National      Full-time    Permanent                          3
##  3 Switzerland Regional      Full-time    Permanent                          4
##  4 Switzerland Local         Part-time    Permanent                          4
##  5 Denmark     National      Full-time    Permanent                          3
##  6 Denmark     National      Full-time    Permanent                          2
##  7 Austria     Local         Full-time    Permanent                          5
##  8 UK          Regional      Full-time    Permanent                          3
##  9 UK          Transnational Full-time    Permanent                          4
## 10 UK          Transnational Full-time    Permanent                          4
## # ℹ 423 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## #   ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## #   work_experience.x <dbl>, trust_parliament.x <dbl>,
## #   trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## #   case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## #   temp_contract.y <fct>, autonomy_selection.y <dbl>, …

6.4.3 right_join

This is like a fan of the second dataset. It takes all rows from the second dataset and the matched rows from the first dataset.

right_joined <- right_join(rookies, veterans, by = "case_id")
# Pro tip: you can use the argument `by = your_column_here` to match the two datsets based on a specific column, e.g. case id

right_joined
## # A tibble: 853 × 31
##    country.x   reach.x       employment.x temp_contract.x autonomy_selection.x
##    <fct>       <fct>         <chr>        <fct>                          <dbl>
##  1 Germany     National      Full-time    Permanent                          5
##  2 Denmark     National      Full-time    Permanent                          3
##  3 UK          Regional      Full-time    Permanent                          3
##  4 UK          Transnational Full-time    Permanent                          4
##  5 UK          Regional      Full-time    Permanent                          4
##  6 Germany     Local         Full-time    Permanent                          4
##  7 UK          Regional      Full-time    Permanent                          4
##  8 Switzerland Regional      Freelancer   <NA>                               2
##  9 Denmark     Regional      Full-time    Temporary                          3
## 10 Austria     National      Full-time    Permanent                          3
## # ℹ 843 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## #   ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## #   work_experience.x <dbl>, trust_parliament.x <dbl>,
## #   trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## #   case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## #   temp_contract.y <fct>, autonomy_selection.y <dbl>, …

6.4.4 full_join

This is like a hoarder. It takes all rows and throws in missings for variables that are not part of both datasets.

full_joined <- full_join(rookies, veterans, by = "case_id")
# Pro tip: you can use the argument `by = your_column_here` to match the two datsets based on a specific column, e.g. case id

full_joined
## # A tibble: 1,169 × 31
##    country.x   reach.x       employment.x temp_contract.x autonomy_selection.x
##    <fct>       <fct>         <chr>        <fct>                          <dbl>
##  1 Germany     National      Full-time    Permanent                          5
##  2 Germany     National      Full-time    Permanent                          3
##  3 Switzerland Regional      Full-time    Permanent                          4
##  4 Switzerland Local         Part-time    Permanent                          4
##  5 Denmark     National      Full-time    Permanent                          3
##  6 Denmark     National      Full-time    Permanent                          2
##  7 Austria     Local         Full-time    Permanent                          5
##  8 UK          Regional      Full-time    Permanent                          3
##  9 UK          Transnational Full-time    Permanent                          4
## 10 UK          Transnational Full-time    Permanent                          4
## # ℹ 1,159 more rows
## # ℹ 26 more variables: autonomy_emphasis.x <dbl>, ethics_1.x <dbl>,
## #   ethics_2.x <dbl>, ethics_3.x <dbl>, ethics_4.x <dbl>,
## #   work_experience.x <dbl>, trust_parliament.x <dbl>,
## #   trust_government.x <dbl>, trust_parties.x <dbl>, trust_politicians.x <dbl>,
## #   case_id <int>, country.y <fct>, reach.y <fct>, employment.y <chr>,
## #   temp_contract.y <fct>, autonomy_selection.y <dbl>, …

6.4.5 Join without .x and .y suffixes

You might have noticed that when you’re using dplyr to join two datasets, .x and .y suffixes are added to any columns that have the same names in both datasets. For example, if we join using the case_id variable, all other variables in the resulting combined dataset will be named something like country.x and country.y. The .x and .y suffixes differentiate the country columns from the rookies dataset and the veterans dataset, respectively.

However, if you’re certain that columns with the same name in both datasets also contain the same values, or if you simply don’t want to retain both versions in the output, there are a couple of approaches to prevent or remove the duplicate columns with the .x and .y suffixes:

1. Using the by argument: The by argument in the join functions lets you specify which columns to join by. By creating a character vector of the column names in your data frames, you can provide this column name vector to the by argument. This way, you’re specifying to join on all columns in your data frames. Since the column names and values are identical in both data frames, dplyr won’t create the duplicate columns with .x and .y suffixes.

column_names <- colnames(rookies) # colnames() grabs the column names from the rookies data frame and turns them into a character vector

full_joined <- full_join(rookies, veterans, by = column_names)

full_joined
## # A tibble: 1,169 × 16
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Denmark   Nati… Full-time  Permanent                      3                 3
##  6 Denmark   Nati… Full-time  Permanent                      2                 4
##  7 Austria   Local Full-time  Permanent                      5                 5
##  8 UK        Regi… Full-time  Permanent                      3                 4
##  9 UK        Tran… Full-time  Permanent                      4                 5
## 10 UK        Tran… Full-time  Permanent                      4                 4
## # ℹ 1,159 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   case_id <int>

You receive only 16 columns in your combined data set. This is the most elegant solution.

2. Renaming .x columns and dropping .y columns: Using the knowledge that you’ll acquire in the next few tutorials, you could also rename all the .x columns in your combined data set and delete all .y columns.

inner_joined <- inner_joined %>%
  rename_with(~ gsub("\\.x$", "", .), everything()) %>% # replaces all column endings ".x" with nothing: ""
  select(-matches("\\.y$")) # deselects all columns that end with ".y"

inner_joined
## # A tibble: 117 × 16
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Denmark   Nati… Full-time  Permanent                      3                 3
##  3 UK        Regi… Full-time  Permanent                      3                 4
##  4 UK        Tran… Full-time  Permanent                      4                 4
##  5 UK        Regi… Full-time  Permanent                      4                 4
##  6 Germany   Local Full-time  Permanent                      4                 5
##  7 UK        Regi… Full-time  Permanent                      4                 4
##  8 Switzerl… Regi… Freelancer <NA>                           2                 3
##  9 Denmark   Regi… Full-time  Temporary                      3                 3
## 10 Austria   Nati… Full-time  Permanent                      3                 4
## # ℹ 107 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   case_id <int>

You are left with only 16 columns.

3. Drop the columns from one data frame before joining: This is the most basic and error-prone, but quickest approach.

veterans <- veterans %>% 
  select(case_id) # drop all columns except case_id

left_joined <- left_join(rookies, veterans, by = "case_id")

left_joined
## # A tibble: 433 × 16
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Denmark   Nati… Full-time  Permanent                      3                 3
##  6 Denmark   Nati… Full-time  Permanent                      2                 4
##  7 Austria   Local Full-time  Permanent                      5                 5
##  8 UK        Regi… Full-time  Permanent                      3                 4
##  9 UK        Tran… Full-time  Permanent                      4                 5
## 10 UK        Tran… Full-time  Permanent                      4                 4
## # ℹ 423 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   case_id <int>

6.5 Reformatting datasets

In this subtutorial, we will learn how to use pivot_longer() and pivot_wider() functions from the tidyr package. We will create a fictional dataset based on the Worlds of Journalism (WoJ) dataset, with a twist - it will have several panel waves. Imagine that the WoJ dataset has been expanded to include data from different years:

set.seed(123)
WoJ_panel <- WoJ %>%
  slice_sample(n = 10) %>%
  select(case_id, work_experience, autonomy_selection) %>%
  mutate(
    work_experience_w2018 = work_experience + sample(1:1, 10, replace = TRUE),
    work_experience_w2019 = work_experience_w2018 + sample(1:1, 10, replace = TRUE),
    autonomy_selection_w2018 = autonomy_selection + sample(-1:1, 10, replace = TRUE),
    autonomy_selection_w2019 = autonomy_selection_w2018 + sample(-1:1, 10, replace = TRUE)
  ) %>%
  select(-c(work_experience, autonomy_selection))

WoJ_panel
## # A tibble: 10 × 5
##    case_id work_experience_w2018 work_experience_w2019 autonomy_selection_w2018
##      <int>                 <dbl>                 <dbl>                    <dbl>
##  1     415                    24                    25                        3
##  2     463                     5                     6                        4
##  3     179                    26                    27                        4
##  4     526                    38                    39                        3
##  5     195                    13                    14                        5
##  6     938                    26                    27                        4
##  7    1142                     4                     5                        3
##  8    1038                    10                    11                        6
##  9     665                    16                    17                        2
## 10     602                     3                     4                        1
## # ℹ 1 more variable: autonomy_selection_w2019 <dbl>

6.5.1 pivot_longer()

Let’s say you want to analyze how work experience and autonomy selection change over time. For this, you need to reshape the dataset into a longer format.

long_data <- WoJ_panel %>%
  pivot_longer(
    cols = contains("_w"),
    names_to = c("variable", "wave"),
    names_pattern = "(.*)_(w\\d+)", # the symbol starts with any symbol that repeats 0 to several times until a _ is reached, followed by a "w" and at least one digit
    values_to = "value"
  )

long_data
## # A tibble: 40 × 4
##    case_id variable           wave  value
##      <int> <chr>              <chr> <dbl>
##  1     415 work_experience    w2018    24
##  2     415 work_experience    w2019    25
##  3     415 autonomy_selection w2018     3
##  4     415 autonomy_selection w2019     3
##  5     463 work_experience    w2018     5
##  6     463 work_experience    w2019     6
##  7     463 autonomy_selection w2018     4
##  8     463 autonomy_selection w2019     5
##  9     179 work_experience    w2018    26
## 10     179 work_experience    w2019    27
## # ℹ 30 more rows

This code transforms the dataset into a longer format where each row represents a single observation for a variable at a specific wave.

6.5.2 pivot_wider()

Now, let’s say you want to do the opposite. You have a long dataset and you want to reshape it into a wider format.

wide_data <- long_data %>%
  pivot_wider(
    names_from = c("variable", "wave"),
    names_sep = "_",
    values_from = "value"
  )

wide_data
## # A tibble: 10 × 5
##    case_id work_experience_w2018 work_experience_w2019 autonomy_selection_w2018
##      <int>                 <dbl>                 <dbl>                    <dbl>
##  1     415                    24                    25                        3
##  2     463                     5                     6                        4
##  3     179                    26                    27                        4
##  4     526                    38                    39                        3
##  5     195                    13                    14                        5
##  6     938                    26                    27                        4
##  7    1142                     4                     5                        3
##  8    1038                    10                    11                        6
##  9     665                    16                    17                        2
## 10     602                     3                     4                        1
## # ℹ 1 more variable: autonomy_selection_w2019 <dbl>

6.6 Advanced column selectors

Tidyselect is a package within the tidyverse that provides a set of functions for selecting and manipulating columns in a data frame. It is mainly used internally by other tidyverse packages, such as dplyr and tidyr, to handle column selections. Tidyselect functions, such as starts_with(), ends_with(), and contains(), allow you to specify column selections based on various patterns, such as prefixes, suffixes, and substrings. In this tutorial, we will look at the following tidyselect functions:

  • last_col()
  • starts_with()
  • ends_with()
  • contains()
  • matches()
  • num_range()
  • where()

Let’s first import the full Worlds of Journalism dataset from the tidycomm package:

WoJ <- tidycomm::WoJ

6.6.1 Selecting the last column

Use last_col() when you want to select the last column of a data frame regardless of its name.

WoJ %>% select(last_col())
## # A tibble: 1,200 × 1
##    trust_politicians
##                <dbl>
##  1                 3
##  2                 3
##  3                 3
##  4                 3
##  5                 2
##  6                 2
##  7                 2
##  8                 3
##  9                 1
## 10                 3
## # ℹ 1,190 more rows

You can also select all columns except the last one:

WoJ %>% select(!last_col())
## # A tibble: 1,200 × 14
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Austria   Nati… Part-time  Permanent                      4                 4
##  6 Switzerl… Local Freelancer <NA>                           4                 4
##  7 Germany   Local Full-time  Permanent                      4                 4
##  8 Denmark   Nati… Full-time  Permanent                      3                 3
##  9 Switzerl… Local Full-time  Permanent                      5                 5
## 10 Denmark   Nati… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 8 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>

6.6.2 Selecting columns that start with, end with or contain a specific string

starts_with() selects columns that start with a specific prefix:

WoJ %>% select(starts_with("ethics"))
## # A tibble: 1,200 × 4
##    ethics_1 ethics_2 ethics_3 ethics_4
##       <dbl>    <dbl>    <dbl>    <dbl>
##  1        2        3        2        1
##  2        1        2        2        1
##  3        2        4        2        1
##  4        1        3        1        2
##  5        2        3        2        1
##  6        2        4        4        3
##  7        1        3        2        2
##  8        2        4        4        4
##  9        1        2        1        3
## 10        1        4        4        4
## # ℹ 1,190 more rows

ends_with() selects columns that end with one or more specific suffixes:

WoJ %>% select(ends_with(c("_1", "_2"))) # ends with _1 OR _2
## # A tibble: 1,200 × 2
##    ethics_1 ethics_2
##       <dbl>    <dbl>
##  1        2        3
##  2        1        2
##  3        2        4
##  4        1        3
##  5        2        3
##  6        2        4
##  7        1        3
##  8        2        4
##  9        1        2
## 10        1        4
## # ℹ 1,190 more rows

contains() selects columns that contain a specific substring:

WoJ %>% select(contains("_"))
## # A tibble: 1,200 × 12
##    temp_contract autonomy_selection autonomy_emphasis ethics_1 ethics_2 ethics_3
##    <fct>                      <dbl>             <dbl>    <dbl>    <dbl>    <dbl>
##  1 Permanent                      5                 4        2        3        2
##  2 Permanent                      3                 4        1        2        2
##  3 Permanent                      4                 4        2        4        2
##  4 Permanent                      4                 5        1        3        1
##  5 Permanent                      4                 4        2        3        2
##  6 <NA>                           4                 4        2        4        4
##  7 Permanent                      4                 4        1        3        2
##  8 Permanent                      3                 3        2        4        4
##  9 Permanent                      5                 5        1        2        1
## 10 Permanent                      2                 4        1        4        4
## # ℹ 1,190 more rows
## # ℹ 6 more variables: ethics_4 <dbl>, work_experience <dbl>,
## #   trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
## #   trust_politicians <dbl>

matches() selects columns based on a regular expression pattern:

WoJ %>% select(matches("^ethics")) # ^ stands for "begins with" in regex notation
## # A tibble: 1,200 × 4
##    ethics_1 ethics_2 ethics_3 ethics_4
##       <dbl>    <dbl>    <dbl>    <dbl>
##  1        2        3        2        1
##  2        1        2        2        1
##  3        2        4        2        1
##  4        1        3        1        2
##  5        2        3        2        1
##  6        2        4        4        3
##  7        1        3        2        2
##  8        2        4        4        4
##  9        1        2        1        3
## 10        1        4        4        4
## # ℹ 1,190 more rows
WoJ %>% select(matches("\\d")) # \\d stands for "contains a digit/number" in regex notation
## # A tibble: 1,200 × 4
##    ethics_1 ethics_2 ethics_3 ethics_4
##       <dbl>    <dbl>    <dbl>    <dbl>
##  1        2        3        2        1
##  2        1        2        2        1
##  3        2        4        2        1
##  4        1        3        1        2
##  5        2        3        2        1
##  6        2        4        4        3
##  7        1        3        2        2
##  8        2        4        4        4
##  9        1        2        1        3
## 10        1        4        4        4
## # ℹ 1,190 more rows

Note: All of the above functions, except matches() are not case-sensetive. This means that they will match both upper- and lowercase letters. If you want to make them case-sensetive, use the ignore.case argument and set it to FALSE.

WoJ %>% select(starts_with("Ethics", ignore.case = FALSE)) # but there is no capitalized Ethics in the data
## # A tibble: 1,200 × 0

6.6.3 Selecting columns based on a numeric range

You can selects columns based on a numerical range with num_range():

WoJ %>% select(num_range("ethics_", 1:2)) # selects ethics_1 and ethics2, but not ethics_3, etc.
## # A tibble: 1,200 × 2
##    ethics_1 ethics_2
##       <dbl>    <dbl>
##  1        2        3
##  2        1        2
##  3        2        4
##  4        1        3
##  5        2        3
##  6        2        4
##  7        1        3
##  8        2        4
##  9        1        2
## 10        1        4
## # ℹ 1,190 more rows

6.6.4 Selecting columns based on their type

You can selects columns based on their type with where(), e.g., all character columns:

WoJ %>% select(where(is.character))
## # A tibble: 1,200 × 1
##    employment
##    <chr>     
##  1 Full-time 
##  2 Full-time 
##  3 Full-time 
##  4 Part-time 
##  5 Part-time 
##  6 Freelancer
##  7 Full-time 
##  8 Full-time 
##  9 Full-time 
## 10 Full-time 
## # ℹ 1,190 more rows

Or you could select all columns that contain numbers:

WoJ %>% select(where(is.numeric))
## # A tibble: 1,200 × 11
##    autonomy_selection autonomy_emphasis ethics_1 ethics_2 ethics_3 ethics_4
##                 <dbl>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1                  5                 4        2        3        2        1
##  2                  3                 4        1        2        2        1
##  3                  4                 4        2        4        2        1
##  4                  4                 5        1        3        1        2
##  5                  4                 4        2        3        2        1
##  6                  4                 4        2        4        4        3
##  7                  4                 4        1        3        2        2
##  8                  3                 3        2        4        4        4
##  9                  5                 5        1        2        1        3
## 10                  2                 4        1        4        4        4
## # ℹ 1,190 more rows
## # ℹ 5 more variables: work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

You could also use:

  • is.integer (whole numbers, e.g., 7)
  • is.double (decimal numbers, e.g., 7.25)
  • is.logical (e.g., TRUE/FALSE)
  • is.factor (e.g., “great”, “greater”, “the greatest”)

6.6.5 Combine selectors

You can combine selectors, of course. Let’s select all character columns and columns that start with “autonomy”.

WoJ %>% select(where(is.character) | starts_with("autonomy"))
## # A tibble: 1,200 × 3
##    employment autonomy_selection autonomy_emphasis
##    <chr>                   <dbl>             <dbl>
##  1 Full-time                   5                 4
##  2 Full-time                   3                 4
##  3 Full-time                   4                 4
##  4 Part-time                   4                 5
##  5 Part-time                   4                 4
##  6 Freelancer                  4                 4
##  7 Full-time                   4                 4
##  8 Full-time                   3                 3
##  9 Full-time                   5                 5
## 10 Full-time                   2                 4
## # ℹ 1,190 more rows

6.6.6 Reorder your columns with everything()

Finally, you can select all columns with everything(). This allows you to reorder your columns quickly:

WoJ %>% select(temp_contract, country, everything())
## # A tibble: 1,200 × 15
##    temp_contract country   reach employment autonomy_selection autonomy_emphasis
##    <fct>         <fct>     <fct> <chr>                   <dbl>             <dbl>
##  1 Permanent     Germany   Nati… Full-time                   5                 4
##  2 Permanent     Germany   Nati… Full-time                   3                 4
##  3 Permanent     Switzerl… Regi… Full-time                   4                 4
##  4 Permanent     Switzerl… Local Part-time                   4                 5
##  5 Permanent     Austria   Nati… Part-time                   4                 4
##  6 <NA>          Switzerl… Local Freelancer                  4                 4
##  7 Permanent     Germany   Local Full-time                   4                 4
##  8 Permanent     Denmark   Nati… Full-time                   3                 3
##  9 Permanent     Switzerl… Local Full-time                   5                 5
## 10 Permanent     Denmark   Nati… Full-time                   2                 4
## # ℹ 1,190 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

6.7 Advanced selection of rows

You can make very advanced row selections by using the function str_detect from the stringr package in combination with filter from the dplyr package, especially if you combine str_detect with regular expressions.

Let’s start with partial matching, i.e., when your pattern is part of a string, select it:

WoJ %>% filter(str_detect(employment, "Free"))
## # A tibble: 172 × 15
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Switzerl… Local Freelancer <NA>                           4                 4
##  2 Denmark   Nati… Freelancer <NA>                           4                 4
##  3 Denmark   Nati… Freelancer <NA>                           5                 5
##  4 UK        Tran… Freelancer <NA>                           4                 4
##  5 Germany   Regi… Freelancer <NA>                           2                 4
##  6 Denmark   Nati… Freelancer <NA>                           4                 4
##  7 Denmark   Nati… Freelancer <NA>                           2                 3
##  8 Denmark   Nati… Freelancer <NA>                           4                 3
##  9 Austria   Nati… Freelancer <NA>                           5                 5
## 10 Denmark   Regi… Freelancer <NA>                           4                 4
## # ℹ 162 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

Note that str_detect is case-sensitive, i.e., it reacts sensitive to capitalization. You won’t get a match for “free”:

WoJ %>% filter(str_detect(employment, "free"))
## # A tibble: 0 × 15
## # ℹ 15 variables: country <fct>, reach <fct>, employment <chr>,
## #   temp_contract <fct>, autonomy_selection <dbl>, autonomy_emphasis <dbl>,
## #   ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>,
## #   work_experience <dbl>, trust_parliament <dbl>, trust_government <dbl>,
## #   trust_parties <dbl>, trust_politicians <dbl>

In regex language, you can ask str_detectto look for lowercase and uppercase matches, when you are unsure of whether the entries are capitalized in your dataset:

WoJ %>% filter(str_detect(employment, "[fF]ree"))
## # A tibble: 172 × 15
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Switzerl… Local Freelancer <NA>                           4                 4
##  2 Denmark   Nati… Freelancer <NA>                           4                 4
##  3 Denmark   Nati… Freelancer <NA>                           5                 5
##  4 UK        Tran… Freelancer <NA>                           4                 4
##  5 Germany   Regi… Freelancer <NA>                           2                 4
##  6 Denmark   Nati… Freelancer <NA>                           4                 4
##  7 Denmark   Nati… Freelancer <NA>                           2                 3
##  8 Denmark   Nati… Freelancer <NA>                           4                 3
##  9 Austria   Nati… Freelancer <NA>                           5                 5
## 10 Denmark   Regi… Freelancer <NA>                           4                 4
## # ℹ 162 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

You can also ask for matches that end on your pattern:

WoJ %>% filter(str_detect(employment, "time$"))
## # A tibble: 1,028 × 15
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Austria   Nati… Part-time  Permanent                      4                 4
##  6 Germany   Local Full-time  Permanent                      4                 4
##  7 Denmark   Nati… Full-time  Permanent                      3                 3
##  8 Switzerl… Local Full-time  Permanent                      5                 5
##  9 Denmark   Nati… Full-time  Permanent                      2                 4
## 10 Austria   Local Full-time  Permanent                      5                 5
## # ℹ 1,018 more rows
## # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

I can’t cover regular expressions and the stringr package in this tutorial, but if you are interested in mastering regular expressions, you can have a look at my tutorial for our MA journalism students.

6.8 Take-Aways

  • Working directory: Avoid hard-coding your working directory as it is not reproducible on different machines. Instead, use the rstudioapi package to set the working directory dynamically based on the location of the R script.
  • Use relative paths: Use relative paths to navigate your working directory. They describe how to navigate through folders and subfolders starting from your current location. Use ./ to move down a folder and ../ to move up one level in the folder hierarchy.
  • Reading multiple files: When reading multiple files into R, create a character vector of file paths using list.files(). Use the map() function from the purrr package to iterate over the file paths and read each file using functions like read.csv(). Combine the resulting data frames using bind_rows() from the dplyr package.
  • Advanced column selectors: Use advanced column selectors from the tidyselect package in dplyr::select() to select and manipulate columns in a data frame. Examples include last_col(), starts_with(), ends_with(), contains(), matches(), num_range(), and where() in select().
  • Reorder columns: using everything() in select() to select all columns and place them in a desired order.
  • Advanced row selectors: Use str_detect() from the stringr package in dplyr::filter() to filter observations based on a pattern or regular expression.

You still have questions? The following book can help you with that: