4 Tutorial: Working with data (files)

After working through Tutorial 4, you’ll…

  • understand how to import data
  • know how to select variables in data frames
  • know how to access values in data frames

If you’d like to learn data import through a video tutorial, you can watch this video that I created for my students:

4.1 Defining your working directory

In most cases, you don’t want to manually enter all your values into R and combine them with the c() function. Instead, you’ll likely want to import data files that you already have on your personal computer or drive. Therefore, the first step to importing your data into R is usually to define your working directory.

Your working directory is the folder from which data can be imported into R or to which you can export and save data created with R.

Create a folder that you want to use as your working directory for this tutorial (or use an existing one). For this example, I’ll create a folder called “IPR” (short for In Public Repository), but you could use a different name.

Image: Working Directory on Windows

Go to that folder and copy the folder path to it:

Image: Copy Working Directory on Windows

On Mac, you go to a document in your folder and right click on it. An options menu opens and you can copy the folder path:

Image: Copy Working Directory on MAC

NOTE: On MAC, your folder path might be too long to display the full name. In this case, I’d recommend to move your files a few levels up. E.g., from “IPR” to “Documents”. In addition, white spaces might occur in the folder path. In this case, you might need to replace them with underscores.

Now you know where this working directory is located - but R should know too! Telling R from which folder to import data or where to export data is called setting your working directory. We use a function called setwd() (you guessed it right: short for “set working directory”) which allows us to do exactly that.

Important: The way this working directory is set differs between Windows- and Mac-Operating Systems.

Windows: The dashes need to be pointing towards the right direction (if you simply copy the path to the folder, you may need to replace these signs “\” with “/”)

setwd("C:/Users/LaraK/Documents/IPR")

Mac: You may need to add a “/” at the beginning:

setwd("/Users/LaraK/Documents/IPR")

NOTE: Setting your entire desktop as the working directory rarely works. It’s better to create a new folder on your desktop and set that folder as your working directory.

If you have forgotten where you set your current working directory, you can also ask R about the path of your current working directory with getwd():

getwd()
## [1] "C:/Users/LaraK/Documents/IPR"

4.1.1 Optional: Setting the working directory on a remote desktop

The LMU Munich provides you with remote desktop access to the PCs in the local CIP-Pools. If you want to use your remote desktop to run R & RStudio, you can follow this link to log into the remote desktop. This is a great fix if – for whatever reason – you can’t get R(Studio) installed on your machine and need a quick solution!

Once you have logged in to the remote desktop, you can open RStudio and set the working directory (and import data) just like it is described in this tutorial (see next image).

Image: Working Directory (& Data Import) on a Remote Desktop

The drawback: Since Windows does not allow RStudio to save script files without permission, you can’t save script files on the remote desktop. You can solve this with a workaround:

  1. Paste your script into a text file before closing RStudio. E.g., WordPad is pre-installed. Save this file as .R.
  2. When you want to load your script in RStudio, right click on the .R file and choose “open with RStudio”.
Image: Saving a Script File on a Remote Desktop

4.2 Import data from your working directory

After setting the working directory, you need to transfer the data file that you want to work with to that folder (here: the “IPR” folder). To do this, download the “WoJ.csv” file from LRZ Sync & Share using this link. The dataset is a subset of the Worlds of Journalism 2012-16 study containing survey data from 40 journalists from five European countries. Put the downloaded data file in the folder that you just set up as your working directory (here: the “IPR” folder).

The data file WoJ.csv is structured as follows:

  • Each row contains the survey answers of a single journalist.
  • Each column contains the values given by all journalists for a single variable.

The variables included here are:

  • country: the home country of the journalist (e.g. Germany, Switzerland)
  • reach: the reach of the outlet that the journalist is working for (Transnational, National, Regional, Local)
  • employment: the journalist’s current employment status (Full-time, Part-time, Freelancer)
  • temp_contract: the journalist’s type of contract (Permenent, Temporary)
  • autonomy_selection: how much autonomy the journalist indicates to have in his/her daily work (from 1 = none at all to 5 = very much)
  • work_experience: in years
  • trust_parliament & trust_government: how much trust the journalist indicates to have (from 1 = none at all to 5 = very much)

We will read in the file using read.csv(). We specify the file path in quotation marks to indicate where to find the data file. However, if you have already set your working directory to the folder where the file is located, you don’t need to specify the path. Additionally, we use the argument header = TRUE to let R know that the first row contains variable names. Finally, we assign the data file to a source object named survey (but you could choose a different name for your source object like WoJ or data). The data is now stored in this object.

survey <- read.csv("WoJ.csv", header = TRUE)
##        country         reach employment temp_contract autonomy_selection
## 1      Germany      National  Full-time     Permanent                  5
## 2      Germany      National  Full-time     Permanent                  3
## 3  Switzerland      Regional  Full-time     Permanent                  4
## 4  Switzerland         Local  Part-time     Permanent                  4
## 5      Austria      National  Part-time     Permanent                  4
## 6  Switzerland         Local Freelancer          <NA>                  4
## 7      Germany         Local  Full-time     Permanent                  4
## 8      Denmark      National  Full-time     Permanent                  3
## 9  Switzerland         Local  Full-time     Permanent                  5
## 10     Denmark      National  Full-time     Permanent                  2
## 11     Austria         Local  Full-time     Permanent                  5
## 12     Denmark      National Freelancer          <NA>                  4
## 13          UK      Regional  Full-time     Permanent                  3
## 14          UK Transnational  Full-time     Permanent                  4
## 15     Austria      National  Full-time     Permanent                  3
## 16     Denmark      National Freelancer          <NA>                  5
## 17          UK Transnational  Full-time     Permanent                  4
## 18 Switzerland      Regional  Full-time     Permanent                  4
## 19 Switzerland      National  Part-time     Permanent                  4
## 20     Denmark      National  Full-time     Permanent                  4
##    work_experience trust_parliament trust_government
## 1               10                3                3
## 2                7                4                4
## 3                6                4                4
## 4                7                4                4
## 5               15                3                2
## 6               27                4                4
## 7               24                2                1
## 8               11                4                3
## 9               25                1                1
## 10               4                3                3
## 11               8                3                2
## 12              25                3                3
## 13              10                3                2
## 14               5                3                2
## 15              23                3                2
## 16              25                4                2
## 17              11                2                3
## 18               8                3                3
## 19              32                3                4
## 20              21                3                3

NOTE: While read.csv() reads in comma-separated values, read.csv2() reads in values that are separated by semicolons.

To make the dataset a little more tangible, we will give the journalists 40 different fictitious names. The names will be saved into a new column in the dataset, which we will call name. Please run this code to add names to your data:

survey$name <- c(
  "Rosalie", "Laurens", "Florian", "Chantal", "Cynthia", "Paul", "Jonas",
  "Tanja", "David", "Ferdinand", "Caroline", "Charline", "Sev", "Theodor",
  "Helke", "Joshua", "Jona", "Konrad", "Lennart", "Luise", "Wiebke", "Marie", "Rosa",
  "Alma", "Ida", "Jean", "Leonie", "Tom", "Maximilian", "Viktor", "Marianne", "Velma", "Carl",
  "Wolf", "Merten", "Tong-Tong", "Sal", "Joe", "Alex", "Robin"
)

4.3 Subsetting variables / columns in data frames

In the Tutorial: Using R as a calculator, your variables were “floating” in your workspace or environment. They were not stored in a container, so you could call them by simply writing their name in the console. When you import data files into R, all variables in that dataset are stored in a “container,” which is your source object. These containers for variables are called data frames in R.

Variables that are part of a data frame can be accessed by their name, but we need to specify the data frame AND the variable name and combine them with the access operator: $. This takes the form of:

dataframe$variablename
# the first part is the container name, i.e. data frame
# this is followed by the access operator $
# finally, you call the variable by name

For instance, we could retrieve the variable / column “name” in our survey data frame by simply using its variable name: We specify the object we want to access, the data frame survey and then retrieve the column name via the operator $:

survey$name
##  [1] "Rosalie"    "Laurens"    "Florian"    "Chantal"    "Cynthia"   
##  [6] "Paul"       "Jonas"      "Tanja"      "David"      "Ferdinand" 
## [11] "Caroline"   "Charline"   "Sev"        "Theodor"    "Helke"     
## [16] "Joshua"     "Jona"       "Konrad"     "Lennart"    "Luise"     
## [21] "Wiebke"     "Marie"      "Rosa"       "Alma"       "Ida"       
## [26] "Jean"       "Leonie"     "Tom"        "Maximilian" "Viktor"    
## [31] "Marianne"   "Velma"      "Carl"       "Wolf"       "Merten"    
## [36] "Tong-Tong"  "Sal"        "Joe"        "Alex"       "Robin"

You know that the name variable is the ninth column of you data frame. Therefore, you can also access this column / variable by calling it by its index number (column index, here: 9). Just like you’ve learned in the Tutorial: Using R as a calculator, you can access sub-elements of a greater object with square brackets [ ]:

survey[9]
##          name
## 1     Rosalie
## 2     Laurens
## 3     Florian
## 4     Chantal
## 5     Cynthia
## 6        Paul
## 7       Jonas
## 8       Tanja
## 9       David
## 10  Ferdinand
## 11   Caroline
## 12   Charline
## 13        Sev
## 14    Theodor
## 15      Helke
## 16     Joshua
## 17       Jona
## 18     Konrad
## 19    Lennart
## 20      Luise
## 21     Wiebke
## 22      Marie
## 23       Rosa
## 24       Alma
## 25        Ida
## 26       Jean
## 27     Leonie
## 28        Tom
## 29 Maximilian
## 30     Viktor
## 31   Marianne
## 32      Velma
## 33       Carl
## 34       Wolf
## 35     Merten
## 36  Tong-Tong
## 37        Sal
## 38        Joe
## 39       Alex
## 40      Robin

Note: While the first command gives you the names as a vector, the second one gives you the names as a data frame object with only one column. This keeps the column header “name” intact. However, if you want to retrieve a vector using the column index, you need to provide two indices: one for the row that you want to select, followed by a comma, and one for the column. Since we want to select all rows, but only column No. 9, we need leave the row No. blank:

survey[, 9] # column index = 9
##  [1] "Rosalie"    "Laurens"    "Florian"    "Chantal"    "Cynthia"   
##  [6] "Paul"       "Jonas"      "Tanja"      "David"      "Ferdinand" 
## [11] "Caroline"   "Charline"   "Sev"        "Theodor"    "Helke"     
## [16] "Joshua"     "Jona"       "Konrad"     "Lennart"    "Luise"     
## [21] "Wiebke"     "Marie"      "Rosa"       "Alma"       "Ida"       
## [26] "Jean"       "Leonie"     "Tom"        "Maximilian" "Viktor"    
## [31] "Marianne"   "Velma"      "Carl"       "Wolf"       "Merten"    
## [36] "Tong-Tong"  "Sal"        "Joe"        "Alex"       "Robin"

4.4 Subsetting observations / rows in data frames

Using the same indexing technique, you can also select an entire row (i.e., journalist) by providing a row index and leaving the column index blank:

survey[1, ] # row index = 1
##   country    reach employment temp_contract autonomy_selection work_experience
## 1 Germany National  Full-time     Permanent                  5              10
##   trust_parliament trust_government    name
## 1                3                3 Rosalie

4.5 Subsetting values / cells in data frames

To subset values of a data set, you can call a variable / column by its name. You just have to specify the data frame, the variable name, AND the row index. For example, to look only at the first name entry in the data, which is Rosalie, you can use the following code:

survey$name[1]
## [1] "Rosalie"

Now, can you guess how to access the exact same value using the column index instead of the column name?

You can enter the row index first, followed by a comma, and then finish with the column index, like this:

survey[1, 9] # row index = 1, column index = 9
## [1] "Rosalie"

Of course, you can use complex indexing on data frames. Let’s look at the first ten rows of the eighth (trust_government) and ninth (name) column:

survey[1:10, 8:9]
##    trust_government      name
## 1                 3   Rosalie
## 2                 4   Laurens
## 3                 4   Florian
## 4                 4   Chantal
## 5                 2   Cynthia
## 6                 4      Paul
## 7                 1     Jonas
## 8                 3     Tanja
## 9                 1     David
## 10                3 Ferdinand

Or select the second ten rows of the sixth (work_experience) and ninth (name) column:

survey[10:20, c(6, 9)]
##    work_experience      name
## 10               4 Ferdinand
## 11               8  Caroline
## 12              25  Charline
## 13              10       Sev
## 14               5   Theodor
## 15              23     Helke
## 16              25    Joshua
## 17              11      Jona
## 18               8    Konrad
## 19              32   Lennart
## 20              21     Luise

4.6 Subsetting data with conditions

Let’s say we want to select all names of journalists who have a work experience of at least 25 years. This time, we’ll need to use a condition to select our rows:

survey[survey$work_experience >= 25, 9]
##  [1] "Paul"       "David"      "Charline"   "Joshua"     "Lennart"   
##  [6] "Alma"       "Ida"        "Maximilian" "Viktor"     "Marianne"  
## [11] "Merten"
# Read: I want to select column 9 (name) from the survey data frame
# and display all rows in which the work experience column of the survey data frame has a value greater than or equal to 25

Alternatively, if you want to show all columns that belong to journalists who have a work experience of at least 25 years, you would do it like that:

survey[survey$work_experience >= 25, ]
##        country         reach employment temp_contract autonomy_selection
## 6  Switzerland         Local Freelancer          <NA>                  4
## 9  Switzerland         Local  Full-time     Permanent                  5
## 12     Denmark      National Freelancer          <NA>                  4
## 16     Denmark      National Freelancer          <NA>                  5
## 19 Switzerland      National  Part-time     Permanent                  4
## 24          UK Transnational Freelancer          <NA>                  4
## 25     Austria      National  Full-time     Permanent                  4
## 29     Austria      Regional  Full-time     Permanent                  4
## 30     Denmark      National Freelancer          <NA>                  4
## 31     Denmark      National  Full-time     Permanent                  4
## 35     Denmark      National  Full-time     Permanent                  4
##    work_experience trust_parliament trust_government       name
## 6               27                4                4       Paul
## 9               25                1                1      David
## 12              25                3                3   Charline
## 16              25                4                2     Joshua
## 19              32                3                4    Lennart
## 24              29                4                4       Alma
## 25              30                3                3        Ida
## 29              30                3                3 Maximilian
## 30              35                3                3     Viktor
## 31              25                3                3   Marianne
## 35              29                3                3     Merten

Of course, you can index more than one column:

survey[survey$work_experience >= 25, 8:9]
##    trust_government       name
## 6                 4       Paul
## 9                 1      David
## 12                3   Charline
## 16                2     Joshua
## 19                4    Lennart
## 24                4       Alma
## 25                3        Ida
## 29                3 Maximilian
## 30                3     Viktor
## 31                3   Marianne
## 35                3     Merten

Finally, you can use multiple conditions:

survey[survey$work_experience >= 25 & survey$trust_government < 2, ]
##       country reach employment temp_contract autonomy_selection work_experience
## 9 Switzerland Local  Full-time     Permanent                  5              25
##   trust_parliament trust_government  name
## 9                1                1 David

4.7 Optional: Import data from various file formats

R provides a wide range of functions and packages to import data from various file formats such as Excel, SPSS, SAS, Stata, JSON and others. Here, we will focus on some commonly used file formats and the R functions to import them.

4.7.1 Importing data from Excel

We will use the readxl package to import data from Excel files. First, we need to install and load the readxl package:

# Install and load the 'readxl' package
install.packages("readxl") # run only the first time
library(readxl)

To import Excel data from your working directory into R, you can now use the read_excel() function:

data <- read_excel("WoJ.xlsx")

This will create a data frame named “data” containing the Excel data. By default, the first row of the Excel file is assumed to contain column names.

4.7.2 Importing data from JSON

JSON files are a popular file format for storing structured data, such as data collected from Social Networking Sites. To import data from a JSON file into R, you can use the jsonlite package:

# Install and load the 'jsonlite' package
install.packages("jsonlite") # run only the first time
library(jsonlite)

And import your data:

data <- fromJSON("WoJ.json")

4.7.3 Importing data from SPSS

SPSS files are still commonly used in social science research and you might want to import your current SPSS projects to R. In R, we can import SPSS files using the haven package.

# Install and load the 'haven' package
install.packages("haven") # run only the first time
library(haven)

To import your data into R, use the read_spss() function (here I repeated the entire working directory, but you could only write read_spss("WoJ.sav")):

data <- read_spss("C:/Users/LaraK/Documents/IPR/WoJ.sav")

By default, the read_spss() function will convert user-defined missings to NA (i.e., missings in R). If you want to read variables with user defined missings into R (e.g., keep -9 instead of turning it into NA), you can use the user_na argument:

data <- read_spss("C:/Users/LaraK/Documents/IPR/WoJ.sav", user_na = TRUE)

read_spss reads all your labelled variables in as an object type (i.e., class) called haven_labelled that stores your label information. You can check your label information by inspecting your data in your environment (by default: right-hand side of RStudio). Just click on the blue error next to your source object:

Image: Inspecting your labelled data frame

NOTE FOR ADVANCED USERS: If you want to add, change, or remove any kind of labels in R, you can use the labelled package:

# Install and load the 'labelled' package
install.packages("labelled")
library(labelled)

To add a label to a variable, use the var_label() function:

var_labels <- c(
  country = "The journalist's country of origin",
  reach = "The reach of the journalist's outlet",
  employment = "The journalist's employment status",
  temp_contract = "The journalist's type of contract",
  autonomy_selection = "The journalist's perceived autonomy",
  work_experience = "The journalist's work experience in years",
  trust_parliament = "The journalist's trust in the current parliament",
  trust_government = "The journalist's trust in the current government",
  name = "Name of journalist"
) # Creates a vector with your variable labels
var_label(survey) <- var_labels # Applies the labels to your column variables of the unlabeled "survey" data set (you could also apply it to the SPSS-labelled "data" data set)

To add value labels to a variable, use the set_value_labels function:

trust_labels <- c("none at all" = 1, "a little" = 2, "partly" = 3, "a lot" = 4, "very much" = 5)
data$trust_parliament <- set_value_labels(data$trust_parliament, .labels = trust_labels)
data$trust_government <- set_value_labels(data$trust_government, .labels = trust_labels)

To remove a variable label from a variable, use the remove_var_label function:

data$trust_parliament <- remove_var_label(data$trust_parliament)

And to remove a value label from a variable, use the remove_val_labels function:

data$trust_parliament <- remove_val_labels(data$trust_parliament)

For a full documentation and a helpful cheat sheet of the labelled package, click this link. It also contains information on how to use the labelled package in combination with the tidyverse. I’ve used Base R in this tutorial.

4.8 Take-Aways

  • Setting the working directory: tells R where your folder with the data is located on your drive, setwd(your_filepath)
  • Import data: after setting the working directory, with read.csv() (comma-separated) or read.csv2() (semicolon-separated) or the package of your choice (e.g., haven)
  • Access variables: either by the access operator $ (dataframe&variablename) or by the column index [,columnNo.]
  • Access values: by indexing, i.e. using [ ] and row + column indices
  • Conditions: you can select rows based on conditions, e.g.: greater >, greater or equal >=, equal ==, and not equal !=

4.9 Additional tutorials

You still have questions? The following online guides can help you with that:

Now it’s your time to get into coding: Try Exercise 2: Base R.