6 Tutorial: Text manipulation with stringr

After working through Tutorial 6, you’ll…

  • understand the concept of string patterns and regular expressions
  • know how to search for string patterns

6.1 What’s stringr?

The stringr package is another package of the tidyverse family, i.e., it comes pre-installed with the tidyverse. The package offers a neat set of functions that makes working with strings really simple for beginners. Therefore, stringr is a good place to start getting into text data management. A string is a data type that is used to represent text rather than numbers.

6.2 Working with strings

First, let’s create a vector that contains some strings and print the vector to the console!

fruits <- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")

fruits
## [1] "banana"     "apple"      "pear"       "strawberry" "raspberry" 
## [6] "kiwi"       "rhubarb"

First, we want to know how long each of these strings is, i.e., how many characters the elements of the fruits vector contain. We will use the str_length() function.

str_length(fruits)
## [1]  6  5  4 10  9  4  7

That was easy. Next, we want to join multiple strings into a single string. We will use the str_c function.

str_c(fruits, collapse = " and ")
## [1] "banana and apple and pear and strawberry and raspberry and kiwi and rhubarb"
# If collapse is not NULL, it will be inserted between elements of the result, here: and

You can also change the order of the str_c function:

str_c("My favourite fruit is: ", fruits, collapse = NULL) 
## [1] "My favourite fruit is: banana"     "My favourite fruit is: apple"     
## [3] "My favourite fruit is: pear"       "My favourite fruit is: strawberry"
## [5] "My favourite fruit is: raspberry"  "My favourite fruit is: kiwi"      
## [7] "My favourite fruit is: rhubarb"

Let’s say, we want to extract substrings from a character vector. For example, we only want to keep the second to fourth letter of each string. We can use the str_sub function to achieve that.

str_sub(fruits, start = 2, end = 4) 
## [1] "ana" "ppl" "ear" "tra" "asp" "iwi" "hub"

6.3 Working with string patterns

Often we want to search for certain string patterns in a text document. String patterns are character sequences (for instance, letter, numbers, or special characters). Let’s assume for a second that we have misspelled one of our fruits (banana, we have switched the na letters to an).

misspelled_fruits <- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")

It would be really great if we could search for the string pattern “an” and replace it with the string pattern “na” automatically, wouldn’t it? Well, stringroffers some functions to do just that. For example, str_detect() tells you if there’s any match to the pattern.

str_detect(misspelled_fruits, "an")
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Now we know that there is one word that contain letters that match the “an” pattern. We have a misspelling! What is that word? str_subset() extracts the matching strings, so that we can find out.

str_subset(misspelled_fruits, "an")
## [1] "baanan"

Of course, it’s banana. But how many times has “an” been misspelled in banana? Just once? Let’s find out with str_count(, which counts the number of patterns in each string.

str_count(misspelled_fruits, "an")
## [1] 2 0 0 0 0 0 0

Two times! Let’s fix that with the str_replace function.

misspelled_fruits <- str_replace(misspelled_fruits, "anan", "nana")

misspelled_fruits
## [1] "banana"     "apple"      "pear"       "strawberry" "raspberry" 
## [6] "kiwi"       "rhubarb"

Perfect! We have fixed our misspelled fruits. However, keep in mind that pattern correction can mess up your string pretty badly if you are not cautious. Therefore, you should always explore your strings very thoroughly before replacing any string patterns. For example, let’s see what our pattern detection will uncover if our misspelled fruits would contain an additional orange, which has been spelled correctly:

misspelled_fruits <- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "orange")

str_detect(misspelled_fruits, "an")
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Now, str_detect() matches two words with the “an” pattern, but the latter is not a misspelling! So always be careful!

As a final lesson, you can also split a string into multiple strings based on certain string patterns using the str_split()function:

cs_fruits <- c("banana, apple, pear, strawberry, raspberry, kiwi, rhubarb, orange")
str_split(cs_fruits, ",")
## [[1]]
## [1] "banana"      " apple"      " pear"       " strawberry" " raspberry" 
## [6] " kiwi"       " rhubarb"    " orange"

6.4 Working with regular expressions

Often, we want to match more complicated string patterns than a simple “an”. For example, we might wish to detect all strings in our text document that do not start with “RT”, because “RT” at the beginning of a string implies a retweet rather than an original tweet when analyzing Twitter data. Arguably, we are often not really interested in analyzing retweets (but sometimes we are, it depends on the research question).

To search and match complex string patterns, we need regular expressions. Regular expressions (short: regex) are a concise language for describing patterns of text. Regex should not be taken literally, but have a non-literal meaning.

Let’s keep working with our (non-misspelled) fruits vector to display what regex can do. First, let’s look for all strings that start with the letter b using the ^ (start of string) regex.

str_detect(fruits, "^b") # ^ stands for "start of string", i.e. we are matching for strings that start with the letter b
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

That’s on point because only our first entry, banana, starts with b and str_detect() matched that correctly! With a similar approach, you can find all fruits that end with the letter b:

str_detect(fruits, "b$") # $ stands for "end of string", i.e. we are matching for strings that end with the letter b
## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Again, we have a perfect match of the only fruit that ends with the letter b: rhubarb. Please, note the difference to not matching these two regexes (^ and $), but the simple string pattern “b”:

str_detect(fruits, "b") # matches all strings that contain the letter b at any place
## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE

This has matched banana, strawberry, raspberry, and rhubarb, because all of these fruits contain a letter b at some place. Finally, you should also not confuse “^b” with “[^b]”, because [^] stands for “anything but” in regex language.

fruits <- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "b", "bbb")

str_detect(fruits, "[^b]") # matches all strings that contain any letters different from b(s)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Next, let’s say that it doesn’t really matter to use whether our fruits contains the letter b or the letter e at any position of the string. There is a very powerful regex to match this string pattern: the “match any one of” operator [].

fruits <- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")

str_detect(fruits, "[be]") # matches all strings that contain either the letter b or the letter e
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Our regex has managed to match all strings that contain either the letter b or the letter e, which leaves only kiwi to be FALSE. Next, we can match all fruits that contain letters that range between s to w in the ABC. We will need to use the range operator [-]:

str_detect(fruits, "[s-w]") # matches all strings that contain either the letter s, t, u, v, or, w
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Again, we have a perfect match! The str_detect has successfully matched “strawberry”, “raspberry”, “kiwi”, and “rhubarb”. Next, we might want to match all fruits that contain more than one r, i.e., we want to match strawberry and raspberry, but not pear. This is where the ? operator (zero or one) * operator (zero or more), the + operator (one or more), and the {n} operator (exactly n) come in handy.

str_detect(fruits, "r?") # matches all strings that contain zero or one rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r*") # matches all strings that contain zero or more rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r+") # matches all strings that contain one or more rs --> this grabs pear as well, not there yet!
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
str_detect(fruits, "r{2}") # matches all strings that contain exactly two rs --> this grabs only the berries, yeah!
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

This looks great! We now know the most important regular expressions! If you feel like you need even more advanced regular expressions, you can look them up in this awesome stringr cheat sheet.

6.5 Take-Aways

  • String patterns & RegEx: String patterns are sequences of characters; regular expressions are a type of string patterna used to match or detect other string patterns in texts.
  • Important regular expressions: The best overview of all regex options can be found on the stringr cheat sheet.

6.6 Additional tutorials

You still have questions? The following tutorials, books, & tools may help you: