6 Tutorial: Text manipulation with stringr
After working through Tutorial 6, you’ll…
- understand the concept of string patterns and regular expressions
- know how to search for string patterns
6.1 What’s stringr?
The stringr
package is another package of the tidyverse
family,
i.e., it comes pre-installed with the tidyverse. The package offers a
neat set of functions that makes working with strings really simple for
beginners. Therefore, stringr
is a good place to start getting into
text data management. A string is a data type that is used to represent
text rather than numbers.
6.2 Working with strings
First, let’s create a vector that contains some strings and print the vector to the console!
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")
fruits
fruits
## [1] "banana" "apple" "pear" "strawberry" "raspberry"
## [6] "kiwi" "rhubarb"
First, we want to know how long each of these strings is, i.e., how many
characters the elements of the fruits vector contain. We will use the
str_length()
function.
str_length(fruits)
## [1] 6 5 4 10 9 4 7
That was easy. Next, we want to join multiple strings into a single
string. We will use the str_c
function.
str_c(fruits, collapse = " and ")
## [1] "banana and apple and pear and strawberry and raspberry and kiwi and rhubarb"
# If collapse is not NULL, it will be inserted between elements of the result, here: and
You can also change the order of the str_c
function:
str_c("My favourite fruit is: ", fruits, collapse = NULL)
## [1] "My favourite fruit is: banana" "My favourite fruit is: apple"
## [3] "My favourite fruit is: pear" "My favourite fruit is: strawberry"
## [5] "My favourite fruit is: raspberry" "My favourite fruit is: kiwi"
## [7] "My favourite fruit is: rhubarb"
Let’s say, we want to extract substrings from a character vector. For
example, we only want to keep the second to fourth letter of each
string. We can use the str_sub
function to achieve that.
str_sub(fruits, start = 2, end = 4)
## [1] "ana" "ppl" "ear" "tra" "asp" "iwi" "hub"
6.3 Working with string patterns
Often we want to search for certain string patterns in a text document. String patterns are character sequences (for instance, letter, numbers, or special characters). Let’s assume for a second that we have misspelled one of our fruits (banana, we have switched the na letters to an).
<- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb") misspelled_fruits
It would be really great if we could search for the string pattern
“an” and replace it with the string pattern “na” automatically,
wouldn’t it? Well, stringr
offers some functions to do just that. For
example, str_detect()
tells you if there’s any match to the pattern.
str_detect(misspelled_fruits, "an")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Now we know that there is one word that contain letters that match the
“an” pattern. We have a misspelling! What is that word? str_subset()
extracts the matching strings, so that we can find out.
str_subset(misspelled_fruits, "an")
## [1] "baanan"
Of course, it’s banana. But how many times has “an” been misspelled
in banana? Just once? Let’s find out with str_count(
, which counts
the number of patterns in each string.
str_count(misspelled_fruits, "an")
## [1] 2 0 0 0 0 0 0
Two times! Let’s fix that with the str_replace
function.
<- str_replace(misspelled_fruits, "anan", "nana")
misspelled_fruits
misspelled_fruits
## [1] "banana" "apple" "pear" "strawberry" "raspberry"
## [6] "kiwi" "rhubarb"
Perfect! We have fixed our misspelled fruits. However, keep in mind that pattern correction can mess up your string pretty badly if you are not cautious. Therefore, you should always explore your strings very thoroughly before replacing any string patterns. For example, let’s see what our pattern detection will uncover if our misspelled fruits would contain an additional orange, which has been spelled correctly:
<- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "orange")
misspelled_fruits
str_detect(misspelled_fruits, "an")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Now, str_detect()
matches two words with the “an” pattern, but the
latter is not a misspelling! So always be careful!
As a final lesson, you can also split a string into multiple strings
based on certain string patterns using the str_split()
function:
<- c("banana, apple, pear, strawberry, raspberry, kiwi, rhubarb, orange")
cs_fruits str_split(cs_fruits, ",")
## [[1]]
## [1] "banana" " apple" " pear" " strawberry" " raspberry"
## [6] " kiwi" " rhubarb" " orange"
6.4 Working with regular expressions
Often, we want to match more complicated string patterns than a simple “an”. For example, we might wish to detect all strings in our text document that do not start with “RT”, because “RT” at the beginning of a string implies a retweet rather than an original tweet when analyzing Twitter data. Arguably, we are often not really interested in analyzing retweets (but sometimes we are, it depends on the research question).
To search and match complex string patterns, we need regular expressions. Regular expressions (short: regex) are a concise language for describing patterns of text. Regex should not be taken literally, but have a non-literal meaning.
Let’s keep working with our (non-misspelled) fruits vector to display
what regex can do. First, let’s look for all strings that start with the
letter b using the ^
(start of string) regex.
str_detect(fruits, "^b") # ^ stands for "start of string", i.e. we are matching for strings that start with the letter b
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
That’s on point because only our first entry, banana, starts with b
and str_detect()
matched that correctly! With a similar approach, you
can find all fruits that end with the letter b:
str_detect(fruits, "b$") # $ stands for "end of string", i.e. we are matching for strings that end with the letter b
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Again, we have a perfect match of the only fruit that ends with the
letter b: rhubarb. Please, note the difference to not matching these
two regexes (^
and $
), but the simple string pattern “b”:
str_detect(fruits, "b") # matches all strings that contain the letter b at any place
## [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE
This has matched banana, strawberry, raspberry, and rhubarb, because all
of these fruits contain a letter b at some place. Finally, you should
also not confuse “^b” with “[^b]”, because [^]
stands for
“anything but” in regex language.
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "b", "bbb")
fruits
str_detect(fruits, "[^b]") # matches all strings that contain any letters different from b(s)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Next, let’s say that it doesn’t really matter to use whether our fruits
contains the letter b or the letter e at any position of the string.
There is a very powerful regex to match this string pattern: the
“match any one of” operator []
.
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")
fruits
str_detect(fruits, "[be]") # matches all strings that contain either the letter b or the letter e
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
Our regex has managed to match all strings that contain either the
letter b or the letter e, which leaves only kiwi to be FALSE. Next, we
can match all fruits that contain letters that range between s to w
in the ABC. We will need to use the range operator [-]
:
str_detect(fruits, "[s-w]") # matches all strings that contain either the letter s, t, u, v, or, w
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Again, we have a perfect match! The str_detect
has successfully
matched “strawberry”, “raspberry”, “kiwi”, and “rhubarb”. Next, we might
want to match all fruits that contain more than one r, i.e., we want
to match strawberry and raspberry, but not pear. This is where the
?
operator (zero or one) *
operator (zero or more), the +
operator
(one or more), and the {n}
operator (exactly n) come in handy.
str_detect(fruits, "r?") # matches all strings that contain zero or one rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r*") # matches all strings that contain zero or more rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r+") # matches all strings that contain one or more rs --> this grabs pear as well, not there yet!
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
str_detect(fruits, "r{2}") # matches all strings that contain exactly two rs --> this grabs only the berries, yeah!
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
This looks great! We now know the most important regular expressions! If you feel like you need even more advanced regular expressions, you can look them up in this awesome stringr cheat sheet.
6.5 Take-Aways
- String patterns & RegEx: String patterns are sequences of characters; regular expressions are a type of string patterna used to match or detect other string patterns in texts.
- Important regular expressions: The best overview of all regex options can be found on the stringr cheat sheet.