6 Tutorial: Text manipulation with stringr
After working through Tutorial 6, you’ll…
- understand the concept of string patterns and regular expressions
- know how to search for string patterns
6.1 What’s stringr?
The stringr
package is another package of the tidyverse
family,
i.e., it comes pre-installed with the tidyverse. The package offers a
neat set of functions that makes working with strings really simple for
beginners. Therefore, stringr
is a good place to start getting into
text data management. A string is a data type that is used to represent
text rather than numbers.
6.2 Working with strings
First, let’s create a vector that contains some strings and print the vector to the console!
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")
fruits
fruits
## [1] "banana" "apple" "pear" "strawberry" "raspberry"
## [6] "kiwi" "rhubarb"
First, we want to know how long each of these strings is, i.e., how many
characters the elements of the fruits vector contain. We will use the
str_length()
function.
str_length(fruits)
## [1] 6 5 4 10 9 4 7
That was easy. Next, we want to join multiple strings into a single
string. We will use the str_c
function.
str_c(fruits, collapse = " and ")
## [1] "banana and apple and pear and strawberry and raspberry and kiwi and rhubarb"
# If collapse is not NULL, it will be inserted between elements of the result, here: and
You can also change the order of the str_c
function:
str_c("My favourite fruit is: ", fruits, collapse = NULL)
## [1] "My favourite fruit is: banana" "My favourite fruit is: apple"
## [3] "My favourite fruit is: pear" "My favourite fruit is: strawberry"
## [5] "My favourite fruit is: raspberry" "My favourite fruit is: kiwi"
## [7] "My favourite fruit is: rhubarb"
Let’s say, we want to extract substrings from a character vector. For
example, we only want to keep the second to fourth letter of each
string. We can use the str_sub
function to achieve that.
str_sub(fruits, start = 2, end = 4)
## [1] "ana" "ppl" "ear" "tra" "asp" "iwi" "hub"
6.3 Working with string patterns
Often we want to search for certain string patterns in a text document. String patterns are character sequences (for instance, letter, numbers, or special characters). Let’s assume for a second that we have misspelled one of our fruits (banana, we have switched the na letters to an).
<- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb") misspelled_fruits
It would be really great if we could search for the string pattern
“an” and replace it with the string pattern “na” automatically,
wouldn’t it? Well, stringr
offers some functions to do just that. For
example, str_detect()
tells you if there’s any match to the pattern.
str_detect(misspelled_fruits, "an")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Now we know that there is one word that contain letters that match the
“an” pattern. We have a misspelling! What is that word? str_subset()
extracts the matching strings, so that we can find out.
str_subset(misspelled_fruits, "an")
## [1] "baanan"
Of course, it’s banana. But how many times has “an” been misspelled
in banana? Just once? Let’s find out with str_count(
, which counts
the number of patterns in each string.
str_count(misspelled_fruits, "an")
## [1] 2 0 0 0 0 0 0
Two times! Let’s fix that with the str_replace
function.
<- str_replace(misspelled_fruits, "anan", "nana")
misspelled_fruits
misspelled_fruits
## [1] "banana" "apple" "pear" "strawberry" "raspberry"
## [6] "kiwi" "rhubarb"
Perfect! We have fixed our misspelled fruits. However, keep in mind that pattern correction can mess up your string pretty badly if you are not cautious. Therefore, you should always explore your strings very thoroughly before replacing any string patterns. For example, let’s see what our pattern detection will uncover if our misspelled fruits would contain an additional orange, which has been spelled correctly:
<- c("baanan", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "orange")
misspelled_fruits
str_detect(misspelled_fruits, "an")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Now, str_detect()
matches two words with the “an” pattern, but the
latter is not a misspelling! So always be careful!
As a final lesson, you can also split a string into multiple strings
based on certain string patterns using the str_split()
function:
<- c("banana, apple, pear, strawberry, raspberry, kiwi, rhubarb, orange")
cs_fruits str_split(cs_fruits, ",")
## [[1]]
## [1] "banana" " apple" " pear" " strawberry" " raspberry"
## [6] " kiwi" " rhubarb" " orange"
6.4 Working with regular expressions
Often, we want to match more complicated string patterns than a simple “an”. For example, we might wish to detect all strings in our text document that do not start with “RT”, because “RT” at the beginning of a string implies a retweet rather than an original tweet when analyzing Twitter data. Arguably, we are often not really interested in analyzing retweets (but sometimes we are, it depends on what you want to analyze).
To search and match complex string patterns, we need regular expressions. Regular expressions (short: regex) are a concise language for describing patterns of text. Regex should not be taken literally, but have a non-literal meaning.
6.4.1 Anchors, alternates, and quantifiers
Let’s keep working with our (non-misspelled) fruits vector to display
what regex can do. We will first learn what “anchors” are. Anchors are used to match
the beginning or end of a string, i.e., they denote the start or end of a pattern and are usually used in combination with other patterns. First, let’s look for all strings that start with the
letter b using the ^
(start of string) anchor.
str_detect(fruits, "^b") # ^ stands for "start of string", i.e. we are matching for strings that start with the letter b
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
That’s on point because only our first entry, banana, starts with b
and str_detect()
matched that correctly! With a similar approach, you
can find all fruits that end with the letter b:
str_detect(fruits, "b$") # $ stands for "end of string", i.e. we are matching for strings that end with the letter b
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Again, we have a perfect match of the only fruit that ends with the
letter b: rhubarb. Please, note the difference to not matching these
two regexes (^
and $
), but the simple string pattern “b”:
str_detect(fruits, "b") # matches all strings that contain the letter b at any place
## [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE
This has matched banana, strawberry, raspberry, and rhubarb, because all
of these fruits contain a letter b at some place. Finally, you should
also not confuse “^b” with “[^b]”, because [^]
stands for
“anything but” in regex language.
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb", "b", "bbb")
fruits
str_detect(fruits, "[^b]") # matches all strings that contain any letters different from b(s)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Next, let’s look at “alternates” in regex. Alternates are used to match multiple characters in a single position. They are specified in square brackets, with a list of characters that could be matched. For example, the regular expression “[bc]ow” will match any string that starts with b, or c, followed by ow. This means it will match both bow and cow.
Let’s look at our fruits again and learn about the very powerful “match any one of” regex []
.
We will now try to find all the fruits that contain the letter b or the letter e at any position of the string.
<- c("banana", "apple", "pear", "strawberry", "raspberry", "kiwi", "rhubarb")
fruits
str_detect(fruits, "[be]") # matches all strings that contain either the letter b or the letter e
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
Our regex has managed to match all strings that contain either the
letter b or the letter e, which leaves only kiwi to be FALSE. Next, we
can match all fruits that contain letters that range between s to w
in the ABC. We will need to use the range operator [-]
:
str_detect(fruits, "[s-w]") # matches all strings that contain either the letter s, t, u, v, or, w
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Again, we have a perfect match! The str_detect
has successfully
matched “strawberry”, “raspberry”, “kiwi”, and “rhubarb”. Next, we might
want to match all fruits that contain more than one r, i.e., we want
to match strawberry and raspberry, but not pear. This is where the
?
operator (zero or one) *
operator (zero or more), the +
operator
(one or more), and the {n}
operator (exactly n) come in handy. These regex are called “quantifiers” and are used to specify how many of the preceding characters should be matched in a string.
str_detect(fruits, "r?") # matches all strings that contain zero or one rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r*") # matches all strings that contain zero or more rs
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(fruits, "r+") # matches all strings that contain one or more rs --> this grabs pear as well, not there yet!
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
str_detect(fruits, "r{2}") # matches all strings that contain exactly two rs --> this grabs only the berries, yeah!
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
This looks great! We now know what what anchors, alternates, and quantifiers in regex are!
6.4.2 Look arounds and groups
Another useful type of regex are “look arounds” and “groups”. Look arounds are used to match patterns before or after a certain character or pattern, and groups are used to group elements of a pattern and specify the order in which they should be evaluated.
One powerful loook around regex is ?=
, wich means “followed by” in regex language. For example, a(?=c)
will match any a that is being followed by c, i.e., it will match words like “account” or “accurate”. Let’s try to grab all the berries again, but this time we won’t use a quantifier regex, but a look around.
str_detect(fruits, "e(?=r)") # matches all strings that contain an e followed by an r
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
Once again, we detect only the two berries in our fruits vector and ignore the pear. It’s just another way of reaching the same goal. Pro tip: You can also negate the “followed by” regex with a !
turning it into a “not followed by” (because !
is the universal negation symbol; do you remember how !
negates filters in dplyr
, e.g., filter(gender != “male”)?). For example:
str_detect(fruits, "e(?!r)") # matches all strings that contain an e not followed by an r
## [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE
Finally, we can use the “preceded by” regex: ?<=
. Similar to “followed by”, it is used to find a pattern that is preceded by a specific string or character. Let’s match our berries again, but let’s use the “preceeded by” regex instead of the “followed by” regex.
str_detect(fruits, "(?<=e)r") # matches all strings that contain an r preceeded by an e
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
Of course, you could also negate this into “not preceeded by” with the help of !
:
str_detect(fruits, "(?<!e)r") # matches all strings that contain an r that is not preceeded by an e
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
Finally, let’s turn to “groups”, which are used to group elements of a pattern and specify the order in which they should be evaluated.
str_detect(fruits, "(e|a)r") # matches all strings that contain an e or an a which are followed by an r
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
If you remember “alternates” regex from above, you could also write this expression this way:
str_detect(fruits, "[ea]r") # matches all strings that contain an e or an a which are followed by an r
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
However, groups can become handy if you need to chain multiple groups for complex patterns. This is very advanced stuff, but I wanted to let you know that it exists.
This was a lot, so here comes an overview / summary of all the regex (screenshot of the stringr
cheat sheet):
Screenshot of anchors, alternates, quantifiers, look arounds, and groups (stringr cheat sheet): |
6.4.3 Match specific types of characters
There is one last thing I would like to teach you. It’s how to match specific types of characters (e.g., numbers, space symbols, etc.). Let’s first change our fruits vector. I’ll add some numbers, spaces, and slashes.
<- c("ban/ana", "apple", "pear2", "strawberry ", "r0spberry", "k1w1", "rhubarb") fruits
Here comes an overview over the types of characters that you can match and how to address them:
Screenshot of types of characters that can be matched in stringr (stringr cheat sheet): |
Let’s try to detect strings that contain a number, i.e., [:digit:].
str_detect(fruits, "[:digit:]") # matches all strings that contain a number
## [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE
Let’s try to detect strings that contain a forwardslash /.
str_detect(fruits, "[:punct:]") # matches all strings that contain a punctuation symbol, such as a forwardslash: /
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Finally, we can detect all fruits that are written with a space symbol:
str_detect(fruits, "[:space:]") # matches all strings that contain a space symbol
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Finished! Repeat and practice these commands and you will become a real expert on regular expressions! This tutorial provided an in-depth overview of regular expressions, with only a very few aspects left to explore. If you feel like you need to look up some regular expressions, you can find them in this awesome stringr cheat sheet.
6.5 Take-Aways
- String patterns & RegEx: String patterns are sequences of characters; regular expressions are a type of abstract, generalized string pattern used to match or detect other string patterns in texts.
- Important regular expressions: Anchors, alternates, quantifiers, look arounds and groups. The best overview of all regex options can be found in the stringr cheat sheet.
6.6 Additional tutorials
You still have questions? The following tutorials, books, & tools may help you:
Now let’s see what you’ve learned so far: Exercise 5: Test your knowledge.