7 Exercise 5: Test your knowledge

First, load the library stringr (or the entire tidyverse).

library(tidyverse)

We will work on real data in this exercise. But before that, we will test our regex on a simple “test” vector that contains hyperlinks of media outlets and webpages. So run this code and create a vector called “seitenURL” in your environment:

seitenURL <- c("https://www.bild.de", "https://www.kicker.at/sport", "http://www.mycoffee.net/fresh", "http://www1.neuegedanken.blog", "https://home.1und1.info/magazine/unterhaltung", "https://de.sputniknews.com/technik", "https://rp-online.de/nrw/staedte/geldern", "https://www.bzbasel.ch/limmattal/region-limmattal", "http://sportdeutschland.tv/wm-maenner-2019")

7.1 Task 1

Let’s get rid of the first part of the URL strings in our “seitenURL” vector, i.e., everything that comes before the name of the outlet (You will need to work with str_replace). Usually, this is some kind of version of “https://www.” or “http://www.” Try to make your pattern as versatile as possible! In a big data set, there will always be exceptions to the rule (e.g., “http://sportdeutschland.tv”), so try to match regex–i.e., types of characters instead of real characters–as often as you can!

7.2 Task 2

Using the seitenURL vector, try to get rid of the characters that follow the media outlet (e.g., “.de” or “.com”).

7.3 Task 3

Now download our newest data set from LRZ Sync and Share and load it into R as an object called “data”. It is a very reduced version of the original data set that I’ve worked with for a project. The data set investigates what kind of outlets and webpages are most often shared and engaged with on Twitter and Facebook. We have tracked every webpage, from big players like BILD to small, private blogs. However, this is a reduced version of my raw data. The outlets are still hidden in the URLs, so we need to extract them first using the regex patterns that you’ve just created.

Use this command and adapt it with your patterns (it uses str_replace in combination with mutate):

data2 <- data %>%
  mutate(seitenURL = str_replace_all(seitenURL, "your https pattern here", "")) %>% 
  mutate(seitenURL = str_replace_all(seitenURL, "your .de/.com pattern here", "")) 

7.4 Task 4

The data set provides two additional, highly informative columns: “SeitenAnzahlTwitter” and “SeitenAnzahlFacebook”. These columns show the number of reactions (shares, likes, comments) for each URL. Having extracted the media outlets, let us examine which media outlet got the most engagement on Twitter and Facebook from all their URLs. Utilize your dplyr abilities to create an R object named “overview” that stores the summary statistic (remmeber group_by and summarize!) of the engagement on Twitter and Facebook per media outlet.

Next, arrange your “overview” data to reveal which media outlet creates the most engagement on Twitter. Do the same for Facebook.