7 Tutorial: Data management with tidycomm

After working through Tutorial 7, you’ll…

  • learn how to use the tidycomm package to perform descriptive analyses.
  • learn how to use tidycomm to transform / rescale variables.

If you haven’t installed the package on this machine yet, install it:

install.packages("tidycomm")

After installation, load the package:

library(tidycomm)

7.1 Index generation

The tidycomm package provides a workflow to quickly add mean/sum indices of variables to the dataset and compute reliability estimates for those added indices. The package offers two key functions:

  1. add_index() and
  2. get_reliability()

7.1.1 add_index():

The add_index() function adds a mean or sum index of the specified variables to the data. The second argument (or first, if used in a pipe) is the name of the index variable to be created. For example, if you want to create a mean index named ‘ethical_concerns’ using variables ‘ethics_1’ to ‘ethics_4’, you can use the following code:

WoJ %>%
  add_index(ethical_concerns, ethics_1, ethics_2, ethics_3, ethics_4) %>%
  dplyr::select(ethical_concerns,
                ethics_1,
                ethics_2,
                ethics_3,
                ethics_4)
## # A tibble: 1,200 × 5
##    ethical_concerns ethics_1 ethics_2 ethics_3 ethics_4
##               <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1             2           2        3        2        1
##  2             1.5         1        2        2        1
##  3             2.25        2        4        2        1
##  4             1.75        1        3        1        2
##  5             2           2        3        2        1
##  6             3.25        2        4        4        3
##  7             2           1        3        2        2
##  8             3.5         2        4        4        4
##  9             1.75        1        2        1        3
## 10             3.25        1        4        4        4
## # ℹ 1,190 more rows

To create a sum index instead, set type = "sum":

WoJ %>%
  add_index(ethical_flexibility, ethics_1, ethics_2, ethics_3, ethics_4, type = "sum") %>%
  dplyr::select(ethical_flexibility,
                ethics_1,
                ethics_2,
                ethics_3,
                ethics_4)
## # A tibble: 1,200 × 5
##    ethical_flexibility ethics_1 ethics_2 ethics_3 ethics_4
##                  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1                   8        2        3        2        1
##  2                   6        1        2        2        1
##  3                   9        2        4        2        1
##  4                   7        1        3        1        2
##  5                   8        2        3        2        1
##  6                  13        2        4        4        3
##  7                   8        1        3        2        2
##  8                  14        2        4        4        4
##  9                   7        1        2        1        3
## 10                  13        1        4        4        4
## # ℹ 1,190 more rows

Make sure to save your index back into your original data set if you want to keep it for later use:

WoJ <- WoJ %>%
  add_index(ethical_concerns, ethics_1, ethics_2, ethics_3, ethics_4)

7.1.2 get_reliability()

The get_reliability() function computes reliability/internal consistency estimates for indices created with add_index(). The function outputs Cronbach’s α along with descriptives and index information.

WoJ %>%
  get_reliability(ethical_concerns)
## # A tibble: 1 × 5
##   Index            Index_of                              M    SD Cronbachs_Alpha
## * <chr>            <chr>                             <dbl> <dbl>           <dbl>
## 1 ethical_concerns ethics_1, ethics_2, ethics_3, et…  2.45 0.777           0.612

By default, if you pass no further arguments to the function, it will automatically compute reliability estimates for all indices created with add_index() found in the data.

WoJ %>%
  get_reliability()
## # A tibble: 1 × 5
##   Index            Index_of                              M    SD Cronbachs_Alpha
## * <chr>            <chr>                             <dbl> <dbl>           <dbl>
## 1 ethical_concerns ethics_1, ethics_2, ethics_3, et…  2.45 0.777           0.612

7.2 Rescaling of continuous scales

Tidycomm provides five functions to easily handle missing values (NAs), to transform continuous scales, and to standardize them:

  • setna_scale() allows you to set specific values to NA in selected variables or the entire data frame
  • reverse_scale() simply turns a scale upside down
  • minmax_scale() down- or upsizes a scale to new minimum/maximum while retaining distances
  • center_scale() subtracts the mean from each individual data point to center a scale at a mean of 0
  • z_scale() works just like center_scale() but also divides the result by the standard deviation to also obtain a standard deviation of 1 and make it comparable to other z-standardized distributions

7.2.1 setna_scale()

setna_scale() allows you to set specific values to NA in selected variables or the entire data frame. It is a more readable alternative to dplyr::mutate(na_if())

It works for numeric, character, and factor variables.

WoJ %>%
  dplyr::select(autonomy_emphasis) %>%
  setna_scale(autonomy_emphasis, value = 5)
## # A tibble: 1,200 × 2
##    autonomy_emphasis autonomy_emphasis_na
##  *             <dbl>                <dbl>
##  1                 4                    4
##  2                 4                    4
##  3                 4                    4
##  4                 5                   NA
##  5                 4                    4
##  6                 4                    4
##  7                 4                    4
##  8                 3                    3
##  9                 5                   NA
## 10                 4                    4
## # ℹ 1,190 more rows

7.2.1.1 Equivalent with mutate(na_if())

WoJ %>%
  select(autonomy_emphasis) %>%
  mutate(autonomy_emphasis = na_if(autonomy_emphasis, 5))
## # A tibble: 1,200 × 1
##    autonomy_emphasis
##                <dbl>
##  1                 4
##  2                 4
##  3                 4
##  4                NA
##  5                 4
##  6                 4
##  7                 4
##  8                 3
##  9                NA
## 10                 4
## # ℹ 1,190 more rows

To create a new variable instead of overwriting:

WoJ %>%
  dplyr::select(autonomy_emphasis) %>%
  setna_scale(autonomy_emphasis,
              value = 5,
              name = "new_na_autonomy")
## # A tibble: 1,200 × 2
##    autonomy_emphasis new_na_autonomy
##  *             <dbl>           <dbl>
##  1                 4               4
##  2                 4               4
##  3                 4               4
##  4                 5              NA
##  5                 4               4
##  6                 4               4
##  7                 4               4
##  8                 3               3
##  9                 5              NA
## 10                 4               4
## # ℹ 1,190 more rows

Multiple values can be replaced at once:

WoJ %>%
  setna_scale(value = c(2, 3, 4), overwrite = TRUE)
## # A tibble: 1,200 × 16
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##  * <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                NA
##  2 Germany   Nati… Full-time  Permanent                     NA                NA
##  3 Switzerl… Regi… Full-time  Permanent                     NA                NA
##  4 Switzerl… Local Part-time  Permanent                     NA                 5
##  5 Austria   Nati… Part-time  Permanent                     NA                NA
##  6 Switzerl… Local Freelancer <NA>                          NA                NA
##  7 Germany   Local Full-time  Permanent                     NA                NA
##  8 Denmark   Nati… Full-time  Permanent                     NA                NA
##  9 Switzerl… Local Full-time  Permanent                      5                 5
## 10 Denmark   Nati… Full-time  Permanent                     NA                NA
## # ℹ 1,190 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>

It also works for categorical data:

WoJ %>%
  dplyr::select(country) %>%
  setna_scale(country,
              value = c("Germany", "Switzerland"))
## # A tibble: 1,200 × 2
##    country     country_na
##  * <fct>       <fct>     
##  1 Germany     <NA>      
##  2 Germany     <NA>      
##  3 Switzerland <NA>      
##  4 Switzerland <NA>      
##  5 Austria     Austria   
##  6 Switzerland <NA>      
##  7 Germany     <NA>      
##  8 Denmark     Denmark   
##  9 Switzerland <NA>      
## 10 Denmark     Denmark   
## # ℹ 1,190 more rows

7.2.2 reverse_scale()

The easiest one is to reverse your scale. You can just specify the scale and define the scale’s lower and upper end. Take autonomy_emphasis as an example that originally ranges from 1 to 5. We will reverse it to range from 5 to 1.

The function adds a new column named autonomy_emphasis_rev:

WoJ %>%
  reverse_scale(autonomy_emphasis,
                lower_end = 1,
                upper_end = 5) %>%
  dplyr::select(autonomy_emphasis,
                autonomy_emphasis_rev)
## # A tibble: 1,200 × 2
##    autonomy_emphasis autonomy_emphasis_rev
##                <dbl>                 <dbl>
##  1                 4                     2
##  2                 4                     2
##  3                 4                     2
##  4                 5                     1
##  5                 4                     2
##  6                 4                     2
##  7                 4                     2
##  8                 3                     3
##  9                 5                     1
## 10                 4                     2
## # ℹ 1,190 more rows

Alternatively, you can also specify the new column name manually:

WoJ %>%
  reverse_scale(autonomy_emphasis,
                name = "new_emphasis",
                lower_end = 1,
                upper_end = 5) %>%
  dplyr::select(autonomy_emphasis,
                new_emphasis)
## # A tibble: 1,200 × 2
##    autonomy_emphasis new_emphasis
##                <dbl>        <dbl>
##  1                 4            2
##  2                 4            2
##  3                 4            2
##  4                 5            1
##  5                 4            2
##  6                 4            2
##  7                 4            2
##  8                 3            3
##  9                 5            1
## 10                 4            2
## # ℹ 1,190 more rows

7.2.3 minmax_scale()

minmax_scale() just takes your continuous scale to a new range. For example, convert the 1-5 scale of autonomy_emphasis to a 1-10 scale while keeping the distances:

WoJ %>%
  minmax_scale(autonomy_emphasis,
               change_to_min = 1,
               change_to_max = 10) %>%
  dplyr::select(autonomy_emphasis,
                autonomy_emphasis_1to10)
## # A tibble: 1,200 × 2
##    autonomy_emphasis autonomy_emphasis_1to10
##                <dbl>                   <dbl>
##  1                 4                    7.75
##  2                 4                    7.75
##  3                 4                    7.75
##  4                 5                   10   
##  5                 4                    7.75
##  6                 4                    7.75
##  7                 4                    7.75
##  8                 3                    5.5 
##  9                 5                   10   
## 10                 4                    7.75
## # ℹ 1,190 more rows

7.2.4 center_scale()

center_scale() moves your continuous scale around a mean of 0:

WoJ %>%
  center_scale(autonomy_selection) %>%
  dplyr::select(autonomy_selection,
                autonomy_selection_centered)
## # A tibble: 1,200 × 2
##    autonomy_selection autonomy_selection_centered
##                 <dbl>                       <dbl>
##  1                  5                       1.12 
##  2                  3                      -0.876
##  3                  4                       0.124
##  4                  4                       0.124
##  5                  4                       0.124
##  6                  4                       0.124
##  7                  4                       0.124
##  8                  3                      -0.876
##  9                  5                       1.12 
## 10                  2                      -1.88 
## # ℹ 1,190 more rows

7.2.5 z_scale()

Finally, z_scale() does more or less the same but standardizes the outcome. To visualize this, we look at it with a visualized tab_frequencies():

WoJ %>%
  z_scale(autonomy_selection) %>%
  tab_frequencies(autonomy_selection,
                  autonomy_selection_z) %>%
  visualize()

7.3 Categorizing and recoding categorical scales

Beyond rescaling continuous variables, tidycomm also provides functions for handling categorical data: categorizing continuous variables, recoding categorical scales, and transforming them into dummy variables.

7.3.1 categorize_scale()

categorize_scale() converts numeric variables into categorical ones based on specified breaks and labels. The breaks are always included in the lowest category. For example, breaks = c(2, 3) with lower_end = 1 and upper_end = 5 creates intervals from 1 to <= 2, >2 to <= 3, and >3 to <= 5.

This is useful for grouping continuous data into categories (e.g., Low, Medium, High).

WoJ %>%
  dplyr::select(trust_parliament, trust_politicians) %>%
  categorize_scale(trust_parliament, trust_politicians,
                   lower_end = 1,
                   upper_end = 5,
                   breaks = c(2, 3),
                   labels = c("Low", "Medium", "High"))
## # A tibble: 1,200 × 4
##    trust_parliament trust_politicians trust_parliament_cat trust_politicians_cat
##  *            <dbl>             <dbl> <fct>                <fct>                
##  1                3                 3 Medium               Medium               
##  2                4                 3 High                 Medium               
##  3                4                 3 High                 Medium               
##  4                4                 3 High                 Medium               
##  5                3                 2 Medium               Low                  
##  6                4                 2 High                 Low                  
##  7                2                 2 Low                  Low                  
##  8                4                 3 High                 Medium               
##  9                1                 1 Low                  Low                  
## 10                3                 3 Medium               Medium               
## # ℹ 1,190 more rows

You can also give the resulting variable a custom name:

WoJ %>%
  dplyr::select(autonomy_selection) %>%
  categorize_scale(autonomy_selection,
                   lower_end = 1,
                   upper_end = 5,
                   breaks = c(2, 3, 4),
                   labels = c("Low", "Medium", "High", "Very High"),
                   name = "autonomy_in_categories")
## # A tibble: 1,200 × 2
##    autonomy_selection autonomy_in_categories_1
##  *              <dbl> <fct>                   
##  1                  5 Very High               
##  2                  3 Medium                  
##  3                  4 High                    
##  4                  4 High                    
##  5                  4 High                    
##  6                  4 High                    
##  7                  4 High                    
##  8                  3 Medium                  
##  9                  5 Very High               
## 10                  2 Low                     
## # ℹ 1,190 more rows

7.3.2 recode_cat_scale()

recode_cat_scale() transforms one or more categorical variables into new categories based on a mapping you define. It’s a simpler alternative to dplyr::mutate(case_when()), especially when recoding multiple variables.

You can specify: - assign: a named vector mapping old values to new ones - other: a value for all unmatched categories (defaults to NA) - overwrite: whether to overwrite the original variable - name: an optional name for the new variable

Example:

WoJ %>%
  recode_cat_scale(country,
                   assign = c("Germany" = "german",
                              "Switzerland" = "swiss"),
                   other = "other",
                   overwrite = TRUE)
## # A tibble: 1,200 × 16
##    country reach   employment temp_contract autonomy_selection autonomy_emphasis
##  * <fct>   <fct>   <chr>      <fct>                      <dbl>             <dbl>
##  1 german  Nation… Full-time  Permanent                      5                 4
##  2 german  Nation… Full-time  Permanent                      3                 4
##  3 swiss   Region… Full-time  Permanent                      4                 4
##  4 swiss   Local   Part-time  Permanent                      4                 5
##  5 other   Nation… Part-time  Permanent                      4                 4
##  6 swiss   Local   Freelancer <NA>                           4                 4
##  7 german  Local   Full-time  Permanent                      4                 4
##  8 other   Nation… Full-time  Permanent                      3                 3
##  9 swiss   Local   Full-time  Permanent                      5                 5
## 10 other   Nation… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>

7.3.2.1 Equivalent with mutate(case_when())

To achieve the same result manually using dplyr::mutate(case_when()):

WoJ %>%
  dplyr::mutate(
    country = dplyr::case_when(
      country == "Germany"     ~ "german",
      country == "Switzerland" ~ "swiss",
      is.na(country)           ~ NA_character_,  # keep missing as NA
      TRUE                     ~ "other"         # everything else → "other"
    )
  )
## # A tibble: 1,200 × 16
##    country reach   employment temp_contract autonomy_selection autonomy_emphasis
##    <chr>   <fct>   <chr>      <fct>                      <dbl>             <dbl>
##  1 german  Nation… Full-time  Permanent                      5                 4
##  2 german  Nation… Full-time  Permanent                      3                 4
##  3 swiss   Region… Full-time  Permanent                      4                 4
##  4 swiss   Local   Part-time  Permanent                      4                 5
##  5 other   Nation… Part-time  Permanent                      4                 4
##  6 swiss   Local   Freelancer <NA>                           4                 4
##  7 german  Local   Full-time  Permanent                      4                 4
##  8 other   Nation… Full-time  Permanent                      3                 3
##  9 swiss   Local   Full-time  Permanent                      5                 5
## 10 other   Nation… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 10 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>

If you want to create a new column instead of overwriting:

WoJ %>%
  dplyr::mutate(
    country_rec = dplyr::case_when(
      country == "Germany"     ~ "german",
      country == "Switzerland" ~ "swiss",
      is.na(country)           ~ NA_character_,
      TRUE                     ~ "other"
    )
  )
## # A tibble: 1,200 × 17
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Austria   Nati… Part-time  Permanent                      4                 4
##  6 Switzerl… Local Freelancer <NA>                           4                 4
##  7 Germany   Local Full-time  Permanent                      4                 4
##  8 Denmark   Nati… Full-time  Permanent                      3                 3
##  9 Switzerl… Local Full-time  Permanent                      5                 5
## 10 Denmark   Nati… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 11 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>, country_rec <chr>

7.3.2.2 Reverse ordinal categories

You can also reverse ordinal items easily:

WoJ %>%
  recode_cat_scale(ethics_1,
                   assign = c(`1` = 5, `2` = 4, `3` = 3, `4` = 2, `5` = 1),
                   overwrite = FALSE)
## # A tibble: 1,200 × 17
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##  * <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Austria   Nati… Part-time  Permanent                      4                 4
##  6 Switzerl… Local Freelancer <NA>                           4                 4
##  7 Germany   Local Full-time  Permanent                      4                 4
##  8 Denmark   Nati… Full-time  Permanent                      3                 3
##  9 Switzerl… Local Full-time  Permanent                      5                 5
## 10 Denmark   Nati… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 11 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>, ethics_1_rec <fct>

Equivalent in case_when():

WoJ %>%
  dplyr::mutate(
    ethics_1_rev = dplyr::case_when(
      ethics_1 == 1 ~ 5,
      ethics_1 == 2 ~ 4,
      ethics_1 == 3 ~ 3,
      ethics_1 == 4 ~ 2,
      ethics_1 == 5 ~ 1,
      is.na(ethics_1) ~ NA_real_
    )
  )
## # A tibble: 1,200 × 17
##    country   reach employment temp_contract autonomy_selection autonomy_emphasis
##    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
##  1 Germany   Nati… Full-time  Permanent                      5                 4
##  2 Germany   Nati… Full-time  Permanent                      3                 4
##  3 Switzerl… Regi… Full-time  Permanent                      4                 4
##  4 Switzerl… Local Part-time  Permanent                      4                 5
##  5 Austria   Nati… Part-time  Permanent                      4                 4
##  6 Switzerl… Local Freelancer <NA>                           4                 4
##  7 Germany   Local Full-time  Permanent                      4                 4
##  8 Denmark   Nati… Full-time  Permanent                      3                 3
##  9 Switzerl… Local Full-time  Permanent                      5                 5
## 10 Denmark   Nati… Full-time  Permanent                      2                 4
## # ℹ 1,190 more rows
## # ℹ 11 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
## #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
## #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>,
## #   ethical_concerns <dbl>, ethics_1_rev <dbl>

7.3.2.3 dummify_scale()

dummify_scale() converts categorical variables into dummy variables (0/1 indicators). Each category becomes its own column.

WoJ %>%
  dplyr::select(temp_contract) %>%
  dummify_scale(temp_contract)
## # A tibble: 1,200 × 3
##    temp_contract temp_contract_permanent temp_contract_temporary
##  * <fct>                           <int>                   <int>
##  1 Permanent                           1                       0
##  2 Permanent                           1                       0
##  3 Permanent                           1                       0
##  4 Permanent                           1                       0
##  5 Permanent                           1                       0
##  6 <NA>                               NA                      NA
##  7 Permanent                           1                       0
##  8 Permanent                           1                       0
##  9 Permanent                           1                       0
## 10 Permanent                           1                       0
## # ℹ 1,190 more rows

You can also first categorize a numeric variable and then create dummy variables from the categories:

WoJ %>%
  categorize_scale(autonomy_emphasis,
                   breaks = c(2, 3),
                   labels = c('low', 'medium', 'high')) %>%
  dummify_scale(autonomy_emphasis_cat) %>%
  dplyr::select(starts_with('autonomy_emphasis'))
## # A tibble: 1,200 × 5
##    autonomy_emphasis autonomy_emphasis_cat autonomy_emphasis_cat_low
##                <dbl> <fct>                                     <int>
##  1                 4 high                                          0
##  2                 4 high                                          0
##  3                 4 high                                          0
##  4                 5 high                                          0
##  5                 4 high                                          0
##  6                 4 high                                          0
##  7                 4 high                                          0
##  8                 3 medium                                        0
##  9                 5 high                                          0
## 10                 4 high                                          0
## # ℹ 1,190 more rows
## # ℹ 2 more variables: autonomy_emphasis_cat_medium <int>,
## #   autonomy_emphasis_cat_high <int>

7.3.3 Summary of all scaling functions

Function Purpose Typical Use
setna_scale() Replace values with NA Clean invalid data
reverse_scale() Flip numeric scales Reverse items
minmax_scale() Rescale numeric ranges Convert 1–5 to 0–100
center_scale() Center around mean = 0 Prepare for regression
z_scale() Center & standardize Compare across variables, prepare for regression
categorize_scale() Turn continuous into categories Group into Low/Med/High, etc.
recode_cat_scale() Recode categories Reorganize categorical data
dummify_scale() Create dummy variables E.g., for Regression / modeling

7.4 Univariate analysis

Univariate, descriptive analysis is usually the first step in data exploration. Tidycomm offers four basic functions to quickly output relevant statistics:

  1. describe(): For continuous variables
  2. tab_percentiles(): For continuous variables
  3. describe_cat(): For categorical variables
  4. tab_frequencies(): For categorical variables

7.4.1 Describe continuous variables

There are two options two describe continuous variables in tidycomm:

  1. describe() and
  2. tab_percentiles()

First, the describe() function outputs several measures of central tendency and variability for all variables named in the function call (i.e., mean, standard deviation, min, max, range, and quartiles). Here’s how you use it:

WoJ %>% describe(autonomy_emphasis, ethics_1, work_experience)
## # A tibble: 3 × 15
##   Variable            N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
## * <chr>           <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 autonomy_empha…  1195       5  4.08  0.793     1     4     4     5     5     4
## 2 ethics_1         1200       0  1.63  0.892     1     1     1     2     5     4
## 3 work_experience  1187      13 17.8  10.9       1     8    17    25    53    52
## # ℹ 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
## #   Kurtosis <dbl>

If no variables are passed to describe(), all numeric variables in the data are described:

WoJ %>% describe()
## # A tibble: 12 × 15
##    Variable           N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
##  * <chr>          <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 autonomy_sele…  1197       3  3.88  0.803     1  4      4       4     5     4
##  2 autonomy_emph…  1195       5  4.08  0.793     1  4      4       5     5     4
##  3 ethics_1        1200       0  1.63  0.892     1  1      1       2     5     4
##  4 ethics_2        1200       0  3.21  1.26      1  2      4       4     5     4
##  5 ethics_3        1200       0  2.39  1.13      1  2      2       3     5     4
##  6 ethics_4        1200       0  2.58  1.25      1  1.75   2       4     5     4
##  7 work_experien…  1187      13 17.8  10.9       1  8     17      25    53    52
##  8 trust_parliam…  1200       0  3.05  0.811     1  3      3       4     5     4
##  9 trust_governm…  1200       0  2.82  0.854     1  2      3       3     5     4
## 10 trust_parties   1200       0  2.42  0.736     1  2      2       3     4     3
## 11 trust_politic…  1200       0  2.52  0.712     1  2      3       3     4     3
## 12 ethical_conce…  1200       0  2.45  0.777     1  2      2.5     3     5     4
## # ℹ 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
## #   Kurtosis <dbl>

You can also group data (using dplyr) before describing:

WoJ %>%
  dplyr::group_by(country) %>%
  describe(autonomy_emphasis)
## # A tibble: 5 × 16
## # Groups:   country [5]
##   country Variable     N Missing     M    SD   Min   Q25   Mdn   Q75   Max Range
## * <fct>   <chr>    <int>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austria autonom…   205       2  4.19 0.614     2     4     4     5     5     3
## 2 Denmark autonom…   375       1  3.90 0.856     1     4     4     4     5     4
## 3 Germany autonom…   172       1  4.34 0.818     1     4     5     5     5     4
## 4 Switze… autonom…   233       0  4.07 0.694     1     4     4     4     5     4
## 5 UK      autonom…   210       1  4.08 0.838     2     4     4     5     5     3
## # ℹ 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
## #   Kurtosis <dbl>

Finally, you can create visualizations of your variables by applying the visualize() function to your data:

WoJ %>%
  describe(autonomy_emphasis) %>%
  visualize()

Second, the tab_percentiles function tabulates percentiles for at least one continuous variable:

WoJ %>%
  tab_percentiles(autonomy_emphasis)
## # A tibble: 1 × 11
##   Variable            p10   p20   p30   p40   p50   p60   p70   p80   p90  p100
## * <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 autonomy_emphasis     3     4     4     4     4     4     4     5     5     5

You can use it to tabulate percentiles for more than one variable:

WoJ %>%
  tab_percentiles(autonomy_emphasis, ethics_1)
## # A tibble: 2 × 11
##   Variable            p10   p20   p30   p40   p50   p60   p70   p80   p90  p100
## * <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 autonomy_emphasis     3     4     4     4     4     4     4     5     5     5
## 2 ethics_1              1     1     1     1     1     2     2     2     3     5

You can also specify your own percentiles:

WoJ %>%
  tab_percentiles(autonomy_emphasis, ethics_1, levels = c(0.16, 0.50, 0.84, 1.0))
## # A tibble: 2 × 5
##   Variable            p16   p50   p84  p100
## * <chr>             <dbl> <dbl> <dbl> <dbl>
## 1 autonomy_emphasis     3     4     5     5
## 2 ethics_1              1     1     2     5

And you can visualize your percentiles:

WoJ %>%
  tab_percentiles(autonomy_emphasis) %>%
  visualize()

7.4.2 Describe categorical variables

There are two options two describe categorical variables in tidycomm:

  1. describe_cat(): This function gives you a quick overview of one or more categorical variables.
  2. tab_frequencies(): The tab_frequencies() function creates a frequency table for one or more categorical variables.

This function gives you a quick overview of one or more categorical variables.
It reports, for each variable: - N: the number of valid observations
- Missing: the number of missing values
- Unique: the number of different categories
- Mode: the most common category
- Mode_N: the frequency of that most common category

In the example below using describe_cat(), country has 1200 valid observations, no missing values, 5 unique categories, and Denmark is the most frequent category with 376 cases:

WoJ %>%
  describe_cat(country)
## # A tibble: 1 × 6
##   Variable     N Missing Unique Mode    Mode_N
## * <chr>    <int>   <int>  <dbl> <chr>    <int>
## 1 country   1200       0      5 Denmark    376

Adding visualize() to the script plots the quantities, giving you a quick and intuitive bar chart:

WoJ %>%
  describe_cat(country) %>% 
  visualize()

The tab_frequencies() function creates a frequency table for one or more categorical variables.

For each category, it reports: - n: the absolute frequency
- percent: the relative frequency
- cum_n: the cumulative frequency
- cum_percent: the cumulative relative frequency

In the example below, you can see how the 5 countries in the WoJ sample are distributed, including their proportions and cumulative shares:

WoJ %>%
  tab_frequencies(country)
## # A tibble: 5 × 5
##   country         n percent cum_n cum_percent
## * <fct>       <int>   <dbl> <int>       <dbl>
## 1 Austria       207   0.172   207       0.172
## 2 Denmark       376   0.313   583       0.486
## 3 Germany       173   0.144   756       0.63 
## 4 Switzerland   233   0.194   989       0.824
## 5 UK            211   0.176  1200       1

Adding visualize() to the script plots the frequencies, giving you a quick and intuitive histogram:

WoJ %>%
  tab_frequencies(country) %>% 
  visualize()

7.5 How to easily copy-paste tidycomm commands

A simple way to explore and reuse tidycomm functions is to let RStudio help you. Here’s how you can quickly access ready-to-copy code examples:

  1. Start typing tidycomm:: in the console.
    RStudio will automatically show a list of all available functions in the tidycomm package.

  2. Choose the function you want to use, for example:
    tidycomm::recode_cat_scale().

  3. Open the help page for this function by adding a ? in front of it:

?tidycomm::recode_cat_scale()

This opens the documentation pane in RStudio. Scroll to the bottom of the help page. This is where the Examples section lives: a goldmine of ready-to-use code snippets.

These examples show you exactly how each function works and give you clean code you can copy and paste into your own script.

Honestly, because I’m terrible at memorizing even the functions I wrote myself, I use this trick all the time. :)

Now let’s see what you’ve learned so far: Exercise 6: tidycomm.