Introduction to summarytabl

Working with categorical variables

Let’s explore how to use cat_tbl() and cat_group_tbl() to summarize categorical variables. We’ll begin by summarizing a single categorical variable, race, from the nlsy dataset.

cat_tbl(data = nlsy, var = "race")

## # A tibble: 3 × 3
##   race                   count percent
##   <chr>                  <int>   <dbl>
## 1 Black                    868   0.292
## 2 Hispanic                 631   0.212
## 3 Non-Black,Non-Hispanic  1477   0.496

The function returns a tibble with three columns by default:

race: the name of the variable being summarized
count: the number of observations in each category of race
percent: the percentage of observations in each category of race, calculated relative to the total

You can exclude certain values and eliminate missing values from the data using the ignore and na.rm arguments, respectively.

cat_tbl(data = nlsy, 
        var = "race",
        ignore = "Hispanic",
        na.rm = TRUE)

## # A tibble: 2 × 3
##   race                   count percent
##   <chr>                  <int>   <dbl>
## 1 Black                    868   0.370
## 2 Non-Black,Non-Hispanic  1477   0.630

Suppose we want to create a contingency table to summarize two categorical variables. We can do this using the cat_group_tbl() function. In this example, we summarize race by bthwht. Before applying cat_group_tbl(), we’ll recode the values of bthwht, changing 0 to regular_birthweight and 1 to low_birthweight.

nlsy_cross_tab <- 
  nlsy |>
  dplyr::select(c(race, bthwht)) |>
  dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight")) 

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht")

## # A tibble: 6 × 4
##   race                   bthwht             count percent
##   <chr>                  <chr>              <int>   <dbl>
## 1 Black                  low_birthweight      102  0.0343
## 2 Black                  regular_bithweight   766  0.257 
## 3 Hispanic               low_birthweight       42  0.0141
## 4 Hispanic               regular_bithweight   589  0.198 
## 5 Non-Black,Non-Hispanic low_birthweight       83  0.0279
## 6 Non-Black,Non-Hispanic regular_bithweight  1394  0.468

The function returns a tibble with four columns by default:

race: the name of the row_var variable
bthwht: the name of the col_var variable
count: the number of observations for each combination of race and bthwht categories.
percent: the percentage of observations for each combination of race and bthwht categories, calculated relative to the total

To pivot the output to the wide format, set pivot = "wider".

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider")

## # A tibble: 3 × 5
##   race      count_bthwht_low_bir…¹ count_bthwht_regular…² percent_bthwht_low_b…³
##   <chr>                      <int>                  <int>                  <dbl>
## 1 Black                        102                    766                 0.0343
## 2 Hispanic                      42                    589                 0.0141
## 3 Non-Blac…                     83                   1394                 0.0279
## # ℹ abbreviated names: ¹count_bthwht_low_birthweight,
## #   ²count_bthwht_regular_bithweight, ³percent_bthwht_low_birthweight
## # ℹ 1 more variable: percent_bthwht_regular_bithweight <dbl>

To display only percentages, set only = "percent". You can also control how those percentages are calculated and displayed using the margins argument.

# Default: percentages across the full table sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                          0.0343                    0.257
## 2 Hispanic                                       0.0141                    0.198
## 3 Non-Black,Non-Hispanic                         0.0279                    0.468
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

# Rowwise: percentages sum to one across columns within each row
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "rows",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                          0.118                     0.882
## 2 Hispanic                                       0.0666                    0.933
## 3 Non-Black,Non-Hispanic                         0.0562                    0.944
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

# Columnwise: percentages within each column sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "columns",
              pivot = "wider",
              only = "percent")

## # A tibble: 3 × 3
##   race                   percent_bthwht_low_birthweight percent_bthwht_regular…¹
##   <chr>                                           <dbl>                    <dbl>
## 1 Black                                           0.449                    0.279
## 2 Hispanic                                        0.185                    0.214
## 3 Non-Black,Non-Hispanic                          0.366                    0.507
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight

Sometimes, you may want to exclude specific values from your analysis. To do this, use a named vector or list to specify which values to exclude from the row_var and col_var variables. For example, in the case below, the Non-Black/Non-Hispanic category is excluded from the race variable (i.e., row_var) and to ensure that NAs are not returned in the final table, na.rm.row_var is set to TRUE.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = c(race = "Non-Black,Non-Hispanic"))

## # A tibble: 4 × 4
##   race     bthwht             count percent
##   <chr>    <chr>              <int>   <dbl>
## 1 Black    low_birthweight      102  0.0680
## 2 Black    regular_bithweight   766  0.511 
## 3 Hispanic low_birthweight       42  0.0280
## 4 Hispanic regular_bithweight   589  0.393

When you need to exclude more than one value from row_var or col_var, use a named list. In the example below, both the Non-Black/Non-Hispanic and Hispanic categories are excluded from the race variable.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))

## # A tibble: 2 × 4
##   race  bthwht             count percent
##   <chr> <chr>              <int>   <dbl>
## 1 Black low_birthweight      102   0.118
## 2 Black regular_bithweight   766   0.882

Working with multiple response and ordinal variables

Next, let’s explore how to use select_tbl() and select_group_tbl() functions to summarize multiple response and ordinal variables. Multiple response and ordinal variables are commonly used in survey research, psychology, and health sciences. Examples include symptom checklists, scales like a depression index with multiple items, or questions allowing respondents to select all choices that apply to them.

The depressive dataset contains eight variables that share the same variable stem: dep, with each one representing a different item used to measure depression.

names(depressive)

##  [1] "cid"   "race"  "sex"   "yob"   "dep_1" "dep_2" "dep_3" "dep_4" "dep_5"
## [10] "dep_6" "dep_7" "dep_8"

Using the select_tbl() function, we can summarize participants’ responses to these items by showing how many respondents chose each answer option (i.e., value) for every variable.

select_tbl(data = depressive, var_stem = "dep")

## # A tibble: 24 × 4
##    variable values count percent
##    <chr>     <int> <int>   <dbl>
##  1 dep_1         1   109  0.0678
##  2 dep_1         2   689  0.429 
##  3 dep_1         3   809  0.503 
##  4 dep_2         1   144  0.0896
##  5 dep_2         2   746  0.464 
##  6 dep_2         3   717  0.446 
##  7 dep_3         1  1162  0.723 
##  8 dep_3         2   392  0.244 
##  9 dep_3         3    53  0.0330
## 10 dep_4         1   601  0.374 
## # ℹ 14 more rows

Alternatively, you can choose to summarize specific variables by passing their names to the var_stem argument and setting the var_input argument to "name".

select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name")

## # A tibble: 9 × 4
##   variable values count percent
##   <chr>     <int> <int>   <dbl>
## 1 dep_1         1   117  0.0714
## 2 dep_1         2   703  0.429 
## 3 dep_1         3   818  0.499 
## 4 dep_4         1   608  0.371 
## 5 dep_4         2   854  0.521 
## 6 dep_4         3   176  0.107 
## 7 dep_6         1   398  0.243 
## 8 dep_6         2   872  0.532 
## 9 dep_6         3   368  0.225

By default, missing values are removed using listwise deletion. To switch to pairwise deletion instead, set na_removal = "pairwise".

select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise")

## # A tibble: 24 × 4
##    variable values count percent
##    <chr>     <int> <int>   <dbl>
##  1 dep_1         1   120  0.0726
##  2 dep_1         2   709  0.429 
##  3 dep_1         3   825  0.499 
##  4 dep_2         1   151  0.0920
##  5 dep_2         2   762  0.464 
##  6 dep_2         3   728  0.444 
##  7 dep_3         1  1192  0.721 
##  8 dep_3         2   406  0.246 
##  9 dep_3         3    55  0.0333
## 10 dep_4         1   611  0.371 
## # ℹ 14 more rows

To display the output in the wide format, set pivot = "wider".

select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise",
           pivot = "wider")

## # A tibble: 8 × 7
##   variable count_value_1 count_value_2 count_value_3 percent_value_1
##   <chr>            <int>         <int>         <int>           <dbl>
## 1 dep_1              120           709           825          0.0726
## 2 dep_2              151           762           728          0.0920
## 3 dep_3             1192           406            55          0.721 
## 4 dep_4              611           856           181          0.371 
## 5 dep_5              206           574           871          0.125 
## 6 dep_6              399           879           371          0.242 
## 7 dep_7             1046           507            95          0.635 
## 8 dep_8              323           801           519          0.197 
## # ℹ 2 more variables: percent_value_2 <dbl>, percent_value_3 <dbl>

It’s common practice to group multiple response or ordinal variables by another variable. This type of descriptive analysis allows for meaningful comparisons across different segments of your dataset. With select_group_tbl(), you can create a summary table for multiple response and ordinal variables, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by race.

First, recode the race variable and the values for each of the eight depressive index variables in the depressive dataset, replacing numeric categories with descriptive string labels for easier interpretation.

dep_recoded <- 
  depressive |>
  dplyr::mutate(
    race = dplyr::case_match(.x = race,
                             1 ~ "Hispanic", 
                             2 ~ "Black", 
                             3 ~ "Non-Black/Non-Hispanic",
                             .default = NA)
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("dep"),
      .fns = ~ dplyr::case_when(.x == 1 ~ "often", 
                                .x == 2 ~ "sometimes", 
                                .x == 3 ~ "hardly ever")
    ))

Next, use the select_group_tbl() function to summarize responses for all eight variables by race:

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race")

## # A tibble: 72 × 5
##    variable race                   values      count percent
##    <chr>    <chr>                  <chr>       <int>   <dbl>
##  1 dep_1    Black                  hardly ever   248  0.154 
##  2 dep_1    Black                  often          45  0.0280
##  3 dep_1    Black                  sometimes     194  0.121 
##  4 dep_1    Hispanic               hardly ever   187  0.116 
##  5 dep_1    Hispanic               often          28  0.0174
##  6 dep_1    Hispanic               sometimes     155  0.0965
##  7 dep_1    Non-Black/Non-Hispanic hardly ever   374  0.233 
##  8 dep_1    Non-Black/Non-Hispanic often          36  0.0224
##  9 dep_1    Non-Black/Non-Hispanic sometimes     340  0.212 
## 10 dep_2    Black                  hardly ever   234  0.146 
## # ℹ 62 more rows

As with select_tbl(), setting the pivot argument to "wider" reshapes the table into the wide format, while using "pairwise" for the na_removal argument ensures missing values are addressed through pairwise deletion.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

## # A tibble: 24 × 8
##    variable values   count_race_Black count_race_Hispanic count_race_Non-Black…¹
##    <chr>    <chr>               <int>               <int>                  <int>
##  1 dep_1    hardly …              256                 190                    379
##  2 dep_1    often                  54                  28                     38
##  3 dep_1    sometim…              203                 159                    347
##  4 dep_2    hardly …              241                 172                    315
##  5 dep_2    often                  52                  38                     61
##  6 dep_2    sometim…              213                 165                    384
##  7 dep_3    hardly …               20                  20                     15
##  8 dep_3    often                 342                 252                    598
##  9 dep_3    sometim…              149                 105                    152
## 10 dep_4    hardly …               48                  40                     93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## #   `percent_race_Non-Black/Non-Hispanic` <dbl>

The ignore argument can be used to exclude specific values from analysis. In the example below, the value often is removed from all eight depression index variables, and the Non-Black/Non-Hispanic category is excluded from the race variable.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider",
                 ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))

## # A tibble: 16 × 6
##    variable values      count_race_Black count_race_Hispanic percent_race_Black
##    <chr>    <chr>                  <int>               <int>              <dbl>
##  1 dep_1    hardly ever              256                 190             0.317 
##  2 dep_1    sometimes                203                 159             0.251 
##  3 dep_2    hardly ever              241                 172             0.305 
##  4 dep_2    sometimes                213                 165             0.269 
##  5 dep_3    hardly ever               20                  20             0.0680
##  6 dep_3    sometimes                149                 105             0.507 
##  7 dep_4    hardly ever               48                  40             0.0854
##  8 dep_4    sometimes                269                 205             0.479 
##  9 dep_5    hardly ever              253                 201             0.333 
## 10 dep_5    sometimes                182                 124             0.239 
## 11 dep_6    hardly ever              128                  95             0.190 
## 12 dep_6    sometimes                249                 200             0.371 
## 13 dep_7    hardly ever               38                  28             0.110 
## 14 dep_7    sometimes                152                 128             0.439 
## 15 dep_8    hardly ever              171                 127             0.238 
## 16 dep_8    sometimes                237                 182             0.331 
## # ℹ 1 more variable: percent_race_Hispanic <dbl>

When group_type is set to variable (the default), the margins argument controls how percentages are calculated and presented.

# Default: percentages across each variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

## # A tibble: 24 × 8
##    variable values   count_race_Black count_race_Hispanic count_race_Non-Black…¹
##    <chr>    <chr>               <int>               <int>                  <int>
##  1 dep_1    hardly …              256                 190                    379
##  2 dep_1    often                  54                  28                     38
##  3 dep_1    sometim…              203                 159                    347
##  4 dep_2    hardly …              241                 172                    315
##  5 dep_2    often                  52                  38                     61
##  6 dep_2    sometim…              213                 165                    384
##  7 dep_3    hardly …               20                  20                     15
##  8 dep_3    often                 342                 252                    598
##  9 dep_3    sometim…              149                 105                    152
## 10 dep_4    hardly …               48                  40                     93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## #   `percent_race_Non-Black/Non-Hispanic` <dbl>

# Rowwise: for each value of the variable, the percentages 
# across all levels of the grouping variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "rows",
                 na_removal = "pairwise",
                 pivot = "wider")

## # A tibble: 24 × 8
##    variable values   count_race_Black count_race_Hispanic count_race_Non-Black…¹
##    <chr>    <chr>               <int>               <int>                  <int>
##  1 dep_1    hardly …              256                 190                    379
##  2 dep_1    often                  54                  28                     38
##  3 dep_1    sometim…              203                 159                    347
##  4 dep_2    hardly …              241                 172                    315
##  5 dep_2    often                  52                  38                     61
##  6 dep_2    sometim…              213                 165                    384
##  7 dep_3    hardly …               20                  20                     15
##  8 dep_3    often                 342                 252                    598
##  9 dep_3    sometim…              149                 105                    152
## 10 dep_4    hardly …               48                  40                     93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## #   `percent_race_Non-Black/Non-Hispanic` <dbl>

# Columnwise: for each level of the grouping variable, 
# the percentages across all values of the variable sum 
# to one.
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "columns",
                 na_removal = "pairwise",
                 pivot = "wider")

## # A tibble: 24 × 8
##    variable values   count_race_Black count_race_Hispanic count_race_Non-Black…¹
##    <chr>    <chr>               <int>               <int>                  <int>
##  1 dep_1    hardly …              256                 190                    379
##  2 dep_1    often                  54                  28                     38
##  3 dep_1    sometim…              203                 159                    347
##  4 dep_2    hardly …              241                 172                    315
##  5 dep_2    often                  52                  38                     61
##  6 dep_2    sometim…              213                 165                    384
##  7 dep_3    hardly …               20                  20                     15
##  8 dep_3    often                 342                 252                    598
##  9 dep_3    sometim…              149                 105                    152
## 10 dep_4    hardly …               48                  40                     93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## #   `percent_race_Non-Black/Non-Hispanic` <dbl>

Another way to use select_group_tbl() is to summarize responses that match a specific pattern, such as survey waves or time points. To enable this feature, set group_type = "pattern" and provide the desired pattern in the group argument. For example, the stem_social_psych dataset contains variables that capture student responses about their sense of belonging in the STEM community at two distinct time points: “w1” and “w2”. You can summarize these responses using a pattern-based approach, where the time points (e.g., “w1” and “w2”) serve as grouping variables.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern")

## # A tibble: 10 × 5
##    variable             group values count percent
##    <chr>                <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 w1         1     5  0.0185
##  2 belong_belongStem_w1 w1         2    20  0.0741
##  3 belong_belongStem_w1 w1         3    59  0.219 
##  4 belong_belongStem_w1 w1         4   107  0.396 
##  5 belong_belongStem_w1 w1         5    79  0.293 
##  6 belong_belongStem_w2 w2         1    11  0.0407
##  7 belong_belongStem_w2 w2         2    11  0.0407
##  8 belong_belongStem_w2 w2         3    44  0.163 
##  9 belong_belongStem_w2 w2         4   113  0.419 
## 10 belong_belongStem_w2 w2         5    91  0.337

Use the group_name argument to assign a descriptive name to the column containing the matched pattern values.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

## # A tibble: 10 × 5
##    variable             wave  values count percent
##    <chr>                <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 w1         1     5  0.0185
##  2 belong_belongStem_w1 w1         2    20  0.0741
##  3 belong_belongStem_w1 w1         3    59  0.219 
##  4 belong_belongStem_w1 w1         4   107  0.396 
##  5 belong_belongStem_w1 w1         5    79  0.293 
##  6 belong_belongStem_w2 w2         1    11  0.0407
##  7 belong_belongStem_w2 w2         2    11  0.0407
##  8 belong_belongStem_w2 w2         3    44  0.163 
##  9 belong_belongStem_w2 w2         4   113  0.419 
## 10 belong_belongStem_w2 w2         5    91  0.337

You can also include variable labels in your summary table by using the var_labels argument.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 var_labels = c(
                   belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)",
                   belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)"
                 ))

## # A tibble: 10 × 6
##    variable             variable_label                wave  values count percent
##    <chr>                <chr>                         <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 I feel like I belong in STEM… w1         1     5  0.0185
##  2 belong_belongStem_w1 I feel like I belong in STEM… w1         2    20  0.0741
##  3 belong_belongStem_w1 I feel like I belong in STEM… w1         3    59  0.219 
##  4 belong_belongStem_w1 I feel like I belong in STEM… w1         4   107  0.396 
##  5 belong_belongStem_w1 I feel like I belong in STEM… w1         5    79  0.293 
##  6 belong_belongStem_w2 I feel like I belong in STEM… w2         1    11  0.0407
##  7 belong_belongStem_w2 I feel like I belong in STEM… w2         2    11  0.0407
##  8 belong_belongStem_w2 I feel like I belong in STEM… w2         3    44  0.163 
##  9 belong_belongStem_w2 I feel like I belong in STEM… w2         4   113  0.419 
## 10 belong_belongStem_w2 I feel like I belong in STEM… w2         5    91  0.337

Finally, use the only argument to choose what information to return.

# Default: counts and percentages
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

## # A tibble: 10 × 5
##    variable             wave  values count percent
##    <chr>                <chr>  <dbl> <int>   <dbl>
##  1 belong_belongStem_w1 w1         1     5  0.0185
##  2 belong_belongStem_w1 w1         2    20  0.0741
##  3 belong_belongStem_w1 w1         3    59  0.219 
##  4 belong_belongStem_w1 w1         4   107  0.396 
##  5 belong_belongStem_w1 w1         5    79  0.293 
##  6 belong_belongStem_w2 w2         1    11  0.0407
##  7 belong_belongStem_w2 w2         2    11  0.0407
##  8 belong_belongStem_w2 w2         3    44  0.163 
##  9 belong_belongStem_w2 w2         4   113  0.419 
## 10 belong_belongStem_w2 w2         5    91  0.337

# Counts only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "count")

## # A tibble: 10 × 4
##    variable             wave  values count
##    <chr>                <chr>  <dbl> <int>
##  1 belong_belongStem_w1 w1         1     5
##  2 belong_belongStem_w1 w1         2    20
##  3 belong_belongStem_w1 w1         3    59
##  4 belong_belongStem_w1 w1         4   107
##  5 belong_belongStem_w1 w1         5    79
##  6 belong_belongStem_w2 w2         1    11
##  7 belong_belongStem_w2 w2         2    11
##  8 belong_belongStem_w2 w2         3    44
##  9 belong_belongStem_w2 w2         4   113
## 10 belong_belongStem_w2 w2         5    91

# Percentages only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "percent")

## # A tibble: 10 × 4
##    variable             wave  values percent
##    <chr>                <chr>  <dbl>   <dbl>
##  1 belong_belongStem_w1 w1         1  0.0185
##  2 belong_belongStem_w1 w1         2  0.0741
##  3 belong_belongStem_w1 w1         3  0.219 
##  4 belong_belongStem_w1 w1         4  0.396 
##  5 belong_belongStem_w1 w1         5  0.293 
##  6 belong_belongStem_w2 w2         1  0.0407
##  7 belong_belongStem_w2 w2         2  0.0407
##  8 belong_belongStem_w2 w2         3  0.163 
##  9 belong_belongStem_w2 w2         4  0.419 
## 10 belong_belongStem_w2 w2         5  0.337

Working with continuous variables

Finally, let’s look at how to use the mean_tbl() and mean_group_tbl() functions to summarize continuous variables. The mean_tbl() function allows you to generate descriptive statistics for either a set of continuous variables that share a common stem or for individual continuous variables. The resulting summary table includes key metrics such as the variable’s mean, standard deviation, minimum value, maximum value, and the count of non-missing observations for each variable.

The sdoh dataset contains six variables describing characteristics of health care facilities, all of which begin with the prefix HHC_PCT. Using the mean_tbl() function, you can generate summary statistics for these variables:

mean_tbl(data = sdoh, var_stem = "HHC_PCT")

## # A tibble: 6 × 6
##   variable                  mean    sd   min   max  nobs
##   <chr>                    <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1  47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2  46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1  48.6     0   100  3227

Alternatively, if you want to generate summary statistics for only a subset of those variables, you can specify their names directly in the var_stem argument and set var_input = "name" to indicate you’re referencing variable names rather than a shared stem.

mean_tbl(
  data = sdoh,
  var_stem = c("HHC_PCT_HHA_PHYS_THERAPY",
               "HHC_PCT_HHA_OCC_THERAPY",
               "HHC_PCT_HHA_SPEECH"),
  var_input = "name"
)

## # A tibble: 3 × 6
##   variable                  mean    sd   min   max  nobs
##   <chr>                    <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_PHYS_THERAPY  56.7  48.8     0   100  3227
## 2 HHC_PCT_HHA_OCC_THERAPY   52.4  48.3     0   100  3227
## 3 HHC_PCT_HHA_SPEECH        49.1  47.6     0   100  3227

You can also specify how missing values are removed, using the na_removal argument.

# Default listwise removal
mean_tbl(data = sdoh, var_stem = "HHC_PCT")

## # A tibble: 6 × 6
##   variable                  mean    sd   min   max  nobs
##   <chr>                    <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1  47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2  46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1  48.6     0   100  3227

# Pairwise removal
mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise")

## # A tibble: 6 × 6
##   variable                  mean    sd   min   max  nobs
##   <chr>                    <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING       58.2  49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY  56.7  48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY   52.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH        49.1  47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL       42.2  46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE          55.1  48.6     0   100  3227

Consider adding variable labels using the var_labels argument to help make the variable names easier to interpret.

mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise",
         var_labels = c(
           HHC_PCT_HHA_NURSING="% agencies offering nursing care services",
           HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services",
           HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services",
           HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services",
           HHC_PCT_HHA_MEDICAL="% agencies offering medical social services",
           HHC_PCT_HHA_AIDE="% agencies offering home health aide services"
         ))

## # A tibble: 6 × 7
##   variable                 variable_label           mean    sd   min   max  nobs
##   <chr>                    <chr>                   <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING      % agencies offering nu…  58.2  49.3     0   100  3227
## 2 HHC_PCT_HHA_PHYS_THERAPY % agencies offering ph…  56.7  48.8     0   100  3227
## 3 HHC_PCT_HHA_OCC_THERAPY  % agencies offering oc…  52.4  48.3     0   100  3227
## 4 HHC_PCT_HHA_SPEECH       % agencies offering sp…  49.1  47.6     0   100  3227
## 5 HHC_PCT_HHA_MEDICAL      % agencies offering me…  42.2  46.2     0   100  3227
## 6 HHC_PCT_HHA_AIDE         % agencies offering ho…  55.1  48.6     0   100  3227

Similar to working with multiple response variables, it’s common practice to group continuous variables by another variable to enable meaningful comparisons across different segments of a dataset. The mean_group_tbl() function facilitates this type of descriptive analysis by generating summary statistics for continuous variables, grouped either by a specific variable in the dataset or by matching patterns in variable names. For example, it’s often useful to present summary statistics by demographic categories such as region, gender, age, or race.

mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               group_type = "variable")

## # A tibble: 24 × 7
##    variable                 REGION     mean    sd   min   max  nobs
##    <chr>                    <chr>     <dbl> <dbl> <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest    57.4  49.5     0   100  1055
##  2 HHC_PCT_HHA_NURSING      Northeast  74.2  43.9     0   100   217
##  3 HHC_PCT_HHA_NURSING      South      58.8  49.2     0   100  1422
##  4 HHC_PCT_HHA_NURSING      West       56    49.7     0   100   450
##  5 HHC_PCT_HHA_PHYS_THERAPY Midwest    55.2  48.9     0   100  1055
##  6 HHC_PCT_HHA_PHYS_THERAPY Northeast  68.0  43.1     0   100   217
##  7 HHC_PCT_HHA_PHYS_THERAPY South      58.4  49.0     0   100  1422
##  8 HHC_PCT_HHA_PHYS_THERAPY West       54.5  49.0     0   100   450
##  9 HHC_PCT_HHA_OCC_THERAPY  Midwest    52.9  48.7     0   100  1055
## 10 HHC_PCT_HHA_OCC_THERAPY  Northeast  64.8  42.8     0   100   217
## # ℹ 14 more rows

You can control which values to exclude and how missing data is handled using the ignore and na_removal arguments. To specify values to ignore, use a named vector or list, where each name corresponds to a variable stem or specific variable name.

# Default listwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

## # A tibble: 18 × 7
##    variable                 REGION   mean    sd    min   max  nobs
##    <chr>                    <chr>   <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100    0    100      100   403
##  2 HHC_PCT_HHA_NURSING      South   100    0    100      100   681
##  3 HHC_PCT_HHA_NURSING      West    100    0    100      100   200
##  4 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.7  7.15  50      100   403
##  5 HHC_PCT_HHA_PHYS_THERAPY South    99.2  4.78  50      100   681
##  6 HHC_PCT_HHA_PHYS_THERAPY West     98.3  5.31  60      100   200
##  7 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3 10.4   33.3    100   403
##  8 HHC_PCT_HHA_OCC_THERAPY  South    95.5 12.4   28.6    100   681
##  9 HHC_PCT_HHA_OCC_THERAPY  West     94.8 12.2   25      100   200
## 10 HHC_PCT_HHA_SPEECH       Midwest  91.9 16.2   33.3    100   403
## 11 HHC_PCT_HHA_SPEECH       South    93.4 15.3   25      100   681
## 12 HHC_PCT_HHA_SPEECH       West     91.0 17.2   20      100   200
## 13 HHC_PCT_HHA_MEDICAL      Midwest  82.4 23.8    9.09   100   403
## 14 HHC_PCT_HHA_MEDICAL      South    89.4 18.6   16.7    100   681
## 15 HHC_PCT_HHA_MEDICAL      West     92.6 15.3   33.3    100   200
## 16 HHC_PCT_HHA_AIDE         Midwest  97.3  8.97  50      100   403
## 17 HHC_PCT_HHA_AIDE         South    96.1 10.3   42.9    100   681
## 18 HHC_PCT_HHA_AIDE         West     96.4  9.96  50      100   200

# Pairwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

## # A tibble: 18 × 7
##    variable                 REGION   mean    sd    min   max  nobs
##    <chr>                    <chr>   <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100    0    100      100   606
##  2 HHC_PCT_HHA_NURSING      South   100    0    100      100   836
##  3 HHC_PCT_HHA_NURSING      West    100    0    100      100   252
##  4 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.8  8.36  25      100   595
##  5 HHC_PCT_HHA_PHYS_THERAPY South    99.4  4.32  50      100   836
##  6 HHC_PCT_HHA_PHYS_THERAPY West     97.7  8.14  33.3    100   251
##  7 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3 11.5   25      100   579
##  8 HHC_PCT_HHA_OCC_THERAPY  South    95.8 12.2   28.6    100   787
##  9 HHC_PCT_HHA_OCC_THERAPY  West     94.5 13.0   25      100   232
## 10 HHC_PCT_HHA_SPEECH       Midwest  92.6 16.1   25      100   552
## 11 HHC_PCT_HHA_SPEECH       South    93.7 15.2   25      100   769
## 12 HHC_PCT_HHA_SPEECH       West     91.3 17.0   20      100   221
## 13 HHC_PCT_HHA_MEDICAL      Midwest  83.0 23.6    9.09   100   419
## 14 HHC_PCT_HHA_MEDICAL      South    89.7 18.6   16.7    100   724
## 15 HHC_PCT_HHA_MEDICAL      West     92.5 15.8   33.3    100   224
## 16 HHC_PCT_HHA_AIDE         Midwest  98.0  7.85  50      100   588
## 17 HHC_PCT_HHA_AIDE         South    96.6  9.82  42.9    100   816
## 18 HHC_PCT_HHA_AIDE         West     96.4 10.8   33.3    100   247

# Pairwise removal excluding several values from the same stem 
# or group variable.
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))

## # A tibble: 12 × 7
##    variable                 REGION   mean    sd    min   max  nobs
##    <chr>                    <chr>   <dbl> <dbl>  <dbl> <dbl> <int>
##  1 HHC_PCT_HHA_NURSING      Midwest 100    0    100      100   606
##  2 HHC_PCT_HHA_NURSING      West    100    0    100      100   252
##  3 HHC_PCT_HHA_PHYS_THERAPY Midwest  97.8  8.36  25      100   595
##  4 HHC_PCT_HHA_PHYS_THERAPY West     97.7  8.14  33.3    100   251
##  5 HHC_PCT_HHA_OCC_THERAPY  Midwest  96.3 11.5   25      100   579
##  6 HHC_PCT_HHA_OCC_THERAPY  West     94.5 13.0   25      100   232
##  7 HHC_PCT_HHA_SPEECH       Midwest  92.6 16.1   25      100   552
##  8 HHC_PCT_HHA_SPEECH       West     91.3 17.0   20      100   221
##  9 HHC_PCT_HHA_MEDICAL      Midwest  83.0 23.6    9.09   100   419
## 10 HHC_PCT_HHA_MEDICAL      West     92.5 15.8   33.3    100   224
## 11 HHC_PCT_HHA_AIDE         Midwest  98.0  7.85  50      100   588
## 12 HHC_PCT_HHA_AIDE         West     96.4 10.8   33.3    100   247

Another way to use mean_group_tbl() is to summarize responses based on a shared pattern, such as survey time points. To enable this feature, set group_type = "pattern" and specify the desired pattern in the group argument.

Consider a dataset compiled by researchers examining how many symptoms participants reported they’d had after a long illness. In this (fictitious) dataset, responses are collected at three time points: “t1” (baseline), “t2” (6-month follow-up), and “t3” (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.

In the example below, we first create the symptoms_data dataset and then use the mean_group_tbl() function to generate summary statistics for variables that begin with the prefix symptoms and contain a substring matching the pattern "_t\\d", an underscore followed by the letter “t” and a single digit, indicating different time points. The ignore argument is also used to exclude the value -999 from the analysis.

set.seed(0803)
symptoms_data <-
  data.frame(
    symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
    symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50),
    symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
  )

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               ignore = c(symptoms = -999))

## # A tibble: 3 × 7
##   variable    group  mean    sd   min   max  nobs
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1     4.03  3.14     0    10    33
## 2 symptoms_t2 t2     5.12  3.33     0    10    33
## 3 symptoms_t3 t3     4.64  3.29     0    10    33

To make your output easier to understand, use the group_name argument to add a label to the column that shows grouping values or matched patterns. You can also use the var_labels argument to display descriptive labels for each variable.

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999), 
               var_labels = c(symptoms_t1 = "# of symptoms at baseline",
                              symptoms_t2 = "# of symptoms at 6 months follow up",
                              symptoms_t3 = "# of symptoms at one-year follow up"))

## # A tibble: 3 × 8
##   variable    variable_label            time_point  mean    sd   min   max  nobs
##   <chr>       <chr>                     <chr>      <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 # of symptoms at baseline t1          4.03  3.14     0    10    33
## 2 symptoms_t2 # of symptoms at 6 month… t2          5.12  3.33     0    10    33
## 3 symptoms_t3 # of symptoms at one-yea… t3          4.64  3.29     0    10    33

Finally, you can choose what information to return using the only argument.

# Default: all summary statistics returned
# (mean, sd, min, max, nobs)
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999))

## # A tibble: 3 × 7
##   variable    time_point  mean    sd   min   max  nobs
##   <chr>       <chr>      <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1          4.03  3.14     0    10    33
## 2 symptoms_t2 t2          5.12  3.33     0    10    33
## 3 symptoms_t3 t3          4.64  3.29     0    10    33

# Means and non-missing observations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "nobs"))

## # A tibble: 3 × 4
##   variable    time_point  mean  nobs
##   <chr>       <chr>      <dbl> <int>
## 1 symptoms_t1 t1          4.03    33
## 2 symptoms_t2 t2          5.12    33
## 3 symptoms_t3 t3          4.64    33

# Means and standard deviations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "sd"))

## # A tibble: 3 × 4
##   variable    time_point  mean    sd
##   <chr>       <chr>      <dbl> <dbl>
## 1 symptoms_t1 t1          4.03  3.14
## 2 symptoms_t2 t2          5.12  3.33
## 3 symptoms_t3 t3          4.64  3.29

Introduction to summarytabl

Overview

Working with categorical variables

Working with multiple response and ordinal variables

Working with continuous variables