summarytabl is an R package designed to simplify the creation of summary tables for different types of data. It provides a set of functions that help you quickly describe:
Each function is clearly prefixed based on the type of data it summarizes, making it easy to identify and apply the right tool for your analysis.
Use these functions to summarize binary and nominal variables:
cat_tbl() creates a summary table for a categorical
variable.cat_group_tbl() summarizes two categorical
variables.These functions are ideal for summarizing binary, ordinal, and Likert-scale variables in which respondents select one response per statement, question, or item:
select_tbl() summarizes multiple response and ordinal
variables.select_group_tbl() summarizes multiple response and
ordinal variables by a group or pattern.For interval and ratio-level variables, use:
mean_tbl() generates summary statistics for continuous
variables.mean_group_tbl() generates summary statistics for
continuous variables by group or pattern.All functions work with data frames and tibbles, and each returns a tibble as output.
This document is organized into three sections, each focusing on a different set of functions for summarizing a specific type of variable.
To begin working with summarytabl, load the package:
Keep reading to learn more about how each function works, or jump to the section that matches the type of variable or data you’re working with.
Let’s explore how to use cat_tbl() and
cat_group_tbl() to summarize categorical variables. We’ll
begin by summarizing a single categorical variable, race,
from the nlsy dataset.
## # A tibble: 3 × 3
## race count percent
## <chr> <int> <dbl>
## 1 Black 868 0.292
## 2 Hispanic 631 0.212
## 3 Non-Black,Non-Hispanic 1477 0.496
The function returns a tibble with three columns by default:
race: the name of the variable being summarizedcount: the number of observations in each category of
racepercent: the percentage of observations in each
category of race, calculated relative to the totalYou can exclude certain values and eliminate missing values from the
data using the ignore and na.rm arguments,
respectively.
## # A tibble: 2 × 3
## race count percent
## <chr> <int> <dbl>
## 1 Black 868 0.370
## 2 Non-Black,Non-Hispanic 1477 0.630
Suppose we want to create a contingency table to summarize two
categorical variables. We can do this using the
cat_group_tbl() function. In this example, we summarize
race by bthwht. Before applying
cat_group_tbl(), we’ll recode the values of
bthwht, changing 0 to
regular_birthweight and 1 to
low_birthweight.
nlsy_cross_tab <-
nlsy |>
dplyr::select(c(race, bthwht)) |>
dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight"))
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht")## # A tibble: 6 × 4
## race bthwht count percent
## <chr> <chr> <int> <dbl>
## 1 Black low_birthweight 102 0.0343
## 2 Black regular_bithweight 766 0.257
## 3 Hispanic low_birthweight 42 0.0141
## 4 Hispanic regular_bithweight 589 0.198
## 5 Non-Black,Non-Hispanic low_birthweight 83 0.0279
## 6 Non-Black,Non-Hispanic regular_bithweight 1394 0.468
The function returns a tibble with four columns by default:
race: the name of the row_var
variablebthwht: the name of the col_var
variablecount: the number of observations for each combination
of race and bthwht categories.percent: the percentage of observations for each
combination of race and bthwht categories,
calculated relative to the totalTo pivot the output to the wide format, set
pivot = "wider".
## # A tibble: 3 × 5
## race count_bthwht_low_bir…¹ count_bthwht_regular…² percent_bthwht_low_b…³
## <chr> <int> <int> <dbl>
## 1 Black 102 766 0.0343
## 2 Hispanic 42 589 0.0141
## 3 Non-Blac… 83 1394 0.0279
## # ℹ abbreviated names: ¹count_bthwht_low_birthweight,
## # ²count_bthwht_regular_bithweight, ³percent_bthwht_low_birthweight
## # ℹ 1 more variable: percent_bthwht_regular_bithweight <dbl>
To display only percentages, set only = "percent". You
can also control how those percentages are calculated and displayed
using the margins argument.
# Default: percentages across the full table sum to one
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht",
pivot = "wider",
only = "percent")## # A tibble: 3 × 3
## race percent_bthwht_low_birthweight percent_bthwht_regular…¹
## <chr> <dbl> <dbl>
## 1 Black 0.0343 0.257
## 2 Hispanic 0.0141 0.198
## 3 Non-Black,Non-Hispanic 0.0279 0.468
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight
# Rowwise: percentages sum to one across columns within each row
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht",
margins = "rows",
pivot = "wider",
only = "percent")## # A tibble: 3 × 3
## race percent_bthwht_low_birthweight percent_bthwht_regular…¹
## <chr> <dbl> <dbl>
## 1 Black 0.118 0.882
## 2 Hispanic 0.0666 0.933
## 3 Non-Black,Non-Hispanic 0.0562 0.944
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight
# Columnwise: percentages within each column sum to one
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht",
margins = "columns",
pivot = "wider",
only = "percent")## # A tibble: 3 × 3
## race percent_bthwht_low_birthweight percent_bthwht_regular…¹
## <chr> <dbl> <dbl>
## 1 Black 0.449 0.279
## 2 Hispanic 0.185 0.214
## 3 Non-Black,Non-Hispanic 0.366 0.507
## # ℹ abbreviated name: ¹percent_bthwht_regular_bithweight
Sometimes, you may want to exclude specific values from your
analysis. To do this, use a named vector or list to specify which values
to exclude from the row_var and col_var
variables. For example, in the case below, the
Non-Black/Non-Hispanic category is excluded from the race
variable (i.e., row_var) and to ensure that NAs are not
returned in the final table, na.rm.row_var is set to
TRUE.
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht",
na.rm.row_var = TRUE,
ignore = c(race = "Non-Black,Non-Hispanic"))## # A tibble: 4 × 4
## race bthwht count percent
## <chr> <chr> <int> <dbl>
## 1 Black low_birthweight 102 0.0680
## 2 Black regular_bithweight 766 0.511
## 3 Hispanic low_birthweight 42 0.0280
## 4 Hispanic regular_bithweight 589 0.393
When you need to exclude more than one value from
row_var or col_var, use a named list. In the
example below, both the Non-Black/Non-Hispanic and
Hispanic categories are excluded from the race
variable.
cat_group_tbl(data = nlsy_cross_tab,
row_var = "race",
col_var = "bthwht",
na.rm.row_var = TRUE,
ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))## # A tibble: 2 × 4
## race bthwht count percent
## <chr> <chr> <int> <dbl>
## 1 Black low_birthweight 102 0.118
## 2 Black regular_bithweight 766 0.882
Next, let’s explore how to use select_tbl() and
select_group_tbl() functions to summarize multiple response
and ordinal variables. Multiple response and ordinal variables are
commonly used in survey research, psychology, and health sciences.
Examples include symptom checklists, scales like a depression index with
multiple items, or questions allowing respondents to select all choices
that apply to them.
The depressive dataset contains eight variables that
share the same variable stem: dep, with each one
representing a different item used to measure depression.
## [1] "cid" "race" "sex" "yob" "dep_1" "dep_2" "dep_3" "dep_4" "dep_5"
## [10] "dep_6" "dep_7" "dep_8"
Using the select_tbl() function, we can summarize
participants’ responses to these items by showing how many respondents
chose each answer option (i.e., value) for every variable.
## # A tibble: 24 × 4
## variable values count percent
## <chr> <int> <int> <dbl>
## 1 dep_1 1 109 0.0678
## 2 dep_1 2 689 0.429
## 3 dep_1 3 809 0.503
## 4 dep_2 1 144 0.0896
## 5 dep_2 2 746 0.464
## 6 dep_2 3 717 0.446
## 7 dep_3 1 1162 0.723
## 8 dep_3 2 392 0.244
## 9 dep_3 3 53 0.0330
## 10 dep_4 1 601 0.374
## # ℹ 14 more rows
Alternatively, you can choose to summarize specific variables by
passing their names to the var_stem argument and setting
the var_input argument to "name".
## # A tibble: 9 × 4
## variable values count percent
## <chr> <int> <int> <dbl>
## 1 dep_1 1 117 0.0714
## 2 dep_1 2 703 0.429
## 3 dep_1 3 818 0.499
## 4 dep_4 1 608 0.371
## 5 dep_4 2 854 0.521
## 6 dep_4 3 176 0.107
## 7 dep_6 1 398 0.243
## 8 dep_6 2 872 0.532
## 9 dep_6 3 368 0.225
By default, missing values are removed using listwise deletion. To
switch to pairwise deletion instead, set
na_removal = "pairwise".
## # A tibble: 24 × 4
## variable values count percent
## <chr> <int> <int> <dbl>
## 1 dep_1 1 120 0.0726
## 2 dep_1 2 709 0.429
## 3 dep_1 3 825 0.499
## 4 dep_2 1 151 0.0920
## 5 dep_2 2 762 0.464
## 6 dep_2 3 728 0.444
## 7 dep_3 1 1192 0.721
## 8 dep_3 2 406 0.246
## 9 dep_3 3 55 0.0333
## 10 dep_4 1 611 0.371
## # ℹ 14 more rows
To display the output in the wide format, set
pivot = "wider".
## # A tibble: 8 × 7
## variable count_value_1 count_value_2 count_value_3 percent_value_1
## <chr> <int> <int> <int> <dbl>
## 1 dep_1 120 709 825 0.0726
## 2 dep_2 151 762 728 0.0920
## 3 dep_3 1192 406 55 0.721
## 4 dep_4 611 856 181 0.371
## 5 dep_5 206 574 871 0.125
## 6 dep_6 399 879 371 0.242
## 7 dep_7 1046 507 95 0.635
## 8 dep_8 323 801 519 0.197
## # ℹ 2 more variables: percent_value_2 <dbl>, percent_value_3 <dbl>
It’s common practice to group multiple response or ordinal variables
by another variable. This type of descriptive analysis allows for
meaningful comparisons across different segments of your dataset. With
select_group_tbl(), you can create a summary table for
multiple response and ordinal variables, grouped either by another
variable in your dataset or by matching a pattern in the variable names.
For example, we often want to summarize survey responses by race.
First, recode the race variable and the values for each
of the eight depressive index variables in the depressive
dataset, replacing numeric categories with descriptive string labels for
easier interpretation.
dep_recoded <-
depressive |>
dplyr::mutate(
race = dplyr::case_match(.x = race,
1 ~ "Hispanic",
2 ~ "Black",
3 ~ "Non-Black/Non-Hispanic",
.default = NA)
) |>
dplyr::mutate(
dplyr::across(
.cols = dplyr::starts_with("dep"),
.fns = ~ dplyr::case_when(.x == 1 ~ "often",
.x == 2 ~ "sometimes",
.x == 3 ~ "hardly ever")
))Next, use the select_group_tbl() function to summarize
responses for all eight variables by race:
## # A tibble: 72 × 5
## variable race values count percent
## <chr> <chr> <chr> <int> <dbl>
## 1 dep_1 Black hardly ever 248 0.154
## 2 dep_1 Black often 45 0.0280
## 3 dep_1 Black sometimes 194 0.121
## 4 dep_1 Hispanic hardly ever 187 0.116
## 5 dep_1 Hispanic often 28 0.0174
## 6 dep_1 Hispanic sometimes 155 0.0965
## 7 dep_1 Non-Black/Non-Hispanic hardly ever 374 0.233
## 8 dep_1 Non-Black/Non-Hispanic often 36 0.0224
## 9 dep_1 Non-Black/Non-Hispanic sometimes 340 0.212
## 10 dep_2 Black hardly ever 234 0.146
## # ℹ 62 more rows
As with select_tbl(), setting the pivot argument to
"wider" reshapes the table into the wide format, while
using "pairwise" for the na_removal argument
ensures missing values are addressed through pairwise deletion.
select_group_tbl(data = dep_recoded,
var_stem = "dep",
group = "race",
na_removal = "pairwise",
pivot = "wider")## # A tibble: 24 × 8
## variable values count_race_Black count_race_Hispanic count_race_Non-Black…¹
## <chr> <chr> <int> <int> <int>
## 1 dep_1 hardly … 256 190 379
## 2 dep_1 often 54 28 38
## 3 dep_1 sometim… 203 159 347
## 4 dep_2 hardly … 241 172 315
## 5 dep_2 often 52 38 61
## 6 dep_2 sometim… 213 165 384
## 7 dep_3 hardly … 20 20 15
## 8 dep_3 often 342 252 598
## 9 dep_3 sometim… 149 105 152
## 10 dep_4 hardly … 48 40 93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## # `percent_race_Non-Black/Non-Hispanic` <dbl>
The ignore argument can be used to exclude specific
values from analysis. In the example below, the value often
is removed from all eight depression index variables, and the
Non-Black/Non-Hispanic category is excluded from the race
variable.
select_group_tbl(data = dep_recoded,
var_stem = "dep",
group = "race",
na_removal = "pairwise",
pivot = "wider",
ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))## # A tibble: 16 × 6
## variable values count_race_Black count_race_Hispanic percent_race_Black
## <chr> <chr> <int> <int> <dbl>
## 1 dep_1 hardly ever 256 190 0.317
## 2 dep_1 sometimes 203 159 0.251
## 3 dep_2 hardly ever 241 172 0.305
## 4 dep_2 sometimes 213 165 0.269
## 5 dep_3 hardly ever 20 20 0.0680
## 6 dep_3 sometimes 149 105 0.507
## 7 dep_4 hardly ever 48 40 0.0854
## 8 dep_4 sometimes 269 205 0.479
## 9 dep_5 hardly ever 253 201 0.333
## 10 dep_5 sometimes 182 124 0.239
## 11 dep_6 hardly ever 128 95 0.190
## 12 dep_6 sometimes 249 200 0.371
## 13 dep_7 hardly ever 38 28 0.110
## 14 dep_7 sometimes 152 128 0.439
## 15 dep_8 hardly ever 171 127 0.238
## 16 dep_8 sometimes 237 182 0.331
## # ℹ 1 more variable: percent_race_Hispanic <dbl>
When group_type is set to variable (the
default), the margins argument controls how percentages are
calculated and presented.
# Default: percentages across each variable sum to one
select_group_tbl(data = dep_recoded,
var_stem = "dep",
group = "race",
na_removal = "pairwise",
pivot = "wider")## # A tibble: 24 × 8
## variable values count_race_Black count_race_Hispanic count_race_Non-Black…¹
## <chr> <chr> <int> <int> <int>
## 1 dep_1 hardly … 256 190 379
## 2 dep_1 often 54 28 38
## 3 dep_1 sometim… 203 159 347
## 4 dep_2 hardly … 241 172 315
## 5 dep_2 often 52 38 61
## 6 dep_2 sometim… 213 165 384
## 7 dep_3 hardly … 20 20 15
## 8 dep_3 often 342 252 598
## 9 dep_3 sometim… 149 105 152
## 10 dep_4 hardly … 48 40 93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## # `percent_race_Non-Black/Non-Hispanic` <dbl>
# Rowwise: for each value of the variable, the percentages
# across all levels of the grouping variable sum to one
select_group_tbl(data = dep_recoded,
var_stem = "dep",
group = "race",
margins = "rows",
na_removal = "pairwise",
pivot = "wider")## # A tibble: 24 × 8
## variable values count_race_Black count_race_Hispanic count_race_Non-Black…¹
## <chr> <chr> <int> <int> <int>
## 1 dep_1 hardly … 256 190 379
## 2 dep_1 often 54 28 38
## 3 dep_1 sometim… 203 159 347
## 4 dep_2 hardly … 241 172 315
## 5 dep_2 often 52 38 61
## 6 dep_2 sometim… 213 165 384
## 7 dep_3 hardly … 20 20 15
## 8 dep_3 often 342 252 598
## 9 dep_3 sometim… 149 105 152
## 10 dep_4 hardly … 48 40 93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## # `percent_race_Non-Black/Non-Hispanic` <dbl>
# Columnwise: for each level of the grouping variable,
# the percentages across all values of the variable sum
# to one.
select_group_tbl(data = dep_recoded,
var_stem = "dep",
group = "race",
margins = "columns",
na_removal = "pairwise",
pivot = "wider")## # A tibble: 24 × 8
## variable values count_race_Black count_race_Hispanic count_race_Non-Black…¹
## <chr> <chr> <int> <int> <int>
## 1 dep_1 hardly … 256 190 379
## 2 dep_1 often 54 28 38
## 3 dep_1 sometim… 203 159 347
## 4 dep_2 hardly … 241 172 315
## 5 dep_2 often 52 38 61
## 6 dep_2 sometim… 213 165 384
## 7 dep_3 hardly … 20 20 15
## 8 dep_3 often 342 252 598
## 9 dep_3 sometim… 149 105 152
## 10 dep_4 hardly … 48 40 93
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹`count_race_Non-Black/Non-Hispanic`
## # ℹ 3 more variables: percent_race_Black <dbl>, percent_race_Hispanic <dbl>,
## # `percent_race_Non-Black/Non-Hispanic` <dbl>
Another way to use select_group_tbl() is to summarize
responses that match a specific pattern, such as survey waves or time
points. To enable this feature, set group_type = "pattern"
and provide the desired pattern in the group argument. For example, the
stem_social_psych dataset contains variables that capture
student responses about their sense of belonging in the STEM community
at two distinct time points: “w1” and “w2”. You can summarize these
responses using a pattern-based approach, where the time points (e.g.,
“w1” and “w2”) serve as grouping variables.
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern")## # A tibble: 10 × 5
## variable group values count percent
## <chr> <chr> <dbl> <int> <dbl>
## 1 belong_belongStem_w1 w1 1 5 0.0185
## 2 belong_belongStem_w1 w1 2 20 0.0741
## 3 belong_belongStem_w1 w1 3 59 0.219
## 4 belong_belongStem_w1 w1 4 107 0.396
## 5 belong_belongStem_w1 w1 5 79 0.293
## 6 belong_belongStem_w2 w2 1 11 0.0407
## 7 belong_belongStem_w2 w2 2 11 0.0407
## 8 belong_belongStem_w2 w2 3 44 0.163
## 9 belong_belongStem_w2 w2 4 113 0.419
## 10 belong_belongStem_w2 w2 5 91 0.337
Use the group_name argument to assign a descriptive name
to the column containing the matched pattern values.
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern",
group_name = "wave")## # A tibble: 10 × 5
## variable wave values count percent
## <chr> <chr> <dbl> <int> <dbl>
## 1 belong_belongStem_w1 w1 1 5 0.0185
## 2 belong_belongStem_w1 w1 2 20 0.0741
## 3 belong_belongStem_w1 w1 3 59 0.219
## 4 belong_belongStem_w1 w1 4 107 0.396
## 5 belong_belongStem_w1 w1 5 79 0.293
## 6 belong_belongStem_w2 w2 1 11 0.0407
## 7 belong_belongStem_w2 w2 2 11 0.0407
## 8 belong_belongStem_w2 w2 3 44 0.163
## 9 belong_belongStem_w2 w2 4 113 0.419
## 10 belong_belongStem_w2 w2 5 91 0.337
You can also include variable labels in your summary table by using
the var_labels argument.
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern",
group_name = "wave",
var_labels = c(
belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)",
belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)"
))## # A tibble: 10 × 6
## variable variable_label wave values count percent
## <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 belong_belongStem_w1 I feel like I belong in STEM… w1 1 5 0.0185
## 2 belong_belongStem_w1 I feel like I belong in STEM… w1 2 20 0.0741
## 3 belong_belongStem_w1 I feel like I belong in STEM… w1 3 59 0.219
## 4 belong_belongStem_w1 I feel like I belong in STEM… w1 4 107 0.396
## 5 belong_belongStem_w1 I feel like I belong in STEM… w1 5 79 0.293
## 6 belong_belongStem_w2 I feel like I belong in STEM… w2 1 11 0.0407
## 7 belong_belongStem_w2 I feel like I belong in STEM… w2 2 11 0.0407
## 8 belong_belongStem_w2 I feel like I belong in STEM… w2 3 44 0.163
## 9 belong_belongStem_w2 I feel like I belong in STEM… w2 4 113 0.419
## 10 belong_belongStem_w2 I feel like I belong in STEM… w2 5 91 0.337
Finally, use the only argument to choose what
information to return.
# Default: counts and percentages
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern",
group_name = "wave")## # A tibble: 10 × 5
## variable wave values count percent
## <chr> <chr> <dbl> <int> <dbl>
## 1 belong_belongStem_w1 w1 1 5 0.0185
## 2 belong_belongStem_w1 w1 2 20 0.0741
## 3 belong_belongStem_w1 w1 3 59 0.219
## 4 belong_belongStem_w1 w1 4 107 0.396
## 5 belong_belongStem_w1 w1 5 79 0.293
## 6 belong_belongStem_w2 w2 1 11 0.0407
## 7 belong_belongStem_w2 w2 2 11 0.0407
## 8 belong_belongStem_w2 w2 3 44 0.163
## 9 belong_belongStem_w2 w2 4 113 0.419
## 10 belong_belongStem_w2 w2 5 91 0.337
# Counts only
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern",
group_name = "wave",
only = "count")## # A tibble: 10 × 4
## variable wave values count
## <chr> <chr> <dbl> <int>
## 1 belong_belongStem_w1 w1 1 5
## 2 belong_belongStem_w1 w1 2 20
## 3 belong_belongStem_w1 w1 3 59
## 4 belong_belongStem_w1 w1 4 107
## 5 belong_belongStem_w1 w1 5 79
## 6 belong_belongStem_w2 w2 1 11
## 7 belong_belongStem_w2 w2 2 11
## 8 belong_belongStem_w2 w2 3 44
## 9 belong_belongStem_w2 w2 4 113
## 10 belong_belongStem_w2 w2 5 91
# Percentages only
select_group_tbl(data = stem_social_psych,
var_stem = "belong_belong",
group = "_w\\d",
group_type = "pattern",
group_name = "wave",
only = "percent")## # A tibble: 10 × 4
## variable wave values percent
## <chr> <chr> <dbl> <dbl>
## 1 belong_belongStem_w1 w1 1 0.0185
## 2 belong_belongStem_w1 w1 2 0.0741
## 3 belong_belongStem_w1 w1 3 0.219
## 4 belong_belongStem_w1 w1 4 0.396
## 5 belong_belongStem_w1 w1 5 0.293
## 6 belong_belongStem_w2 w2 1 0.0407
## 7 belong_belongStem_w2 w2 2 0.0407
## 8 belong_belongStem_w2 w2 3 0.163
## 9 belong_belongStem_w2 w2 4 0.419
## 10 belong_belongStem_w2 w2 5 0.337
Finally, let’s look at how to use the mean_tbl() and
mean_group_tbl() functions to summarize continuous
variables. The mean_tbl() function allows you to generate
descriptive statistics for either a set of continuous variables that
share a common stem or for individual continuous variables. The
resulting summary table includes key metrics such as the variable’s
mean, standard deviation, minimum value, maximum value, and the count of
non-missing observations for each variable.
The sdoh dataset contains six variables describing
characteristics of health care facilities, all of which begin with the
prefix HHC_PCT. Using the mean_tbl() function,
you can generate summary statistics for these variables:
## # A tibble: 6 × 6
## variable mean sd min max nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING 58.2 49.3 0 100 3227
## 2 HHC_PCT_HHA_PHYS_THERAPY 56.7 48.8 0 100 3227
## 3 HHC_PCT_HHA_OCC_THERAPY 52.4 48.3 0 100 3227
## 4 HHC_PCT_HHA_SPEECH 49.1 47.6 0 100 3227
## 5 HHC_PCT_HHA_MEDICAL 42.2 46.2 0 100 3227
## 6 HHC_PCT_HHA_AIDE 55.1 48.6 0 100 3227
Alternatively, if you want to generate summary statistics for only a
subset of those variables, you can specify their names directly in the
var_stem argument and set var_input = "name"
to indicate you’re referencing variable names rather than a shared
stem.
mean_tbl(
data = sdoh,
var_stem = c("HHC_PCT_HHA_PHYS_THERAPY",
"HHC_PCT_HHA_OCC_THERAPY",
"HHC_PCT_HHA_SPEECH"),
var_input = "name"
)## # A tibble: 3 × 6
## variable mean sd min max nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_PHYS_THERAPY 56.7 48.8 0 100 3227
## 2 HHC_PCT_HHA_OCC_THERAPY 52.4 48.3 0 100 3227
## 3 HHC_PCT_HHA_SPEECH 49.1 47.6 0 100 3227
You can also specify how missing values are removed, using the
na_removal argument.
## # A tibble: 6 × 6
## variable mean sd min max nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING 58.2 49.3 0 100 3227
## 2 HHC_PCT_HHA_PHYS_THERAPY 56.7 48.8 0 100 3227
## 3 HHC_PCT_HHA_OCC_THERAPY 52.4 48.3 0 100 3227
## 4 HHC_PCT_HHA_SPEECH 49.1 47.6 0 100 3227
## 5 HHC_PCT_HHA_MEDICAL 42.2 46.2 0 100 3227
## 6 HHC_PCT_HHA_AIDE 55.1 48.6 0 100 3227
## # A tibble: 6 × 6
## variable mean sd min max nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING 58.2 49.3 0 100 3227
## 2 HHC_PCT_HHA_PHYS_THERAPY 56.7 48.8 0 100 3227
## 3 HHC_PCT_HHA_OCC_THERAPY 52.4 48.3 0 100 3227
## 4 HHC_PCT_HHA_SPEECH 49.1 47.6 0 100 3227
## 5 HHC_PCT_HHA_MEDICAL 42.2 46.2 0 100 3227
## 6 HHC_PCT_HHA_AIDE 55.1 48.6 0 100 3227
Consider adding variable labels using the var_labels
argument to help make the variable names easier to interpret.
mean_tbl(data = sdoh,
var_stem = "HHC_PCT",
na_removal = "pairwise",
var_labels = c(
HHC_PCT_HHA_NURSING="% agencies offering nursing care services",
HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services",
HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services",
HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services",
HHC_PCT_HHA_MEDICAL="% agencies offering medical social services",
HHC_PCT_HHA_AIDE="% agencies offering home health aide services"
))## # A tibble: 6 × 7
## variable variable_label mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING % agencies offering nu… 58.2 49.3 0 100 3227
## 2 HHC_PCT_HHA_PHYS_THERAPY % agencies offering ph… 56.7 48.8 0 100 3227
## 3 HHC_PCT_HHA_OCC_THERAPY % agencies offering oc… 52.4 48.3 0 100 3227
## 4 HHC_PCT_HHA_SPEECH % agencies offering sp… 49.1 47.6 0 100 3227
## 5 HHC_PCT_HHA_MEDICAL % agencies offering me… 42.2 46.2 0 100 3227
## 6 HHC_PCT_HHA_AIDE % agencies offering ho… 55.1 48.6 0 100 3227
Similar to working with multiple response variables, it’s common
practice to group continuous variables by another variable to enable
meaningful comparisons across different segments of a dataset. The
mean_group_tbl() function facilitates this type of
descriptive analysis by generating summary statistics for continuous
variables, grouped either by a specific variable in the dataset or by
matching patterns in variable names. For example, it’s often useful to
present summary statistics by demographic categories such as region,
gender, age, or race.
## # A tibble: 24 × 7
## variable REGION mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING Midwest 57.4 49.5 0 100 1055
## 2 HHC_PCT_HHA_NURSING Northeast 74.2 43.9 0 100 217
## 3 HHC_PCT_HHA_NURSING South 58.8 49.2 0 100 1422
## 4 HHC_PCT_HHA_NURSING West 56 49.7 0 100 450
## 5 HHC_PCT_HHA_PHYS_THERAPY Midwest 55.2 48.9 0 100 1055
## 6 HHC_PCT_HHA_PHYS_THERAPY Northeast 68.0 43.1 0 100 217
## 7 HHC_PCT_HHA_PHYS_THERAPY South 58.4 49.0 0 100 1422
## 8 HHC_PCT_HHA_PHYS_THERAPY West 54.5 49.0 0 100 450
## 9 HHC_PCT_HHA_OCC_THERAPY Midwest 52.9 48.7 0 100 1055
## 10 HHC_PCT_HHA_OCC_THERAPY Northeast 64.8 42.8 0 100 217
## # ℹ 14 more rows
You can control which values to exclude and how missing data is
handled using the ignore and na_removal
arguments. To specify values to ignore, use a named vector or list,
where each name corresponds to a variable stem or specific variable
name.
# Default listwise removal
mean_group_tbl(data = sdoh,
var_stem = "HHC_PCT",
group = "REGION",
ignore = c(HHC_PCT = 0, REGION = "Northeast"))## # A tibble: 18 × 7
## variable REGION mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING Midwest 100 0 100 100 403
## 2 HHC_PCT_HHA_NURSING South 100 0 100 100 681
## 3 HHC_PCT_HHA_NURSING West 100 0 100 100 200
## 4 HHC_PCT_HHA_PHYS_THERAPY Midwest 97.7 7.15 50 100 403
## 5 HHC_PCT_HHA_PHYS_THERAPY South 99.2 4.78 50 100 681
## 6 HHC_PCT_HHA_PHYS_THERAPY West 98.3 5.31 60 100 200
## 7 HHC_PCT_HHA_OCC_THERAPY Midwest 96.3 10.4 33.3 100 403
## 8 HHC_PCT_HHA_OCC_THERAPY South 95.5 12.4 28.6 100 681
## 9 HHC_PCT_HHA_OCC_THERAPY West 94.8 12.2 25 100 200
## 10 HHC_PCT_HHA_SPEECH Midwest 91.9 16.2 33.3 100 403
## 11 HHC_PCT_HHA_SPEECH South 93.4 15.3 25 100 681
## 12 HHC_PCT_HHA_SPEECH West 91.0 17.2 20 100 200
## 13 HHC_PCT_HHA_MEDICAL Midwest 82.4 23.8 9.09 100 403
## 14 HHC_PCT_HHA_MEDICAL South 89.4 18.6 16.7 100 681
## 15 HHC_PCT_HHA_MEDICAL West 92.6 15.3 33.3 100 200
## 16 HHC_PCT_HHA_AIDE Midwest 97.3 8.97 50 100 403
## 17 HHC_PCT_HHA_AIDE South 96.1 10.3 42.9 100 681
## 18 HHC_PCT_HHA_AIDE West 96.4 9.96 50 100 200
# Pairwise removal
mean_group_tbl(data = sdoh,
var_stem = "HHC_PCT",
group = "REGION",
na_removal = "pairwise",
ignore = c(HHC_PCT = 0, REGION = "Northeast"))## # A tibble: 18 × 7
## variable REGION mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING Midwest 100 0 100 100 606
## 2 HHC_PCT_HHA_NURSING South 100 0 100 100 836
## 3 HHC_PCT_HHA_NURSING West 100 0 100 100 252
## 4 HHC_PCT_HHA_PHYS_THERAPY Midwest 97.8 8.36 25 100 595
## 5 HHC_PCT_HHA_PHYS_THERAPY South 99.4 4.32 50 100 836
## 6 HHC_PCT_HHA_PHYS_THERAPY West 97.7 8.14 33.3 100 251
## 7 HHC_PCT_HHA_OCC_THERAPY Midwest 96.3 11.5 25 100 579
## 8 HHC_PCT_HHA_OCC_THERAPY South 95.8 12.2 28.6 100 787
## 9 HHC_PCT_HHA_OCC_THERAPY West 94.5 13.0 25 100 232
## 10 HHC_PCT_HHA_SPEECH Midwest 92.6 16.1 25 100 552
## 11 HHC_PCT_HHA_SPEECH South 93.7 15.2 25 100 769
## 12 HHC_PCT_HHA_SPEECH West 91.3 17.0 20 100 221
## 13 HHC_PCT_HHA_MEDICAL Midwest 83.0 23.6 9.09 100 419
## 14 HHC_PCT_HHA_MEDICAL South 89.7 18.6 16.7 100 724
## 15 HHC_PCT_HHA_MEDICAL West 92.5 15.8 33.3 100 224
## 16 HHC_PCT_HHA_AIDE Midwest 98.0 7.85 50 100 588
## 17 HHC_PCT_HHA_AIDE South 96.6 9.82 42.9 100 816
## 18 HHC_PCT_HHA_AIDE West 96.4 10.8 33.3 100 247
# Pairwise removal excluding several values from the same stem
# or group variable.
mean_group_tbl(data = sdoh,
var_stem = "HHC_PCT",
group = "REGION",
na_removal = "pairwise",
ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))## # A tibble: 12 × 7
## variable REGION mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 HHC_PCT_HHA_NURSING Midwest 100 0 100 100 606
## 2 HHC_PCT_HHA_NURSING West 100 0 100 100 252
## 3 HHC_PCT_HHA_PHYS_THERAPY Midwest 97.8 8.36 25 100 595
## 4 HHC_PCT_HHA_PHYS_THERAPY West 97.7 8.14 33.3 100 251
## 5 HHC_PCT_HHA_OCC_THERAPY Midwest 96.3 11.5 25 100 579
## 6 HHC_PCT_HHA_OCC_THERAPY West 94.5 13.0 25 100 232
## 7 HHC_PCT_HHA_SPEECH Midwest 92.6 16.1 25 100 552
## 8 HHC_PCT_HHA_SPEECH West 91.3 17.0 20 100 221
## 9 HHC_PCT_HHA_MEDICAL Midwest 83.0 23.6 9.09 100 419
## 10 HHC_PCT_HHA_MEDICAL West 92.5 15.8 33.3 100 224
## 11 HHC_PCT_HHA_AIDE Midwest 98.0 7.85 50 100 588
## 12 HHC_PCT_HHA_AIDE West 96.4 10.8 33.3 100 247
Another way to use mean_group_tbl() is to summarize
responses based on a shared pattern, such as survey time points. To
enable this feature, set group_type = "pattern" and specify
the desired pattern in the group argument.
Consider a dataset compiled by researchers examining how many symptoms participants reported they’d had after a long illness. In this (fictitious) dataset, responses are collected at three time points: “t1” (baseline), “t2” (6-month follow-up), and “t3” (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.
In the example below, we first create the symptoms_data
dataset and then use the mean_group_tbl() function to
generate summary statistics for variables that begin with the prefix
symptoms and contain a substring matching the pattern
"_t\\d", an underscore followed by the letter “t” and a
single digit, indicating different time points. The ignore
argument is also used to exclude the value -999 from the
analysis.
set.seed(0803)
symptoms_data <-
data.frame(
symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50),
symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
)
mean_group_tbl(data = symptoms_data,
var_stem = "symptoms",
group = "_t\\d",
group_type = "pattern",
ignore = c(symptoms = -999))## # A tibble: 3 × 7
## variable group mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1 4.03 3.14 0 10 33
## 2 symptoms_t2 t2 5.12 3.33 0 10 33
## 3 symptoms_t3 t3 4.64 3.29 0 10 33
To make your output easier to understand, use the
group_name argument to add a label to the column that shows
grouping values or matched patterns. You can also use the
var_labels argument to display descriptive labels for each
variable.
mean_group_tbl(data = symptoms_data,
var_stem = "symptoms",
group = "_t\\d",
group_type = "pattern",
group_name = "time_point",
ignore = c(symptoms = -999),
var_labels = c(symptoms_t1 = "# of symptoms at baseline",
symptoms_t2 = "# of symptoms at 6 months follow up",
symptoms_t3 = "# of symptoms at one-year follow up"))## # A tibble: 3 × 8
## variable variable_label time_point mean sd min max nobs
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 # of symptoms at baseline t1 4.03 3.14 0 10 33
## 2 symptoms_t2 # of symptoms at 6 month… t2 5.12 3.33 0 10 33
## 3 symptoms_t3 # of symptoms at one-yea… t3 4.64 3.29 0 10 33
Finally, you can choose what information to return using the
only argument.
# Default: all summary statistics returned
# (mean, sd, min, max, nobs)
mean_group_tbl(data = symptoms_data,
var_stem = "symptoms",
group = "_t\\d",
group_type = "pattern",
group_name = "time_point",
ignore = c(symptoms = -999))## # A tibble: 3 × 7
## variable time_point mean sd min max nobs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 symptoms_t1 t1 4.03 3.14 0 10 33
## 2 symptoms_t2 t2 5.12 3.33 0 10 33
## 3 symptoms_t3 t3 4.64 3.29 0 10 33
# Means and non-missing observations only
mean_group_tbl(data = symptoms_data,
var_stem = "symptoms",
group = "_t\\d",
group_type = "pattern",
group_name = "time_point",
ignore = c(symptoms = -999),
only = c("mean", "nobs"))## # A tibble: 3 × 4
## variable time_point mean nobs
## <chr> <chr> <dbl> <int>
## 1 symptoms_t1 t1 4.03 33
## 2 symptoms_t2 t2 5.12 33
## 3 symptoms_t3 t3 4.64 33
# Means and standard deviations only
mean_group_tbl(data = symptoms_data,
var_stem = "symptoms",
group = "_t\\d",
group_type = "pattern",
group_name = "time_point",
ignore = c(symptoms = -999),
only = c("mean", "sd"))## # A tibble: 3 × 4
## variable time_point mean sd
## <chr> <chr> <dbl> <dbl>
## 1 symptoms_t1 t1 4.03 3.14
## 2 symptoms_t2 t2 5.12 3.33
## 3 symptoms_t3 t3 4.64 3.29