Getting Started with paneldesc

Dmitrii Tereshchenko

The paneldesc package provides a comprehensive set of tools for analyzing panel (longitudinal) data. It helps you explore the structure of your panel, examine missing value patterns, decompose numeric variables into between‑ and within‑entity components, and analyze transitions in categorical variables. The package is designed to work seamlessly with data frames that have been marked with panel structure using make_panel(), reducing repetitive specification of entity and time identifiers.

This vignette walks you through the basic workflow using the built‑in production dataset, a simulated unbalanced panel of firms over six years.

For a comprehensive guide with detailed examples, case studies, and extended tutorials, please visit the package web-book: https://dtereshch.github.io/paneldesc-guides/.

Installation

If you haven’t installed the package yet, you can get the stable version from CRAN.

install.packages("paneldesc")

Or you can install the development version from GitHub.

# install.packages("devtools")
devtools::install_github("dtereshch/paneldesc")

Loading the package

Load the package.

library(paneldesc)

Data import

The package includes a simulated dataset called production. It contains information on 30 firms over up to 6 years, with variables such as sales, capital, labor, industry, and ownership. Missing values are present in some variables to mimic real‑world data.

data(production)

To avoid repeatedly specifying the entity and time variables (firm and year), we create a panel_data object using make_panel(). This adds metadata that many subsequent functions will automatically use.

panel <- make_panel(production, index = c("firm", "year"))

Panel data structure analysis

The first group of functions is designed to analyze the structure of the panel.

describe_dimensions() returns the number of rows, distinct entities, distinct time periods, and substantive variables.

describe_dimensions(panel)
#>   rows entities periods variables
#> 1  180       30       6         5

describe_periods() shows, for each time period, how many entities have non‑missing data in any substantive variable, along with their share in the total number of entities.

describe_periods(panel)
#>   year count share
#> 1    1    25 0.833
#> 2    2    28 0.933
#> 3    3    30 1.000
#> 4    4    29 0.967
#> 5    5    26 0.867
#> 6    6    19 0.633

describe_balance() provides summary statistics for the distribution of entities per period and periods per entity.

describe_balance(panel)
#>   dimension   mean   std min max
#> 1  entities 26.167 3.971  19  30
#> 2   periods  5.233 0.935   3   6

plot_periods() creates a histogram of the number of time periods covered by each entity.

plot_periods(panel)

describe_patterns() tabulates the distinct patterns of presence/absence across time (e.g., which entities appear in which years).

describe_patterns(panel)
#>   pattern 1 2 3 4 5 6 count share
#> 1       1 1 1 1 1 1 1    16 0.533
#> 2       2 1 1 1 1 1 0     5 0.167
#> 3       3 1 1 1 1 0 0     3 0.100
#> 4       4 0 0 1 1 1 1     2 0.067
#> 5       5 0 1 1 1 1 0     2 0.067
#> 6       6 0 1 1 1 1 1     1 0.033
#> 7       7 1 1 1 0 0 0     1 0.033

You can also visualize these patterns with a heatmap using plot_patterns().

plot_patterns(panel)

Missing values analysis

The second group of functions is aimed at analyzing missing values, taking into account the nature of panel data.

plot_missing() creates a heatmap showing the number of missing values for each variable across all time periods. Darker cells indicate more missing values.

plot_missing(panel)
#> Analysing all variables: sales, capital, labor, industry, ownership

summarize_missing() returns a table with overall missing counts, shares, and the number of entities and periods affected per variable.

summarize_missing(panel)
#> Analyzing all variables: sales, capital, labor, industry, ownership
#>    variable na_count na_share entities periods
#> 1     sales       26    0.144       15       5
#> 2   capital       26    0.144       17       5
#> 3     labor       26    0.144       16       6
#> 4  industry       23    0.128       14       5
#> 5 ownership       23    0.128       14       5

describe_incomplete() lists entities that have at least one missing value, with details on which variables are incomplete.

describe_incomplete(panel)
#>    firm na_count variables
#> 1    23       15         5
#> 2    21       11         5
#> 3     1       10         5
#> 4     2       10         5
#> 5     6       10         5
#> 6     7       10         5
#> 7    12       10         5
#> 8    26       10         5
#> 9    25        6         5
#> 10   30        6         5
#> 11    4        5         5
#> 12   13        5         5
#> 13   17        5         5
#> 14   29        5         5
#> 15    8        2         2
#> 16    3        1         1
#> 17   10        1         1
#> 18   14        1         1
#> 19   24        1         1

Numeric variables analysis

The third group of functions is aimed at analyzing numeric variables, taking into account the nature of panel data.

summarize_numeric() calculates basic statistics (count, mean, std, min, max) for numeric variables.

summarize_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#>   variable count   mean    std   min     max
#> 1    sales   154 69.756 46.804 8.321 336.853
#> 2  capital   154 32.490 31.053 0.968 194.719
#> 3    labor   154 79.329 73.687 4.097 419.848

You can optionally group by another variable, which does not necessarily have to be a panel identifier. Here we use year.

summarize_numeric(panel, group = "year")
#> Analyzing all numeric variables: sales, capital, labor
#>    year variable count   mean     std    min     max
#> 1     1    sales    25 58.491  44.590  8.321 190.100
#> 2     1  capital    24 24.862  16.273  0.968  65.950
#> 3     1    labor    25 68.871  66.941  4.097 246.852
#> 4     2    sales    28 56.099  37.944 17.803 186.349
#> 5     2  capital    27 28.790  31.053  3.150 151.464
#> 6     2    labor    27 60.463  48.484 11.692 222.761
#> 7     3    sales    30 76.660  47.574 20.580 219.513
#> 8     3  capital    30 35.464  39.174  4.729 194.719
#> 9     3    labor    29 90.437  82.628  9.284 414.844
#> 10    4    sales    28 73.104  33.238 19.455 135.118
#> 11    4  capital    29 44.522  35.375  5.080 132.898
#> 12    4    labor    29 73.967  54.005 16.327 240.726
#> 13    5    sales    24 75.398  43.091 20.161 211.092
#> 14    5  capital    25 28.351  23.127  5.339  86.078
#> 15    5    labor    26 90.604  85.026 21.063 413.784
#> 16    6    sales    19 81.744  73.320 20.352 336.853
#> 17    6  capital    19 29.767  30.908  2.288 108.787
#> 18    6    labor    18 96.609 103.777 20.507 419.848

plot_heterogeneity() visualizes the distribution of a numeric variable across groups. We use select = "sales" to look at sales, and the function automatically uses the entity and time variables as groups because panel has panel attributes.

plot_heterogeneity(panel, select = "sales")

decompose_numeric() splits the total variance of numeric variables into between‑entity and within‑entity components.

decompose_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#>   variable dimension   mean    std     min     max   count
#> 1    sales   overall 69.756 46.804   8.321 336.853 154.000
#> 2    sales   between     NA 29.776  25.772 159.197  30.000
#> 3    sales    within     NA 35.862 -28.397 247.412   5.133
#> 4  capital   overall 32.490 31.053   0.968 194.719 154.000
#> 5  capital   between     NA 13.969   8.671  75.083  30.000
#> 6  capital    within     NA 27.701 -22.444 152.126   5.133
#> 7    labor   overall 79.329 73.687   4.097 419.848 154.000
#> 8    labor   between     NA 44.023  24.606 175.731  30.000
#> 9    labor    within     NA 59.561 -77.709 323.445   5.133

Factor variables analysis

The last group of functions is aimed at analyzing factor (categorical) variables, taking into account the nature of panel data.

decompose_factor() breaks down the overall frequency of each category into between‑entity (how many entities ever have that category) and within‑entity (average share of time an entity spends in that category) components.

decompose_factor(panel)
#> Analyzing all factor variables: industry, ownership
#>    variable   category count_overall share_overall count_between share_between
#> 1  industry Industry 1            63         0.401            13         0.433
#> 2  industry Industry 2            45         0.287            11         0.367
#> 3  industry Industry 3            49         0.312            10         0.333
#> 4 ownership    private            76         0.484            16         0.533
#> 5 ownership     public            55         0.350            13         0.433
#> 6 ownership      mixed            26         0.166             7         0.233
#>   share_within
#> 1        0.918
#> 2        0.809
#> 3        0.917
#> 4        0.898
#> 5        0.813
#> 6        0.724

summarize_transition() computes transition counts and shares between states of a factor variable over consecutive time periods. Here we analyze transitions in ownership.

summarize_transition(panel, select = "ownership")
#> 23 rows with NA values in 'ownership' removed.
#>   from_to private public mixed
#> 1 private   0.950  0.000 0.050
#> 2  public   0.016  0.984 0.000
#> 3   mixed   0.043  0.065 0.891