The PanelSelect package supports a family of panel sample selection models. The models extend the classic Heckman selection framework to panel data with individual random effects, enabling more robust and flexible correction for non-random participation and attrition in panel settings.
The general structure involves a two-stage setup:
The first stage is a panel Probit model describing the participation or response decision.
The second stage can be a panel linear, Probit, Poisson, or Poisson log-normal model for the outcome.
The package allows two distinct sources of selection bias:
Selection on individual effects (i.e., correlation between random effects across equations)
Selection on idiosyncratic errors (i.e., contemporaneous correlation of error terms)
These two types of selection can be activiated separately or jointly, providing flexibility for empirical applications.
The likelihood functions for these models are detailed in Bailey and Peng (2025) and Peng and Van den Bulte (2024). Because they lack closed-form solutions, likelihoods are approximated numerically via Gaussian–Hermite quadrature. Two types of standard errors are available: (1) Hessian-based standard errors and (2) BHHH (Berndt–Hall–Hall–Hausman) standard errors.
Table 1 lists the supported models. Panel count models are supported through a companion package PanelCount.
| Model | First Stage | Second Stage | Outcome Variable | Package |
|---|---|---|---|---|
| probitRE_linearRE | probitRE | linearRE | linear | PanelSelect |
| probitRE_probitRE | probitRE | ProbitRE | binary | PanelSelect |
| probitRE_PoissonRE | probitRE | PoissonRE | count | PanelCount |
| probitRE_PLNRE | probitRE | PLN_RE | count | PanelCount |
Let \(\mathbf{w}_{it}\) and \(\mathbf{x}_{it}\) represent the row vectors of covariates in the selection and outcome equations, respectively, with \(\boldsymbol{\alpha}\) and \(\boldsymbol{\beta}\) denoting the corresponding column vectors of parameters. Define \(d_{it}\) as a binary participation or response indicator, and \(y_{it}\) as the outcome variable, which is observed only when \(d_{it} = 1\). Individuals are indexed by \(i = 1, \ldots, N\), and time periods by \(t = 1, \ldots, T\). The models supported by the package are summarized below, where “RE” denotes the inclusion of individual random effects.
First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]
Second stage (LinearRE):
\[y_{it}^* = \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \sigma\epsilon_{it}\]
Correlation structure:
\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \]
Here, \(\delta\), \(\lambda\), and \(\sigma\) are positive parameters capturing the standard deviation of random effects and the second-stage error. The variance of the first-stage error is normalized to one to ensure unique identification of model parameters, consistent with the convention in Probit models.
First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]
Second stage (LinearRE):
\[y_{it}^* = 1( \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \epsilon_{it} > 0 )\]
Correlation structure:
\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \]
Since both stages are Probit models, the variances of error terms in both stages are fixed to one to ensure model identification.
First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]
Second stage (PoissonRE): \[E[y_{it}|x_{it},v_i] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i)\]
Correlation structure:
\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right). \] In the second stage, the random effect \(v_i\) accounts for unobserved heterogeneity and potential overdispersion in the count outcome. Because the Poisson distribution intrinsically models conditional variance, no additional idiosyncratic error term is introduced in this specification.
First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]
Second stage (PLN_RE): \[E[y_{it}|x_{it},v_i,\epsilon_{it}] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i + \gamma\epsilon_{it})\]
Correlation structure:
\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \] This model introduces an additional random effect at the individual–time level, thereby allowing selection to occur at both the individual and individual–time levels. However, \(\epsilon_{it}\) should be interpreted as a random effect rather than as an idiosyncratic error term.
After loading the PanelSelect package, type “example(model_name)” to see sample code for each model. The code below runs the probitRE_linearRE model on a simulated dataset with the following data generating process (DGP):
\[d_{it} = 1( x_{it} + u_i + \varepsilon_{it} > 0 )\]
\[y_{it}^* = x_{it} + v_i + \epsilon_{it}\]
\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & -0.5 \\ -0.5 & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0.5 \\ 0.5 & 1 \end{pmatrix} \right). \]
library(PanelSelect)
library(MASS)
N = 500
period = 5
obs = N*period
rho = -0.5
tau = 0.5
set.seed(100)
re = mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,rho,rho,1), nrow=2))
u = rep(re[,1], each=period)
v = rep(re[,2], each=period)
e = mvrnorm(obs, mu=c(0,0), Sigma=matrix(c(1,tau,tau,1), nrow=2))
e1 = e[,1]
e2 = e[,2]
t = rep(1:period, N)
id = rep(1:N, each=period)
w = rnorm(obs)
z = rnorm(obs)
x = rnorm(obs)
d = as.numeric(x + w + u + e1 > 0)
y = x + w + v + e2
y[d==0] = NA
dt = data.frame(id, t, y, x, w, z, d)
# As N increases, the parameter estimates will be more accurate
m = probitRE_linearRE(d~x+w, y~x+w, 'id', dt, H=10, verbose=-1)
print(m$estimates, digits=4)
#> estimate se z p lci uci
#> probit.(Intercept) -0.01675 0.05586 -0.2998 7.643e-01 -0.12623 0.09274
#> probit.x 0.96137 0.05128 18.7461 0.000e+00 0.86086 1.06188
#> probit.w 1.04892 0.05324 19.7033 0.000e+00 0.94458 1.15326
#> linear.(Intercept) 0.03536 0.15517 0.2279 8.197e-01 -0.26877 0.33950
#> linear.x 0.94444 0.06366 14.8362 0.000e+00 0.81967 1.06921
#> linear.w 0.92165 0.06963 13.2359 0.000e+00 0.78518 1.05813
#> delta 0.97825 0.07005 13.9653 0.000e+00 0.85015 1.12564
#> lambda 1.01392 0.05469 18.5410 0.000e+00 0.91221 1.12697
#> sigma 1.02086 0.03570 28.5923 0.000e+00 0.95322 1.09329
#> rho -0.46251 0.07913 -5.8452 5.060e-09 -0.60296 -0.29425
#> tau 0.44739 0.15957 2.8038 5.051e-03 0.09018 0.70261The resulting estimates closely match the true parameter values, confirming the model’s effectiveness in recovering the underlying parameters.
This section highlights several practical considerations while using this package.
As shown in Bailey and Peng (2025), panel selection models typically achieve stronger identification than their cross-sectional counterparts.
The panel need not be balanced. Also, individuals are allowed to make different participation decisions at different time periods.
The covariates used in the first-stage and second-stage equations may differ, although excluding a variable from one stage typically requires theoretical justification. When in doubt, it is advisable to use the same set of covariates in both stages.
The model allows the two correlation parameters \(\rho\) and \(\tau\) to be turned off separately or jointly.
Inputs and output formats for models imported from the PanelCount package are somewhat different.
When the outcome variable is observed regardless of participation, the endogeneity package may be more appropriate, although it currently focuses on cross-sectional settings.
Bailey, M., & Peng, J. (2025). A Random Effects Model of Non-Ignorable Nonresponse in Panel Survey Data. Available at SSRN https://www.ssrn.com/abstract=5475626
Peng, J., & Van den Bulte, C. (2024). Participation vs. Effectiveness in Sponsored Tweet Campaigns: A Quality-Quantity Conundrum. Management Science, 70(11):7345-8215. Available at https://doi.org/10.1287/mnsc.2019.01897