Double truncation occurs when a data sample only includes subjects
whose event time lies within a random truncation interval. For example,
the survdt package includes the doubly truncated dataset
aids, which holds records of AIDS incubation times from a
retrospective sample of patients with transfusion-acquired HIV. The
inclusion criteria were that the patient was diagnosed with AIDS between
the discovery of the disease in 1982 and the end of the study in 1986,
with time measured in months since HIV infection. This sampling scheme
was chosen so that the precise HIV infection time could be ascertained
retrospectively from a blood transfusion registry. The event time of
interest is the AIDS incubation time incu, with the
truncation times contained in ltrunc and
rtrunc respectively. See ?survdt::aids for
more details.
The survdt package has a special Survdt
data structure designed to store doubly truncated data. Below we create
a Survdt object using the aids dataset and
plot a simple visualization of the sample. See
?survdt::Survdt for more details.
y <- Survdt(aids$incu, aids$ltrunc, aids$rtrunc)
head(y)
#> time ltrunc rtrunc
#> [1,] 28 25 80
#> [2,] 14 9 64
#> [3,] 10 2 57
#> [4,] 10 -1 54
#> [5,] 23 8 63
#> [6,] 13 -3 52
#> attr(,"class")
#> [1] "Survdt"
plot(y)Each row of this plot contains a subject’s truncation interval as a gray rectangle and the subject’s event time in black, with rows sorted by event time.
The nonparametric maximum likelihood estimator (NPMLE) of the event time cumulative distribution function (cdf) is a right-continuous step function with point masses at the observed event times. Think of it like an empirical distribution function that is adjusted for double truncation bias. It is estimated iteratively via an EM algorithm. The underlying assumptions for the NPMLE are discussed in Assumptions required for the NPMLE.
The survdt package includes the function
npmle() to compute the NPMLE and its standard errors. The
first input argument should be a Survdt object or
expression, which is evaluated using the variables found in the
data argument (if provided). See
?survdt::npmle for more details.
Below we fit the NPMLE on the aids data and plot both
the estimated survival function and estimated cumulative hazard
function, along with pointwise 95% confidence intervals. Other plot
options include the estimated selection (non-truncation) probability
over time and the estimated inverse probability weights. See
?survdt::plot.npmle for more details.
npfit <- npmle(Survdt(incu, ltrunc, rtrunc), data = aids, cumhaz = TRUE)
npfit
#>
#> NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times:
#>
#> time cdf se
#> [1,] 0 0.0007053536 0.0007291657
#> [2,] 4 0.0041320035 0.0017019771
#> [3,] 5 0.0054913032 0.0020516114
#> [4,] 6 0.0070255698 0.0023356958
#> [5,] 8 0.0107959130 0.0028739745
#> ...
plot(npfit)
These confidence intervals are computed efficiently through closed-form
standard errors, i.e. without relying on bootstrap or other resampling
methods, in contrast to some other R packages for double truncation.
The aids data has the factor variable
age_gp which categorizes each subject based on age at HIV
infection. The three age groups are children (age \(\leq 4\)), adults (age \(4-59\)), and elderly (age \(>59\)). The group sizes are:
In order to test for differences in incubation times across the three
age groups, the survdt package provides two general
options.
First, the data can be analyzed through a Cox proportional hazards
model using methods that correct for double truncation bias. See
vignette("cox-regression-quasiindep", package = "survdt")
and ?survdt::coxph_indtrunc for more details on how to do
this with the survdt package.
Second, the data can be analyzed nonparametrically to detect any
difference in the incubation time distributions without assuming
proportional hazards. The survdt package implements this in
the function test_samplediff_indtrunc(). The first input
argument should be a formula with a Survdt response and
variables defining the groups, i.e. strata, to test across on the right
hand side. The test can be based on the supremum difference between
survival functions or between cumulative hazards, both estimated by the
NPMLE. The p-value is computed by simulating from the estimated
asymptotic null distribution of the test statistic. See
?survdt::test_samplediff_indtrunc for more details.
Below we test for differences in the AIDS incubation time distribution across the three age groups. We limit the range of the test to at most 70 months since HIV infection to avoid the high variability associated with the tail of the distribution (see the plot in Nonparametric maximum likelihood estimation).
agetest <- test_samplediff_indtrunc(Survdt(incu, ltrunc, rtrunc) ~ age_gp,
data = aids,
restr_times = c(NA, 70))
agetest
#>
#> Test for group differences under quasi-independent truncation.
#>
#> Unstratified NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times.
#> Min event time: 0
#> Max event time: 89
#>
#> Supremum test of type "survival".
#> Test statistic: 0.74
#> Time range: [0, 70]
#> Time weights: none
#> Critical value: 0.21 at level 0.95
#> Significant evidence found against equality across strata (p<5e-04).
#> Results based on 2000 simulations from the estimated null distribution.
#>
#> Summary of the 3 groups compared:
#> stratum size unique event times
#> age_gp>59 141 59
#> age_gp4-59 120 59
#> age_gp<=4 34 20We find highly significant evidence that the incubation time distribution differs across age groups. The test result can be visualized through the following plot, which compares the group-specific survival function estimates with the unstratified estimates. Under the null hypothesis of equality across strata, the stratified estimates should all lie within the 95% uniform confidence band, denoted by dashed lines.
Plotting the test result is recommended in general, since it provides a
better understanding of the estimated effect size and power of the
test.
The NPMLE requires two key conditions. First, all possible event
times must have a positive probability of being observed. This can be
determined from the study design, as well as by comparing the observed
event times to the range of possible event times based on external
knowledge. Second, the truncation times must be independent of the event
time within the observable data region, commonly known as
quasi-independence. See Testing for
quasi-independent truncation for more information on how to test
this assumption with the survdt package.
The survdt package includes the function
test_quasiindep_ctau() to test for quasi-independence
between the event and truncation times using the conditional Kendall’s
tau rank correlation. The first input argument should be a
Survdt object or expression, which is evaluated using the
variables found in the data argument (if provided). It
computes the (event time, left truncation time) and (event time, right
truncation time) rank correlations (tau) accounting for
double truncation, as well as a p-value for testing whether both are
zero. See ?survdt::test_quasiindep_ctau for more
details.
Below we apply this quasi-independence test to the aids
data.
test_quasiindep_ctau(Survdt(incu, ltrunc, rtrunc), data = aids)
#>
#> Quasi-independent truncation test by conditional rank correlations.
#>
#> Conditional Kendall's tau between event time and...
#> left truncation time: 0.07
#> right truncation time: 0.07
#> Test statistic: chisq=3.5, df=1
#> p-value: 0.06
#> Proportion of subject pairs comparable was 0.47.Based on the small estimated correlations and moderate p-value, we do
not find strong evidence against quasi-independence in the
aids data.
A unique feature of doubly truncated data, compared to one-sided truncation, is that it is possible in some cases for there to be no effective sampling bias induced by truncation. This occurs when the sampling probability is constant across all event times. If the sampling bias is ignorable, standard methods which do not correct for double truncation bias may be used to analyze the data.
The survdt package includes the function
test_ignorability_indtrunc to test for ignorable sampling
bias nonparametrically through the NPMLE. The first input argument can
be a Survdt object/expression or the fitted NPMLE object
returned by npmle(). The test can be based on the supremum
difference between the time-specific and overall selection probabilities
or the supremum difference between the NPMLE CDF and the empirical CDF
(not adjusted for double truncation). The p-value is computed by
simulating from the estimated asymptotic null distribution of the test
statistic. See ?survdt::test_ignorability_indtrunc for more
details.
Below we apply these tests to the aids data. The output
from test_ignorability_indtrunc has a plot
method to help visualize the deviation of the NPMLE from its reference
value using a uniform confidence band. For the CDF-based test, we limit
the range of the test to at most 80 months since HIV infection to avoid
the high variability associated with the tail of the distribution (see
the plot in Nonparametric
maximum likelihood estimation).
igtest_prob <- test_ignorability_indtrunc(npfit, test_type = "sel_prob")
igtest_prob
#>
#> Ignorable sampling bias test under quasi-independent truncation.
#>
#> NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times.
#> Min event time: 0
#> Max event time: 89
#>
#> Supremum test of type "sel_prob".
#> Test statistic: 4.0
#> Time range: [0, 89]
#> Time weights: none
#> Critical value: 1.5 at level 0.95
#> Significant evidence found against ignorability (p<5e-04).
#> Results based on 2000 simulations from the estimated null distribution.
plot(igtest_prob)
igtest_cdf <- test_ignorability_indtrunc(npfit, test_type = "cdf",
restr_times = c(NA, 80))
igtest_cdf
#>
#> Ignorable sampling bias test under quasi-independent truncation.
#>
#> NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times.
#> Min event time: 0
#> Max event time: 89
#>
#> Supremum test of type "cdf".
#> Test statistic: 0.51
#> Time range: [0, 80]
#> Time weights: none
#> Critical value: 0.18 at level 0.95
#> Significant evidence found against ignorability (p<5e-04).
#> Results based on 2000 simulations from the estimated null distribution.
plot(igtest_cdf)Both tests show highly significant evidence against ignorable
sampling bias in the aids data. Thus this data should be
analyzed with methods that directly account for double truncation
bias.
To illustrate the utility of plotting the test results, we run the CDF-based test without restricting the time range for the test statistic.
igtest_cdf2 <- test_ignorability_indtrunc(npfit, test_type = "cdf")
igtest_cdf2
#>
#> Ignorable sampling bias test under quasi-independent truncation.
#>
#> NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times.
#> Min event time: 0
#> Max event time: 89
#>
#> Supremum test of type "cdf".
#> Test statistic: 0.51
#> Time range: [0, 89]
#> Time weights: none
#> Critical value: 0.52 at level 0.95
#> Significant evidence not found against ignorability (p=0.057).
#> Results based on 2000 simulations from the estimated null distribution.
plot(igtest_cdf2)This unrestricted test returns a much larger p-value, and after
plotting the result it is evident that this is due to the large variance
of the NPMLE cdf at long incubation times. Thus the unrestricted CDF
test has low power to detect what is clearly a large deviation from
ignorability. This issue can occur when the estimated selection
probabilities approach zero within the event time range. The plot below
confirms that this is the case for the aids data.