Nonparametric Analysis of Doubly Truncated Data

library(survdt)

Doubly truncated data

Double truncation occurs when a data sample only includes subjects whose event time lies within a random truncation interval. For example, the survdt package includes the doubly truncated dataset aids, which holds records of AIDS incubation times from a retrospective sample of patients with transfusion-acquired HIV. The inclusion criteria were that the patient was diagnosed with AIDS between the discovery of the disease in 1982 and the end of the study in 1986, with time measured in months since HIV infection. This sampling scheme was chosen so that the precise HIV infection time could be ascertained retrospectively from a blood transfusion registry. The event time of interest is the AIDS incubation time incu, with the truncation times contained in ltrunc and rtrunc respectively. See ?survdt::aids for more details.

The survdt package has a special Survdt data structure designed to store doubly truncated data. Below we create a Survdt object using the aids dataset and plot a simple visualization of the sample. See ?survdt::Survdt for more details.

y <- Survdt(aids$incu, aids$ltrunc, aids$rtrunc)
head(y)
#>      time ltrunc rtrunc
#> [1,]   28     25     80
#> [2,]   14      9     64
#> [3,]   10      2     57
#> [4,]   10     -1     54
#> [5,]   23      8     63
#> [6,]   13     -3     52
#> attr(,"class")
#> [1] "Survdt"
plot(y)

Each row of this plot contains a subject’s truncation interval as a gray rectangle and the subject’s event time in black, with rows sorted by event time.

Nonparametric maximum likelihood estimation

The nonparametric maximum likelihood estimator (NPMLE) of the event time cumulative distribution function (cdf) is a right-continuous step function with point masses at the observed event times. Think of it like an empirical distribution function that is adjusted for double truncation bias. It is estimated iteratively via an EM algorithm. The underlying assumptions for the NPMLE are discussed in Assumptions required for the NPMLE.

The survdt package includes the function npmle() to compute the NPMLE and its standard errors. The first input argument should be a Survdt object or expression, which is evaluated using the variables found in the data argument (if provided). See ?survdt::npmle for more details.

Below we fit the NPMLE on the aids data and plot both the estimated survival function and estimated cumulative hazard function, along with pointwise 95% confidence intervals. Other plot options include the estimated selection (non-truncation) probability over time and the estimated inverse probability weights. See ?survdt::plot.npmle for more details.

npfit <- npmle(Survdt(incu, ltrunc, rtrunc), data = aids, cumhaz = TRUE)
npfit
#> 
#> NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#> with 71 unique event times:
#> 
#>      time          cdf           se
#> [1,]    0 0.0007053536 0.0007291657
#> [2,]    4 0.0041320035 0.0017019771
#> [3,]    5 0.0054913032 0.0020516114
#> [4,]    6 0.0070255698 0.0023356958
#> [5,]    8 0.0107959130 0.0028739745
#> ...

plot(npfit)

plot(npfit, target = "cumhaz")

These confidence intervals are computed efficiently through closed-form standard errors, i.e. without relying on bootstrap or other resampling methods, in contrast to some other R packages for double truncation.

Testing for event time differences across multiple groups

The aids data has the factor variable age_gp which categorizes each subject based on age at HIV infection. The three age groups are children (age \(\leq 4\)), adults (age \(4-59\)), and elderly (age \(>59\)). The group sizes are:

summary(aids$age_gp)
#>  >59 4-59  <=4 
#>  141  120   34

In order to test for differences in incubation times across the three age groups, the survdt package provides two general options.

First, the data can be analyzed through a Cox proportional hazards model using methods that correct for double truncation bias. See vignette("cox-regression-quasiindep", package = "survdt") and ?survdt::coxph_indtrunc for more details on how to do this with the survdt package.

Second, the data can be analyzed nonparametrically to detect any difference in the incubation time distributions without assuming proportional hazards. The survdt package implements this in the function test_samplediff_indtrunc(). The first input argument should be a formula with a Survdt response and variables defining the groups, i.e. strata, to test across on the right hand side. The test can be based on the supremum difference between survival functions or between cumulative hazards, both estimated by the NPMLE. The p-value is computed by simulating from the estimated asymptotic null distribution of the test statistic. See ?survdt::test_samplediff_indtrunc for more details.

Below we test for differences in the AIDS incubation time distribution across the three age groups. We limit the range of the test to at most 70 months since HIV infection to avoid the high variability associated with the tail of the distribution (see the plot in Nonparametric maximum likelihood estimation).

agetest <- test_samplediff_indtrunc(Survdt(incu, ltrunc, rtrunc) ~ age_gp, 
                                    data = aids,
                                    restr_times = c(NA, 70))
agetest
#> 
#> Test for group differences under quasi-independent truncation.
#> 
#>   Unstratified NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#>     with 71 unique event times.
#>   Min event time: 0
#>   Max event time: 89
#> 
#>   Supremum test of type "survival".
#>   Test statistic: 0.74
#>     Time range:   [0, 70]
#>     Time weights: none
#>   Critical value: 0.21 at level 0.95
#>   Significant evidence found against equality across strata (p<5e-04).
#>   Results based on 2000 simulations from the estimated null distribution.
#> 
#> Summary of the 3 groups compared:
#>     stratum size unique event times
#>   age_gp>59  141                 59
#>  age_gp4-59  120                 59
#>   age_gp<=4   34                 20

We find highly significant evidence that the incubation time distribution differs across age groups. The test result can be visualized through the following plot, which compares the group-specific survival function estimates with the unstratified estimates. Under the null hypothesis of equality across strata, the stratified estimates should all lie within the 95% uniform confidence band, denoted by dashed lines.

plot(agetest)

Plotting the test result is recommended in general, since it provides a better understanding of the estimated effect size and power of the test.

Assumptions required for the NPMLE

The NPMLE requires two key conditions. First, all possible event times must have a positive probability of being observed. This can be determined from the study design, as well as by comparing the observed event times to the range of possible event times based on external knowledge. Second, the truncation times must be independent of the event time within the observable data region, commonly known as quasi-independence. See Testing for quasi-independent truncation for more information on how to test this assumption with the survdt package.

Testing for quasi-independent truncation

The survdt package includes the function test_quasiindep_ctau() to test for quasi-independence between the event and truncation times using the conditional Kendall’s tau rank correlation. The first input argument should be a Survdt object or expression, which is evaluated using the variables found in the data argument (if provided). It computes the (event time, left truncation time) and (event time, right truncation time) rank correlations (tau) accounting for double truncation, as well as a p-value for testing whether both are zero. See ?survdt::test_quasiindep_ctau for more details.

Below we apply this quasi-independence test to the aids data.

test_quasiindep_ctau(Survdt(incu, ltrunc, rtrunc), data = aids)
#> 
#> Quasi-independent truncation test by conditional rank correlations.
#> 
#>   Conditional Kendall's tau between event time and...
#>     left truncation time: 0.07
#>     right truncation time: 0.07
#>   Test statistic: chisq=3.5, df=1
#>   p-value: 0.06
#>   Proportion of subject pairs comparable was 0.47.

Based on the small estimated correlations and moderate p-value, we do not find strong evidence against quasi-independence in the aids data.

Testing for ignorable sampling bias

A unique feature of doubly truncated data, compared to one-sided truncation, is that it is possible in some cases for there to be no effective sampling bias induced by truncation. This occurs when the sampling probability is constant across all event times. If the sampling bias is ignorable, standard methods which do not correct for double truncation bias may be used to analyze the data.

The survdt package includes the function test_ignorability_indtrunc to test for ignorable sampling bias nonparametrically through the NPMLE. The first input argument can be a Survdt object/expression or the fitted NPMLE object returned by npmle(). The test can be based on the supremum difference between the time-specific and overall selection probabilities or the supremum difference between the NPMLE CDF and the empirical CDF (not adjusted for double truncation). The p-value is computed by simulating from the estimated asymptotic null distribution of the test statistic. See ?survdt::test_ignorability_indtrunc for more details.

Below we apply these tests to the aids data. The output from test_ignorability_indtrunc has a plot method to help visualize the deviation of the NPMLE from its reference value using a uniform confidence band. For the CDF-based test, we limit the range of the test to at most 80 months since HIV infection to avoid the high variability associated with the tail of the distribution (see the plot in Nonparametric maximum likelihood estimation).

igtest_prob <- test_ignorability_indtrunc(npfit, test_type = "sel_prob")
igtest_prob
#> 
#> Ignorable sampling bias test under quasi-independent truncation.
#> 
#>   NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#>     with 71 unique event times.
#>   Min event time: 0
#>   Max event time: 89
#> 
#>   Supremum test of type "sel_prob".
#>   Test statistic: 4.0
#>     Time range:   [0, 89]
#>     Time weights: none
#>   Critical value: 1.5 at level 0.95
#>   Significant evidence found against ignorability (p<5e-04).
#>   Results based on 2000 simulations from the estimated null distribution.
plot(igtest_prob)


igtest_cdf <- test_ignorability_indtrunc(npfit, test_type = "cdf", 
                                         restr_times = c(NA, 80))
igtest_cdf
#> 
#> Ignorable sampling bias test under quasi-independent truncation.
#> 
#>   NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#>     with 71 unique event times.
#>   Min event time: 0
#>   Max event time: 89
#> 
#>   Supremum test of type "cdf".
#>   Test statistic: 0.51
#>     Time range:   [0, 80]
#>     Time weights: none
#>   Critical value: 0.18 at level 0.95
#>   Significant evidence found against ignorability (p<5e-04).
#>   Results based on 2000 simulations from the estimated null distribution.
plot(igtest_cdf)

Both tests show highly significant evidence against ignorable sampling bias in the aids data. Thus this data should be analyzed with methods that directly account for double truncation bias.

To illustrate the utility of plotting the test results, we run the CDF-based test without restricting the time range for the test statistic.

igtest_cdf2 <- test_ignorability_indtrunc(npfit, test_type = "cdf")
igtest_cdf2
#> 
#> Ignorable sampling bias test under quasi-independent truncation.
#> 
#>   NPMLE fit based on 295 complete samples (0 dropped due to missingness)
#>     with 71 unique event times.
#>   Min event time: 0
#>   Max event time: 89
#> 
#>   Supremum test of type "cdf".
#>   Test statistic: 0.51
#>     Time range:   [0, 89]
#>     Time weights: none
#>   Critical value: 0.52 at level 0.95
#>   Significant evidence not found against ignorability (p=0.057).
#>   Results based on 2000 simulations from the estimated null distribution.
plot(igtest_cdf2)

This unrestricted test returns a much larger p-value, and after plotting the result it is evident that this is due to the large variance of the NPMLE cdf at long incubation times. Thus the unrestricted CDF test has low power to detect what is clearly a large deviation from ignorability. This issue can occur when the estimated selection probabilities approach zero within the event time range. The plot below confirms that this is the case for the aids data.

plot(npfit, target = "sel_prob")