---
title: "Nonparametric Analysis of Doubly Truncated Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Nonparametric Analysis of Doubly Truncated Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 5,
  fig.height = 3,
  fig.align = "center"
)
```

```{r setup}
library(survdt)
```

## Doubly truncated data

Double truncation occurs when a data sample only includes subjects whose event 
time lies within a random truncation interval.
For example, the `survdt` package includes the doubly truncated dataset `aids`,
which holds records of AIDS incubation times from a retrospective sample of 
patients with transfusion-acquired HIV.
The inclusion criteria were that the patient was diagnosed with AIDS between the
discovery of the disease in 1982 and the end of the study in 1986, with time 
measured in months since HIV infection.
This sampling scheme was chosen so that the precise HIV infection time could be 
ascertained retrospectively from a blood transfusion registry.
The event time of interest is the AIDS incubation time `incu`, with the truncation 
times contained in `ltrunc` and `rtrunc` respectively. See `?survdt::aids` for more
details.

The `survdt` package has a special `Survdt` data structure designed to store 
doubly truncated data. Below we create a `Survdt` object using the `aids` dataset
and plot a simple visualization of the sample. See `?survdt::Survdt` for more 
details.

```{r}
y <- Survdt(aids$incu, aids$ltrunc, aids$rtrunc)
head(y)
plot(y)
```

Each row of this plot contains a subject's truncation interval as a gray rectangle
and the subject's event time in black, with rows sorted by event time.

## Nonparametric maximum likelihood estimation

The nonparametric maximum likelihood estimator (NPMLE) of the event time cumulative
distribution function (cdf) is a right-continuous step function with point masses at
the observed event times.
Think of it like an empirical distribution function that is adjusted for double
truncation bias.
It is estimated iteratively via an EM algorithm.
The underlying assumptions for the NPMLE are discussed in 
[Assumptions required for the NPMLE].

The `survdt` package includes the function `npmle()` to compute the NPMLE
and its standard errors.
The first input argument should be a `Survdt` object or expression, which is 
evaluated using the variables found in the `data` argument (if provided).
See `?survdt::npmle` for more details.

Below we fit the NPMLE on the `aids` data and plot both the estimated survival 
function and estimated cumulative hazard function, along with pointwise 95% 
confidence intervals.
Other plot options include the estimated selection (non-truncation) probability 
over time and the estimated inverse probability weights.
See `?survdt::plot.npmle` for more details.
```{r}
npfit <- npmle(Survdt(incu, ltrunc, rtrunc), data = aids, cumhaz = TRUE)
npfit

plot(npfit)
```
```{r}
plot(npfit, target = "cumhaz")
```
These confidence intervals are computed efficiently through closed-form standard
errors, i.e. without relying on bootstrap or other resampling methods, in 
contrast to some other R packages for double truncation.

### Testing for event time differences across multiple groups

The `aids` data has the factor variable `age_gp` which categorizes each subject
based on age at HIV infection.
The three age groups are children (age $\leq 4$), adults (age $4-59$), and 
elderly (age $>59$).
The group sizes are:
```{r}
summary(aids$age_gp)
```

In order to test for differences in incubation times across the three age groups,
the `survdt` package provides two general options.

First, the data can be analyzed through a Cox proportional hazards model using
methods that correct for double truncation bias.
See `vignette("cox-regression-quasiindep", package = "survdt")` and 
`?survdt::coxph_indtrunc` for more details on how to do this with the `survdt`
package.

Second, the data can be analyzed nonparametrically to detect any difference in the 
incubation time distributions without assuming proportional hazards.
The `survdt` package implements this in the function `test_samplediff_indtrunc()`.
The first input argument should be a formula with a `Survdt` response and 
variables defining the groups, i.e. strata, to test across on the right hand side.
The test can be based on the supremum difference between survival functions or 
between cumulative hazards, both estimated by the NPMLE.
The p-value is computed by simulating from the estimated asymptotic null 
distribution of the test statistic.
See `?survdt::test_samplediff_indtrunc` for more details.

Below we test for differences in the AIDS incubation time distribution across 
the three age groups. 
We limit the range of the test to at most 70 months since HIV infection to avoid
the high variability associated with the tail of the distribution (see the plot 
in [Nonparametric maximum likelihood estimation]).

```{r}
agetest <- test_samplediff_indtrunc(Survdt(incu, ltrunc, rtrunc) ~ age_gp, 
                                    data = aids,
                                    restr_times = c(NA, 70))
agetest
```

We find highly significant evidence that the incubation time distribution differs
across age groups.
The test result can be visualized through the following plot, which compares the
group-specific survival function estimates with the unstratified estimates.
Under the null hypothesis of equality across strata, the stratified estimates
should all lie within the 95% uniform confidence band, denoted by dashed lines.

```{r, fig.height = 3.5}
plot(agetest)
```
Plotting the test result is recommended in general, since it provides a better
understanding of the estimated effect size and power of the test.

### Assumptions required for the NPMLE

The NPMLE requires two key conditions. 
First, all possible event times must have a positive probability of being observed. 
This can be determined from the study design, as well as by comparing the observed 
event times to the range of possible event times based on external knowledge.
Second, the truncation times must be independent of the event time within the 
observable data region, commonly known as quasi-independence.
See [Testing for quasi-independent truncation] for more information on how to
test this assumption with the `survdt` package.

### Testing for quasi-independent truncation

The `survdt` package includes the function `test_quasiindep_ctau()` to test for
quasi-independence between the event and truncation times using the conditional 
Kendall's tau rank correlation.
The first input argument should be a `Survdt` object or expression, which is 
evaluated using the variables found in the `data` argument (if provided).
It computes the (event time, left truncation time) and 
(event time, right truncation time) rank correlations (`tau`) accounting for
double truncation, as well as a p-value for testing whether both are zero.
See `?survdt::test_quasiindep_ctau` for more details.

Below we apply this quasi-independence test to the `aids` data.

```{r}
test_quasiindep_ctau(Survdt(incu, ltrunc, rtrunc), data = aids)
```

Based on the small estimated correlations and moderate p-value, we do not find 
strong evidence against quasi-independence in the `aids` data.

### Testing for ignorable sampling bias

A unique feature of doubly truncated data, compared to one-sided truncation, is
that it is possible in some cases for there to be no effective sampling bias 
induced by truncation.
This occurs when the sampling probability is constant across all event times.
If the sampling bias is ignorable, standard methods which do not correct for 
double truncation bias may be used to analyze the data.

The `survdt` package includes the function `test_ignorability_indtrunc` to test for
ignorable sampling bias nonparametrically through the NPMLE.
The first input argument can be a `Survdt` object/expression or the fitted NPMLE
object returned by `npmle()`.
The test can be based on the supremum difference between the time-specific and
overall selection probabilities or the supremum difference between the NPMLE CDF
and the empirical CDF (not adjusted for double truncation).
The p-value is computed by simulating from the estimated asymptotic null 
distribution of the test statistic.
See `?survdt::test_ignorability_indtrunc` for more details.

Below we apply these tests to the `aids` data.
The output from `test_ignorability_indtrunc` has a `plot` method to help visualize
the deviation of the NPMLE from its reference value using a uniform 
confidence band.
For the CDF-based test, we limit the range of the test to at most 80 months since 
HIV infection to avoid the high variability associated with the tail of the 
distribution (see the plot in [Nonparametric maximum likelihood estimation]).

```{r, fig.height=3.5}
igtest_prob <- test_ignorability_indtrunc(npfit, test_type = "sel_prob")
igtest_prob
plot(igtest_prob)

igtest_cdf <- test_ignorability_indtrunc(npfit, test_type = "cdf", 
                                         restr_times = c(NA, 80))
igtest_cdf
plot(igtest_cdf)
```

Both tests show highly significant evidence against ignorable sampling bias in 
the `aids` data.
Thus this data should be analyzed with methods that directly account for double
truncation bias.

To illustrate the utility of plotting the test results, we run the CDF-based
test without restricting the time range for the test statistic.

```{r, fig.height=3.5}
igtest_cdf2 <- test_ignorability_indtrunc(npfit, test_type = "cdf")
igtest_cdf2
plot(igtest_cdf2)
```

This unrestricted test returns a much larger p-value, and after plotting the 
result it is evident that this is due to the large variance of the NPMLE cdf at
long incubation times.
Thus the unrestricted CDF test has low power to detect what is clearly a large 
deviation from ignorability.
This issue can occur when the estimated selection probabilities approach zero 
within the event time range.
The plot below confirms that this is the case for the `aids` data.

```{r}
plot(npfit, target = "sel_prob")
```