---
title: "Species Distribution Response Curves"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Species Distribution Response Curves}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10.5,
  fig.height = 6.5,
  out.width = "100%",
  fig.align = "center"
)

species_example <- requireNamespace("disdat", quietly = TRUE) &&
  requireNamespace("randomForest", quietly = TRUE)

ensemble_example <- species_example &&
  requireNamespace("mgcv", quietly = TRUE)
```

This vignette shows a compact species distribution workflow using
[`disdat`](https://cran.r-project.org/package=disdat) data, a down-sampled
random forest classifier, and model-agnostic response plots from `curves`.

The example uses one species from New South Wales (`"nsw14"`), combines its
presence records with 10,000 random background points, and then down-samples
the background class during model fitting so each tree sees a balanced sample.

The code below is shown in full. It is evaluated when the optional packages
`disdat` and `randomForest` are installed; the short ensemble example also
requires `mgcv`.

## Prepare the training data

```{r, eval = species_example, message = FALSE, warning = FALSE}
spID <- "nsw14"
pr <- disdat::disPo("NSW")
bg <- disdat::disBg("NSW")

pr <- pr[pr$spid == spID, ]
training <- rbind(pr, bg)

covars <- c(
  "disturb", "mi", "rainann", "raindq",
  "rugged", "soildepth",
  "tempann", "topo", "vegsys"
)

training <- training[, c("occ", covars)]
training$vegsys <- as.factor(training$vegsys)

numeric_covars <- setdiff(covars, "vegsys")
training[numeric_covars] <- lapply(
  training[numeric_covars],
  function(x) as.numeric(scale(x))
)

head(training[, 1:5])
```

The numeric covariates are centred and scaled to keep the example plots on
comparable axes. This is only for readability; `curves` does not require
standardised predictors.

## Fit a down-sampled random forest

`randomForest::randomForest()` can balance each bootstrap sample with
`sampsize`. Here the number of background samples drawn for each tree is set to
the number of presences, which is a common way to reduce class imbalance. The
number of trees is deliberately modest so the vignette remains quick to build.

```{r, eval = species_example, message = FALSE, warning = FALSE}
# get the number of presences (before converting to factors);
pr_num <- sum(training$occ)
smpsize <- c("0" = pr_num, "1" = pr_num)

training$occ <- as.factor(training$occ)

set.seed(20260501)
rf_model <- randomForest::randomForest(
  occ ~ .,
  data = training,
  ntree = 500,
  sampsize = smpsize,
  replace = TRUE
)
```

## Plot univariate and bivariate response curves

For binary classification models, `predict(..., type = "prob")` returns one
column per class. In this example the presence class is `"1"`, so the response
plots use `response = "1"` to make that choice explicit. Because the model is
fit with presences and background samples rather than confirmed absences, the
plotted values are interpreted as relative likelihoods, not occurrence
probabilities.

```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE}
predictors <- training[, covars]

curves::univariate(
  rf_model,
  predictors,
  method = "profile",
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)
```

The same interface can be used for partial dependence, ICE, and ALE plots.
`method = "pdp"` averages predictions over sampled rows from `dat`, whereas
`method = "ice+pdp"` overlays that average on top of the sampled ICE curves.
Here `n = 100` controls the grid resolution for numeric predictors, while
`background_n` controls how many rows are sampled for PDP/ICE. For
`method = "pdp"`, `interval = "quantile"` can be used to show a central
pointwise band around the averaged curve.

```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE}
set.seed(20260501)
curves::univariate(
  rf_model,
  predictors,
  method = "pdp",
  n = 100,
  background_n = 200,
  interval = "quantile",
  interval_level = 0.8,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)
```

```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE}
set.seed(20260502)
curves::univariate(
  rf_model,
  predictors,
  method = "ice+pdp",
  n = 100,
  background_n = 200,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)
```

ALE curves accumulate local finite differences within the observed predictor
distribution, which often makes them less sensitive than PDPs to strongly
correlated predictors. In `curves`, univariate ALE supports both numeric and
factor predictors; bivariate ALE surfaces currently use numeric predictor
pairs.

```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE}
curves::univariate(
  rf_model,
  predictors,
  method = "ale",
  n = 40,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)
```

The bivariate examples use `mi` and `rainann`, two numeric covariates in the
training data.

```{r, eval = species_example, fig.width = 5.6, fig.height = 4, message = FALSE, warning = FALSE}
set.seed(20260503)
curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  method = "pdp",
  background_n = 200,
  rug = TRUE,
  plot_type = "heatmap",
  type = "prob",
  response = "1",
  zlab = "Relative likelihood"
)
```

For correlated covariates, a centred bivariate ALE surface is often more
stable than a PDP surface because it accumulates local differences only where
the data are observed.

```{r, eval = species_example, fig.width = 5.6, fig.height = 4, message = FALSE, warning = FALSE}
curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  method = "ale",
  n = 40,
  plot_type = "heatmap",
  type = "prob",
  response = "1",
  zlab = "ALE"
)
```

## Ensemble response curves

Species distribution modelling often uses ensembles of models fit to the same
response and predictors. `multimodel()` computes the chosen response-curve
method for each ensemble member, then combines those curves pointwise. This
short example fits a random forest and a GAM to the same two predictors and
puts both models on the predicted relative likelihood scale before averaging.

```{r, eval = ensemble_example, fig.width = 7, fig.height = 4.5, message = FALSE, warning = FALSE}
ensemble_covars <- c("mi", "rainann")
ensemble_training <- training[, c("occ", ensemble_covars)]

set.seed(20260504)
rf_member <- randomForest::randomForest(
  occ ~ .,
  data = ensemble_training,
  ntree = 300,
  sampsize = smpsize,
  replace = TRUE
)

gam_training <- transform(
  ensemble_training,
  occ_num = as.integer(as.character(occ))
)

gam_member <- mgcv::gam(
  occ_num ~ s(mi, k = 6) + s(rainann, k = 6),
  data = gam_training,
  family = binomial(),
  method = "REML"
)


rf_likelihood <- function(model, newdata) {
  predict(model, newdata, type = "prob")[, "1"]
}

gam_likelihood <- function(model, newdata) {
  as.numeric(stats::predict(model, newdata, type = "response"))
}

set.seed(20260505)
curves::multimodel(
  list(rf_member, gam_member),
  ensemble_training[, ensemble_covars],
  fun = list(rf_likelihood, gam_likelihood),
  method = "pdp",
  n = 60,
  background_n = 200,
  show_models = TRUE,
  ylab = "Relative likelihood"
)
```

For a 3D surface, a small wrapper around `predict()` is still convenient when
you want direct control over the returned prediction vector.

```{r, eval = species_example && requireNamespace("plotly", quietly = TRUE), message = FALSE, warning = FALSE}
rf_likelihood <- function(model, newdata) {
  predict(model, newdata, type = "prob")[, "1"]
}

curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  plot_type = "surface",
  fun = rf_likelihood,
  zlab = "Relative likelihood"
)
```
