The idionomics Pipeline: keeping individual level information to find functionally relevant principles

The idionomic science principle

Classical longitudinal methods (e.g., multilevel models) estimate one set of parameters shared — or partially shared — across all units of an ensemble. If the focus of interest is the trajectory of individuals, this is only sensible under hard-to-meet assumptions, such as exchangeability and/or ergodicity. If these assumptions are not met, ensemble averages may systematically obscure individual differences: an average positive effect may coexist with a significant subset of individuals for whom the effect is negative, nonsignificant, or zero.

Idionomic science inverts the order of operations:

  1. Unit first. Fit a model to each person’s time series independently, capturing that person’s unique dynamics, residual structure, and effect sizes.
  2. Group later. Aggregate the individual estimates with meta-analytic methods that explicitly represent and quantify heterogeneity across people.

The recommended pipeline is:

i_screener()  →  pmstandardize()  →  i_detrender()  →  iarimax()/looping_machine()  →  i_pval() / sden_test()

Synthetic data

This vignette walks through each step with a synthetic panel of 12 subjects, each contributing 50 time points. The data includes four variables (a, b, c, y) with known relationships, plus three manually created subjects that demonstrate specific pipeline behaviors.

library(idionomics)
#> 
#> One Size Fits None!

set.seed(42)
panel <- do.call(rbind, lapply(1:9, function(id) {
  a <- rnorm(50)
  b <- 0.4 * a + rnorm(50)
  c <- 0.4 * b + rnorm(50)
  data.frame(id = as.character(id), time = seq_len(50),
             a = a, b = b, c = c,
             y = 0.5 * a + rnorm(50),
             stringsAsFactors = FALSE)
}))

# Subject 10: near-constant "a" — will be caught by i_screener(min_sd = 0.5)
s10 <- data.frame(id = "10", time = seq_len(50),
                  a = rep(3, 50), b = rnorm(50), c = rnorm(50),
                  y = rnorm(50), stringsAsFactors = FALSE)

# Subject 11: full positive loop (a -> b -> c -> a)
a11 <- rnorm(50)
b11 <- 0.6 * a11 + rnorm(50, sd = 0.5)
c11 <- 0.6 * b11 + rnorm(50, sd = 0.5)
s11 <- data.frame(id = "11", time = seq_len(50),
                  a = a11 + 0.4 * c11, b = b11, c = c11,
                  y = 0.5 * a11 + rnorm(50), stringsAsFactors = FALSE)

# Subject 12: negative a -> y effect
a12 <- rnorm(50)
s12 <- data.frame(id = "12", time = seq_len(50),
                  a = a12, b = 0.4 * a12 + rnorm(50),
                  c = rnorm(50), y = -0.5 * a12 + rnorm(50),
                  stringsAsFactors = FALSE)

panel <- rbind(panel, s10, s11, s12)

The data-generating process:


Step 1: Data quality screening — i_screener()

Before standardizing, screen subjects on their raw data. After pmstandardize(), all non-constant series have within-person variance = 1 by construction, making variance-based filters ineffective. i_screener() catches low-quality subjects at the right stage.

panel_clean <- i_screener(panel, cols = c("a", "b", "c", "y"), id_var = "id",
                          min_n_subject = 20, min_sd = 0.5, verbose = TRUE)
#> i_screener applies per-subject data quality filters.
#>    min_n_subject: subject-column combinations need >= 20 non-NA observations per variable.
#>    min_sd       : subject-column combinations need within-person SD >= 0.5 per variable.
#>    filter_type  : 'joint' - Subjects excluded if any variable fails any criterion.
#>    Subjects before screening : 12
#>    Subjects failing at least one criterion across variables : 1
#>    Subjects passing all criteria on all variables : 11

Subject 10 is removed because a has SD = 0, which falls below min_sd = 0.5.

Three criteria are available (all optional except min_n_subject):

Criterion What it catches
min_n_subject Too few observations
min_sd Near-constant series (floor/ceiling responders)
max_mode_pct “Stuck” responders (e.g. >= 80% of responses identical)

Use mode = "report" to inspect quality metrics without committing to exclusion:

report <- i_screener(panel, cols = c("a", "b", "c", "y"), id_var = "id",
                     min_n_subject = 20, min_sd = 0.5, max_mode_pct = 0.80,
                     mode = "report", verbose = TRUE)
#> i_screener applies per-subject data quality filters.
#>    min_n_subject: subject-column combinations need >= 20 non-NA observations per variable.
#>    min_sd       : subject-column combinations need within-person SD >= 0.5 per variable.
#>    max_mode_pct : subject-column combinations need <= 80% of responses on the modal value.
#>    filter_type  : 'joint' - Subjects excluded if any variable fails any criterion.
#>    Subjects before screening : 12
#>    Subjects failing at least one criterion across variables : 1
#>    Subjects passing all criteria on all variables : 11
print(report)
#> # A tibble: 12 × 18
#>    id    a_n_valid b_n_valid c_n_valid y_n_valid  a_sd  b_sd  c_sd  y_sd
#>    <chr>     <int>     <int>     <int>     <int> <dbl> <dbl> <dbl> <dbl>
#>  1 1            50        50        50        50 1.15  0.930 1.02  1.11 
#>  2 10           50        50        50        50 0     0.994 0.979 0.934
#>  3 11           50        50        50        50 1.09  0.709 0.752 1.08 
#>  4 12           50        50        50        50 1.05  1.15  1.12  1.09 
#>  5 2            50        50        50        50 0.988 1.18  0.878 1.02 
#>  6 3            50        50        50        50 0.996 1.19  1.13  1.11 
#>  7 4            50        50        50        50 1.05  1.00  1.02  1.09 
#>  8 5            50        50        50        50 1.07  1.10  1.10  1.40 
#>  9 6            50        50        50        50 1.06  1.11  0.933 1.20 
#> 10 7            50        50        50        50 0.995 1.04  1.15  1.03 
#> 11 8            50        50        50        50 0.953 1.04  1.11  1.09 
#> 12 9            50        50        50        50 0.935 1.18  1.03  1.10 
#> # ℹ 9 more variables: a_mode_pct <dbl>, b_mode_pct <dbl>, c_mode_pct <dbl>,
#> #   y_mode_pct <dbl>, a_pass <lgl>, b_pass <lgl>, c_pass <lgl>, y_pass <lgl>,
#> #   pass_overall <lgl>

Step 2: Within-person standardization — pmstandardize()

Person-mean standardization computes (x - person_mean) / person_sd for each subject and column. This removes between-person differences in level and scale, making coefficients comparable across subjects. Output columns are named <col>_psd.

panel_std <- pmstandardize(panel_clean, cols = c("a", "b", "c", "y"), id_var = "id",
                           verbose = TRUE)
#> This function creates within-person z-scores (person-level standardization).
#>    If all values for a feature within ID are NA, returns NA.
#>    If fewer than 2 non-NA values exist, returns 0 (SD undefined).
#>    If values are constant within person, returns 0 (zero variance).
#>    Otherwise, returns (x - person_mean) / person_sd.
head(panel_std[, c("id", "time", "a_psd", "b_psd", "c_psd", "y_psd")])
#> # A tibble: 6 × 6
#>   id     time   a_psd  b_psd  c_psd   y_psd
#>   <chr> <int>   <dbl>  <dbl>  <dbl>   <dbl>
#> 1 1         1  1.22    0.843  1.64   0.618 
#> 2 1         2 -0.459  -1.18   0.746 -1.61  
#> 3 1         3  0.346   1.76  -0.195  1.25  
#> 4 1         4  0.581   0.871  2.29   0.0759
#> 5 1         5  0.382   0.178 -0.443 -0.202 
#> 6 1         6 -0.0612  0.159  0.311 -1.12

Edge cases are handled automatically: all-NA series remain NA; single-observation series return 0; constant series (zero SD) return 0 and are filtered downstream by iarimax()’s minvar guard.


Step 3: Linear detrending — i_detrender()

i_detrender() removes the within-person linear time trend via lm(col ~ time) and appends each column with the residuals (<col>_dt). This reduces the integration order auto.arima() selects, fostering comparability between models.

Filtering is per-column and independent: a subject can produce valid detrended residuals for one variable while receiving NA in another if that column fails the variance or observation-count thresholds.

panel_dt <- i_detrender(panel_std, cols = c("a_psd", "b_psd", "c_psd", "y_psd"),
                        id_var = "id", timevar = "time", verbose = TRUE)
#> i_detrender removes the within-person linear time trend per variable.
#>    If all values for a column within ID are NA: returns NA.
#>    If fewer than 20 non-NA observations: returns NA.
#>    If pre-detrend or post-detrend variance < 0.01: returns NA.
#>    Otherwise: returns residuals from lm(col ~ timevar) per person.
head(panel_dt[, c("id", "time", "a_psd_dt", "b_psd_dt", "c_psd_dt", "y_psd_dt")])
#> # A tibble: 6 × 6
#>   id     time a_psd_dt b_psd_dt c_psd_dt y_psd_dt
#>   <chr> <int>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 1         1   0.883    0.512     1.41    0.587 
#> 2 1         2  -0.785   -1.50      0.523  -1.64  
#> 3 1         3   0.0349   1.45     -0.409   1.22  
#> 4 1         4   0.283    0.580     2.08    0.0493
#> 5 1         5   0.0983  -0.0999   -0.637  -0.227 
#> 6 1         6  -0.331   -0.105     0.126  -1.15

Pipeline order matters. Detrending after standardization is preferred. Reversing the order (i_detrender() then pmstandardize()) can amplify near-zero residuals to unit variance, bypassing iarimax()’s minvar filter.


Step 4a: Per-subject ARIMAX and meta-analysis — iarimax()

iarimax() fits one forecast::auto.arima() model per subject, extracts coefficients via broom::tidy(), and pools the focal predictor’s coefficients with metafor::rma() (REML).

result <- iarimax(panel_dt,
                  y_series  = "y_psd_dt",
                  x_series  = "a_psd_dt",
                  id_var    = "id",
                  timevar   = "time",
                  verbose   = TRUE)
#> Filtering data based on minimum non-NA observations and variance...
#> Filtering done: 11 subjects will be used for the analyses.
#> Running I-ARIMAX algorithm...
#>   Applying ARIMAX (auto d) to case: 1 ... 9.1% done
#>   Applying ARIMAX (auto d) to case: 11 ... 18.2% done
#>   Applying ARIMAX (auto d) to case: 12 ... 27.3% done
#>   Applying ARIMAX (auto d) to case: 2 ... 36.4% done
#>   Applying ARIMAX (auto d) to case: 3 ... 45.5% done
#>   Applying ARIMAX (auto d) to case: 4 ... 54.5% done
#>   Applying ARIMAX (auto d) to case: 5 ... 63.6% done
#>   Applying ARIMAX (auto d) to case: 6 ... 72.7% done
#>   Applying ARIMAX (auto d) to case: 7 ... 81.8% done
#>   Applying ARIMAX (auto d) to case: 8 ... 90.9% done
#>   Applying ARIMAX (auto d) to case: 9 ... 100% done
#> I-ARIMAX algorithm finished.

By default, auto.arima() selects the differencing order d independently per subject. When d varies across subjects, some coefficients describe effects on levels (d = 0) while others describe effects on changes (d = 1), making them less comparable in the meta-analysis. Use fixed_d to impose a common differencing order:

# Fix d = 0 for all subjects (coefficients describe effects on levels)
result_d0 <- iarimax(panel_dt, y_series = "y_psd_dt", x_series = "a_psd_dt",
                     id_var = "id", timevar = "time", fixed_d = 0)

Summary

summary(result)
#> 
#>               IARIMAX Results Summary
#>            ----------------------------
#>    Features:
#>      Outcome          : y_psd_dt 
#>      Focal predictor  : a_psd_dt 
#>      ID variable      : id 
#> 
#>    Number of subjects:
#>      Original Dataframe (iarimax input): 11 
#>      Excl. first filter (var or n).    : 0 
#>      Skipped (auto.arima failed)       : 0 
#>      Analyzed cases                    : 11 
#> 
#>    Per-subject effects  (alpha =0.05):
#>      Positive            : 10 
#>      Significant positive: 10 
#>      Negative            : 1 
#>      Significant negative: 1 
#> 
#>    Random-Effects Meta-Analysis (REML)
#>      beta = 0.3801   SE = 0.0961   z = 3.9561   p = 1e-04 
#>      95% Confidence Interval: [0.1918, 0.5684]
#>      95% Prediction Interval: [-0.2236, 0.9838]
#> 
#>    Heterogeneity
#>      tau2 = 0.0856   I2 = 84.54 %   Q(df = 10 ) = 65.5311   p = 0

Caterpillar plot

plot(result,
     y_series_name = "Outcome (y)",
     x_series_name = "Predictor (a)")

Each bar is a per-subject confidence interval, colored green (significantly positive), red (significantly negative), or black (crosses zero). The blue band is the 95% CI of the REMA pooled effect. Subject 12 (negative a -> y effect) should appear on the negative side.

Per-subject results

The full individual-level output is in $results_df:

head(result$results_df[, c("id", "nAR", "nI", "nMA",
                           "estimate_a_psd_dt", "std.error_a_psd_dt",
                           "n_valid", "n_params", "raw_cor")])
#> # A tibble: 6 × 9
#>   id      nAR    nI   nMA estimate_a_psd_dt std.error_a_psd_dt n_valid n_params
#>   <chr> <int> <int> <int>             <dbl>              <dbl>   <int>    <int>
#> 1 1         0     0     0             0.634              0.113      50        1
#> 2 11        0     0     0             0.438              0.129      50        1
#> 3 12        0     0     0            -0.531              0.126      50        1
#> 4 2         0     0     0             0.345              0.129      50        1
#> 5 3         0     0     3             0.448              0.143      50        4
#> 6 4         0     0     0             0.570              0.119      50        1
#> # ℹ 1 more variable: raw_cor <dbl>

Step 4b: Directed loop detection — looping_machine()

looping_machine() tests whether three variables form a directed positive loop (a -> b -> c -> a). It fits three I-ARIMAX legs, applies i_pval() to each, and computes Loop_positive_directed: a 0/1 indicator that is 1 only when all three focal betas are positive and significant at alpha.

loop_result <- looping_machine(panel_dt,
                               a_series = "a_psd_dt",
                               b_series = "b_psd_dt",
                               c_series = "c_psd_dt",
                               id_var   = "id",
                               timevar  = "time",
                               verbose  = TRUE)
#> Calculating a to b: From a_psd_dt to b_psd_dt...
#> Filtering data based on minimum non-NA observations and variance...
#> Filtering done: 11 subjects will be used for the analyses.
#> Running I-ARIMAX algorithm...
#>   Applying ARIMAX (auto d) to case: 1 ... 9.1% done
#>   Applying ARIMAX (auto d) to case: 11 ... 18.2% done
#>   Applying ARIMAX (auto d) to case: 12 ... 27.3% done
#>   Applying ARIMAX (auto d) to case: 2 ... 36.4% done
#>   Applying ARIMAX (auto d) to case: 3 ... 45.5% done
#>   Applying ARIMAX (auto d) to case: 4 ... 54.5% done
#>   Applying ARIMAX (auto d) to case: 5 ... 63.6% done
#>   Applying ARIMAX (auto d) to case: 6 ... 72.7% done
#>   Applying ARIMAX (auto d) to case: 7 ... 81.8% done
#>   Applying ARIMAX (auto d) to case: 8 ... 90.9% done
#>   Applying ARIMAX (auto d) to case: 9 ... 100% done
#> I-ARIMAX algorithm finished.
#> Calculating b to c: From b_psd_dt to c_psd_dt...
#> Filtering data based on minimum non-NA observations and variance...
#> Filtering done: 11 subjects will be used for the analyses.
#> Running I-ARIMAX algorithm...
#>   Applying ARIMAX (auto d) to case: 1 ... 9.1% done
#>   Applying ARIMAX (auto d) to case: 11 ... 18.2% done
#>   Applying ARIMAX (auto d) to case: 12 ... 27.3% done
#>   Applying ARIMAX (auto d) to case: 2 ... 36.4% done
#>   Applying ARIMAX (auto d) to case: 3 ... 45.5% done
#>   Applying ARIMAX (auto d) to case: 4 ... 54.5% done
#>   Applying ARIMAX (auto d) to case: 5 ... 63.6% done
#>   Applying ARIMAX (auto d) to case: 6 ... 72.7% done
#>   Applying ARIMAX (auto d) to case: 7 ... 81.8% done
#>   Applying ARIMAX (auto d) to case: 8 ... 90.9% done
#>   Applying ARIMAX (auto d) to case: 9 ... 100% done
#> I-ARIMAX algorithm finished.
#> Calculating c to a: From c_psd_dt to a_psd_dt...
#> Filtering data based on minimum non-NA observations and variance...
#> Filtering done: 11 subjects will be used for the analyses.
#> Running I-ARIMAX algorithm...
#>   Applying ARIMAX (auto d) to case: 1 ... 9.1% done
#>   Applying ARIMAX (auto d) to case: 11 ... 18.2% done
#>   Applying ARIMAX (auto d) to case: 12 ... 27.3% done
#>   Applying ARIMAX (auto d) to case: 2 ... 36.4% done
#>   Applying ARIMAX (auto d) to case: 3 ... 45.5% done
#>   Applying ARIMAX (auto d) to case: 4 ... 54.5% done
#>   Applying ARIMAX (auto d) to case: 5 ... 63.6% done
#>   Applying ARIMAX (auto d) to case: 6 ... 72.7% done
#>   Applying ARIMAX (auto d) to case: 7 ... 81.8% done
#>   Applying ARIMAX (auto d) to case: 8 ... 90.9% done
#>   Applying ARIMAX (auto d) to case: 9 ... 100% done
#> I-ARIMAX algorithm finished.
#> Looping Machine finished.
#> Number of cases with the positive loop present: 1
# Proportion of subjects with a detected positive directed loop
mean(loop_result$loop_df$Loop_positive_directed, na.rm = TRUE)
#> [1] 0.09090909

# Per-leg I-ARIMAX results are also accessible
summary(loop_result$iarimax_a_to_b)
#> 
#>               IARIMAX Results Summary
#>            ----------------------------
#>    Features:
#>      Outcome          : b_psd_dt 
#>      Focal predictor  : a_psd_dt 
#>      ID variable      : id 
#> 
#>    Number of subjects:
#>      Original Dataframe (iarimax input): 11 
#>      Excl. first filter (var or n).    : 0 
#>      Skipped (auto.arima failed)       : 0 
#>      Analyzed cases                    : 11 
#> 
#>    Per-subject effects  (alpha =0.05):
#>      Positive            : 11 
#>      Significant positive: 10 
#>      Negative            : 0 
#>      Significant negative: 0 
#> 
#>    Random-Effects Meta-Analysis (REML)
#>      beta = 0.4695   SE = 0.0542   z = 8.6541   p = 0 
#>      95% Confidence Interval: [0.3631, 0.5758]
#>      95% Prediction Interval: [0.1854, 0.7535]
#> 
#>    Heterogeneity
#>      tau2 = 0.0181   I2 = 57.56 %   Q(df = 10 ) = 27.0741   p = 0.0025

Because the positive directed loop is a conjunction of three one-sided tests, it is very conservative: under the global null and independence, the false positive rate is approximately \((\alpha/2)^3\).

Optional arguments include covariates (shared extra predictors for all three legs), include_third_as_covariate (adds the third loop variable as a covariate in each leg), and fixed_d (fixes the differencing order across all legs for comparability).


Step 5: Per-subject p-values — i_pval()

i_pval() attaches a pval_<feature> column to results_df using the two-tailed t-distribution with ML-based degrees of freedom (n_valid - n_params).

result_pval <- i_pval(result)
result_pval$results_df[, c("id", "estimate_a_psd_dt", "pval_a_psd_dt")]
#> # A tibble: 11 × 3
#>    id    estimate_a_psd_dt pval_a_psd_dt
#>    <chr>             <dbl>         <dbl>
#>  1 1                 0.634   0.000000928
#>  2 11                0.438   0.00136    
#>  3 12               -0.531   0.000106   
#>  4 2                 0.345   0.0103     
#>  5 3                 0.448   0.00295    
#>  6 4                 0.570   0.0000153  
#>  7 5                 0.443   0.00111    
#>  8 6                 0.448   0.000802   
#>  9 7                 0.618   0.00000115 
#> 10 8                 0.387   0.00488    
#> 11 9                 0.364   0.00762

Step 6: Sign Divergence / Equisyncratic Null test — sden_test()

sden_test() runs a binomial test on the count of individually significant effects. Two variants are available and selected automatically based on the REMA pooled effect:

The auto-selection pivot is fixed at p = 0.05, independent of alpha_arimax.

sden <- sden_test(result_pval)
#> Running Sign Divergence Test (SDT): number of negative cases (counter positive pooled-effect) values are being evaluated
summary(sden)
#> 
#>                   SDEN Test Results
#>                -----------------------
#>    Features:
#>      Outcome          : y_psd_dt 
#>      Focal predictor  : a_psd_dt 
#>      ID variable      : id 
#> 
#>    Test          : SDT counter-positive (Sign Divergence Test) 
#>    Test selection: Automatic 
#> 
#>    Explanation:
#>   -------------
#>    Automatic selection based on:
#>    The pooled effect (0.3801) was positive and statistically significant (p = 1e-04).
#> 
#>    Testing whether the proportion of negative significant effects
#>     is greater than 0.025 
#> 
#>    Results:
#>   ---------
#>    Context - REMA pooled effect (p-val): 0.38 ( 1e-04 )
#> 
#>    Total number of significant effects (both sides): 11 
#>    Number of significant positive cases            : 10 
#>    Number of significant negative cases            : 1 
#>    Number of valid cases                           : 11 
#> 
#>    Given the test, relevant significant effects are just negatives : 1 
#>    SDT p-value = 0.2430786

You can also force a specific test:

sden_ent <- sden_test(result, test = "ENT")
#> Running Equisyncratic Null Test (ENT)
#> The p-value of the pooled effect is statistically significant at 0.05. p = 1e-04. SDT test is suggested.
summary(sden_ent)
#> 
#>                   SDEN Test Results
#>                -----------------------
#>    Features:
#>      Outcome          : y_psd_dt 
#>      Focal predictor  : a_psd_dt 
#>      ID variable      : id 
#> 
#>    Test          : ENT (Equisyncratic Null Test) 
#>    Test selection: Manual 
#> 
#>    Explanation:
#>   -------------
#>    You manually selected the test.
#> 
#>    Testing whether the proportion of all significant effects (both sides)
#>     is greater than 0.05 
#> 
#>    Results:
#>   ---------
#>    Context - REMA pooled effect (p-val): 0.38 ( 1e-04 )
#> 
#>    Total number of significant effects (both sides): 11 
#>    Number of significant positive cases            : 10 
#>    Number of significant negative cases            : 1 
#>    Number of valid cases                           : 11 
#> 
#>    Given the test, relevant significant effects are all (both sides) : 11 
#>    ENT p-value = 0