Pipeline audit trails and data diagnostics for tidyverse
workflows.
tidyaudit captures metadata-only snapshots at each step of a dplyr pipeline, building a structured audit report without storing the data itself. Operation-aware taps enrich snapshots with join match rates, filter drop statistics, and more. The package combines diagnostic tools for interactive development and production-oriented tools for data quality.
# Install CRAN version using
install.packages("tidyaudit")
# Install development version using using `pak`
pak::pak("fpcordeiro/tidyaudit")library(tidyaudit)
library(dplyr)
set.seed(123)
orders <- data.frame(id = 1:100, amount = runif(100, 10, 500), region_id = sample(1:5, 100, TRUE))
regions <- data.frame(region_id = 1:4, name = c("North", "South", "East", "West"))
trail <- audit_trail("order_pipeline")
result <- orders |>
audit_tap(trail, "raw") |>
left_join_tap(regions, by = "region_id", .trail = trail, .label = "with_region") |>
filter_tap(amount > 100, .trail = trail, .label = "high_value", .stat = amount)
#> ℹ filter_tap: amount > 100
#> Dropped 18 of 100 rows (18.0%)
#> Stat amount: dropped 1,062.191 of 25,429.39
print(trail)
#> ── Audit Trail: "order_pipeline" ─────────────────────────────────────────────────────────────────────
#> Created: 2026-02-21 14:36:35
#> Snapshots: 3
#>
#> # Label Rows Cols NAs Type
#> ─ ─────────── ──── ──── ─── ────────────────────────────────────
#> 1 raw 100 3 0 tap
#> 2 with_region 100 4 23 left_join (many-to-one, 77% matched)
#> 3 high_value 82 4 20 filter (dropped 18 rows, 18%)
#>
#> Changes:
#> raw → with_region: = rows, +1 cols, +23 NAs
#> with_region → high_value: -18 rows, = cols, -3 NAs
audit_diff(trail, "raw", "high_value")
#> ── Audit Diff: "raw" → "high_value" ──
#>
#> Metric Before After Delta
#> ────── ────── ───── ─────
#> Rows 100 82 -18
#> Cols 3 4 +1
#> NAs 0 20 +20
#>
#> ✔ Columns added: name
#>
#> Numeric shifts (common columns):
#> Column Mean before Mean after Shift
#> ───────── ─────────── ────────── ──────
#> id 50.50 49.66 -0.84
#> amount 254.29 297.16 +42.87
#> region_id 3.08 3.05 -0.03Audit trail system — the core innovation:
audit_trail() / audit_tap() — build
snapshot timelines inside pipesleft_join_tap(), filter_tap(), and friends
— operation-aware taps with enriched diagnostics (match rates, drop
statistics)audit_diff() — detailed before/after comparison of any
two snapshotsaudit_report() — full pipeline report in one callDiagnostic functions — tidyverse ports from dtaudit:
validate_join() — analyze joins without performing
themvalidate_primary_keys() /
validate_var_relationship() — key validationcompare_tables() — column, row, and numeric
comparisonfilter_keep() / filter_drop() — filter
with diagnostic outputdiagnose_nas() / summarize_column() /
get_summary_table() — data qualitydiagnose_strings() / audit_transform() —
string quality auditing and transformationSee vignette("tidyaudit") for the audit trail
walkthrough and vignette("diagnostics") for the diagnostic
functions guide.
tidyaudit is a tidyverse-native sibling to dtaudit (a data.table-based package on CRAN). The two packages share design vocabulary and S3 class naming conventions but no code or dependencies.
LGPL (>= 3)