Scheduling featdelta runs

Purpose of this tutorial

The earlier vignettes showed how to define features and run the database pipeline with fd_run(). In practice, you usually do not want to open RStudio and run that command manually every day. You want a script that can be executed by the operating system.

This tutorial shows how to set up that script and schedule it:

  1. create a small project folder;
  2. save an R script that connects to the database and calls fd_run();
  3. test the script manually with Rscript;
  4. schedule the script with Windows Task Scheduler or cron;
  5. inspect a log file after each run.

This vignette is written as an instruction rather than as executed package code. The database connection uses imaginary PostgreSQL credentials. You should adapt the connection details, SQL query, feature definitions, and table names to your own project.

The example is not meant to be copied without changes. It assumes a specific database, a specific raw table, and a specific set of columns. Your own script will differ in those places.

Before you start

This tutorial assumes that the previous vignettes have already explained what fd_run() does. Here we focus only on scheduling.

You need:

  1. R installed on the machine that will run the job;
  2. the featdelta package installed in that R library;
  3. the DBI driver for your database, such as RPostgres or RMariaDB;
  4. permission to write to the target feature table;
  5. permission to create scheduled tasks on the machine.

Operating-system schedulers are normally minute-based. Windows Task Scheduler and cron are good for running jobs every minute, hourly, daily, or on a business schedule. They are not meant for second-by-second demonstrations, so this tutorial uses minute-level schedules.

Create a project folder

Create a small folder for the scheduled pipeline. For example:

featdelta_sensor_pipeline/
  scripts/
    run_featdelta.R
  logs/

On Windows, this could be:

C:/featdelta_sensor_pipeline/

On Linux or macOS, this could be:

/home/analyst/featdelta_sensor_pipeline/

The important point is that the scheduled script should use stable, absolute paths. A scheduled job may not start in the same working directory as your interactive R session.

Create the R script

Create this file:

C:/featdelta_sensor_pipeline/scripts/run_featdelta.R

On Linux or macOS, use the equivalent path in your project folder, for example:

/home/analyst/featdelta_sensor_pipeline/scripts/run_featdelta.R

Save the following script as run_featdelta.R.

Before looking at the script, here is the scenario it assumes:

For this example, the features are:

In your own project, replace these with the transformations that make sense for your raw table.

library(DBI)
library(RPostgres)
library(featdelta)

# Use an absolute project directory so the script works from a scheduler.
project_dir <- "C:/featdelta_sensor_pipeline"
log_path <- file.path(project_dir, "logs", "featdelta-run.log")

log_line <- function(...) {
  cat(
    format(Sys.time(), "%Y-%m-%d %H:%M:%S"),
    " | ",
    paste(..., collapse = ""),
    "\n",
    file = log_path,
    append = TRUE,
    sep = ""
  )
}

con <- dbConnect(
  RPostgres::Postgres(),
  # Replace these connection settings with your own database details.
  host = "database.company.local",
  port = 5432,
  dbname = "analytics",
  user = Sys.getenv("ANALYTICS_DB_USER"),
  password = Sys.getenv("ANALYTICS_DB_PASSWORD")
)

on.exit(dbDisconnect(con), add = TRUE)

# Replace this query with the raw dataset you want to process.
# The query must return the key column and every raw column used below.
source_sql <- "
  SELECT
    reading_id,
    device_id,
    temperature,
    vibration,
    pressure,
    runtime_hours
  FROM raw_schema.device_readings
  WHERE reading_status = 'ready'
  ORDER BY reading_id
"

# Replace these definitions with your own feature logic.
defs <- fd_define(
  temp_above_80 = temperature > 80,
  vibration_score = vibration * runtime_hours,
  pressure_high = pressure >= 38,
  maintenance_flag = temp_above_80 | vibration_score > 80 | pressure_high
)

report <- fd_run(
  con = con,
  sql = source_sql,
  defs = defs,
  # Replace the key and feature table with your project-specific values.
  key = "reading_id",
  feat_table_name = "feature_schema.reading_features",
  fail_fast = FALSE,
  verbose = FALSE
)

would_insert <- if (is.null(report$upsert)) {
  NA_integer_
} else {
  report$upsert$counts$would_insert
}

would_update <- if (is.null(report$upsert)) {
  NA_integer_
} else {
  report$upsert$counts$would_update
}

log_line(
  "success=", report$success,
  ", stage=", report$stage,
  ", fetched=", report$fetch$n_rows,
  ", inserts=", would_insert,
  ", updates=", would_update
)

if (!isTRUE(report$success)) {
  log_line("error=", report$error$message)
  stop(report$error$message)
}

This script does four things:

  1. opens a database connection;
  2. defines the source SQL query;
  3. defines the R feature expressions;
  4. calls fd_run() and writes a compact log line.

The parts you normally change are:

  1. project_dir, so logs are written to your project folder;
  2. dbConnect(), so the script reaches your database;
  3. source_sql, so the script selects your raw data;
  4. defs, so the script computes your features;
  5. key, so featdelta knows the unique row identifier;
  6. feat_table_name, so features are written to the correct database table;
  7. the scheduler frequency, so the job runs when your raw data is ready.

For MySQL or MariaDB, the connection part would look similar, but with RMariaDB:

library(DBI)
library(RMariaDB)

con <- dbConnect(
  RMariaDB::MariaDB(),
  host = "database.company.local",
  port = 3306,
  dbname = "analytics",
  user = Sys.getenv("ANALYTICS_DB_USER"),
  password = Sys.getenv("ANALYTICS_DB_PASSWORD")
)

If automatic dialect detection is not sufficient in your environment, pass the dialect explicitly:

report <- fd_run(
  con = con,
  sql = source_sql,
  defs = defs,
  key = "reading_id",
  feat_table_name = "feature_schema.reading_features",
  dialect = "postgres",
  fail_fast = FALSE,
  verbose = FALSE
)

For MySQL or MariaDB, use dialect = "mysql".

Store credentials outside the script

The example reads credentials from environment variables:

Sys.getenv("ANALYTICS_DB_USER")
Sys.getenv("ANALYTICS_DB_PASSWORD")

This is safer than writing passwords directly into the R file. The exact way to set environment variables depends on your operating system and security setup.

For local testing, you can set them temporarily inside an R session:

Sys.setenv(ANALYTICS_DB_USER = "analyst")
Sys.setenv(ANALYTICS_DB_PASSWORD = "secret")

For production, use your organization’s preferred method, such as user-level environment variables, a secrets manager, or a protected configuration file.

Test the script manually

Before scheduling anything, run the script manually from a terminal.

On Windows PowerShell:

& "C:\Program Files\R\R-4.5.2\bin\Rscript.exe" "C:\featdelta_sensor_pipeline\scripts\run_featdelta.R"

On Linux or macOS:

Rscript /home/analyst/featdelta_sensor_pipeline/scripts/run_featdelta.R

Then inspect the log file:

C:/featdelta_sensor_pipeline/logs/featdelta-run.log

A successful line might look like this:

2026-05-05 09:00:04 | success=TRUE, stage=complete, fetched=12, inserts=12, updates=0

If the script fails manually, fix that first. A scheduler will not make an unreliable script more reliable. The script must be able to run from a plain terminal command before you schedule it.

Windows: schedule with taskscheduleR

On Windows, you can use the taskscheduleR package to create a Windows Task Scheduler job from R.

Install it if needed:

install.packages("taskscheduleR")

Then run this once from an interactive R session:

library(taskscheduleR)

taskscheduler_create(
  taskname = "featdelta_sensor_pipeline",
  rscript = "C:/featdelta_sensor_pipeline/scripts/run_featdelta.R",
  schedule = "MINUTE",
  starttime = format(Sys.time() + 70, "%H:%M"),
  modifier = 1
)

This creates a task that runs the script every minute. That is a useful testing frequency because you can watch the log file and database table change shortly after creating the task.

In this demonstration, “every minute” is only a testing schedule. A real production schedule should match the arrival pattern of your raw data. For example, if new sensor readings are loaded every hour, schedule the job after that hourly load. If a warehouse table is refreshed once per night, schedule fd_run() after the nightly refresh.

To list scheduled tasks created through taskscheduleR:

taskscheduler_ls()

To remove the test task:

taskscheduler_delete("featdelta_sensor_pipeline")

For production, use a less aggressive schedule, such as hourly, daily, or after the expected raw-data load has finished.

Linux or macOS: schedule with cronR

On Linux or macOS, you can use the cronR package to create a cron job.

Install it if needed:

install.packages("cronR")

Then run this once from an interactive R session:

library(cronR)

cmd <- cron_rscript(
  rscript = "/home/analyst/featdelta_sensor_pipeline/scripts/run_featdelta.R",
  rscript_log = "/home/analyst/featdelta_sensor_pipeline/logs/featdelta-cron.log",
  log_append = TRUE
)

cron_add(
  command = cmd,
  frequency = "minutely",
  id = "featdelta_sensor_pipeline",
  description = "Run featdelta sensor feature pipeline"
)

This creates a cron job that runs every minute. During testing, inspect:

/home/analyst/featdelta_sensor_pipeline/logs/featdelta-run.log
/home/analyst/featdelta_sensor_pipeline/logs/featdelta-cron.log

To list cron jobs:

cron_ls()

To remove the test job:

cron_rm(id = "featdelta_sensor_pipeline")

For production, use a schedule that matches your data refresh cycle. For example, a daily run after midnight, an hourly run during business hours, or a custom cron expression.

What to monitor

At minimum, monitor the log file written by the script. Each run should tell you:

  1. whether the run succeeded;
  2. which stage completed;
  3. how many rows were fetched;
  4. how many rows were inserted;
  5. how many rows were updated.

You may also want to query the feature table directly:

SELECT COUNT(*) AS n_rows
FROM feature_schema.reading_features;

Or inspect the most recent keys:

SELECT *
FROM feature_schema.reading_features
ORDER BY reading_id DESC
LIMIT 20;

The exact monitoring setup depends on your database and scheduler, but the principle is simple: make each scheduled run leave evidence that it ran and what it changed.

Common scheduler issues

Scheduled jobs often fail for environmental reasons rather than because of fd_run() itself.

Common issues include:

  1. the scheduler uses a different R installation than RStudio;
  2. the scheduled R library does not have featdelta or the DBI driver installed;
  3. environment variables are available interactively but not to the scheduled process;
  4. the script uses relative paths;
  5. the scheduler account does not have database access;
  6. the scheduler account cannot write to the log folder.

When debugging, first run the exact Rscript command from a terminal. Then make sure the scheduler runs under a user account that has the same package library, environment variables, file permissions, and database permissions.

What to remember

Scheduling is not part of featdelta itself. The package gives you the repeatable R feature pipeline. Windows Task Scheduler or cron runs that pipeline at the chosen time.

The practical setup is:

  1. write a standalone run_featdelta.R;
  2. test it manually with Rscript;
  3. log each run;
  4. schedule it with taskscheduleR or cronR;
  5. monitor the log and feature table.

References

The scheduling examples use the public documentation for: