featdelta is for R users who create features in R but
need those features to live in a database.
In many analytics projects, the raw observations are stored in a database and new rows arrive over time. The feature engineering logic, however, may be easier to write, test, and maintain in R. This creates a practical problem: after the R code computes the features, those values still need to be stored in a database table so they can be reused by modelling scripts, dashboards, monitoring jobs, or other downstream systems.
featdelta provides an automated pipeline for that
workflow. You define feature expressions in R, and the package handles
the database-oriented steps around them:
The goal is to reduce the amount of repeated code needed to refresh feature tables. Instead of rebuilding an analysis dataset by hand each time new raw data arrives, you can keep a persistent feature table in the database and update it incrementally.
This pattern appears in many applied data workflows.
For example, in a daily market-data workflow, a database table may be updated with the latest closing prices after each trading day. Once the new prices are available, an analyst may calculate indicators in R and store them in a separate feature table. On the next day, the pipeline should process only the newly available observations rather than rebuilding the entire history.
In a credit-risk workflow, new loan or credit-card applications may arrive every day. Feature engineering often involves transformed variables that are more useful for modelling than the raw input columns. For example, an analyst may derive ratios, delay summaries, utilization measures, or other scorecard inputs from the raw application and bureau data. If these features are recreated only inside ad-hoc modelling scripts, refreshing the dataset with recent applications can become a recurring manual task. A persistent feature table makes the transformed variables easier to reuse for model development, validation, and monitoring.
In a business-intelligence workflow, a reporting tool may need variables that are difficult to create directly in the BI layer. These might come from an R model, a specialized matching algorithm, or domain-specific transformation code. If the results are pushed back to the database, the BI tool can read them like ordinary columns.
The details differ across domains, but the workflow is similar: raw data changes over time, feature logic is maintained in R, and the computed features are more useful when they are available in the database.
This vignette uses an in-memory SQLite database and the built-in
mtcars dataset. The database is tiny and the example
features are intentionally simple, but the workflow mirrors a real
production pattern:
Before demonstrating featdelta, we first set up the
database. In this example, we pretend that the rows arrive in two
batches:
mtcars dataset;Our goal is to add new features to a database feature table at the end of each day:
library(DBI)
library(RSQLite)
library(featdelta)
cars <- mtcars
cars$id <- seq_len(nrow(cars))
# Name of the unique row identifier used throughout the pipeline.
key <- "id"
day_one <- 1:20
day_two <- 21:30We use an in-memory SQLite database to mimic a real R-to-database
workflow. We establish the database connection with
DBI::dbConnect(). For this small example, no database
credentials are needed.
Next, DBI::dbWriteTable() pushes the first batch of rows
from R into the database and creates a table named
raw_cars. This table represents the raw observations for
which we want to create separate features.
After the raw data has been written to the database, we can inspect it using a simple SQL query.
con <- dbConnect(SQLite(), ":memory:")
dbWriteTable(con, "raw_cars", cars[day_one, ], overwrite = TRUE)
dbGetQuery(con, "SELECT id, mpg, cyl, hp, wt, am FROM raw_cars ORDER BY id")
#> id mpg cyl hp wt am
#> 1 1 21.0 6 110 2.620 1
#> 2 2 21.0 6 110 2.875 1
#> 3 3 22.8 4 93 2.320 1
#> 4 4 21.4 6 110 3.215 0
#> 5 5 18.7 8 175 3.440 0
#> 6 6 18.1 6 105 3.460 0
#> 7 7 14.3 8 245 3.570 0
#> 8 8 24.4 4 62 3.190 0
#> 9 9 22.8 4 95 3.150 0
#> 10 10 19.2 6 123 3.440 0
#> 11 11 17.8 6 123 3.440 0
#> 12 12 16.4 8 180 4.070 0
#> 13 13 17.3 8 180 3.730 0
#> 14 14 15.2 8 180 3.780 0
#> 15 15 10.4 8 205 5.250 0
#> 16 16 10.4 8 215 5.424 0
#> 17 17 14.7 8 230 5.345 0
#> 18 18 32.4 4 66 2.200 1
#> 19 19 30.4 4 52 1.615 1
#> 20 20 33.9 4 65 1.835 1In real-world scenarios, the data needed for feature engineering is
rarely stored in one simple table. It is often selected from several
database tables with joins, filters, and other selection criteria. For
that reason, the database-to-R interface in this vignette uses a SQL
query instead of referring directly to the raw_cars
table.
In the snippet below, we define the source dataset as rows where
id > 15. This demonstrates that the dataset to be
processed can be created with a custom query. There is one important
requirement: each returned row must have a unique identifier. That is
why we defined the key variable earlier.
We can bring this source dataset into the R session with
DBI::dbGetQuery(con, source_sql).
source_sql <- "SELECT * FROM raw_cars WHERE id > 15 ORDER BY id"
dbGetQuery(con, source_sql)
#> mpg cyl disp hp drat wt qsec vs am gear carb id
#> 1 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> 2 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> 3 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> 4 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> 5 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20At this point, the source table contains raw observations, but the feature table does not exist yet.
The rest of the vignette demonstrates how the feature table can be
created and refreshed with featdelta. Later, we summarize
which manual steps this pipeline replaces.
Start by defining features. These are ordinary R expressions
evaluated against the raw data returned by source_sql.
defs <- fd_define(
transmission = ifelse(am == 1, "automatic", "manual"),
hp_per_cyl = hp / cyl,
wt_per_hp = wt / hp
)
defs
#> <featdelta_defs>
#> Definition steps (3):
#> - [column] transmission -> ifelse(am == 1, "automatic", "manual")
#> - [column] hp_per_cyl -> hp/cyl
#> - [column] wt_per_hp -> wt/hpThis is one of the first conveniences of featdelta:
feature definitions are stored as an object. They can be printed,
reviewed, reused, tested, and passed to the pipeline. The expressions
are not scattered across one-off scripts.
Before writing to the database, we can also compute locally to see the shape of the feature table that will be produced.
raw_preview <- dbGetQuery(con, source_sql)
fd_compute(
data = raw_preview,
defs = defs,
key = key
)
#> id transmission hp_per_cyl wt_per_hp
#> 1 16 manual 26.875 0.02522791
#> 2 17 manual 28.750 0.02323913
#> 3 18 automatic 16.500 0.03333333
#> 4 19 automatic 13.000 0.03105769
#> 5 20 automatic 16.250 0.02823077The output contains the key column plus the computed features. This is the table shape we want to persist.
In this example, we define features in the simplest possible way: one expression per feature. In real projects, feature definitions may be generated by longer R scripts and may produce many columns. Later vignettes will cover richer definition patterns, including multi-column feature blocks.
Now let featdelta run the full pipeline. The important
result is the database table it creates. In this example, we call the
new database table car_features.
run_day_one <- fd_run(
con = con,
sql = source_sql,
defs = defs,
key = key,
feat_table_name = "car_features",
verbose = FALSE
)
dbGetQuery(con, "SELECT * FROM car_features ORDER BY id")
#> id transmission hp_per_cyl wt_per_hp
#> 1 16 manual 26.875 0.02522791
#> 2 17 manual 28.750 0.02323913
#> 3 18 automatic 16.500 0.03333333
#> 4 19 automatic 13.000 0.03105769
#> 5 20 automatic 16.250 0.02823077The R expressions were evaluated against the selected raw rows, and the result was stored as a persistent table. You can inspect that table with an ordinary SQL query, just as you would inspect any other database table.
That is the central idea: the data scientist can write feature logic in R, while the computed features are stored in a database table that other workflows can reuse.
Now pretend time passes and more raw observations arrive.
dbAppendTable(con, "raw_cars", cars[day_two, ])
#> [1] 10
dbGetQuery(
con,
"SELECT id, mpg, cyl, hp, wt, am FROM raw_cars WHERE id > 15 ORDER BY id DESC"
)
#> id mpg cyl hp wt am
#> 1 30 19.7 6 175 2.770 1
#> 2 29 15.8 8 264 3.170 1
#> 3 28 30.4 4 113 1.513 1
#> 4 27 26.0 4 91 2.140 1
#> 5 26 27.3 4 66 1.935 1
#> 6 25 19.2 8 175 3.845 0
#> 7 24 13.3 8 245 3.840 0
#> 8 23 15.2 8 150 3.435 0
#> 9 22 15.5 8 150 3.520 0
#> 10 21 21.5 4 97 2.465 0
#> 11 20 33.9 4 65 1.835 1
#> 12 19 30.4 4 52 1.615 1
#> 13 18 32.4 4 66 2.200 1
#> 14 17 14.7 8 230 5.345 0
#> 15 16 10.4 8 215 5.424 0The query above shows that new rows have been added to the raw table. The feature table, however, has not been refreshed yet. If we inspect it now, it still ends at id 20, even though raw rows up to id 30 are available.
dbGetQuery(
con,
"SELECT * FROM car_features ORDER BY id DESC"
)
#> id transmission hp_per_cyl wt_per_hp
#> 1 20 automatic 16.250 0.02823077
#> 2 19 automatic 13.000 0.03105769
#> 3 18 automatic 16.500 0.03333333
#> 4 17 manual 28.750 0.02323913
#> 5 16 manual 26.875 0.02522791This is the point where manual pipelines become repetitive: you need to compare raw keys with feature-table keys, build feature rows only for the missing raw observations, and append them without duplicating existing rows.
With featdelta, run the same pipeline again and inspect
the database table.
run_day_two <- fd_run(
con = con,
sql = source_sql,
defs = defs,
key = key,
feat_table_name = "car_features",
verbose = FALSE
)
dbGetQuery(con, "SELECT * FROM car_features ORDER BY id DESC")
#> id transmission hp_per_cyl wt_per_hp
#> 1 30 automatic 29.16667 0.01582857
#> 2 29 automatic 33.00000 0.01200758
#> 3 28 automatic 28.25000 0.01338938
#> 4 27 automatic 22.75000 0.02351648
#> 5 26 automatic 16.50000 0.02931818
#> 6 25 manual 21.87500 0.02197143
#> 7 24 manual 30.62500 0.01567347
#> 8 23 manual 18.75000 0.02290000
#> 9 22 manual 18.75000 0.02346667
#> 10 21 manual 24.25000 0.02541237
#> 11 20 automatic 16.25000 0.02823077
#> 12 19 automatic 13.00000 0.03105769
#> 13 18 automatic 16.50000 0.03333333
#> 14 17 manual 28.75000 0.02323913
#> 15 16 manual 26.87500 0.02522791The existing feature rows were not duplicated. featdelta
detected the raw rows whose keys were missing from
car_features, computed features only for those rows, and
inserted them into the database table.
Feature engineering is rarely finished on the first day. Suppose we decide that we also want a power-to-weight feature.
We can define a new version of the feature set and run the same
pipeline again. This time, the definitions include the new
power_to_weight feature.
defs_v2 <- fd_define(
transmission = ifelse(am == 1, "automatic", "manual"),
hp_per_cyl = hp / cyl,
wt_per_hp = wt / hp,
power_to_weight = hp / wt
)
run_refresh <- fd_run(
con = con,
sql = source_sql,
defs = defs_v2,
key = key,
feat_table_name = "car_features",
fetch_mode = "all",
verbose = FALSE
)
dbGetQuery(con, "SELECT * FROM car_features ORDER BY id DESC")
#> id transmission hp_per_cyl wt_per_hp power_to_weight
#> 1 30 automatic 29.16667 0.01582857 63.17690
#> 2 29 automatic 33.00000 0.01200758 83.28076
#> 3 28 automatic 28.25000 0.01338938 74.68605
#> 4 27 automatic 22.75000 0.02351648 42.52336
#> 5 26 automatic 16.50000 0.02931818 34.10853
#> 6 25 manual 21.87500 0.02197143 45.51365
#> 7 24 manual 30.62500 0.01567347 63.80208
#> 8 23 manual 18.75000 0.02290000 43.66812
#> 9 22 manual 18.75000 0.02346667 42.61364
#> 10 21 manual 24.25000 0.02541237 39.35091
#> 11 20 automatic 16.25000 0.02823077 35.42234
#> 12 19 automatic 13.00000 0.03105769 32.19814
#> 13 18 automatic 16.50000 0.03333333 30.00000
#> 14 17 manual 28.75000 0.02323913 43.03087
#> 15 16 manual 26.87500 0.02522791 39.63864Here we use fetch_mode = "all", which tells
fd_run() to refresh the features for all rows returned by
source_sql, including rows that already existed in the
feature table. This ensures that old rows are not missing the newly
added feature. In a real workflow, this is the difference between
maintaining a database feature table over time and repeatedly rebuilding
analysis datasets by hand.
Without featdelta, the workflow usually grows into a
script that does all of this manually:
That script can be written, but it tends to become repetitive and
fragile. featdelta makes those steps explicit:
defs <- fd_define(
feature_a = some_r_expression,
feature_b = another_r_expression
)
fd_run(
con = con,
sql = "SELECT * FROM raw_table",
defs = defs,
key = "id",
feat_table_name = "feature_table"
)For most users, this is the main workflow: define features in R, then run the database pipeline.
The database table is the main result, but fd_run() also
returns a structured report. The report is useful when developing,
debugging, or monitoring the pipeline.
run_report <- fd_run(
con = con,
sql = source_sql,
defs = defs_v2,
key = key,
feat_table_name = "car_features",
fetch_mode = "all",
return_data = "features",
preview_n = 3,
verbose = FALSE
)
run_report
#> <fd_run_report>
#> success: TRUE
#> stage: complete
#> table: car_features
#> dialect: sqlite
#> fetched: 15 rows
#> computed: 15 rows, 4 feature columns
#> upsert: would_insert=0, would_update=15The report tells you how many rows were fetched, which features were computed, and how many rows were targeted for insert or update. It is there when you need visibility, but it does not get in the way of the core workflow.
This vignette introduced the main story: create feature logic in R, keep a database feature table up to date, and avoid rebuilding the same feature dataset by hand.
The next topics are:
fd_define() and
fd_block();fd_fetch() and fd_upsert()
directly;