Your Database Schema Is a Graph
My default when building a feature query is to write it in one shot. I want to do all the joins, all the aggregates, and arrive at the output I need in a single pass. That’s the intention. Then something is off. The row counts are wrong or an aggregate is getting inflated by a join I didn’t account for. So I start unwinding it. In Postgres that means CTEs: WITH base AS (...), WITH joined AS (...), WITH aggregated AS (...). One checkpoint at a time, stepping through the relational algebra until I find where things went sideways. It works. It’s also exactly like scattering console.log statements through a codebase: manual, sequential, tells you where the break is but nothing about how to avoid being here next time. And you will be back next time, because when the schema changes or the distribution shifts, you’re iterating the whole thing again.
That loop is the villain in Fey et al.’s 2023 paper, “Relational Deep Learning: Graph Representation Learning on Relational Databases”. The paper’s argument is that we have been solving the wrong problem. Instead of getting better at writing aggregation queries, we should stop writing them. The Graph Neural Network does it instead.
How I think about feature engineering on relational data
Feature engineering on a relational schema is mostly a translation problem. You have a fact table recording events over time and a dimension table holding context about the entities involved. My mental model is still rooted in classic data warehousing (STAR schemas anyone?). Then let’s say you want to predict something about one of those entities at some future point. To get there, you have to flatten the event history into a single row per entity: counts, averages, maximums, windowed aggregates. Each column in that flat table represents a decision you made about what might be predictive.
The decisions compound. Do you count all reviews or just recent ones? Do you weight by recency? What time window is relevant for the outcome you’re predicting? Every one of those choices is a hypothesis encoded as SQL. And when you get the model results back and something doesn’t work, you’re not sure whether the model is wrong or whether one of those hypotheses was wrong.
The place where this bites me hardest is in the null cases. When I flatten a schema that has nullable FK relationships, I have to make a deliberate choice: zero? NULL? A flag? Whatever I pick, I’m potentially hiding signal. A customer with no purchases in the prediction window is different from a customer with three purchases. A user with a null on an optional demographic field is telling me something. But the flat table can only see what I explicitly decided to encode. What gets quietly dropped matters more than what goes in.
What the paper does
The bet: that relational databases should be treated as graphs, and that graph neural networks should learn directly from that graph structure, replacing the manual join-and-aggregate step that’s used as inputs to predict a value relative to the entities in your system.
The paper’s framing of what’s broken:
“The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models.”
Cvitkovic demonstrated in earlier work that GNNs could work on relational tables, and Featuretools has been automating the aggregation layer since around 2017. The paper’s more defensible point is that no prior approach combined three things: a rigorous benchmark for reproducible comparison, a correct treatment of temporal data, and an end-to-end trainable pipeline.
The two-level graph abstraction
The paper defines two layers of graph structure. The first, the schema graph, is your ER diagram. One node per table, one edge per FK relationship. You already have this mental model. Nothing new here except a name.
The second is the Relational Entity Graph (REG). This is where it gets interesting. The REG instantiates your ER diagram at row level: every row in every table becomes a node, and every FK reference between two specific rows becomes an edge. Your customer row is a node. The review that customer wrote is a separate node. The FK from that review to the customer is one edge. The FK from that review to the product is a second edge. The REG has as many edges as there are FK references in your actual data, not in your schema.
Your ER diagram is the blueprint. The REG is the building. The GNN runs on the building.
The nodes carry the column values from their source rows as features. Node types are tracked separately (a customer node and a product node get different weight matrices). Edge types are tracked separately too. The paper calls this a heterogeneous graph, which just means the GNN knows which table a row came from.
Message passing as the GROUP BY you don’t write
A message-passing GNN updates each node’s representation by doing two operations: it collects representations from every FK-linked neighbor (that is the JOIN), and it aggregates those collected representations into a summary (that is the GROUP BY). The paper calls this “the exact neural version of SQL JOIN+AGGREGATE operations.”
Drop the word “exact.” SQL aggregates are deterministic algebraic operations over a complete dataset. GNN message passing is a learned approximation running on sampled subgraphs. The analogy is useful; the precision claim is not. The right framing: the GNN does structurally what your GROUP BY clause does, except the aggregation function is parameters learned from training data rather than clauses you wrote. The optimizer decides what to weight. You designed the schema and defined the label.
After several rounds of message passing (each round is one hop in the FK graph), every node has a representation that incorporates signal from its neighborhood. Node-level predictions use that representation directly. Link-level predictions (will this user buy this product?) combine two node representations.
The temporal problem that prior work ignored
Every feature engineering query for a time-sensitive prediction task needs a WHERE event_date <= :prediction_timestamp clause. Forget it once and your model trains on reviews the user hadn’t posted yet, or orders that hadn’t happened. Training metrics look great. Deployment fails because those future rows don’t exist. This is temporal leakage, and it invalidates a model completely. Not “makes it weaker.” Invalidates it.
The REG has a built-in answer. Every node that has a timestamp column (every row in a fact table like reviews or orders) gets that timestamp attached. When the GNN computes a prediction for entity v at time t, it only looks at neighbors where the neighbor’s timestamp is at or before t. The paper calls this the temporal neighborhood.
“In order to prevent future data leakage… we require that the computational graph for an entity v with seed time tv only includes entities with timestamp τ(w) ≤ tv.”
This is a correctness requirement, not a performance optimization. A model that violates it is broken. The REG enforces it architecturally rather than relying on the practitioner to write the right WHERE clause every time, for every feature, without exception.
The paper also describes three temporal sampling strategies (uniform, ordered by recency, recency-biased) for choosing which neighbors to include when the full temporal neighborhood is too large for GPU memory. These are heuristics. They are not validated in this paper. Whether recency-biased sampling outperforms uniform sampling is an open question the series will track.
What you still have to build
The paper’s claim is that RDL works “without any manual feature engineering.” What RDL genuinely eliminates: the SQL aggregation design loop. The COUNT(*) and AVG(rating) and windowed aggregate queries that encode your feature hypotheses. Those go away. The GNN learns what to aggregate.
What RDL does not eliminate is more interesting than the paper lets on. You still write the training table SQL: time-conditioned queries defining which entities you’re predicting, at what timestamp, with what label. The paper calls this T_train. It requires temporal reasoning and domain knowledge. That’s still yours to write.
Take the rel-amazon lifetime value task as a concrete example. The label is “sum of prices of products the user will buy and review in the next two years.” Writing that label SQL requires decisions the paper glosses over:
SELECT
customer_id,
seed_time,
COALESCE(SUM(price), 0) AS label
FROM transactions
WHERE transaction_date >= seed_time
AND transaction_date < seed_time + INTERVAL '2 years'
GROUP BY customer_id, seed_time
Does a returned order count toward the label? Does a partial refund reduce it? What about orders placed but not fulfilled before the label window closes? The query above does not answer those questions — you do, by how you filter the transactions table before this aggregation runs. Whatever you write, the model optimizes toward it.
Then there is entity filtering. The paper restricts to “active users defined as users that wrote a review in the past two years before the timestamp.” That WHERE clause is yours:
WHERE customer_id IN (
SELECT customer_id FROM reviews
WHERE review_time >= seed_time - INTERVAL '2 years'
AND review_time < seed_time
)
Why two years? Why reviews and not purchases? A user who bought things but never reviewed — do they get a row with a real label, or do you exclude them? Every entity filter is a hypothesis about who the model should learn from.
Then there is the seed time constraint. If your prediction window is 90 days, your seed times must stop at least 90 days before the end of your data — otherwise you cannot compute what actually happened in the label window. Your last valid seed time is your data end date minus 90 days. You slide backwards from that cutoff in fixed intervals to generate multiple training rows per entity:
-- generate seed times at 90-day intervals going back from the cutoff
SELECT
generate_series(
max_date - INTERVAL '90 days',
min_date,
INTERVAL '-90 days'
) AS seed_time
FROM (SELECT MIN(transaction_date) AS min_date, MAX(transaction_date) - INTERVAL '90 days' AS max_date FROM transactions) t
Earlier seed times give you more training examples but older data. Shorter strides give you more rows but overlapping label windows. Those are your tradeoffs to navigate.
Finally, edge cases in the label itself. A customer with zero purchases in the prediction window gets a label of $0. That is a valid training example. But $0 and NULL mean different things depending on your entity filter. If you excluded inactive customers, your $0 labels are all from customers who were active at seed time and then went quiet. If you did not filter, $0 includes customers who were never going to buy anything. The COALESCE(SUM(price), 0) in the label query treats them identically. The model will learn whatever pattern that SQL produces.
None of this is what the paper means by “feature engineering.” But it is engineering. The distinction the paper draws is real — you are specifying what to predict, not how to predict it. The GNN handles the how. Whether that distinction holds up under the pressure of production workloads. I’m super interested to see what happens with other papers in this space.
Schema configuration stays too. You declare which FK relationships to include in the graph, which column carries the timestamp for each table, how to handle dimension tables that lack per-row timestamps. On a clean, well-documented schema this is manageable. On an enterprise database with nullable FKs, implicit joins via business-logic keys, and denormalized tables that encode FK logic in application code, that configuration is not trivial.
Then there’s encoder selection: text columns need a pretrained transformer, numerical columns need normalization, categoricals need embedding layers. The quality of these encoders directly determines what information the GNN has to work with. And sampling hyperparameters: how many hops deep should the GNN go? How large a neighborhood to sample? These are decisions additive to the SQL workflow, not a replacement for part of it.
The CNN analogy the paper uses is apt for the part it gets right. Before deep learning, computer vision researchers hand-crafted image filters: Sobel edge detectors, Gabor filters, HOG descriptors. CNNs replaced those with learned ones. RDL proposes the same substitution for relational feature aggregation. That substitution is real and, if it holds empirically, genuinely valuable. But moving from hand-crafted to learned representations in computer vision didn’t eliminate image preprocessing, architecture selection, or hyperparameter tuning. It replaced one set of decisions with a different set.
The benchmark that made this the founding document
The paper introduces RELBENCH, an open benchmark package with two real-world databases. The first, rel-amazon, has 21.9 million reviews in the Amazon books catalog spanning 1996-2018, organized across three tables. The second, rel-stackex, pulls from Stats Stack Exchange: seven tables including users, posts, comments, votes, and badges, spanning 2009-2023.
Each database has two prediction tasks: customer lifetime value, churn prediction, user engagement, vote prediction. Temporal train/validation/test splits are pinned to specific timestamps. The Python package integrates with PyTorch Geometric for GNN training and PyTorch Frame for multi-modal column encoding.
The moment was right: two different groups tacking toward the same direction
Another group including Han Zhang, Quan Gan, David Wipf, and Weinan Zhang posted Graph-based Feature Synthesis for Prediction over Relational Databases (GFS) on arXiv. Two groups arriving at the same foundational formulation within days: relational databases should be treated as heterogeneous graphs.
The infrastructure matured simultaneously: PyTorch Geometric reached production-grade support for heterogeneous graphs and temporal neighbor sampling around 2022-2023. Temporal GNN research (TGAT in 2020, TGN in 2020) had already worked out time-consistent sampling in adjacent contexts. PyTorch Frame, a concurrent build from the same author group, handled multi-modal column encoding. The OGB benchmark from 2020 had already demonstrated that open reproducible benchmarks can anchor GNN subfields. The same lab built OGB.
The two papers made different architectural bets, and that divergence introduces the first live debate in the nascent field. GFS is a two-stage pipeline: the graph structure synthesizes features that feed into a conventional gradient boosting model. No joint optimization. RDL trains everything end-to-end. Two fundamentally different philosophies about whether the graph is a preprocessing step or the model itself.
What this paper actually proves vs. what it promises
The NeurIPS 2024 RelBench empirical paper is where the first head-to-head comparison appears: RDL against a professional data scientist doing manual feature engineering across seven datasets and thirty tasks. That is the result the abstract is promising. That is the next paper that I definitely want to check out.
So what does this paper actually prove?
-
The REG definition is rigorous. The vocabulary it establishes (REG, schema graph, training table, temporal neighborhood) derives from first principles.
-
The temporal correctness argument is important. Filtering neighbors by timestamp is not optional. A model that skips this step is invalid, not weaker. The paper handles this well.
-
RELBENCH. It anchored NeurIPS 2024 RelBench, ICML 2025 RelGNN, RelBench v2, and a KDD 2025 survey paper commissioned because the field had enough activity to warrant synthesis. A 2023 preprint that produces a 2025 survey is not a small bet.
My take on the honest caveat
The nullable FK problem I described earlier is exactly where I want to see how RDL behaves. When I flatten a schema that has optional FK relationships, I make a deliberate choice about how to encode the absence. That choice is mine, it’s visible, and I can inspect its effect on the downstream model.
The paper’s answer is that the GNN learns what to do with missing signal from training data. A row with a null FK simply has no edge in the REG. The GNN sees the absence structurally.
Maybe. What I need from an RDL implementation to trust that answer is logging: being able to see what the model is doing with those absent edges and validate it against my expectations. That inspection surface is the difference between trusting a new system and hoping it does what you think it does.
The paper does not prescribe how to handle dimension table rows that lack timestamps. The framework assumes FK declarations are correct and complete. Neither of these is consistently true in production databases I have worked with. These are not fatal objections to the approach. They are the things I want to watch as I start to explore more papers in this space.
What comes next
The questions this paper left open are the questions the series exists to track.
The most practically important: does RDL beat XGBoost plus a well-configured automated feature engineering pipeline under fair conditions? Not “manual SQL from scratch” as the paper implies, but a real baseline where a skilled data scientist uses Featuretools-style tooling with domain knowledge.
The open question I keep coming back to is the one the paper does not address at all: how does RDL behave when the schema is messy? Nullable FKs, undocumented relationships, denormalized tables where FK logic lives in application code rather than database constraints. Every production database I have worked with has at least some of this.
The Relational Entity Graph, the schema graph, the training table, the temporal neighborhood, the temporal leakage problem are the words I’m adding to my lexicon.