Inside Fabi: 3 systems behind an analyst agent

12 minute read

April 28, 2026

TL;DR: Fabi is an analyst agent. The interesting engineering sits in 3 systems beneath the chat: a ReAct-loop harness that handles context collection (static schema RAG plus dynamic on-demand retrieval), memory and nightly dreaming consolidation, and cost / token optimization through per-iteration reasoning routing; a dependency engine that joins a static cell DAG with live kernel variable provenance to skip stale-but-still-resident cells; and a custom stateful sandbox whose result cache the dependency engine can invalidate coherently. Each is shaped by the same 4 axes: accuracy, latency / UX, cost, and security.

Fabi is an analyst agent. From the user's chair it looks like a chat panel sitting next to dashboards and data apps, powered by SQL, Python, and React for interactive elements. Underneath there are 3 systems doing real work: the agent harness, the dependency engine that decides what to rerun when any code or data artifact changes, and the stateful sandbox that executes the code. This post walks each one at the architectural level.

A note on terminology used throughout: a thread is a persistent conversation tied to a Smartbook; a chat is one user message and the agent's full response, which may run several LLM turns (ReAct iterations) under the hood.

The 4 axes we optimize for

Every architectural decision in Fabi pays against 4 axes that fight each other:

Accuracy. The agent picks the right tables, writes code that runs, answers the actual question.
Latency & UX. Wall-clock latency, perceived progress, interruptibility, and whether the user can predict what the agent is about to change. A silent 60-second wait is a UX failure even if the answer at the end is correct, and so is an agent that rewrites cells the user did not ask it to touch.
Cost. Tokens, warehouse queries, and compute stay bounded across multi-turn chats and across long-lived threads.
Security. No DROP TABLE on a customer warehouse, no exfiltration of credentials, no execution of arbitrary code outside the sandbox. Security is a hard constraint, not something the other 3 axes get to negotiate against.

You can buy any one of these by paying with the others. Spending more reasoning tokens improves accuracy and degrades cost and UX. Skipping context retrieval improves UX and degrades accuracy. Caching saves latency and cost, but only when invalidation is correct.

Architecture at a glance

A user prompt enters from any surface (chat panel in the Fabi web app, Slack, CLI, or an MCP client) and lands at the analyst agent. The agent runs a ReAct loop (reasoning interleaved with tool actions), drawing on a fixed set of tools (RAG over schemas, sub-agent dispatch, dry-run code execution, cell edits, memory updates, MCP integrations) and round-tripping with the LLM provider on a compacted prompt. Code generated along the way executes against a stateful sandbox and the artifacts (cells, charts, data apps) get saved into a Smartbook the user can return to. Underneath the agent sit the rest of the pieces (multi-language dependency resolver, schema retrieval, kernel orchestrator, nightly dreaming) covered later in the post.

Each of the 3 systems sits at a different operating point on the axes above. The harness balances all 4 at runtime; the dependency engine cuts cost and latency through aggressive caching while keeping reruns correct; the stateful sandbox owns kernel-level isolation and the latency floor.

1. The agent harness

A harness is what everyone in agent-land is talking about right now, and where most of the engineering hours actually go. The LLM is a fixed commodity; you build with whichever frontier LLM is best on the day. The harness is everything around the LLM that turns a generic call into something that can analyse a 3 TB warehouse, propose code that runs the first time, get cancelled cleanly, and not bankrupt the company in tokens.

Components

Agentic flow. A ReAct loop with reflection, primary agent as orchestrator. We chose ReAct over deeper architectures (file-system-backed plans, persistent todo lists, multi-agent swarms) because the product is interactive: analysts expect each chat to complete in 10 seconds to 2 minutes end to end (even when the ReAct loop runs multiple LLM turns under the hood), not the 10-minute to multi-hour horizons of a deep research agent. Cancellation is cooperative via an atomic flag checked at every step boundary; the same mechanism is the foundation for mid-flight prompt injection, so live steering of an in-flight agent will land on the same plumbing.

Multi-turn chats are dominated by turn 1, where the LLM figures out what the user wants, which tables matter, which approach to take. Turns 2 onwards are mechanical: SQL dialect fixes, column swaps, formatting, follow-up clarifications. So we route per iteration: turn 1 against a high-effort reasoning LLM, subsequent iterations against the same provider's lower-effort variant. The trade is fundamentally accuracy against latency, spending reasoning budget where it matters and running fast enough on the cheap turns to feel responsive. The swap survives prompt caching because we stay within one provider; the cheap-LLM call is just another hit on the long-cached prefix.

Tools. Tools are the agent's hands. The design rule is that anything a human can do in the Smartbook UI, the agent can do through a tool call: RAG over schemas, sub-agent dispatch, dry-run code execution, cell edits, memory updates, MCP integrations. Calls are non-blocking and stream progress back to the UI, so the user watches the agent work mid-turn instead of staring at a spinner. Every invocation routes through a permission check first.

Context. Context is the agent's senses, what it can see and know about the world before deciding what to do next. Static context loads through RAG once per chat (schema, semantics, prior certified queries) and can be re-pulled later via tool calls when the agent needs more; dynamic context is fetched on demand each turn (conversation search, Smartbook state, cell status). The harness assembles both fresh on every turn and compacts when the bundle approaches the LLM context window limit.

How much schema context the agent gets is decided dynamically based on the size of the data source. Small warehouses (under 5 tables) we pass in full with sample values; medium (5 to 40 tables) we pass table and column names only; large warehouses route through embedding-based RAG and only the columns relevant to the prompt come back. The principle is keep the context highly relevant without bloating the prompt. Per-org semantic documents and prior example queries layer on top and get richer as the dreaming process feeds them more facts. Most accuracy regressions we saw in the first year traced back to context that was either too thin (LLM guessed) or too thick (LLM drowned).

Memory. Memory is what makes the agent smarter over time, separate from any single chat's context. We split it into short-term (per-thread chat memory captured chat by chat) and long-term (org-level and user-level context distilled by a nightly dreaming process). Skills, both progressively discovered patterns and agent-developed routines, sit in the same layer.

Per-thread chat memory is updated on every turn and tracks the user's overall goal plus thread-local context (tables landed on, dataframes produced, dead ends ruled out); it survives process restarts so a user comes back tomorrow without re-explaining yesterday. Nightly dreaming is a scheduled job that aggregates the past week's thread memory across an org and runs 2 structured LLM extraction passes: an org-level pass distils data-source-scoped facts ("amount column is in cents", "fiscal year starts in February") into the org's data source semantics, and a user-level pass captures preferences and patterns into the user's custom instruction. Both merges are section-scoped, so hand-authored semantics survive untouched. The shape is a feedback loop: more usage produces more thread memory, more thread memory produces sharper contexts, sharper contexts produce more accurate runs against the same warehouse.

Eval. Eval is the bottleneck on accuracy improvements, and the hard part is attribution. When end-to-end accuracy moves, was it the LLM, the RAG retrieval, the context assembly, an extra step in the ReAct loop, or a new tool that moved it? Per-component scorers help, but in production accuracy is highly contextualised: the same user prompt against 2 different orgs lands in different ground truths, so generic benchmarks stop being load-bearing past a certain scale. What we shipped is observability for adhoc debugging, not a proper eval framework; rigorous eval is still the hardest unsolved problem in the agent stack we built.

Guardrails. Governed code execution is the load-bearing part of security: an agent running arbitrary code against a customer warehouse is a security surface, and an LLM that hallucinates a DROP TABLE is not a bug you can ship and apologise for later. The harness layer enforces a permission check ahead of every tool call (data source role, sub-agent permission scope, MCP allow-list) and a read-only SQL path that rejects DML and DDL before the warehouse driver sees the statement. We wrote about the broader principle in Governed data execution layer: BI as a data OS; the kernel-level isolation that backs all of this is covered below in the stateful sandbox section.

2. The dependency engine

Data analysis is highly interactive. We keep state in the kernel between cells (dataframes, intermediate variables, query results) so the agent does not have to hit the warehouse on every change. The catch is that traditional Jupyter kernels turn that state into a mess: out-of-order execution, hidden variables, results that drift from the code that produced them. The dependency engine fixes that by tracking the code blocks the agent generates alongside the kernel variables that came out of them, deciding for each cell whether what's in memory is still trustworthy and what needs to refresh. Users never have to think about kernel state; whatever runs lands an accurate, reproducible result.

2 halves: a parser that builds the cell-level dependency graph, and a resolver that decides what to execute against live kernel state.

Multi-language variable parser

A unified parser walks 3 cell types and emits 1 dependency graph.

Cell type	Extracts
Python	assigns and in-place dataframe edits (`df['col'] = ...`, `df.append(...)`) both treated as redefinitions, tuple unpacking, comprehension targets, `def` / `class` names, with builtins filtered
SQL + Jinja	referenced tables, dataframes (via DuckDB), plus the explicit output dataframe name, and variables expanded inside `{{ }}`
Text/Markdown + templating	variable names referenced in narrative via `{{revenue}}`

We build the DAG using Python AST parsing for code cells, SQL plus Jinja parsing for query cells, and {{ }} interpolation parsing for Markdown. Each cell becomes a node with explicit produces/references edges.

Resolver: stale or fresh

Alongside the static graph, the kernel tracks which code block last produced each variable in memory. The resolver joins the 2: walking the static DAG tells it the parent and child set of any cell, and walking the kernel state tells it which of those cells already have fresh results sitting in memory. Together they answer the only question that matters when the agent edits a cell: which parents need to refresh and which children need to rerun.

When the agent edits a cell, the resolver runs a 3-step flow:

Compute the affected set. Walk the static DAG to collect every parent and every descendant of the edited cell.
Classify each cell against kernel state. For each cell in the affected set, check the kernel: if the cell's output variable is still resident in memory and the producing code has not changed, the cell is fresh and gets skipped. Otherwise it is stale and queued.
Run queued cells in dependency order. Parents first, then descendants, so each cell sees fresh inputs.

The short-circuit in step 2 is the single largest warehouse-cost saver in the system. Static analysis alone would happily rerun an upstream SQL cell whose dataframe is already sitting in kernel memory; checking kernel state first means we only pay the warehouse round-trip when memory has actually lost the value.

The flow is reactive by default: when any cell changes, every affected cell reruns automatically to keep the Smartbook state consistent. The user never has to figure out what to refresh by hand and never sees a half-stale Smartbook.

3. The stateful sandbox

We built our own sandbox system. The standard hosted offerings (Daytona, Vercel Sandbox, and similar) get you isolation and a process to run code in, but they do not track which cell produced which variable, do not expose a per-kernel persistent volume that survives pod recycle, and do not give us a result cache surface we can manage in lockstep with the dependency resolver.

Kernel isolation

Every user session with a Smartbook gets a dedicated kernel.

Cell-level variable provenance

Every variable in the kernel carries metadata: name, the cell that produced it, and a timestamp. Tagging happens at mutation, not just at definition, so provenance survives in-place changes (df.append(...) and friends).

The dependency resolver consumes this metadata to decide whether a variable is still fresh.

Result cache

The kernel runs a 2-layer result cache for warehouse data:

Layer	Keyed on	Purpose
Kernel in-memory	variable name	near-zero-cost reuse while the kernel is alive (across chats and turns)
On-disk DuckDB	variable name + `cell_uuid`	survives pod recycles via the persistent volume

The dependency resolver invalidates both layers coherently.

Warm-pool orchestration with persistent recovery

The orchestrator pulls kernels from a standby pool of pre-started pods rather than provisioning fresh, so kernel boot itself is sub-second. Each kernel owns a persistent volume keyed by kernel id, so when a pod is recycled (node drain, OOM, deploy) the replacement remounts the same volume and resumes with the user's files and DuckDB cache already on disk. First-cell latency in practice is dominated by what the volume needs: a recycled pod with the volume already populated lands the first cell almost immediately, while a cold start has to pull files into the volume from object storage and pays for it.

Beyond a vanilla Jupyter kernel

The rest of what makes the kernel useful is smaller individually but adds up to why we did not just wrap a Jupyter kernel:

Capability	Vanilla Jupyter	Fabi kernel
Transport	zmq message passing	HTTP + JSON, easier to load balance and observe
Dry-run	Not supported	Snapshot + revert: kernel executes speculatively, captures variable delta, rolls back the dry run
Output handling	Renders as is	Auto-trims oversized HTML, SVG, and Plotly; degrades large dataframes to bounded previews
Secrets	Plain env vars	Secret manager with TTL caching and regex log redaction so credentials never leak through stdout

How the 3 systems compose

The 3 systems lean on each other. The harness only stays cheap and correct because the dependency engine and the kernel keep stale work and warehouse round-trips out of every chat. The dependency engine only stays correct because the kernel exposes live variable provenance. The kernel is only worth maintaining because the dependency engine and the agent consume its state. Unbundle any 1 of the 3 and the other 2 lose teeth.

A concrete example. The dreaming process learns "fiscal year starts in February" from a week of thread memory and writes it to org context. The schema RAG injects it into every future chat against the same warehouse. The fact is part of the cached static context, so the LLM prompt cache stays stable across turns. The cheap-LLM swap on turn 2 onwards keeps hitting that cache. The agent never re-derives the fact, the warehouse is never queried for it, and reasoning tokens stay cheap. One nightly job cashes out across every future chat against the same warehouse.

Pull any layer out of that chain (dreaming, schema RAG, prompt cache, per-iteration routing) and it breaks. The interesting engineering was not in any single layer; it was in the contracts between them, and that is what lets each piece stay simple while the whole system pulls its weight.