Why Your BI Data Model Will Fail an AI Agent

When I took ownership of the real-time data layer for our LLM-powered agent, my team's first instinct was to reuse our existing embedded BI models. I understood the impulse: we'd just finished migrating them to ClickHouse. They were battle-tested. But they were also the wrong foundation.

The problem

For years, customers who needed complex filtering (show me all records whose history includes X, who also have these custom properties) hit a wall in the product UI. The embedded analytics feature could handle those queries, but on a three-to-four-hour delay. When you're talking to an AI agent, you expect live data. Neither system could deliver what we felt was needed.

Why some BI models don't work for agents

Our BI data model is optimized for predefined query patterns. We knew what dashboards exist and what questions customers were asking the most. We designed the BI model to present results in a flat pattern to meet the dominant customer demand for wide reporting table type dashboards.

An agent data model has to support whatever question a customer decides to ask. And working with an AI agent has the advantage of being able to run a sequence of queries to return smaller sectioned results which could then be collected for further processing by the agent to meet the customer question. We could still control some patterns to ensure joins were optimized for ClickHouse's structure and measures had consistent definitions. But we didn't need to have big wide models and complex transformations ahead of time.

The part that gets missed is the semantic layer. A dashboard has labels and tooltips. A human analyst has institutional knowledge. An LLM agent has neither. It needs explicit context for every table and every field: what it means, how it relates to other data, when to use it and when not to. Years of data model tables with minimal documentation meant I needed to build that better context. Without it, the agent can access data but won't consistently understand it.

What I built

I designed the architecture around ClickHouse focusing on as minimal transformation as possible. That meant avoiding heavy CTEs or window functions as much as possible, reasoning that anything that needed a window function the agent may be able to handle instead in post processing on the results after the ClickHouse query. The streaming layer uses ClickPipes. These detect new and updated records and append them in sub-second time. Downstream, I used ReplacingMergeTree tables for incremental updates: when a record changes, it comes in as an insert, and ClickHouse deduplicates in the background. This gave us near real-time data without the cost of rebuilding tables from scratch on every change.

The modeling layer is defined using dbt. The semantic layer uses Cube. I built the full context layer: field descriptions, table relationships, and usage guidance, so the agent could make sense of what was available and what it was querying.

The result: billions of rows queryable in sub-second, end-to-end response time of a couple of seconds, and support for the complex filtering patterns our product platform couldn't handle.

What broke after shipping

About a month in, I found something unexpected. The downstream ReplacingMergeTree tables needed account IDs for row-level access control, but not all source tables had account ID directly on them. I was doing joins to get them and those joins, running simultaneously with upstream ClickPipe inserts, were causing significant memory costs that surfaced on the ClickPipe operation rather than where I expected. I audited every expensive join, optimized what I could, and flagged the rest for source table backfills. ClickHouse rewards correct configuration and punishes incorrect configuration fast.

The outcome nobody planned for

Once the real-time data layer existed, product teams realized they had access to complex data capabilities that had not been easily available in the product before. Not just for the agent, but potentially across the platform. They'd been limited by what the system could support. Now they had instantaneous access to join patterns and filtering that had previously only existed in our analytics layer, on a multi-hour delay.

The data model I built as agent infrastructure has been integrated with platform agent infrastructure to provide even better operational capabilities for the AI agent. That wasn't the original requirement, but it's probably the most valuable outcome of the whole project.