The ClickHouse Aggregation Pattern You Solve Once and Carry Forever

When we migrated our data warehouse from BigQuery to ClickHouse, I expected the challenge to be in the BI migration itself: refactoring models, views, and explores in Looker to work with the new warehouse. That part was straightforward. But the problem that ended up mattering most was one I didn't see coming.

The constraint

Looker's aggregate awareness has limitations with ClickHouse that prevent certain measure types from working as expected: distinct sum, distinct average, distinct median. Distinct aggregates for these functions were not supported by Looker.

This isn't an edge case for us. Our customers rely heavily on distinct aggregations. A count of records who had a certain history pattern isn't useful if it's double-counting because of join fanout. A median time to reach X state is meaningless if it's computed across duplicated rows. Getting these measures right is table stakes.

The solution

Once I identified the problem, I went looking for ClickHouse native functions that could handle the aggregation correctly, bypassing Looker's aggregate awareness for those specific measure types. By collaborating with AI (read my post on how to use AI as a thinking partner) I found new capabilities I did not previously know about, developed and built the solution, tested it, and documented it. It worked. The variety and depth of ClickHouse's higher order and array functions was key to the success: functions like arrayMap, arrayReduce and groupUniqueArray are deeply integrated and execute fast, really fast.

The second time

A few months later, I was deep into the agent real-time data model. Completely different project, completely different tool. The semantic layer was Cube instead of Looker. A bit different syntax and architecture.

And then the same constraint appeared. Cube had its own version of the same limitation with distinct aggregations in ClickHouse. Same root cause, different surface.

The moment I recognized it, the solution was already in hand. I applied the same ClickHouse native functions, adapted to Cube's configuration. Done.

Why this matters

It's not because the technical solution is especially complex. It's because of what it represents.

When I solved this the first time, I could have approached it with a whole custom CTE or view to fix the specific measures that were broken, moved on, and forgotten about it. But that would have only worked for that situation. Instead, I built the solution around the underlying constraint (ClickHouse's handling of distinct aggregations) rather than the specific tool (Looker's aggregate awareness).

That decision cost maybe an hour of extra thinking. It saved me days when the same problem appeared in a completely different context.

This is the kind of thing that compounds over a career. Not every solution is transferable. But asking "is this a tool-specific problem or a platform-level constraint?" changes how you build things. And every time the answer is "platform-level," you've got a pattern you'll carry forever.

It's also, incidentally, the same reason we eliminated Redshift as a warehouse option. Redshift can't support multiple orderings for distinct aggregations in the same query. You'd have to run separate queries and stitch them together. The constraint follows you across tools. The only question is whether you've built a portable solution.