Christopher Sniffen recently sat down with Rezaur Rahman — CIO / CISO / CAIO at the Advisory Council on Historic Preservation — for a conversation on what it actually takes to build frontier AI for federal infrastructure. They get into the limits of frontier models on geospatial reasoning, mechanistic interpretability for applied AI, the trick that makes vision models useful for satellite data, the geospatial benchmark gap, and one case where a frontier model returned a wildly confident — and wildly wrong — threat attribution.

Transcript

Lightly edited for clarity/brevity.

Building AI-native, not bolt-on

Rezaur: So we’re working with Google Public Sector and Snorkel as key partners on building a geospatial deep-research, AI-native system. And along with that — since that wasn’t challenging enough — we’re building a world-simulation system to simulate real-world impacts for large infrastructure projects. The regulatory impacts, and how you can accelerate permits around that.

What’s interesting about this is the approach we’re taking is very different from traditional AI projects. Traditionally, you’re taking an existing software environment and stack and adding AI on top of it using APIs. For us, it was important to build an AI-native system — where the data coming in, even how the information is visualized, is designed for the AI model’s interpretation first. Then we create an optimized representation of that information for the humans.

Christopher Sniffen: I want to talk about what your vision for AI-native really means, because I think it’s unique — at least compared to others I’ve spoken with. But before we do that, let’s talk about the context of the problem we’re trying to solve, which is the Section 106 process. If a large hyperscaler or a telecommunications provider wants to build a data center or put up cell towers, one of the things they have to consider is the impact on historical properties across the country and in US territories as well.

Rezaur: Right. Those could be properties on the National Register or eligible for the National Register.

Chris: As I’ve gotten to work with you on this project, I’ve realized how many different data sources are implicated in this process. I had no idea. Is there anything else we should touch on in terms of what we’re trying to do?

Rezaur: Sure. In federal, we have this problem even across federal agencies — and in this scenario, we have to retrieve data from all the federal agencies, all the states, all the territories, and even tribes. So how do you bring all that back? Different data formats, different data standards. There’s no standard schema, there’s no data dictionary. It’s literally the Wild West.

We pull all this data back with relevant information about potential properties that might be implicated in a project. You’re looking for cultural resources. There’s specific legal criteria that you have to follow for projects involving federal agencies, federal funding, federal grants, or approvals.

Then once we get that information back, we want to create — like you said — a deep-research agent that also has some world-model characteristics, so we can present users with not just high-level impact but really concrete visuals.

When I started architecting this system last year — really late 2024, then early 2025 — the models were just not there yet. Around April is when I initially published my architecture for it: a pipeline to do deep research and bring all this information in. The launch of Nano Banana was a key part of it, because when I saw that Nano Banana could generate fairly accurate imagery consistently — in our case, we have to generate imagery of the real world, often involving structures, buildings, locations, cities — the scale of what we’re doing is not just individual properties. It could be above ground, buildings, an entire site, a community, a town, a pipeline. You think about things across states.

Chris: Yes, multi-states. That presents unique challenges.

Rezaur: How do you explain that to an AI? How do you explain, “Build me the most optimal pipeline route at the lowest regulatory cost”? That’s ultimately the goal. We’re being funded for this project to accelerate permitting, and hopefully it becomes a model of how you can build AI-native systems for public sector and other uses too — for large-scale infrastructure projects at national scale.

Why one model can’t do geospatial AI

Chris: Let’s pivot to talk about what AI-native means for a system like this. Most people’s experience with AI is limited to chat interactions, maybe some image generation. What you’re talking about is inherently multimodal, very geospatial-based — and these aren’t things that are easily conveyed to a chatbot.

Rezaur: The other issue is that a single model doesn’t understand all of it. Language models are good at language. Vision models are good at images or video. There are some geospatial models — Google’s doing some work around Earth AI — but a geospatial model looking at satellite data or specific geospatial embeddings is not necessarily understanding the geometry of structures.

While building the system, I ran into this. I thought: OK, I can take the LiDAR data, the 3D tiles from Google Maps, the geospatial information, and extract accurate wireframes and geometric structures, then feed that into an optimized pipeline to Nano Banana to render. Guess what? Doesn’t work.

Chris: Most of the images I’ve seen Nano Banana produce as teasers are not photorealistic of real places.

Rezaur: I’ve tested a pipeline where — before Google announced grounding Google Maps with Gemini — I built a browser plugin where I could take a specific location and say, “OK, this in the 1930s, some context, research this location.” I’d have Gemini run a deep-research cycle for the location, bring all that into context, and produce an optimized prompt for Nano Banana to render. That’s done some interesting work and been pretty good.

Chris: Does it do fairly well at the grounding?

Rezaur: It does, if it has knowledge about that location or if it can retrieve knowledge from the web. I tested it both ways — historical and current. At test time, it’s retrieving information from the web, outside of the training cutoff. And it does pretty well. However, there are limitations of accuracy and inconsistency. With Nano Banana 2, I believe you can steer it through context with six or seven images.

“I want more entropy, not less”

Rezaur: Since late 2024, one of my things has been: there’s a limit to how far you can push transformer models for accuracy. They’re generative by nature. What you’re generating is a probability distribution of their training data. And I actually like that. I want more entropy, not less — because if I’m trying to get to a point of discovery and intelligence, that’s going to happen through higher entropy, a higher probability-distribution exploration of the mathematical space and the geometric manifolds. If you limit it to a specific path, you lose that exploration.

Chris: I’ve heard you talk about this before. Sometimes, when we are over-optimizing our context or manipulating it a lot, we’re giving more input than necessary in an effort to ground the model — to make sure it’s consistent and accurate. Maybe that’s not the right way to approach prompt optimization. Is that a fair way to put it?

Rezaur: There’s a reason you don’t hear people talk about prompt optimization or “think step by step” anymore — that’s kind of passed. Now we’re focused on harness engineering and wrapping context engineering along with it. But it’s a side effect of the model’s architecture — these larger parameter models, or even distilled models from the larger ones, that, given a certain task or input that’s generalizable to a degree, the output is good. So it’s usable. But you’re not really understanding how it got to that output in the distribution space — in the matrices, the weights, what is plugging into which layers, how it’s building that manifold.

Fundamentally to me, you have to understand what’s happening inside the model. That’s why mechanistic interpretability is so important to me. That’s why a key partner for us from early on was Goodfire — to use the software that goes along with it to look inside the model and steer it in the right direction.

Inside the model: mechanistic interpretability

Rezaur: Because I’m really bothered by external steering — forcing the model using skills, even though skills are super useful. Skills, instructions, files — essentially stuffing the context with “hey, stay in this lane,” because I need to make sure I get a deterministic output. To me, I’m like, what are you doing? Why are you forcing this intelligence, which can be so creative, into this narrow probabilistic space, where you’re losing the intelligence and exploration? You’ll never really have AGI this way.

Chris: There are generally two ways you can understand a model. One is to look at the externalities — if I give it a set of inputs, I can look at the outputs, repeat the experiment, build several that represent a degree of variance, and measure the outputs. It’s like my dog — the way I understand my dog’s behavior, and get comfortable with it consistently coming when I call its name, is I do it a lot of times in a lot of ways and observe what the dog does. But there’s another way to look at this — and that is to look at the internals. I’m not going to do that to my dog, but we can do it to a model, thankfully, with some of the stuff Goodfire has talked about. Mechanistic interpretability is something that’s been talked about a lot, but it’s not something that’s really been done.

Rezaur: Traditionally, in the short history, it’s been for alignment. You want to make sure it’s not doing bad things or going off the rails — or what Anthropic would want us to believe it’s doing. But for applied AI applications is where I’ve always been thinking about it. Initially, my thinking was for cybersecurity. When I was testing the models, I didn’t think they were good enough. However, doing some activation steering, I was able to get a much better output. That got me hooked in that direction. I could see: if we can steer the internals of the models while still staying in alignment of the model prior — that being the model training and the manifolds that are built over time based on the training — we can get a much better outcome, because we’re staying within the model’s latent space and the intelligence the manifolds have developed.

Chris: That sheds a different light on the idea of evaluation. Evaluating a model with a standard benchmark might involve running 3,000 different scenarios through a model, observing the outputs, and comparing them to a gold standard. But it’s a little different if you’re trying to measure the internals and you can actually steer features within the model. You might run the same scenario five times with different steering of features.

Rezaur: It is a lot of work, because you want to take a lot of the density that’s in the model and get to sparsity, so you can isolate specific features you want to measure or steer.

I want to backtrack a little, because the levels of optimization in building AI systems are very important to me. We’re optimizing every single layer to maximize performance. When people interface AI models, they’re just experiencing the chatbot or the image generation. They’re not seeing what’s happening behind. Even when Nano Banana is running, it’s probably not just one model — there are multiple Gemini models interacting, coming together. So there’s an orchestration happening across a mix of different models. That’s one level of system engineering and orchestration.

Externalizing memory: context, files, graphs

Rezaur: The other layer is where you’re externalizing the intelligence of the model. How are we doing that? Because the model’s training data is fixed. It’s not learning anything new. And it’s not practical to retrain every time based on whatever your need is.

Chris: Which is why the economics aren’t there.

Rezaur: Right. And it could just be an architectural limitation of transformer architectures — we may need a whole new architecture. So when you’re externalizing, what are you externalizing? You’re externalizing memory. That was one of the first early problems. Like, “Oh wait, I tell the model something, next time I’m telling it every time. Make sure you do it this way. Do my code this way. The reference files are here.” Every single time.

So we said: wait, maybe we can write these into Markdown files and load them into the context. Just keep loading up the context. We got this. Get rid of the LoRA adapters. We don’t need fine-tuning anymore. Context engineering solved everything.

Chris: Absolutely.

Rezaur: I remember reading that agent context engineering paper — I talked to Alex, the Snorkel CEO, about it too. I really liked it. But it was funny that maybe four to six months later, another big lab published a paper saying: wait, fine-tuning with LoRA adapters is amazing — and now we have easier infrastructure to do it with, and we’re getting better results than context engineering. So there’s a lot of back-and-forth — but there are just so many horizons you can optimize, and you have to pursue it to build the best possible system.

You can only influence the intelligence of the model — which is fixed — during test time and in context. But that’s not persisting. So how do we persist? We come up with external memory systems. There are many memory-architecture approaches. Early on — late 2024, early 2025 — we were focused on building graph systems for that reason, to give persistent and referential memory that agents could go back to. But that might not be the answer. A file system might be a better way to do it, because the models know how to deal with file systems.

Chris: It’s probably not a one-size-fits-all problem. Depending on the domain, the model, how many parameters it has, how much information you’re packing into the context.

Rezaur: Absolutely. It’s kind of throwing stuff up to see what sticks. ML has always been trial and error. AI seems to be the same way.

That makes me think of Google AlphaEvolve and Co-Scientist. Those are things we’re applying to this too. If you’re exploring probability distributions and you want to get to the most optimal one, you have to run through these loops. We have auto-research out in the wild — we don’t have to wait for AlphaEvolve. You can run these auto-research loops to iteratively run hypotheses and do self-optimizations for hours.

Chris: Do you worry about grounding for those types of systems?

Rezaur: I worry about grounding on all the systems, because we’re working with legal framework — these are very high-impact, large-scale, national-scale infrastructure projects. Mistakes aren’t good. My goal is to get to a hallucination rate of sub-1%. For our cybersecurity system, my goal initially was around a hallucination rate of 30 to 40%. The reason: I was able to provide expertise, having a security background. I’m the verifier. So it’s very different trust standards for the system.

Chris: How do you think we’ll get there? Are there particular things you need that don’t exist yet?

Rezaur: That’s definitely the case. One thing I like to do — I read a bunch of papers and try to see how I can bring them together. And that’s why I’m excited to be on here, because I have no one to talk to about this stuff. I’m always thinking about how these papers come together, to predict what’s coming six months to a year out.

I was talking to my friend about this — CrowdStrike is architecting their agentic SOC the way we architected ours back in 2024, down to their synthetic data generation. They’re using synthetic-generation software from NVIDIA, an acquisition of Gretel, who we were working with back in 2024 for generating synthetic data for cybersecurity.

Chris: What kind of synthetic data?

Rezaur: What they were specifically scrubbing was PII as it relates to IP addresses. That’s one area, but they’re also using synthetic data pairs and expert data pairs for training. I think they had about 500,000, which is a lot. But they were using Nemotron and GPT-OSS, and actually older versions. I think that architecture could be better.

Chris: 500,000 data points will probably get it a lot better.

Rezaur: Theirs was focused on converting natural-language queries into the code for CrowdStrike systems — for querying against databases populated by Falcon. Like their SIEM, essentially.

Chris: Why don’t we talk about the data needs you have for the systems we’re building?

Rezaur: I’m dealing with data that’s in databases — sometimes Esri, sometimes not. Sometimes it’s spreadsheets. It might be in a scanned file or not scanned. This data is living across federal and state systems. Then we have petabytes of geospatial data. How do you bring this all together and have a single model understand all this information — grounded? I mean literally grounded to the accurate geospatial point on the ground, with the context around that location, and with temporal depth. You’re not just looking at it at this point in time, like you pulled it up on Google Maps. You’re looking at it over time, how it’s changed. A single model cannot do it