Parsing Isn’t Neutral: Why Evaluation Choices Matter
Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself.
Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy.
In this post, we’ll unpack one of the most overlooked pieces of model evaluation, parsing, and explore three key questions:
- What happens when you enforce strict versus flexible parsing rules?
- How structured outputs can limit model reasoning itself?
- Why do the same models look better or worse depending on the evaluation setup?
The Setup: Same Models, Different Parsers
We ran a range of models across SnorkelGraph, a benchmark of graph reasoning problems. Each problem has a structural answer, such as a list of nodes, which must be parsed, normalized, and then passed through graph validators to check correctness.
To prepare those answers, we tested two main parsing strategies:
- Structured parsers: Requires exact JSON outputs, parsed with tools like Pydantic.
- Unstructured parsers: Extracts answers using regex rules or another LLM.
This design put parsing at the center, allowing us to measure its direct impact on evaluation outcomes.
Results: Parsing Changes the Score
The results, shown in Figure 1: Parsing Methods Comparison and Figure 2: Model Performance Comparison, reveal three notable patterns:
- Structured formats can constrain reasoning. Models like GPT-4.1 and Grok-3 performed worse when forced into rigid JSON structures, as the requirement to maintain exact formatting limited their ability to reason fully through the task.
- Reasoning-first models held steady. Models like Claude Sonnet 4, Gemini 2.5 Pro, and o4-mini showed minimal sensitivity to the parsing method, with only slight decreases under regex parsing due to its stricter format requirements.
- Flexible parsing raised scores for weaker models. Regex and LLM parsers captured valid answers from freer-form reasoning, improving reported accuracy. GPT-4o struggled across all formats but performed best when given more flexibility with the LLM parser.
Parsing speed also varied dramatically. As shown in Figure 3: Parsing Methods – Time Comparison, regex and JSON parsing were nearly instantaneous, LLM parsing took a few seconds, and Pydantic AI lagged far behind at nearly 30 seconds per response.
In short: the same model could look better or worse depending on how its answers were parsed.
Why Structured Outputs Can Hold Models Back
Forcing models into structured formats didn’t just affect evaluation—it actively reduced reasoning quality.
Weaker models, in particular, struggled to balance two demands at once: reasoning through the problem and conforming to schema rules. The result was often incomplete reasoning or failed answers, even before validation.
By contrast, when allowed to reason freely and have answers extracted and validated later, models performed better across the board. Structured constraints don’t just change how we measure results—they can reshape reasoning itself.
Why It’s Tricky
Parsing isn’t just a technical detail—it’s part of the evaluation. Strict parsing enforces discipline but can constrain reasoning. Flexible parsing captures more reasoning ability but risks overstating robustness by being too forgiving.
It’s a trade-off: exactness versus resilience. Both are valid, but they measure different things.
Recommendations for Practitioners
- For precision: Use regex parsing—it’s reliable, fast, and strict.
- For flexibility: Use LLM-based parsing—it better reflects reasoning, though it’s less exact.
- Use caution with structured outputs: They can depress scores and limit reasoning for weaker models.
Closing Thoughts
Parsing may seem like a technical afterthought, but it shapes the story you tell about your AI. By making deliberate parsing choices—and recognizing their impact before answers even reach an evaluator—we can move from misleading metrics to evaluations we can trust.
That’s why our published SnorkelGraph benchmark uses an LLM parser with unstructured outputs. The goal isn’t to measure whether models can produce perfectly formatted JSON, but whether they can actually solve the complex spatial and mathematical reasoning problems the benchmark was designed to test.
At Snorkel AI, we pay close attention to every aspect of evaluating LLM responses, and iteratively improve them by collaborating with our network of experts to develop rubrics of carefully chosen evaluation criteria. Be sure to take a look at our series of posts on rubric development. Get in touch with us if you have a project that needs high quality data!
Justin Bauer is a Research Engineer at Snorkel AI, working on synthetic data, evaluation, and benchmarks. He previously interned at Google DeepMind and Tesla, focusing on reinforcement learning and sensor perception.