Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?
In this week’s community research spotlight, we’re taking a look at the research paper, ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning (Yin et al., 2025).
What’s the big idea?
LLMs will answer questions with greater accuracy if they are asked to analyze the intent first, then proceed to retrieve relevant information and perform step-by-step reasoning. Thus, the authors proposed and tested a new zero-show prompting method called ARR – analyze intent, retrieve relevant information and reason step by step.
Is it the greatest thing since sliced bread? I’m not a fan of bread, so no.
Is it a compelling alternative to Chain-of-Thought (CoT)? Perhaps. We’ll see.
ARR concepts
ARR emphasizes a trigger sentence composed of three parts:
- Analyze intent – Understand the purpose and desired outcome of the question.
- Retrieve Information – Gather the right information needed to answer the question.
- Reason Step-by-Step – Apply logical steps to reach the correct conclusion.
In comparison to Chain-of-Though (CoT) prompting, which focuses solely on step-by-step reasoning, the authors believe the addition of explicitly intent analysis will allow for a more nuanced understanding of questions. This is not surprising seeing as LLMs, even when instructed to think step by step, can come to premature conclusions and attempt to justify them – and not understanding the intent may explain why.
ARR experiments
The authors measured the effectiveness of ARR by testing it with 10 standard QA datasets, spanning reading comprehension, commonsense reasoning, world knowledge and multitask understanding. To perform the tests, they used the following prompt template.
Prompt template
Passage: {source}
Question: {question}
Answer: {trigger}
Trigger sentences
Base (blank)
Answer:
CoT
Answer: Let’s think step by step
ARR
Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning.
The results revealed consistent improvements over basic and CoT prompting, with the below using Llama 3.1 3B Instruct.
Performance comparison
Type | Dataset | Base | CoT | ARR |
---|---|---|---|---|
RC | BoolQ | 84.16 | 84.65 | 86.33 |
RC | LogiQA | 35.79 | 38.10 | 39.02 |
CR | CSQA | 72.97 | 73.71 | 74.94 |
CR | SIQA | 69.55 | 68.12 | 70.98 |
WK | SciQ | 85.90 | 93.70 | 94.40 |
WK | OBQA | 72.20 | 78.20 | 80.00 |
WK | ARC | 82.59 | 84.31 | 84.84 |
MT | BBH | 50.19 | 58.40 | 59.01 |
MT | MMLU | 60.68 | 32.08 | 63.51 |
MT | MMLU-Pro | 38.75 | 40.10 | 42.72 |
RC – reading comprehension
CR – commonsense reasoning
WK – world knowledge
MT – multitask understanding
Testing
Datasets
BoolQ, LogiQA, CSQA, SIQA, SciQ, OBQA, ARC, BBH, MMLU, MMLU-Pro
Llama models
Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct
Other models
Qwen2.5-7B-Instruct, Gemma-7B-IT and Mistral-7B-Instruct-v0.3
ARR consistently outperformed CoT, showcasing particularly noticeable gains in datasets requiring world knowledge (SciQ, OBQA) and multitask understanding (MMLU-Pro). The findings underscore the value of intent analysis in demanding and diverse contexts.
It certainly appears that ARR is better at inducing accurate responses than CoT.
ARR sample question
I particularly liked this example:
A scientist maps a long region in which earthquakes originate and determines this region is a transform plate boundary. Which evidence would cause the scientist to reevaluate this determination?
(A) Volcanism also characterizes the region. (B) Earthquake centers in the region occur at shallow depths. (C) The region shows extensive faulting of sediments. (D) Equal crust densities are found on opposite sides of the region
With CoT
Answer: Let’s think step by step.
Transform plate boundaries are characterized by horizontal motion of the Earth’s crust, which is evidenced by faulting.
Volcanism is characteristic of divergent and convergent plate boundaries, not transform plate boundaries. Therefore, (A) is incorrect.
Earthquakes at transform plate boundaries can occur at any depth, so (B) is incorrect.
(C) is correct because faulting is a characteristic of transform plate boundaries.
(D) is incorrect because transform plate boundaries are characterized by horizontal motion of the Earth’s crust, not by equal crust densities.
With ARR
Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning.
The question asks for evidence that would cause a scientist to reevaluate the determination that a region is a transform plate boundary.
A transform plate boundary is a region where two plates slide past one another horizontally. The evidence that would cause a scientist to reevaluate this determination would be evidence that is characteristic of a different type of plate boundary.
A divergent plate boundary is a region where two plates move apart from one another, and a convergent plate boundary is a region where two plates move toward one another. Volcanism is characteristic of a divergent plate boundary, so (A) is correct.
The LLM came to the wrong conclusion prematurely and attempted to justify it with CoT. However, with ARR, the LLM correctly stated the intent of the question and used it to pursue the correct answer.
ARR component analysis
I found the ablation study particularly interesting as the authors proceeded to measure the impact of each individual component (analyze intent, retrieve information, reason step by step).
First, let’s take a look at the trigger sentences:
ARR | Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning. |
Analyze | Answer: Let’s analyze the intent of the question, and answer the question. |
Retrieve | Answer: Let’s find relevant information, and answer the question. |
Reason | Answer: Let’s answer the question with step-by-step reasoning. |
You would think ARR would always result in the highest accuracy, right? Wrong.
Dataset | ARR | Analyze | Retrieve | Reason |
---|---|---|---|---|
BoolQ | 86.33 | 86.09 | 85.35 | 85.87 |
LogiQA | 39.02 | 38.40 | 37.79 | 38.86 |
CSQA | 74.94 | 75.76 | 75.59 | 74.53 |
SIQA | 70.98 | 70.78 | 68.01 | 68.01 |
SciQ | 94.40 | 94.30 | 92.80 | 94.50 |
OBQA | 80.00 | 86.80 | 81.20 | 82.60 |
ARC | 84.84 | 85.83 | 85.33 | 85.03 |
BBH | 59.01 | 57.08 | 58.27 | 58.96 |
MMLU | 63.51 | 63.66 | 63.73 | 61.77 |
MMLU-Pro | 42.72 | 42.54 | 43.08 | 41.11 |
Final thoughts
While ARR achieves the highest scores 40% of the time, it appears that analyzing the intent is the contributing factor in questions requiring world knowledge, whereas retrieval is the contributing factor in those requiring multitask understanding. My impression is that ARR may be a practical alternative to CoT, but it depends on the task.
As I mentioned earlier, it makes sense. If we don’t ensure LLMs fully understand the intent behind our questions, how can we expect them to answer correctly? Well, “correctly” based on our own perspective and expectations. There’s a lot of research going on regarding reasoning capabilities, but I think it’s worth examining the importance of question context too (e.g., intent).
Check out the research paper. There’s a lot more information, several clear examples, and additional measurements using various models and approaches.
I would definitely add ARR to my prompt engineering toolbox.
I started out as a developer and architect before pivoting to product/marketing. I'm still a developer at heart (and love coding for fun), but I love advocating for innovative products -- particularly to developers.
I've spent most of my time in the database space, but lately I've been going down the LLM rabbit hole.