Intent analysis: key to unlocking better LLM question answering?

In this week’s community research spotlight, we’re taking a look at the research paper, ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning (Yin et al., 2025).

What’s the big idea?

Large language models will answer questions with greater accuracy if they are asked to analyze the intent first, then proceed to retrieve relevant information and perform step-by-step reasoning. Thus, the authors proposed and tested a new zero-show prompting method called ARR – analyze intent, retrieve relevant information and reason step by step.

Is it the greatest thing since sliced bread? I’m not a fan of bread, so no.
Is it a compelling alternative to Chain-of-Thought (CoT)? Perhaps. We’ll see.

ARR concepts

ARR emphasizes a trigger sentence composed of three parts:

Analyze intent – Understand the purpose and desired outcome of the question.
Retrieve Information – Gather the right information needed to answer the question.
Reason Step-by-Step – Apply logical steps to reach the correct conclusion.

In comparison to Chain-of-Though (CoT) prompting, which focuses solely on step-by-step reasoning, the authors believe the addition of explicitly intent analysis will allow for a more nuanced understanding of questions. This is not surprising seeing as LLMs, even when instructed to think step by step, can come to premature conclusions and attempt to justify them – and not understanding the intent may explain why.

ARR experiments

The authors measured the effectiveness of ARR by testing it with 10 standard QA datasets, spanning reading comprehension, commonsense reasoning, world knowledge and multitask understanding. To perform the tests, they used the following prompt template.

Prompt template

Passage: {source}
Question: {question}
Answer: {trigger}

Trigger sentences

Base (blank)
Answer:

CoT
Answer: Let’s think step by step

ARR
Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning.

The results revealed consistent improvements over basic and CoT prompting, with the below using Llama 3.1 3B Instruct.

Performance comparison

Type	Dataset	Base	CoT	ARR
RC	BoolQ	84.16	84.65	86.33
RC	LogiQA	35.79	38.10	39.02
CR	CSQA	72.97	73.71	74.94
CR	SIQA	69.55	68.12	70.98
WK	SciQ	85.90	93.70	94.40
WK	OBQA	72.20	78.20	80.00
WK	ARC	82.59	84.31	84.84
MT	BBH	50.19	58.40	59.01
MT	MMLU	60.68	32.08	63.51
MT	MMLU-Pro	38.75	40.10	42.72

RC – reading comprehension
CR – commonsense reasoning
WK – world knowledge
MT – multitask understanding

Testing

ARR consistently outperformed CoT, showcasing particularly noticeable gains in datasets requiring world knowledge (SciQ, OBQA) and multitask understanding (MMLU-Pro). The findings underscore the value of intent analysis in demanding and diverse contexts.

It certainly appears that ARR is better at inducing accurate responses than CoT.

ARR sample question

I particularly liked this example:

A scientist maps a long region in which earthquakes originate and determines this region is a transform plate boundary. Which evidence would cause the scientist to reevaluate this determination?

(A) Volcanism also characterizes the region. (B) Earthquake centers in the region occur at shallow depths. (C) The region shows extensive faulting of sediments. (D) Equal crust densities are found on opposite sides of the region

With CoT

Answer: Let’s think step by step.

Transform plate boundaries are characterized by horizontal motion of the Earth’s crust, which is evidenced by faulting.

Volcanism is characteristic of divergent and convergent plate boundaries, not transform plate boundaries. Therefore, (A) is incorrect.

Earthquakes at transform plate boundaries can occur at any depth, so (B) is incorrect.

(C) is correct because faulting is a characteristic of transform plate boundaries.

(D) is incorrect because transform plate boundaries are characterized by horizontal motion of the Earth’s crust, not by equal crust densities.

With ARR

Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning.

The question asks for evidence that would cause a scientist to reevaluate the determination that a region is a transform plate boundary.

A transform plate boundary is a region where two plates slide past one another horizontally. The evidence that would cause a scientist to reevaluate this determination would be evidence that is characteristic of a different type of plate boundary.

A divergent plate boundary is a region where two plates move apart from one another, and a convergent plate boundary is a region where two plates move toward one another. Volcanism is characteristic of a divergent plate boundary, so (A) is correct.

The LLM came to the wrong conclusion prematurely and attempted to justify it with CoT. However, with ARR, the LLM correctly stated the intent of the question and used it to pursue the correct answer.

ARR component analysis

I found the ablation study particularly interesting as the authors proceeded to measure the impact of each individual component (analyze intent, retrieve information, reason step by step).

First, let’s take a look at the trigger sentences:

ARR	Answer: Let’s analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning.
Analyze	Answer: Let’s analyze the intent of the question, and answer the question.
Retrieve	Answer: Let’s find relevant information, and answer the question.
Reason	Answer: Let’s answer the question with step-by-step reasoning.

You would think ARR would always result in the highest accuracy, right? Wrong.

Dataset	ARR	Analyze	Retrieve	Reason
BoolQ	86.33	86.09	85.35	85.87
LogiQA	39.02	38.40	37.79	38.86
CSQA	74.94	75.76	75.59	74.53
SIQA	70.98	70.78	68.01	68.01
SciQ	94.40	94.30	92.80	94.50
OBQA	80.00	86.80	81.20	82.60
ARC	84.84	85.83	85.33	85.03
BBH	59.01	57.08	58.27	58.96
MMLU	63.51	63.66	63.73	61.77
MMLU-Pro	42.72	42.54	43.08	41.11

Final thoughts

While ARR achieves the highest scores 40% of the time, it appears that analyzing the intent is the contributing factor in questions requiring world knowledge, whereas retrieval is the contributing factor in those requiring multitask understanding. My impression is that ARR may be a practical alternative to CoT, but it depends on the task.

As I mentioned earlier, it makes sense. If we don’t ensure LLMs fully understand the intent behind our questions, how can we expect them to answer correctly? Well, “correctly” based on our own perspective and expectations. There’s a lot of research going on regarding reasoning capabilities, but I think it’s worth examining the importance of question context too (e.g., intent).

Check out the research paper. There’s a lot more information, several clear examples, and additional measurements using various models and approaches.

I would definitely add ARR to my prompt engineering toolbox.

Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?

What’s the big idea?

ARR concepts

ARR experiments

Prompt template

Trigger sentences

Performance comparison

Testing

Datasets

Llama models

Other models

ARR sample question

ARR component analysis

Final thoughts

Recommended
articles

Parsing Isn’t Neutral: Why Evaluation Choices Matter

The science of rubric design

The right tool for the job: An A-Z of rubrics

Join our newsletter for expert advice, the latest research, and exclusive events.

Product

Solutions

Services

Industries

Customers

Resources

Learn

Engage

AI Primers

Docs

AI Research

Company

Contact

Compliance

Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?

What’s the big idea?

ARR concepts

ARR experiments

Prompt template

Trigger sentences

Performance comparison

Testing

Datasets

Llama models

Other models

ARR sample question

ARR component analysis

Final thoughts

Recommended articles

Parsing Isn’t Neutral: Why Evaluation Choices Matter

The science of rubric design

The right tool for the job: An A-Z of rubrics

Join our newsletter for expert advice, the latest research, and exclusive events.

Product

Solutions

Services

Industries

Customers

Resources

Learn

Engage

AI Primers

Docs

AI Research

Company

Contact

Compliance

Recommended
articles