Research spotlight: reasoning distillation and long CoT structure

We’re taking a look at the research paper, LLMs can easily learn to reason from demonstration (Li et al., 2025), in this week’s community research spotlight. It focuses on how the structure of reasoning traces impacts distillation from models such as DeepSeek R1.

What’s the big idea regarding LLM reasoning distillation?

The reasoning capabilities of powerful models such as DeepSeek R1 and QwQ-32B-Preview can be efficiently and easily distilled into smaller, open models – and when doing so, the structure of a long chain-of-thought (CoT) is more important than the details within its individual steps. In fact, the details don’t even have to be correct.

The reasoning distillation process described, for each problem in a dataset, generates a response that includes a long CoT whose reasoning steps result in the correct solution. Then, a smaller, open model is fine-tuned with these problem-solution pairs – effectively transferring the reasoning capabilities of a large reasoning model such as DeepSeek R1 (the teacher) to a smaller one (the student).

It’s worth noting the authors performed reasoning distillation with just 17,000 examples applied via supervised fine-tuning (SFT) and LoRA, with the latter being data and parameter efficient.

We’ll dive into the two main experiments soon, but I think this chart says it all. Distilling reasoning capabilities from DeepSeek R1 to Qwen2.5-32B-Instruct resulted in it being more/less on par with OpenAI o1-preview, sometimes noticeably better (e.g., Math).

However, the key takeaway of this research paper is that the structure of a long CoT was more important than the content of its individual steps.

Long Chain-of-Thought concepts

Large reasoning models generate a long CoT by incorporating reflection, backtracking and self-validation. This long CoT helps it come to the correct conclusion, or in this research paper, to generate a correct answer via multi-step reasoning.

One little tidbit I found interesting was a list of words and phrases which are frequent indicators of reflection, backtracking and self-validation:

“Alternatively”
“Wait”
“Just to be thorough”
“Just to make sure”
“Let me just double-check”
“Let me try another”
“Let me verify”
“Let me check”
“Hmm”
“But”
“Maybe I should consider”
“Maybe I can consider”

And for reference, here is the prompt they used.

“Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop a well-considered thinking process.

Please structure your response into two main sections: Thought and Solution.

In the Thought section, detail your reasoning process using the specified format: <|begin of thought|> thought with steps separated with \n\n} <|end of thought|> Each step should include detailed considerations such as analyzing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps.

In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin of solution|> final formatted, precise, and clear solution <|end of solution|>”

Now, try to solve the following question through the above guidelines:

LLM reasoning distillation experiments

DeepSeek-R1 → Qwen-32B-Instruct

In this experiment, the authors fine-tuned Qwen-32B-Instruct with the Bespoke-Stratos-17k reasoning dataset.

It contains coding questions from APPS and TACO, math questions from NuminaMATH and science/puzzle questions from STILL-2. These questions were paired with reasoning traces and correct solutions generated by DeepSeek R1 in order to create the training data.

The result is a 15.2% average improvement in accuracy with just 16k samples. Notably, their fine-tuned Qwen-32B-Instruct model approaches DeepSeek R1 accuracy on the AMC 2023 benchmark in particular.

QwQ-32B-Preview → Qwen-32B-Instruct

In the second experiment, the authors curated a similar dataset, but the reasoning traces and correct solutions were generated by QwQ-32B-Preview.

This time they experimented with both SFT and LoRA, as well as two different dataset sizes (7k and 17k). Interestingly enough, fine tuning with LoRA produced a model on par with SFT – and with OpenAI’s o1-preview. It demonstrates that reasoning distillation can be quite efficient in terms of both data and parameters. Further, this is where the authors began to realize that long CoT reasoning may not rely on knowledge, but rather on the structure of reasoning patterns.

Incorrect reasoning traces

The interesting finding in this research is that modifying reasoning traces to introduce errors has little impact on reasoning transfer via distillation and fine tuning. The authors introduced errors in different places in order to measure the impact of a reasoning trace’s overall structure vs. the content within individual steps.

Changes within reasoning steps:

Modified examples so the answers were wrong
Modified digits within reasoning steps (e.g. replaced with random numbers)
Removed common reasoning keywords such as “wait”

Changes to the reasoning structure:

Deleted random reasoning steps
Inserted random reasoning steps from other examples
Shuffled reasoning steps randomly (i.e., changed the order)

100% with wrong answers? Just 3.2% lower accuracy.
67% of reasoning step digits randomized? Just 4.3% lower accuracy.
100% of reasoning keywords removed? Just 3.3% lower accuracy.

All in all, even with the reasoning traces corrupted, reasoning distillation was effective – producing a model that was within a few percentage points of baseline accuracy. Now, what happens when the reasoning structure is corrupted?

67% of reasoning steps deleted? 12.8%% lower accuracy.
67% of reasoning steps randomly added? 14.3% lower accuracy.
67% of reasoning steps shuffled?

In short, eliciting the overall structure of a long CoT is the most critical aspect when fine tuning models to improve their reasoning capabilities.

Final thoughts on LLM reasoning distillation

I found this research paper to be particularly insightful in light of the DeepSeek R1 release. We continue to see that models with exceptional reasoning capabilities are well within reach for everyone, including enterprises who many want or need to deploy open models. There are 270+ fine-tuned versions of DeepSeek R1 on Hugging Face now, and hundreds of datasets derived from it. We continue to believe that distillation is a particularly effective method for enterprises where specialized models are needed to support AI applications which have specific domain, business and use case requirements.
If you want to learn more about LLM distillation, register for our webinar tomorrow where I will go into a lot more detail. If you can’t attend, don’t worry, you can always watch the on-demand recording.

Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation?

What’s the big idea regarding LLM reasoning distillation?

Long Chain-of-Thought concepts