Debunking the Claims of K2-Think

Niels Mündler, Jasper Dekoninck, Nikola Jovanović*, Ivo Petrov, Martin Vechev

September 12, 2025

The reported performance of K2-Think is overstated, relying on flawed evaluation marked by contamination, unfair comparisons, and misrepresentation of both its own and competing models’ results.

K2-Think (different from Kimi-K2!) is a reasoning LLM released a few days ago that claims performance on par with GPT-OSS 120B and DeepSeek v3.1, but with fewer parameters, as described in their paper. It received a significant amount of attention online, with several news articles being published on the topic (Wired, Forbes, CNBC, etc.). However, as we discuss below, the reported gains are overstated, relying on flawed evaluation marked by contamination, unfair comparisons, and misrepresentation of both its own and competing models’ results. Instead, K2-Think performs slightly worse than other models of similar size, such as GPT-OSS 20B, Nemotron-32B or Qwen3 30B 2507.

Evaluation is invalid due to data contamination

We find clear evidence of data contamination.

For math, both SFT and RL datasets used by K2-Think include the DeepScaleR dataset, which in turn includes Omni-Math problems. As K2-Think uses Omni-Math for its evaluation, this suggests contamination. We confirm this using approximate string matching, finding that at least 87 of the 173 Omni-Math problems that K2-Think uses in evaluation were also included in its training data. Interestingly, there is a large overlap between the creators of the RL dataset, Guru, and the authors of K2-Think, who should have been fully aware of this.

We find a similar issue in the LiveCodeBench evaluation. Around 22% of samples used in K2-Think’s evaluation appear in their SFT dataset. The original authors of the SFT dataset (AM-Team) include a decontamination step, removing problems from Oct 2024 onward. However, K2-Think’s LiveCodeBench evaluation uses all problems from July 2024 onwards, 22% of which were thus also previously seen in training.

The net effect is that the evaluation results on mathematics and code are invalid.

Unfair comparisons with best-of-n and external model use

The paper’s main results table reports K2-Think’s best-of-3 performance, a well-known method to improve model performance. All other models are evaluated using best-of-1, posing them at a significant disadvantage. To make matters worse, the best-of-3 judgment is made by an unspecified “external model”, which could have arbitrary size. This same external model is also used to provide K2-Think with detailed problem-solving plans. The authors define this entire pipeline as “K2-Think,” with the model itself being only one component, undermining the claim that K2-Think relies only on a small 32B parameter model.

Comparing this pipeline to other models without the pipeline, as done in the paper, is invalid. The pipeline itself could be easily applied to other models and would similarly increase their score. Without the external help, K2-Think’s performance is worse than Nemotron 32B, a similarly-sized model trained with a similar methodology on Qwen2.5 32B and released in July.

Model	AIME 2024	AIME 2025	HMMT25
K2-Think	86.26	77.72	66.46
Nemotron 32B	87.09	82.71	67.29
Qwen3 30B (July)*	-	85.00	71.40

Table 1: Performance comparison of K2-Think without external help, and Nemotron 32B (both finetunes of Qwen2.5 32B), as well as Qwen3 30B. Results for Qwen3 (*) are taken from their model page. All other results are taken from the K2-Think paper.

Misrepresenting results of other models

The report fails to adequately evaluate other models. Most notably, GPT-OSS is only run with medium instead of high reasoning effort, which is the recommended setting for reasoning benchmarks.

Additionally, K2-Think uses outdated versions for many models. For example, even though they evaluate GPT-OSS, which was released in August, the Qwen3 models evaluated in the paper do not appear to be the latest versions of these models, published in July. Specifically, for the three benchmarks that overlap between the releases of Qwen3 and K2-Think (AIME 2025, HMMT 2025, GPQA-Diamond), the results seem to match the older versions, which are significantly (15-20%) below the reported results of the July versions.

In the table below, we compare the self-reported results of Qwen3 with the numbers reported in the K2-Think paper. The scores attributed to Qwen3-30B are far lower than expected, even when compared against the earlier non-July release.¹

Model	AIME 2025			HMMT 2025			GPQA-Diamond
Model	Self-Report	MathArena	K2-Think	Self-Report	MathArena	K2-Think	Self-Report	K2-Think
Qwen3-30B (Think)	70.90	70.00	58.14 (?)	49.80	50.83	23.54 (?)	65.80	58.91 (?)
Qwen3-30B (Think, July)	85.00	—	—	71.40	—	—	73.40	—
Qwen3-235B (Think)	81.50	80.83	75.43	62.50	62.50	61.88	71.10	65.55
Qwen3-235B (Think, July)	92.30	—	—	83.90	—	—	81.10	—

Table 2: Comparing reported scores from Qwen3 technical report and model pages, MathArena benchmark, and K2-Think paper on AIME 2025, HMMT 2025, and GPQA-Diamond.

Giving more weight to high-scoring math benchmarks

Finally, K2-Think reports aggregate math scores using a “micro average”, weighing each of the four benchmarks (AIME24, AIME25, HMMT, OmniMath-Hard) by their respective number of tasks, instead of averaging individual benchmark scores equally. While meant to quantify overall math ability of the model, such an average metric is heavily dominated by OmniMath-Hard (~66% of the total score). Not only is this K2-Think’s strongest benchmark, but it is also directly tied to the contamination issues discussed above.

Our own evaluation

To validate our analysis, we ran K2-Think on our MathArena benchmark in a fair comparison against other models. We followed the recommended hyperparameters for K2-Think, using temperature 1, p = 0.95, and 64,000 output tokens. The results show that while K2-Think is a competent model, it falls well short of the performance claimed in the paper and the popular media articles. In particular, it is far from matching DeepSeek v3.1 or GPT-OSS 120B, despite the authors’ claim to the contrary. In fact, it shows that K2-Think’s math capabilities are not even on par with the smaller GPT-OSS 20B model.

Conclusion

Overall, we found that the K2-Think model makes wrong claims in several locations: It evaluates on data it was trained on, relies on an external model and additional samples for its claimed performance gains, and artificially reduces the scores of compared models and reweighs its own scores to claim parity or superiority.

Open models are good and we evaluate them all the time. However, flawed evaluations and exaggerated claims are not helpful. We hope the authors fix these issues in the next iteration of K2-Think and correctly present their achievements.

Footnotes

¹ Since the K2-Think paper does not specify whether the thinking model was used for Qwen3-30B, it is possible that the authors evaluated the instruction-tuned variant instead. However, under that assumption, the reported numbers suddenly become implausibly high, raising further doubts about the validity of these comparisons.

Discuss this post on Twitter | Subscribe to our Atom feed