The Crucible of Intelligence

Benchmarking LLMs on the Trailing Edge of Cyber Warfare

“If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat.” — Sun Tzu, The Art of War

In the mythology of modern technology, artificial intelligence is often cast as an omniscient oracle—a pristine intellect capable of unraveling the complexities of cyber warfare with effortless precision. Across our industry, Large Language Models (LLMs) are being hailed as the ultimate weapon in the defender’s arsenal. Yet, in the trenches of Security Operations (SecOps), blind faith in untested systems is the prelude to ruin. To defend the modern enterprise, we must first know our tools as intimately as we know our adversaries.

Today, Athena Labs is proud to announce the latest benchmark results from AthenaBench, our industry-leading Cyber Threat Intelligence (CTI) evaluation framework (read our full methodology at arXiv:2511.01144).

AthenaBench emerges from a fundamental truth in cyber defense: true security depends on measurement. If you cannot test, verify, and deeply understand how your AI systems perform in the fog of a real-world security event, you cannot trust them to protect your infrastructure.

The Illusion of Static Knowledge

Most general-purpose LLM benchmarks evaluate models against historical, static datasets. They test an AI’s ability to recall past events, much like a historian reciting facts from a closed book. But the battlefield of cyberspace is not static.

The vast majority of high-impact cyberattacks target the trailing 30 to 60 days of the vulnerability landscape. Adversaries weaponize zero-day exploits, novel malware variants, and newly disclosed CVEs before defenses can adapt. A model might score exceptionally well on a standard benchmark simply because the answers were baked into its training data years ago.

AthenaBench shatters this illusion. It is the only benchmark that incorporates up-to-date, dynamic data into its test criteria. By evaluating models against recent and emerging threats, AthenaBench tests true reasoning and analytical capability, proving whether an LLM can parse novel threat intelligence or if it is merely regurgitating memorized history.

Order from Chaos: The Six Pillars of AthenaBench

To provide a comprehensive assessment of an LLM’s utility in a modern Security Operations Center (SOC), AthenaBench evaluates models across six critical domains:

CKT (CTI Knowledge Test): A rigorous evaluation testing foundational knowledge of threat intelligence, attack vectors, and security principles.
ATE (Attack Technique Enumeration): Measures the model’s ability to accurately map observed behaviors to standardized frameworks like MITRE ATT&CK.
RCM (Root Cause Mapping): Evaluates how well a model can trace an incident back to its originating vulnerability, misconfiguration, or human error.
RMS (Risk Mitigation Strategy): Tests the model’s ability to provide complete, actionable, and accurate defensive recommendations without hallucinating irrelevant steps.
VSP (Vulnerability Severity Prediction): Tests the capability to accurately assess the impact and severity of a given exploit.
TAA (Threat Actor Attribution): Challenges the model to analyze tactics, techniques, and procedures (TTPs) to attribute an attack to a specific threat group or APT.

The State of the Machine Mind: Current Results

We have subjected 19 of the industry’s leading commercial and open-source models to the AthenaBench crucible. The models below are ranked by their Combined Score, which aggregates their performance across all six SecOps categories.

Model	CKT (Accuracy)	ATE (Accuracy)	RCM (Accuracy)	RMS (F1-score)	VSP (Accuracy)	TAA (Accuracy)	Combined
Gemini-3-pro	90.83%	83.20%	74.35%	43.10	90.66%	36.00%	69.69
GPT-5.2 (reasoning effort = high)	91.00%	75.20%	72.85%	35.65	86.07%	42.00%	67.12
GPT-5	92.00%	76.00%	71.60%	32.60	85.42%	39.00%	66.10
Gemini-2.5-pro	89.07%	76.20%	71.20%	28.43	85.43%	31.00%	63.55
GPT-5.2 (reasoning effort = none)	89.10%	62.20%	71.40%	34.15	87.00%	31.00%	62.48
GPT-4o	85.23%	51.60%	71.30%	20.20	84.73%	35.00%	58.01
Minerva-Llama8B	73.33%	47.60%	69.00%	41.22	87.56%	19.00%	56.28
Gemini-2.5-flash	85.13%	51.60%	65.05%	13.44	78.50%	30.00%	53.95
Foundation-sec-8B-Reasoning	81.57%	48.80%	70.25%	20.07	81.78%	20.00%	53.75
GPT-4	78.67%	35.80%	63.05%	15.06	84.72%	31.00%	51.38
Llama 3.3-70b-Instruct	81.37%	30.40%	60.00%	11.13	70.14%	26.00%	46.51
Llama 3-70b-Instruct	78.93%	31.60%	56.65%	11.08	63.81%	22.00%	44.01
Llama-Primus-Merged	76.33%	33.80%	56.60%	6.56	71.86%	17.00%	43.69
Qwen3-14B	78.57%	19.40%	54.10%	6.95	80.30%	17.00%	42.72
Qwen2.5-14B	77.70%	15.40%	56.85%	6.89	72.17%	19.00%	41.33
Qwen3-8B	75.70%	11.80%	48.90%	5.50	82.59%	16.00%	40.08
Llama 3.1-8B	71.80%	16.40%	42.77%	3.61	74.02%	24.00%	38.77
Qwen3-4B	74.67%	5.60%	45.35%	4.82	79.60%	15.00%	37.51
Foundation-sec-8B	25.20%	41.00%	43.30%	0.79	59.73%	2.00%	28.67

(Data as of March 2026)

Forging the Future of Defense

As AI continues to reshape the landscape of cyber conflict, our defense must evolve in tandem. We are constantly updating AthenaBench to reflect the shifting realities of the threat landscape, ensuring that as adversaries adapt, our metrics adapt with them.

If you are interested in learning more about the research and development happening at Athena Labs, or if you have a request for our team to test an additional model not currently on the leaderboard, we welcome the conversation.

Reach out to our team at: labs@athenasecuritygrp.com