AthenaBench Results: The Premier Cyber Threat Intelligence LLM Benchmark
Read our full methodology: arXiv:2511.01144
The Need for Real-World SecOps Evaluation
As Large Language Models (LLMs) are increasingly integrated into Cybersecurity and Security Operations (SecOps), the need to accurately measure their domain-specific capabilities has never been more critical. General-purpose benchmarks fail to capture the nuanced, high-stakes reasoning required to identify, attribute, and mitigate cyber threats.
AthenaBench was developed by Athena Labs to bridge this gap. It is an industry-leading Cyber Threat Intelligence (CTI) benchmark designed specifically to test how commercial and open-source models perform under real-world SecOps conditions.
Why AthenaBench is Unique: Testing the Trailing Edge
In the real world, threat landscapes are not static. The vast majority of high-impact cyberattacks target the last trailing 30-60 day vulnerabilities, including zero-day exploits and newly disclosed CVEs. Standard LLM benchmarks evaluate models against historical, static datasets—meaning a model might score well simply because the answers were in its training data years ago.
AthenaBench is the only benchmark that includes up-to-date, dynamic data in its test criteria. By evaluating models against recent, emerging threats, AthenaBench proves whether an LLM can genuinely reason through novel CTI or if it is merely regurgitating memorized, outdated intelligence.
The AthenaBench Categories
To provide a comprehensive assessment of an LLM’s utility in a Security Operations Center (SOC), AthenaBench evaluates models across six distinct critical categories:
- CKT (CTI Knowledge Test): A rigorous multiple-choice evaluation testing foundational knowledge of threat intelligence, attack vectors, and cybersecurity principles.
- ATE (Attack Technique Enumeration): Measures the model’s ability to accurately identify and map observed behaviors to standardized frameworks like MITRE ATT&CK.
- RCM (Root Cause Mapping): Evaluates how well a model can trace an incident back to its originating vulnerability, misconfiguration, or user error.
- RMS (Risk Mitigation Strategy): Scored via F1-score, this tests the model’s ability to provide complete, actionable, and accurate defensive recommendations without hallucinating irrelevant steps.
- VSP (Vulnerability Severity Prediction): Tests the model’s capability to accurately assess the impact and severity (e.g., CVSS equivalents) of a given exploit or vulnerability.
- TAA (Threat Actor Attribution): Challenges the model to analyze tactics, techniques, and procedures (TTPs) to attribute an attack to a specific Advanced Persistent Threat (APT) or threat group.
Current LLM Leaderboard
Below are the curated results from our latest testing of industry-leading commercial and open-source models. The models are ranked by their Combined Score, which aggregates their performance across all six SecOps categories.
| Model | CKT (Accuracy) | ATE (Accuracy) | RCM (Accuracy) | RMS (F1-score) | VSP (Accuracy) | TAA (Accuracy) | Combined |
| Gemini-3-pro | 90.83% | 83.20% | 74.35% | 43.10 | 90.66% | 36.00% | 69.69 |
| GPT-5.2 (reasoning effort = high) | 91.00% | 75.20% | 72.85% | 35.65 | 86.07% | 42.00% | 67.12 |
| GPT-5 | 92.00% | 76.00% | 71.60% | 32.60 | 85.42% | 39.00% | 66.10 |
| Gemini-2.5-pro | 89.07% | 76.20% | 71.20% | 28.43 | 85.43% | 31.00% | 63.55 |
| GPT-5.2 (reasoning effort = none) | 89.10% | 62.20% | 71.40% | 34.15 | 87.00% | 31.00% | 62.48 |
| GPT-4o | 85.23% | 51.60% | 71.30% | 20.20 | 84.73% | 35.00% | 58.01 |
| Minerva-Llama8B | 73.33% | 47.60% | 69.00% | 41.22 | 87.56% | 19.00% | 56.28 |
| Gemini-2.5-flash | 85.13% | 51.60% | 65.05% | 13.44 | 78.50% | 30.00% | 53.95 |
| Foundation-sec-8B-Reasoning | 81.57% | 48.80% | 70.25% | 20.07 | 81.78% | 20.00% | 53.75 |
| GPT-4 | 78.67% | 35.80% | 63.05% | 15.06 | 84.72% | 31.00% | 51.38 |
| Llama 3.3-70b-Instruct | 81.37% | 30.40% | 60.00% | 11.13 | 70.14% | 26.00% | 46.51 |
| Llama 3-70b-Instruct | 78.93% | 31.60% | 56.65% | 11.08 | 63.81% | 22.00% | 44.01 |
| Llama-Primus-Merged | 76.33% | 33.80% | 56.60% | 6.56 | 71.86% | 17.00% | 43.69 |
| Qwen3-14B | 78.57% | 19.40% | 54.10% | 6.95 | 80.30% | 17.00% | 42.72 |
| Qwen2.5-14B | 77.70% | 15.40% | 56.85% | 6.89 | 72.17% | 19.00% | 41.33 |
| Qwen3-8B | 75.70% | 11.80% | 48.90% | 5.50 | 82.59% | 16.00% | 40.08 |
| Llama 3.1-8B | 71.80% | 16.40% | 42.77% | 3.61 | 74.02% | 24.00% | 38.77 |
| Qwen3-4B | 74.67% | 5.60% | 45.35% | 4.82 | 79.60% | 15.00% | 37.51 |
| Foundation-sec-8B | 25.20% | 41.00% | 43.30% | 0.79 | 59.73% | 2.00% | 28.67 |
Get Involved with Athena Labs
We are constantly updating AthenaBench to reflect the ever-evolving threat landscape and testing new models as they hit the market.
If you want to know more about the research and development work we are doing at Athena Labs, or if you have a request for us to test an additional model that is not included in the current leaderboard, please reach out to our team.
Email us at: labs@athenasecuritygrp.com
