Athena Security
  • HOME
  • PRODUCTS
    • Athena Core
    • Athena NIDS
    • Athena AI Analyst (Pallas)
    • Athena XDR+ (Ageleia)
  • SERVICES
    • Athena SecOps
    • Athena MDR
    • Athena vCISO
  • ATHENA LABS
    • Areas of Research
    • Market Research
    • Publications
    • AthenaBench Results
  • ABOUT
    • About Athena
    • Leadership Team
    • Board of Directors
    • Advisory Board
    • Press Releases
    • FAQs
  • BLOG
  • CONTACT
  • Click to open the search input field Click to open the search input field Search
  • Menu Menu

AthenaBench Results: The Premier Cyber Threat Intelligence LLM Benchmark

Read our full methodology: arXiv:2511.01144

The Need for Real-World SecOps Evaluation

As Large Language Models (LLMs) are increasingly integrated into Cybersecurity and Security Operations (SecOps), the need to accurately measure their domain-specific capabilities has never been more critical. General-purpose benchmarks fail to capture the nuanced, high-stakes reasoning required to identify, attribute, and mitigate cyber threats.

AthenaBench was developed by Athena Labs to bridge this gap. It is an industry-leading Cyber Threat Intelligence (CTI) benchmark designed specifically to test how commercial and open-source models perform under real-world SecOps conditions.

Why AthenaBench is Unique: Testing the Trailing Edge

In the real world, threat landscapes are not static. The vast majority of high-impact cyberattacks target the last trailing 30-60 day vulnerabilities, including zero-day exploits and newly disclosed CVEs. Standard LLM benchmarks evaluate models against historical, static datasets—meaning a model might score well simply because the answers were in its training data years ago.

AthenaBench is the only benchmark that includes up-to-date, dynamic data in its test criteria. By evaluating models against recent, emerging threats, AthenaBench proves whether an LLM can genuinely reason through novel CTI or if it is merely regurgitating memorized, outdated intelligence.

The AthenaBench Categories

To provide a comprehensive assessment of an LLM’s utility in a Security Operations Center (SOC), AthenaBench evaluates models across six distinct critical categories:

  • CKT (CTI Knowledge Test): A rigorous multiple-choice evaluation testing foundational knowledge of threat intelligence, attack vectors, and cybersecurity principles.
  • ATE (Attack Technique Enumeration): Measures the model’s ability to accurately identify and map observed behaviors to standardized frameworks like MITRE ATT&CK.
  • RCM (Root Cause Mapping): Evaluates how well a model can trace an incident back to its originating vulnerability, misconfiguration, or user error.
  • RMS (Risk Mitigation Strategy): Scored via F1-score, this tests the model’s ability to provide complete, actionable, and accurate defensive recommendations without hallucinating irrelevant steps.
  • VSP (Vulnerability Severity Prediction): Tests the model’s capability to accurately assess the impact and severity (e.g., CVSS equivalents) of a given exploit or vulnerability.
  • TAA (Threat Actor Attribution): Challenges the model to analyze tactics, techniques, and procedures (TTPs) to attribute an attack to a specific Advanced Persistent Threat (APT) or threat group.

Current LLM Leaderboard

Below are the curated results from our latest testing of industry-leading commercial and open-source models. The models are ranked by their Combined Score, which aggregates their performance across all six SecOps categories.

Model CKT (Accuracy) ATE (Accuracy) RCM (Accuracy) RMS (F1-score) VSP (Accuracy) TAA (Accuracy) Combined
Gemini-3-pro 90.83% 83.20% 74.35% 43.10 90.66% 36.00% 69.69
GPT-5.2 (reasoning effort = high) 91.00% 75.20% 72.85% 35.65 86.07% 42.00% 67.12
GPT-5 92.00% 76.00% 71.60% 32.60 85.42% 39.00% 66.10
Gemini-2.5-pro 89.07% 76.20% 71.20% 28.43 85.43% 31.00% 63.55
GPT-5.2 (reasoning effort = none) 89.10% 62.20% 71.40% 34.15 87.00% 31.00% 62.48
GPT-4o 85.23% 51.60% 71.30% 20.20 84.73% 35.00% 58.01
Minerva-Llama8B 73.33% 47.60% 69.00% 41.22 87.56% 19.00% 56.28
Gemini-2.5-flash 85.13% 51.60% 65.05% 13.44 78.50% 30.00% 53.95
Foundation-sec-8B-Reasoning 81.57% 48.80% 70.25% 20.07 81.78% 20.00% 53.75
GPT-4 78.67% 35.80% 63.05% 15.06 84.72% 31.00% 51.38
Llama 3.3-70b-Instruct 81.37% 30.40% 60.00% 11.13 70.14% 26.00% 46.51
Llama 3-70b-Instruct 78.93% 31.60% 56.65% 11.08 63.81% 22.00% 44.01
Llama-Primus-Merged 76.33% 33.80% 56.60% 6.56 71.86% 17.00% 43.69
Qwen3-14B 78.57% 19.40% 54.10% 6.95 80.30% 17.00% 42.72
Qwen2.5-14B 77.70% 15.40% 56.85% 6.89 72.17% 19.00% 41.33
Qwen3-8B 75.70% 11.80% 48.90% 5.50 82.59% 16.00% 40.08
Llama 3.1-8B 71.80% 16.40% 42.77% 3.61 74.02% 24.00% 38.77
Qwen3-4B 74.67% 5.60% 45.35% 4.82 79.60% 15.00% 37.51
Foundation-sec-8B 25.20% 41.00% 43.30% 0.79 59.73% 2.00% 28.67

Get Involved with Athena Labs

We are constantly updating AthenaBench to reflect the ever-evolving threat landscape and testing new models as they hit the market.

If you want to know more about the research and development work we are doing at Athena Labs, or if you have a request for us to test an additional model that is not included in the current leaderboard, please reach out to our team.

Email us at: labs@athenasecuritygrp.com

  • Areas of Research
  • Market Research
  • Publications
  • AthenaBench Results

Contact Us

Please enable JavaScript in your browser to complete this form.
Loading
© Copyright - Athena Software Group, Inc. 2026
  • Privacy Policy
  • Client Login
Scroll to top Scroll to top Scroll to top