Athena Publications
Explore Athena’s published technical papers, articles, and AI-driven security insights from Athena Labs to stay informed on cybersecurity innovation.
Word Embeddings and Semantic Spaces in Natural Language Processing
International Journal of Intelligence Science, 13, 1-21. Worth, P. J. (2023)
Abstract
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
KEYWORDS
Natural Language Processing, Vector Space Models, Semantic Spaces, Word Embeddings, Representation Learning, Text Vectorization, Machine Learning, Deep Learning
Layer Cake: On Language Representation and Compute Characteristics in Text Classification
Abstract
Since transformer-based language models were introduced in 2017, they have been shown to be extraordinarily effective across a variety of NLP tasks including but not limited to language generation. The introduction and widespread adoption of these LLMs, which encode extremely high-dimensional semantic spaces, comes at a significant cost in terms of system and computational resource requirements, requirements that have reshaped the entire chip (GPU) and data center industry as hardware, cloud, and infrastructure providers try to keep up with the demand. This has motivated the research community to develop a variety of design strategies that optimize the use of these resources; however, computational requirements continue to grow in proportion to model size and complexity. With this study, we introduce a framework called Layer Cake for the precise measurement of the relative computational resource requirements necessary for text classification using a variety of classifiers from across the Machine Learning (ML) and Deep Learning (DL) landscape, leveraging different language model families and focusing on the forms of language representation used in different test scenarios. We find that while LLMs do yield the best results across classifiers on average, these improvements come at a significant computational overhead. For example, from a Macro-F1 score perspective, LLM-based classifiers outperform their static embedding language model counterparts (Word2Vec, FastText & GloVe), even when encapsulated in DL architectures such as Convolutional Neural Networks or Long-Short-Term Networks by 8.87% on average, and perform 12.73% better than ML classifiers such as Support Vector Machines and Logistic Regression models. However, this uptick in model performance comes at a computational overhead cost of 4398.07% compared to the GPU requirements of static word embedding DL classifiers, and a 4126.02% increase in computation time relative to ML classifiers, the latter of which are CPU, rather than GPU, bound.
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs
Md Tanvirul Alam, Aritran Piplai, Ionut Cardei, Nidhi Rastogi, Peter J Worth Jr
Abstract
Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.
Subjects:
Machine Learning (cs.LG)
Contact: info@athenasecuritygrp.com