arXiv論文一覧 - cs.CR updates on arXiv.org

#1 Automated Post-Incident Policy Gap Analysis via Threat-Informed Evidence Mapping using Large Language Models

著者: Huan Lin Oh, Jay Yong Jun Jie, Mandy Lee Ling Siu, Jonathan Pan

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03287

要約:
Cybersecurity post-incident reviews are essential for identifying control failures and improving organisational resilience, yet they remain labour-intensive, time-consuming, and heavily reliant on expert judgment. This paper investigates whether Large Language Models (LLMs) can augment post-incident review workflows by autonomously analysing system evidence and identifying security policy gaps. We present a threat-informed, agentic framework that ingests log data, maps observed behaviours to the MITRE ATT&CK framework, and evaluates organisational security policies for adequacy and compliance. Using a simulated brute-force attack scenario against a Windows OpenSSH service (MITRE ATT&CK T1110), the system leverages GPT-4o for reasoning, LangGraph for multi-agent workflow orchestration, and LlamaIndex for traceable policy retrieval. Experimental results indicate that the LLM-based pipeline can interpret log-derived evidence, identify insufficient or missing policy controls, and generate actionable remediation recommendations with explicit evidence-to-policy traceability. Unlike prior work that treats log analysis and policy validation as isolated tasks, this study integrates both into a unified end-to-end proof-of-concept post-incident review framework. The findings suggest that LLM-assisted analysis has the potential to improve the efficiency, consistency, and auditability of post-incident evaluations, while highlighting the continued need for human oversight in high-stakes cybersecurity decision-making.

#2 How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

著者: Songyang Liu, Chaozhuo Li, Rui Pu, Litian Zhang, Chenxu Wang, Zejian Chen, Yuting Zhang, Yiming Hei

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03288

要約:
Jailbreak attacks present a significant challenge to the safety of Large Language Models (LLMs), yet current automated evaluation methods largely rely on coarse classifications that focus mainly on harmfulness, leading to substantial overestimation of attack success. To address this problem, we propose FJAR, a fine-grained jailbreak evaluation framework with anchored references. We first categorized jailbreak responses into five fine-grained categories: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, based on the degree to which the response addresses the malicious intent of the query. This categorization serves as the basis for FJAR. Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references by breaking down the original queries. These references guide the evaluator in determining whether the response genuinely fulfills the original query. Extensive experiments demonstrate that FJAR achieves the highest alignment with human judgment and effectively identifies the root causes of jailbreak failures, providing actionable guidance for improving attack strategies.

#3 Differentiation Between Faults and Cyberattacks through Combined Analysis of Cyberspace Logs and Physical Measurements

著者: Mohammad Shamim Ahsan, Haizhou Wang, Venkateswara Reddy Motakatla, Minghui Zhu, Peng Liu

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03289

要約:
In recent years, cyberattacks - along with physical faults - have become an increasing factor causing system failures, especially in DER (Distributed Energy Resources) systems. In addition, according to the literature, a number of faults have been reported to remain undetected. Consequently, unlike anomaly detection works that only identify abnormalities, differentiating undetected faults and cyberattacks is a challenging task. Although several works have studied this problem, they crucially fall short of achieving an accurate distinction due to the reliance on physical laws or physical measurements. To resolve this issue, the industry typically conducts an integrated analysis with physical measurements and cyberspace information. Nevertheless, this industry approach consumes a significant amount of time due to the manual efforts required in the analysis. In this work, we focus on addressing these crucial gaps by proposing a non-trivial approach of distinguishing undetected faults and cyberattacks in DER systems. Specifically, first, a special kind of dependency graph is constructed using a novel virtual physical variable-oriented taint analysis (PVOTA) algorithm. Then, the graph is simplified using an innovative node pruning technique, which is based on a set of context-dependent operations. Next, a set of patterns capturing domain-specific knowledge is derived to bridge the semantic gaps between the cyber and physical sides. Finally, these patterns are matched to the relevant events that occurred during failure incidents, and possible root causes are concluded based on the pattern matching results. In the end, the efficacy of our proposed automatic integrated analysis is evaluated through four case studies covering failure incidents caused by the FDI attack, undetected faults, and memory corruption attacks.

#4 AgentMark: Utility-Preserving Behavioral Watermarking for Agents

intellectual propertyagent

著者: Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, Linna Zhou

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03294

要約:
LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.

#5 TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

著者: Scott Thornton

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03300

要約:
Large language models remain vulnerable to jailbreak attacks, and single-layer defenses often trade security for usability. We present TRYLOCK, the first defense-in-depth architecture that combines four heterogeneous mechanisms across the inference stack: weight-level safety alignment via DPO, activation-level control via Representation Engineering (RepE) steering, adaptive steering strength selected by a lightweight sidecar classifier, and input canonicalization to neutralize encoding-based bypasses. On Mistral-7B-Instruct evaluated against a 249-prompt attack set spanning five attack families, TRYLOCK achieves 88.0% relative ASR reduction (46.5% to 5.6%), with each layer contributing unique coverage: RepE blocks 36% of attacks that bypass DPO alone, while canonicalization catches 14% of encoding attacks that evade both. We discover a non-monotonic steering phenomenon -- intermediate strength (alpha=1.0) degrades safety below baseline -- and provide mechanistic hypotheses explaining RepE-DPO interference. The adaptive sidecar reduces over-refusal from 60% to 48% while maintaining identical attack defense, demonstrating that security and usability need not be mutually exclusive. We release all components -- trained adapters, steering vectors, sidecar classifier, preference pairs, and complete evaluation methodology -- enabling full reproducibility.

#6 Autonomous Threat Detection and Response in Cloud Security: A Comprehensive Survey of AI-Driven Strategies

著者: Gaurav Sarraf, Vibhor Pal

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03303

要約:
Cloud computing has changed online communities in three dimensions, which are scalability, adaptability and reduced overhead. But there are serious security concerns which are brought about by its distributed and multi-tenant characteristics. The old methods of detecting and reacting to threats which are mostly reliant on fixed signatures, predefined rules and human operators are becoming less and less effective even in the advanced stages of cyberattacks of cloud infrastructures. The recent trend in the field of addressing these limitations is the creation of technologies of artificial intelligence (AI). The strategies allow independent protection, anomaly detection, and real-time analysis with references to using deep learning, machine learning, and reinforcement learning. Through imbuing AI with a constantly-learning feature, it enables the intrusion detection system to be more accurate and generate a lesser number of false positives and it also enables the possibility of adaptive and predictive security. The fusion of large-scale language models with efficient orchestration platforms contributes to reacting to the arising threats with a quicker and more precise response. This allows automatic control over incidences, self-healing network, and defense mechanisms on a policy basis. Considering the current detection and response methods, this discussion assesses their strengths and weaknesses and outlines key issues such as data privacy, adversarial machine learning and integration complexity in the context of AI-based cloud security. These results suggest the future application of AI to support autonomous, scalable and active cloud security operations.

#7 AI-Driven Cybersecurity Threats: A Survey of Emerging Risks and Defensive Strategies

著者: Sai Teja Erukude, Viswa Chaitanya Marella, Suhasnadh Reddy Veluru

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03304

要約:
Artificial Intelligence's dual-use nature is revolutionizing the cybersecurity landscape, introducing new threats across four main categories: deepfakes and synthetic media, adversarial AI attacks, automated malware, and AI-powered social engineering. This paper aims to analyze emerging risks, attack mechanisms, and defense shortcomings related to AI in cybersecurity. We introduce a comparative taxonomy connecting AI capabilities with threat modalities and defenses, review over 70 academic and industry references, and identify impactful opportunities for research, such as hybrid detection pipelines and benchmarking frameworks. The paper is structured thematically by threat type, with each section addressing technical context, real-world incidents, legal frameworks, and countermeasures. Our findings emphasize the urgency for explainable, interdisciplinary, and regulatory-compliant AI defense systems to maintain trust and security in digital ecosystems.

#8 Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset

diffusion

著者: Oran Duan, Yinghua Shen, Yingzhu Lv, Luyang Jie, Yaxin Liu, Qiong Wu

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03323

要約:
Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.

#9 DeepLeak: Privacy Enhancing Hardening of Model Explanations Against Membership Leakage

privacy

著者: Firas Ben Hmida, Zain Sbeih, Philemon Hailemariam, Birhanu Eshete

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03429

要約:
Machine learning (ML) explainability is central to algorithmic transparency in high-stakes settings such as predictive diagnostics and loan approval. However, these same domains require rigorous privacy guaranties, creating tension between interpretability and privacy. Although prior work has shown that explanation methods can leak membership information, practitioners still lack systematic guidance on selecting or deploying explanation techniques that balance transparency with privacy. We present DeepLeak, a system to audit and mitigate privacy risks in post-hoc explanation methods. DeepLeak advances the state-of-the-art in three ways: (1) comprehensive leakage profiling: we develop a stronger explanation-aware membership inference attack (MIA) to quantify how much representative explanation methods leak membership information under default configurations; (2) lightweight hardening strategies: we introduce practical, model-agnostic mitigations, including sensitivity-calibrated noise, attribution clipping, and masking, that substantially reduce membership leakage while preserving explanation utility; and (3) root-cause analysis: through controlled experiments, we pinpoint algorithmic properties (e.g., attribution sparsity and sensitivity) that drive leakage. Evaluating 15 explanation techniques across four families on image benchmarks, DeepLeak shows that default settings can leak up to 74.9% more membership information than previously reported. Our mitigations cut leakage by up to 95% (minimum 46.5%) with only <=3.3% utility loss on average. DeepLeak offers a systematic, reproducible path to safer explainability in privacy-sensitive ML.

#10 Security Parameter Analysis of the LINEture Post-Quantum Digital Signature Scheme

著者: Yevgen Kotukh, Gennady Khalimov

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03465

要約:
This paper presents a comprehensive cryptographic analysis of the security parameters of the LINEture post-quantum digital signature scheme, which is constructed using matrix algebra over elementary abelian 2-groups. We investigate the influence of three principal parameters. First, the word size m (exhibiting quadratic impact), the second is a vector dimension l, and the third is a number of submatrices in the session key q (exhibiting linear impact) on cryptographic strength. Our analysis reveals a dualistic nature of the parameter l. According to the previous analysis, it does not affect resistance to guessing attacks. A deeper examination of the verification mechanism demonstrates that l establishes a kind of verification barrier of l times m bits. We establish the threshold relationship l less q minus 1 times m, below which parameter l becomes security-critical. The optimal selection rule l near q minus 1 times m is proposed for maximum cryptographic efficiency. Comparative analysis with NIST PQC standards and practical parameter recommendations are provided.

#11 Full-Stack Knowledge Graph and LLM Framework for Post-Quantum Cyber Readiness

著者: Rasmus Erlemann, Charles Colyer Morris, Sanjyot Sathe

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03504

要約:
The emergence of large-scale quantum computing threatens widely deployed public-key cryptographic systems, creating an urgent need for enterprise-level methods to assess post-quantum (PQ) readiness. While PQ standards are under development, organizations lack scalable and quantitative frameworks for measuring cryptographic exposure and prioritizing migration across complex infrastructures. This paper presents a knowledge graph based framework that models enterprise cryptographic assets, dependencies, and vulnerabilities to compute a unified PQ readiness score. Infrastructure components, cryptographic primitives, certificates, and services are represented as a heterogeneous graph, enabling explicit modeling of dependency-driven risk propagation. PQ exposure is quantified using graph-theoretic risk functionals and attributed across cryptographic domains via Shapley value decomposition. To support scalability and data quality, the framework integrates large language models with human-in-the-loop validation for asset classification and risk attribution. The resulting approach produces explainable, normalized readiness metrics that support continuous monitoring, comparative analysis, and remediation prioritization.

#12 A Critical Analysis of the Medibank Health Data Breach and Differential Privacy Solutions

privacy

著者: Zhuohan Cui, Qianqian Lang, Zikun Song

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03508

要約:
This paper critically examines the 2022 Medibank health insurance data breach, which exposed sensitive medical records of 9.7 million individuals due to unencrypted storage, centralized access, and the absence of privacy-preserving analytics. To address these vulnerabilities, we propose an entropy-aware differential privacy (DP) framework that integrates Laplace and Gaussian mechanisms with adaptive budget allocation. The design incorporates TLS-encrypted database access, field-level mechanism selection, and smooth sensitivity models to mitigate re-identification risks. Experimental validation was conducted using synthetic Medibank datasets (N = 131,000) with entropy-calibrated DP mechanisms, where high-entropy attributes received stronger noise injection. Results demonstrate a 90.3% reduction in re-identification probability while maintaining analytical utility loss below 24%. The framework further aligns with GDPR Article 32 and Australian Privacy Principle 11.1, ensuring regulatory compliance. By combining rigorous privacy guarantees with practical usability, this work contributes a scalable and technically feasible solution for healthcare data protection, offering a pathway toward resilient, trustworthy, and regulation-ready medical analytics.

#13 Deontic Knowledge Graphs for Privacy Compliance in Multimodal Disaster Data Sharing

privacy

著者: Kelvin Uzoma Echenim, Karuna Pande Joshi

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03587

要約:
Disaster response requires sharing heterogeneous artifacts, from tabular assistance records to UAS imagery, under overlapping privacy mandates. Operational systems often reduce compliance to binary access control, which is brittle in time-critical workflows. We present a novel deontic knowledge graph-based framework that integrates a Disaster Management Knowledge Graph (DKG) with a Policy Knowledge Graph (PKG) derived from IoT-Reg and FEMA/DHS privacy drivers. Our release decision function supports three outcomes: Allow, Block, and Allow-with-Transform. The latter binds obligations to transforms and verifies post-transform compliance via provenance-linked derived artifacts; blocked requests are logged as semantic privacy incidents. Evaluation on a 5.1M-triple DKG with 316K images shows exact-match decision correctness, sub-second per-decision latency, and interactive query performance across both single-graph and federated workloads.

#14 Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

著者: Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, Yiming He

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03594

要約:
This paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such as incomplete training data, linguistic ambiguity, and generative uncertainty. It further differentiates between hallucinations and jailbreaks in terms of intent and triggering mechanisms. We propose a three-dimensional survey framework: (1) Attack dimension-including template/encoding-based, in-context learning manipulation, reinforcement/adversarial learning, LLM-assisted and fine-tuned attacks, as well as prompt- and image-level perturbations and agent-based transfer in VLMs; (2) Defense dimension-encompassing prompt-level obfuscation, output evaluation, and model-level alignment or fine-tuning; and (3) Evaluation dimension-covering metrics such as Attack Success Rate (ASR), toxicity score, query/time cost, and multimodal Clean Accuracy and Attribute Success Rate. Compared with prior works, this survey spans the full spectrum from text-only to multimodal settings, consolidating shared mechanisms and proposing unified defense principles: variant-consistency and gradient-sensitivity detection at the perception layer, safety-aware decoding and output review at the generation layer, and adversarially augmented preference alignment at the parameter layer. Additionally, we summarize existing multimodal safety benchmarks and discuss future directions, including automated red teaming, cross-modal collaborative defense, and standardized evaluation.

#15 Detection and Prevention of Process Disruption Attacks in the Electrical Power Systems using MMS Traffic: An EPIC Case

著者: Praneeta K Maganti, Daisuke Mashima, Rajib Ranjan Maiti

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03690

要約:
Smart grids are increasingly exposed to sophisticated cyber threats due to their reliance on interconnected communication networks, as demonstrated by real world incidents such as the cyberattacks on the Ukrainian power grid. In IEC61850 based smart substations, the Manufacturing Message Specification protocol operates over TCP to facilitate communication between SCADA systems and field devices such as Intelligent Electronic Devices and Programmable Logic Controllers. Although MMS enables efficient monitoring and control, it can be exploited by adversaries to generate legitimate looking packets for reconnaissance, unauthorized state reading, and malicious command injection, thereby disrupting grid operations. In this work, we propose a fully automated attack detection and prevention framework for IEC61850 compliant smart substations to counter remote cyberattacks that manipulate process states through compromised PLCs and IEDs. A detailed analysis of the MMS protocol is presented, and critical MMS field value pairs are extracted during both normal SCADA operation and active attack conditions. The proposed framework is validated using seven datasets comprising benign operational scenarios and multiple attack instances, including IEC61850Bean based attacks and script driven attacks leveraging the libiec61850 library. Our approach accurately identifies attack signature carrying MMS packets that attempt to disrupt circuit breaker status, specifically targeting the smart home zone IED and PLC of the EPIC testbed. The results demonstrate the effectiveness of the proposed framework in precisely detecting malicious MMS traffic and enhancing the cyber resilience of IEC61850 based smart grid environments.

#16 Human Challenge Oracle: Designing AI-Resistant, Identity-Bound, Time-Limited Tasks for Sybil-Resistant Consensus

著者: Homayoun Maleki, Nekane Sainz, Jon Legarda

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03923

要約:
Sybil attacks remain a fundamental obstacle in open online systems, where adversaries can cheaply create and sustain large numbers of fake identities. Existing defenses, including CAPTCHAs and one-time proof-of-personhood mechanisms, primarily address identity creation and provide limited protection against long-term, large-scale Sybil participation, especially as automated solvers and AI systems continue to improve. We introduce the Human Challenge Oracle (HCO), a new security primitive for continuous, rate-limited human verification. HCO issues short, time-bound challenges that are cryptographically bound to individual identities and must be solved in real time. The core insight underlying HCO is that real-time human cognitive effort, such as perception, attention, and interactive reasoning, constitutes a scarce resource that is inherently difficult to parallelize or amortize across identities. We formalize the design goals and security properties of HCO and show that, under explicit and mild assumptions, sustaining s active identities incurs a cost that grows linearly with s in every time window. We further describe abstract classes of admissible challenges and concrete browser-based instantiations, and present an initial empirical study illustrating that these challenges are easily solvable by humans within seconds while remaining difficult for contemporary automated systems under strict time constraints.

#17 SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

privacy

著者: Andreea-Elena Bodea, Stephen Meisenbacher, Alexandra Klymenko, Florian Matthes

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03979

要約:
The continued promise of Large Language Models (LLMs), particularly in their natural language understanding and generation capabilities, has driven a rapidly increasing interest in identifying and developing LLM use cases. In an effort to complement the ingrained "knowledge" of LLMs, Retrieval-Augmented Generation (RAG) techniques have become widely popular. At its core, RAG involves the coupling of LLMs with domain-specific knowledge bases, whereby the generation of a response to a user question is augmented with contextual and up-to-date information. The proliferation of RAG has sparked concerns about data privacy, particularly with the inherent risks that arise when leveraging databases with potentially sensitive information. Numerous recent works have explored various aspects of privacy risks in RAG systems, from adversarial attacks to proposed mitigations. With the goal of surveying and unifying these works, we ask one simple question: What are the privacy risks in RAG, and how can they be measured and mitigated? To answer this question, we conduct a systematic literature review of RAG works addressing privacy, and we systematize our findings into a comprehensive set of privacy risks, mitigation techniques, and evaluation strategies. We supplement these findings with two primary artifacts: a Taxonomy of RAG Privacy Risks and a RAG Privacy Process Diagram. Our work contributes to the study of privacy in RAG not only by conducting the first systematization of risks and mitigations, but also by uncovering important considerations when mitigating privacy risks in RAG systems and assessing the current maturity of proposed mitigations.

#18 HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

agent

著者: Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, Jianhua Li

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04034

要約:
Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.

#19 Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

著者: Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03265

要約:
This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.

#20 Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

著者: Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03420

要約:
As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.

#21 Verbatim Data Transcription Failures in LLM Code Generation: A State-Tracking Stress Test

著者: Mohd Ariful Haque, Kishor Datta Gupta, Mohammad Ashiqur Rahman, Roy George

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03640

要約:
Many real-world software tasks require exact transcription of provided data into code, such as cryptographic constants, protocol test vectors, allowlists, and calibration tables. These tasks are operationally sensitive because small omissions or alterations can remain silent while producing syntactically valid programs. This paper introduces a deliberately minimal transcription-to-code benchmark to isolate this reliability concern in LLM-based code generation. Given a list of high-precision decimal constants, a model must generate Python code that embeds the constants verbatim and performs a simple aggregate computation. We describe the prompting variants, evaluation protocol based on exact-string inclusion, and analysis framework used to characterize state-tracking and long-horizon generation failures. The benchmark is intended as a compact stress test that complements existing code-generation evaluations by focusing on data integrity rather than algorithmic reasoning.

#22 What Matters For Safety Alignment?

著者: Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03868

要約:
This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

#23 An Ontology-Based Approach to Security Risk Identification of Container Deployments in OT Contexts

著者: Yannick Landeck, Dian Balta, Martin Wimmer, Christian Knierim

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04010

要約:
In operational technology (OT) contexts, containerised applications often require elevated privileges to access low-level network interfaces or perform administrative tasks such as application monitoring. These privileges reduce the default isolation provided by containers and introduce significant security risks. Security risk identification for OT container deployments is challenged by hybrid IT/OT architectures, fragmented stakeholder knowledge, and continuous system changes. Existing approaches lack reproducibility, interpretability across contexts, and technical integration with deployment artefacts. We propose a model-based approach, implemented as the Container Security Risk Ontology (CSRO), which integrates five key domains: adversarial behaviour, contextual assumptions, attack scenarios, risk assessment rules, and container security artefacts. Our evaluation of CSRO in a case study demonstrates that the end-to-end formalisation of risk calculation, from artefact to risk level, enables automated and reproducible risk identification. While CSRO currently focuses on technical, container-level treatment measures, its modular and flexible design provides a solid foundation for extending the approach to host-level and organisational risk factors.

#24 IoTChain: A Three-Tier Blockchain-based IoT Security Architecture

著者: Zijian Bao, Wenbo Shi, Debiao He, Kim-Kwang Raymond Chood

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/1806.02008

要約:
There has been increasing interest in the potential of blockchain in enhancing the security of devices and systems, such as Internet of Things (IoT). In this paper, we present a blockchain-based IoT security architecture, IoTchain. The three-tier architecture comprises an authentication layer, a blockchain layer and an application layer, and is designed to achieve identity authentication, access control, privacy protection, lightweight feature, regional node fault tolerance, denial-of-service resilience, and storage integrity. We also evaluate the performance of IoTchain to demonstrate its utility in an IoT deployment.

#25 Lockcoin: a secure and privacy-preserving mix service for bitcoin anonymity

privacy

著者: Zijian Bao, Bin Wang, Yongxin Zhang, Qinghao Wang, Wenbo Shi

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/1811.04349

要約:
We propose Lockcoin, a secure and privacy-preserving mix service for bitcoin anonymity. We introduce mix servers to provide mix service for user to prevent attackers linking the input address with output address by using blind signature shceme, multisignature scheme. Lockcoin provides anonymity, scalability, bitcoin compatibillity, theft impossibility and accountability. We have proposed a prototype of Lockcoin based on bitcoin test network, experimental results show that our solution is efficient. Lockcoin's source codes are released on github.com/Northeastern-University-Blockchain/Lockcoin.

#26 SoK: Security of the Image Processing Pipeline for Camera-based Sensing in Autonomous Vehicles

著者: Michael K\"uhr, Mohammad Hamad, Pedram MohajerAnsari, Mert D. Pes\'e, Sebastian Steinhorst

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2409.01234

要約:
Cameras capture images that are essential for many safety-critical tasks. To process these images, a complex pipeline with multiple layers is used. Security attacks on this pipeline can severely affect passenger safety and system performance. However, many attacks presented in scientific literature overlook the fact that there are different layers and, hence, the feasibility and impact of these attacks can vary. While there has been research to improve the quality and robustness of the image processing pipeline, these efforts are often orthogonal to security research without exploiting potential overlap and synergies. In this work, we aim to bridge this gap by combining security and robustness research for the image processing pipeline in autonomous vehicles. We thoroughly investigated the body of literature on the security and robustness of the image processing pipeline and selected 92 papers for deeper discussion in this SoK. For the security domain, we classify the risk of attacks using the automotive security standard ISO 21434, emphasizing the need to consider all layers for overall system security. With our online tool TARA-CAM, we propose an interactive method to perform threat analysis and risk assessment following the ISO standard. We also demonstrate how existing robustness research can help mitigate the impact of attacks, addressing the current research gap. Finally, we present PICT, an embedded open-source testbed that can influence various parameters across all layers, allowing researchers to analyze the effects of different defense strategies and attack impacts. With this SoK, we contribute a comprehensive discussion and systematic analysis of existing approaches to image processing pipeline security and robustness, together with an open-source tool and testbed that jointly facilitates hardening the image processing pipeline against existing and future security attacks.

#27 Lightweight and Resilient Signatures for Cloud-Assisted Embedded IoT Systems

著者: Saif E. Nouma, Attila A. Yavuz

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2409.13937

要約:
Digital signatures provide scalable authentication with non-repudiation and are vital tools for the Internet of Things (IoT). Many IoT applications harbor vast quantities of resource-limited devices often used with cloud computing. However, key compromises (e.g., physical, malware) pose a significant threat to IoTs due to increased attack vectors and open operational environments. Forward security and distributed key management are critical breach-resilient countermeasures to mitigate such threats. Yet forward-secure signatures are exorbitantly costly for low-end IoTs, while cloud-assisted approaches suffer from centrality or non-colluding semi-honest servers. In this work, we create two novel digital signatures called Lightweight and Resilient Signatures with Hardware Assistance (LRSHA) and its Forward-secure version (FLRSHA). They offer a near-optimally efficient signing with small keys and signature sizes. We synergize various design strategies, such as commitment separation to eliminate costly signing operations and hardware-assisted distributed servers to enable breach-resilient verification. Our schemes achieve magnitudes of faster forward-secure signing and compact key/signature sizes without suffering from strong security assumptions (non-colluding, central servers) or a heavy burden on the verifier (extreme storage, computation). We formally prove the security of our schemes and validate their performance with full-fledged open-source implementations on both commodity hardware and 8-bit AVR microcontrollers.

#28 Privacy for Quantum Annealing. Attack on Spin Reversal Transformations in the case of cryptanalysis

privacy

著者: Mateusz Le\'sniak, Micha{\l} Wro\'nski

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2409.17744

要約:
This paper demonstrates that applying spin reversal transformations (SRT), commonly known as a sufficient method for privacy enhancement in problems solved using quantum annealing, does not guarantee privacy for all possible cases. We show how to recover the original problem from the Ising problem obtained using SRT when the resulting problem in Ising form represents the algebraic attack on the $E_0$ stream cipher. A small example illustrates how to retrieve the original problem from that transformed by SRT. Moreover, we show that our method is efficient also for full-scale problems.

#29 Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code

intellectual property

著者: Jungin Kim, Shinwoo Park, Yo-Sub Han

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.18851

要約:
Identifying LLM-generated code through watermarking poses a challenge in preserving functional correctness. Previous methods rely on the assumption that watermarking high-entropy tokens effectively maintains output quality. Our analysis reveals a fundamental limitation of this assumption: syntax-critical tokens such as keywords often exhibit the highest entropy, making existing approaches vulnerable to logic corruption. We present STONE, a syntax-aware watermarking method that embeds watermarks only in non-syntactic tokens and preserves code integrity. For its rigorous assessment, we also introduce STEM, a comprehensive framework that balances three critical dimensions: correctness, detectability, and imperceptibility. Across Python, C++, and Java, STONE preserves correctness, sustains strong detectability, and achieves balanced performance with minimal overhead. Our implementation is available at https://anonymous.4open.science/r/STONE-watermarking-AB4B/.

#30 Kitten or Panda? Measuring the Specificity of Threat Group Behaviors in Public CTI Knowledge Bases

著者: Aakanksha Saha, Martina Lindorfer, Juan Caballero

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.10645

要約:
In recent years, the cyber threat intelligence (CTI) community has invested significant effort in building knowledge bases that catalog threat groups. These knowledge bases associate each threat group with its observed behaviors, including their Tactics, Techniques, and Procedures (TTPs) as well as the malware and tools they employ during attacks. However, the distinctiveness and completeness of such behavioral profiles remain largely unexplored, despite being critical for tasks such as threat group attribution. In this work, we systematically analyze threat group profiles built from two public CTI knowledge bases: MITRE ATT&CK and Malpedia. We first investigate what fraction of threat groups have group-specific behaviors, i.e., behaviors used exclusively by a single group. We find that only 34% of threat groups in ATT&CK have group-specific techniques, limiting the use of techniques as reliable behavioral signatures to identify the threat group behind an attack. The software used by a threat group proves to be more distinctive, with 73% of ATT&CK groups using group-specific software. However, this percentage drops to 24% in the broader Malpedia dataset. Next, we evaluate how group profiles improve when data from both sources are combined. While coverage improves modestly, the proportion of groups with group-specific behaviors remains under 30%. We then enhance profiles by adding exploited vulnerabilities and additional techniques extracted from threat reports. Despite the additional information, 64% of groups still lack any group-specific behavior. Our findings raise concerns about the specificity of existing behavioral profiles and highlight the need for caution, as well as further improvement, when using them for threat group attribution.

#31 Sharpening Kubernetes Audit Logs with Context Awareness

著者: Matteo Franzil, Valentino Armani, Luis Augusto Dias Knob, Domenico Siracusa

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.16328

要約:
Kubernetes has emerged as the de facto orchestrator of microservices, providing scalability and extensibility to a highly dynamic environment. It builds an intricate and deeply connected system that requires extensive monitoring capabilities to be properly managed. To this account, K8s natively offers audit logs, a powerful feature for tracking API interactions in the cluster. Audit logs provide a detailed and chronological record of all activities in the system. Unfortunately, K8s auditing suffers from several practical limitations: it generates large volumes of data continuously, as all components within the cluster interact and respond to user actions. Moreover, each action can trigger a cascade of secondary events dispersed across the log, with little to no explicit linkage, making it difficult to reconstruct the context behind user-initiated operations. In this paper, we introduce K8NTEXT, a novel approach for streamlining K8s audit logs by reconstructing contexts, i.e., grouping actions performed by actors on the cluster with the subsequent events these actions cause. Correlated API calls are automatically identified, labeled, and consistently grouped using a combination of inference rules and a Machine Learning model, largely simplifying data consumption. We evaluate K8NTEXT's performance, scalability, and expressiveness both in systematic tests and with a series of use cases. We show that it consistently provides accurate context reconstruction, even for complex operations involving 50, 100 or more correlated actions, achieving over 95 percent accuracy across the entire spectrum, from simple to highly composite actions.

#32 Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

agent

著者: Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, Meng Han

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.01211

要約:
With the proliferation of LLM-driven multi-agent systems (MAS), the security of Web links has become a critical concern. Once MAS is induced to trust a malicious link, attackers can use it as a springboard to expand the attack surface. In this paper, we propose Web Fraud Attacks, a novel type of attack manipulating unique structures of web links to deceive MAS. We design 12 representative attack variants that encompass various methods, such as homoglyph deception, sub-directory nesting, and parameter obfuscation. Through extensive experiments on these attack vectors, we demonstrate that Web fraud attacks not only exhibit significant destructive potential across different MAS architectures but also possess a distinct advantage in evasion: they circumvent the need for complex input design, lowering the threshold for attacks significantly. These results underscore the importance of addressing Web fraud attacks, providing new insights into MAS safety. Our code is available at https://github.com/JiangYingEr/Web-Fraud-Attack-in-MAS.

#33 Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

著者: Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.12069

要約:
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.

#34 Secure Digital Semantic Communications: Fundamentals, Challenges, and Opportunities

著者: Weixuan Chen, Qianqian Yang, Yuanyuan Jia, Junyu Pan, Shuo Shao, Jincheng Dai, Meixia Tao, Ping Zhang

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.24602

要約:
Semantic communication (SemCom) has emerged as a promising paradigm for future wireless networks by prioritizing task-relevant meaning over raw data delivery, thereby reducing communication overhead and improving efficiency. However, shifting from bit-accurate transmission to task-oriented delivery introduces new security and privacy risks. These include semantic leakage, semantic manipulation, knowledge base vulnerabilities, model-related attacks, and threats to authenticity and availability. Most existing secure SemCom studies focus on analog SemCom, where semantic features are mapped to continuous channel inputs. In contrast, digital SemCom transmits semantic information through discrete bits or symbols within practical transceiver pipelines, offering stronger compatibility with realworld systems while exposing a distinct and underexplored attack surface. In particular, digital SemCom typically represents semantic information over a finite alphabet through explicit digital modulation, following two main routes: probabilistic modulation and deterministic modulation. These discrete mechanisms and practical transmission procedures introduce additional vulnerabilities affecting bit- or symbol-level semantic information, the modulation stage, and packet-based delivery and protocol operations. Motivated by these challenges and the lack of a systematic analysis of secure digital SemCom, this paper reviews SemCom fundamentals, clarifies the architectural differences between analog and digital SemCom and their security implications, organizes the threat landscape for digital SemCom, and discusses potential defenses. Finally, we outline open research directions toward secure and deployable digital SemCom systems.

#35 Static Deadlock Detection for Rust Programs

著者: Yu Zhang, Kaiwen Zhang, Guanjun Liu

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2401.01114

要約:
Rust relies on its unique ownership mechanism to ensure thread and memory safety. However, numerous potential security vulnerabilities persist in practical applications. New language features in Rust pose new challenges for vulnerability detection. This paper proposes a static deadlock detection method tailored for Rust programs, aiming to identify various deadlock types, including double lock, conflict lock, and deadlock associated with conditional variables. With due consideration for Rust's ownership and lifetimes, we first complete the pointer analysis. Then, based on the obtained points-to information, we analyze dependencies among variables to identify potential deadlocks. We develop a tool and conduct experiments based on the proposed method. The experimental results demonstrate that our method outperforms existing deadlock detection methods in precision.

#36 HONEYBEE: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning[Technical Report]

著者: Hongbin Zhong, Matthew Lentz, Nina Narodytska, Adriana Szekeres, Kexin Rong

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.01538

要約:
Enterprise deployments of vector databases require access control policies to protect sensitive data. These systems often implement access control through hybrid vector queries that combine nearest-neighbor search with relational predicates based on user permissions. However, existing approaches face a fundamental trade-off: dedicated per-user indexes minimize query latency but incur high memory redundancy, while shared indexes with post-search filtering reduce memory overhead at the cost of increased latency. This paper introduces HONEYBEE, a dynamic partitioning framework that leverages the structure of Role-Based Access Control (RBAC) policies to create a smooth trade-off between these extremes. RBAC policies organize users into roles and assign permissions at the role level, creating a natural ``thin waist`` in the permission structure that is ideal for partitioning decisions. Specifically, HONEYBEE produces overlapping partitions where vectors can be strategically replicated across different partitions to reduce query latency while controlling memory overhead. To guide these decisions, HONEYBEE develops analytical models of vector search performance and recall, and formulates partitioning as a constrained optimization problem that balances memory usage, query efficiency, and recall. Evaluations on RBAC workloads demonstrate that HONEYBEE achieves up to 13.5X lower query latency than row-level security with only a 1.24X increase in memory usage, while achieving comparable query performance to dedicated, per-role indexes with 90.4% reduction in additional memory consumption, offering a practical middle ground for secure and efficient vector search.

#37 TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

privacyagent

著者: Dominik Meier, Jan Philip Wahle, Paul R\"ottger, Terry Ruas, Bela Gipp

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.20118

要約:
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

#38 Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

著者: Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.10390

要約:
Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.

#39 $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models

著者: Huanqi Wu, Huangbiao Xu, Runfeng Xie, Jiaxin Cai, Kaixin Zhang, Xiao Ke

公開日: Thu, 08 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.05319

要約:
Despite remarkable progress in steganography, embedding semantically rich, sentence-level information into carriers remains a challenging problem. In this work, we present a novel concept of Semantic Steganography, which aims to hide semantically meaningful and structured content, such as sentences or paragraphs, in cover media. Based on this concept, we present Sentence-to-Image Steganography as an instance that enables the hiding of arbitrary sentence-level messages within a cover image. To accomplish this feat, we propose S^2LM: Semantic Steganographic Language Model, which leverages large language models (LLMs) to embed high-level textual information into images. Unlike traditional bit-level approaches, S^2LM redesigns the entire pipeline, involving the LLM throughout the process to enable the hiding and recovery of arbitrary sentences. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages to evaluate semantic steganography methods. Experimental results demonstrate that S^2LM effectively enables direct sentence recovery beyond bit-level steganography. The source code and IVT dataset will be released soon.

cs.CR updates on arXiv.org

📋 論文タイトル一覧