arXiv論文一覧 - cs.CR updates on arXiv.org

#1 When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks

backdoordiffusion

著者: Shenyang Chen, Liuwan Zhu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20193

要約:
Standard evaluations of backdoor attacks on text-to-image (T2I) models primarily measure trigger activation and visual fidelity. We challenge this paradigm, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold. We trace this vulnerability to a geometric mechanism: a Jacobian-based analysis reveals that backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods. To rigorously quantify this structural degradation, we introduce SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment. Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.

#2 OpenPort Protocol: A Security Governance Specification for AI Agent Tool Access

agent

著者: Genliang Zhu, Chu Wang, Ziyuan Wang, Zhida Li, Qiang Li

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20196

要約:
AI agents increasingly require direct, structured access to application data and actions, but production deployments still struggle to express and verify the governance properties that matter in practice: least-privilege authorization, controlled write execution, predictable failure handling, abuse resistance, and auditability. This paper introduces OpenPort Protocol (OPP), a governance-first specification for exposing application tools through a secure server-side gateway that is model- and runtime-neutral and can bind to existing tool ecosystems. OpenPort defines authorization-dependent discovery, stable response envelopes with machine-actionable \texttt{agent.*} reason codes, and an authorization model combining integration credentials, scoped permissions, and ABAC-style policy constraints. For write operations, OpenPort specifies a risk-gated lifecycle that defaults to draft creation and human review, supports time-bounded auto-execution under explicit policy, and enforces high-risk safeguards including preflight impact binding and idempotency. To address time-of-check/time-of-use drift in delayed approval flows, OpenPort also specifies an optional State Witness profile that revalidates execution-time preconditions and fails closed on state mismatch. Operationally, the protocol requires admission control (rate limits/quotas) with stable 429 semantics and structured audit events across allow/deny/fail paths so that client recovery and incident analysis are deterministic. We present a reference runtime and an executable governance toolchain (layered conformance profiles, negative security tests, fuzz/abuse regression, and release-gate scans) and evaluate the core profile at a pinned release tag using artifact-based, externally reproducible validation.

#3 Evaluating the Reliability of Digital Forensic Evidence Discovered by Large Language Model: A Case Study

著者: Jeel Piyushkumar Khatiwala, Daniel Kwaku Ntiamoah Addai, Weifeng Xu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20202

要約:
The growing reliance on AI-identified digital evidence raises significant concerns about its reliability, particularly as large language models (LLMs) are increasingly integrated into forensic investigations. This paper proposes a structured framework that automates forensic artifact extraction, refines data through LLM-driven analysis, and validates results using a Digital Forensic Knowledge Graph (DFKG). Evaluated on a 13 GB forensic image dataset containing 61 applications, 2,864 databases, and 5,870 tables, the framework ensures artifact traceability and evidentiary consistency through deterministic Unique Identifiers (UIDs) and forensic cross-referencing. We propose this methodology to address challenges in ensuring the credibility and forensic integrity of AI-identified evidence, reducing classification errors, and advancing scalable, auditable methodologies. A comprehensive case study on this dataset demonstrates the framework's effectiveness, achieving over 95 percent accuracy in artifact extraction, strong support of chain-of-custody adherence, and robust contextual consistency in forensic relationships. Key results validate the framework's ability to enhance reliability, reduce errors, and establish a legally sound paradigm for AI-assisted digital forensics.

#4 Right to History: A Sovereignty Kernel for Verifiable AI Agent Execution

agent

著者: Jing Zhang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20214

要約:
AI agents increasingly act on behalf of humans, yet no existing system provides a tamper-evident, independently verifiable record of what they did. As regulations such as the EU AI Act begin mandating automatic logging for high-risk AI systems, this gap carries concrete consequences -- especially for agents running on personal hardware, where no centralized provider controls the log. Extending Floridi's informational rights framework from data about individuals to actions performed on their behalf, this paper proposes the Right to History: the principle that individuals are entitled to a complete, verifiable record of every AI agent action on their own hardware. The paper formalizes this principle through five system invariants with structured proof sketches, and implements it in PunkGo, a Rust sovereignty kernel that unifies RFC 6962 Merkle tree audit logs, capability-based isolation, energy-budget governance, and a human-approval mechanism. Adversarial testing confirms all five invariants hold. Performance evaluation shows sub-1.3 ms median action latency, ~400 actions/sec throughput, and 448-byte Merkle inclusion proofs at 10,000 log entries.

#5 The TCF doesn't really A(A)ID -- Automatic Privacy Analysis and Legal Compliance of TCF-based Android Applications

privacy

著者: Victor Morel, Cristiana Santos, Pontus Carlsson, Joel Ahlinder, Romaric Duvignau

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20222

要約:
The Transparency and Consent Framework (TCF), developed by the Interactive Advertising Bureau (IAB) Europe, provides a de facto standard for requesting, recording, and managing user consent from European end-users. This framework has previously been found to infringe European data protection law and has subsequently been regularly updated. Previous research on the TCF focused exclusively on web contexts, with no attention given to its implementation in mobile applications. No work has systematically studied the privacy implications of the TCF on Android apps. To address this gap, we investigate the prevalence of the TCF in popular Android apps from the Google Play Store, and assess whether these apps respect users' consent banner choices. By scraping and downloading 4482 of the most popular Google Play Store apps on an emulated Android device, we automatically determine which apps use the TCF, automatically interact with consent banners, and analyze the apps' traffic in two different stages, passive (post choices) and active (during banner interaction and post choices). We found that 576 (12.85%) of the 4482 downloadable apps in our dataset implemented the TCF, and we identified potential privacy violations within this subset. In 15 (2.6%) of these apps, users' choices are stored only when consent is granted. Users who refuse consent are shown the consent banner again each time they launch the app. Network traffic analysis conducted during the passive stage reveals that 66.2% of the analyzed TCF-based apps share personal data, through the Android Advertising ID (AAID), in the absence of a lawful basis for processing. 55.3% of apps analyzed during the active stage share AAID before users interact with the apps' consent banners, violating the prior consent requirement.

#6 CryptRISC: A Secure RISC-V Processor for High-Performance Cryptography with Power Side-Channel Protection

著者: Amisha Srivastava, Muskan Porwal, Kanad Basu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20285

要約:
Cryptographic computations are fundamental to modern computing, ensuring data confidentiality and integrity. However, these operations are highly vulnerable to power side-channel attacks that exploit variations in power consumption to leak sensitive information. Masking is a widely used countermeasure, yet software-based techniques often introduce significant performance overhead and implementation complexity, while fixed-function hardware masking lacks flexibility across diverse cryptographic algorithms. In this paper, we present CryptRISC, the first RISC-V-based processor that combines cryptographic acceleration with hardware-level power side-channel resistance through an ISA-driven operand masking framework. Our design extends the CVA6 core with 64-bit RISC-V Scalar Cryptography Extensions and introduces two microarchitectural components: a Field Detection Layer, which identifies the dominant algebraic field of each cryptographic instruction, and a Masking Control Unit, which applies field-aware operand randomization at runtime. This enables dynamic selection of Boolean, affine, or arithmetic masking schemes based on instruction semantics, providing optimized protection across algorithms including AES, SHA-256, SHA-512, SM3, and SM4. Unlike prior approaches relying on static masking logic or software instrumentation, our method performs operand masking transparently within the execution pipeline without modifying instruction encoding. Experimental results show speedups up to 6.80$\times$ over baseline software implementations, with only a 1.86% hardware overhead relative to the baseline CVA6 core, confirming the efficiency and practicality of CryptRISC.

#7 Understanding Human-AI Collaboration in Cybersecurity Competitions

著者: Tingxuan Tang, Nicolas Janis, Kalyn Asher Montague, Kevin Eykholt, Dhilung Kirat, Youngja Park, Jiyong Jang, Adwait Nadkarni, Yue Xiao

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20446

要約:
Capture-the-Flag (CTF) competitions are increasingly becoming a testbed for evaluating AI capabilities at solving security tasks, due to the controlled environments and objective success criteria. Existing evaluations have focused on how successful AI is at solving CTF challenges in isolation from human CTF players. As AI usage increases in both academic and industrial settings, it is equally likely that human players may collaborate with AI agents to solve challenges. This possibility exposes a key knowledge gap: how do humans perceive AI CTF assistance; when assistance is provided, how do they collaborate and is it effective with respect to human performance; how do humans assisted by AI compare to the performance of fully autonomous AI agents on the same challenges. We address this gap with the first empirical study of AI assistance in a live, onsite CTF. In a study with 41 participants, we qualitatively study (i) how participants' perception, trust, and expectations shift before versus after hands-on AI use, and (ii) how participants collaborate with an instrumented AI agent. Moreover, we also (iii) benchmark four autonomous AI agents on the same fresh challenge set to compare outcomes with human teams and analyze agent trajectories. We find that, as the competition progresses, teams increasingly delegate larger subtasks to the AI, giving it more agency. Interestingly, CTF challenges solving rates are often constrained not by model's reasoning capabilities, but rather by the human players: ineffective prompting and poor context specification become the primary bottleneck. Remarkably, autonomous agents that self-direct their prompting and tool use bypass this bottleneck and outperform most human teams, coming in second overall in the competition. We conclude with implications for the future design of CTF challenges and for building effective human-in-the-loop AI systems for security.

#8 Towards Secure and Efficient DNN Accelerators via Hardware-Software Co-Design

著者: Wei Xuan, Zihao Xuan, Rongliang Fu, Ning Lin, Kwunhang Wong, Zikang Yuan, Lang Feng, Zhongrui Wang, Tsung-Yi Ho, Yuzhong Jiao, Luhong Liang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20521

要約:
The rapid deployment of deep neural network (DNN) accelerators in safety-critical domains such as autonomous vehicles, healthcare systems, and financial infrastructure necessitates robust mechanisms to safeguard data confidentiality and computational integrity. Existing security solutions for DNN accelerators, however, suffer from excessive hardware resource demands and frequent off-chip memory access overheads, which degrade performance and scalability. To address these challenges, this paper presents a secure and efficient memory protection framework for DNN accelerators with minimal overhead. First, we propose a bandwidth-aware cryptographic scheme that adapts encryption granularity based on memory traffic patterns, striking a balance between security and resource efficiency. Second, we observe that both the overlapping regions in the intra-layer tiling's sliding window pattern and those resulting from inter-layer tiling strategy discrepancies introduce substantial redundant memory accesses and repeated computational overhead in cryptography. Third, we introduce a multi-level authentication mechanism that effectively eliminates unnecessary off-chip memory accesses, enhancing performance and energy efficiency. Experimental results show that this work decreases performance overhead by over 12% and achieves 87% energy efficiency improvement for both server and edge neural processing units (NPUs), while ensuring robust scalability.

#9 OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

privacy

著者: Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, Cong Wang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20595

要約:
Multi-tenant LLM serving frameworks widely adopt shared Key-Value caches to enhance efficiency. However, this creates side-channel vulnerabilities enabling prompt leakage attacks. Prior studies identified these attack surfaces yet focused on expanding attack vectors rather than optimizing attack performance, reporting impractically high attack costs that underestimate the true privacy risk. We propose OptiLeak, a reinforcement learning-enhanced framework that maximizes prompt reconstruction efficiency through two-stage fine-tuning. Our key insight is that domain-specific ``hard tokens'' -- terms difficult to predict yet carrying sensitive information -- can be automatically identified via likelihood ranking and used to construct preference pairs for Direct Preference Optimization, eliminating manual annotation. This enables effective preference alignment while avoiding the overfitting issues of extended supervised fine-tuning. Evaluated on three benchmarks spanning medical and financial domains, OptiLeak achieves up to $12.48\times$ reduction in average requests per token compared to baseline approaches, with consistent improvements across model scales from 3B to 14B parameters. Our findings demonstrate that cache-based prompt leakage poses a more severe threat than previously reported, underscoring the need for robust cache isolation in production deployments.

#10 Post-Quantum Sanitizable Signatures from McEliece-Based Chameleon Hashing

著者: Shahzad Ahmad, Stefan Rass, Zahra Seyedi

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20657

要約:
We introduce a novel post-quantum sanitizable signature scheme constructed upon a chameleon hash function derived from the McEliece cryptosystem. In this design, the designated sanitizer possesses the inherent trapdoor of a Goppa code, which facilitates controlled collision-finding via Patterson decoding. This mechanism enables authorized modification of specific message blocks while ensuring all other content remains immutably bound. We provide formal security definitions and rigorous proofs of existential unforgeability and immutability, grounded in the hardness of syndrome decoding in the random-oracle model, where a robust random oracle thwarts trivial linear hash collisions. A key innovation lies in our precise characterization of the transparency property: by imposing a specific weight constraint on the randomizers generated by the signer, we achieve perfect transparency, rendering sanitized signatures indistinguishable from freshly signed ones. This work establishes the first transparent, code-based, post-quantum sanitizable signature scheme, offering strong theoretical guarantees and a pathway for practical deployment in long-term secure applications.

#11 ICSSPulse: A Modular LLM-Assisted Platform for Industrial Control System Penetration Testing

著者: Michail Takaronis, Athanasia Kollarou, Vyron Kampourakis, Vasileios Gkioulos, Sokratis Katsikas

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20663

要約:
It is well established that industrial control systems comprise the operational backbone of modern critical infrastructures, yet their increasing connectivity exposes them to cyber threats that are difficult to study and remedy safely under real-time operational conditions. In this paper, we present ICSSPulse, an open-source, modular, and extensible penetration testing platform designed for the security assessment of ICS communication protocols. To the best of our knowledge, ICSSPulse is the first web-based platform that unifies network scanning, protocol-aware Modbus and OPC~UA interaction, and Large Language Model (LLM)-assisted reporting within a single, lightweight ecosystem. Our platform provides a user-friendly graphical interface that orchestrates enumeration, exploitation, and reporting activities over simulated industrial services, enabling safe and reproducible experimentation. It supports protocol-level discovery, asset enumeration, and controlled read/write interactions, while preserving protocol fidelity and operational transparency. Experimental evaluation using synthetic Modbus test servers, a Factory I/O water treatment scenario, and a custom OPC~UA production-line model demonstrated ICSSPulse's potential to discover active industrial services, enumerate process-relevant assets, and manipulate process variables. A key contribution of this work lies in the integration of an LLM-assisted reporting module that automatically translates technical findings into structured executive and technical reports, with mitigation guidance informed by the ICS MITRE ATT&CK ICS matrix.

#12 Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

intellectual propertydiffusion

著者: Fan Guo, Jiyu Kang, Qi Ming, Emily Davis, Finn Carter

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20680

要約:
Robust invisible watermarking schemes aim to embed hidden information into images such that the watermark survives common manipulations. However, powerful diffusion-based image generation and editing techniques now pose a new threat to these watermarks. In this paper, we present a comprehensive theoretical and empirical analysis demonstrating that diffusion models can effectively erase robust watermarks even when those watermarks were designed to withstand conventional distortions. We show that a diffusion-driven image regeneration process, which leverages generative models to recreate an image, can remove embedded watermarks while preserving the image's perceptual content. Furthermore, we introduce a guided diffusion-based attack that explicitly targets the embedded watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion transformations, the mutual information between the watermarked image and the hidden payload approaches zero, leading to inevitable decoding failure. Experimentally, we evaluate multiple state-of-the-art watermarking methods (including deep learning-based schemes like StegaStamp, TrustMark, and VINE) and demonstrate that diffusion edits yield near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings reveal a fundamental vulnerability in current robust watermarking techniques against generative model-based edits, underscoring the need for new strategies to ensure watermark resilience in the era of powerful diffusion models.

#13 AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

agent

著者: Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei, Zhong Chen, Wei Yang Bryan Lim

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20720

要約:
The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast-evolving nature of modern AI agents. We introduce AdapTools, a novel adaptive IPI attack framework that selects stealthier attack tools and generates adaptive attack prompts to create a rigorous security evaluation environment. Our approach comprises two key components: (1) Adaptive Attack Strategy Construction, which develops transferable adversarial strategies for prompt optimization, and (2) Attack Enhancement, which identifies stealthy tools capable of circumventing task-relevance defenses. Comprehensive experimental evaluation shows that AdapTools achieves a 2.13 times improvement in attack success rate while degrading system utility by a factor of 1.78. Notably, the framework maintains its effectiveness even against state-of-the-art defense mechanisms. Our method advances the understanding of IPI attacks and provides a useful reference for future research.

#14 A Secure and Interoperable Architecture for Electronic Health Record Access Control and Sharing

著者: Tayeb Kenaza, Islam Debicha, Youcef Fares, Mehdi Sehaki, Sami Messai

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20830

要約:
Electronic Health Records (EHRs) store sensitive patient information, necessitating stringent access control and sharing mechanisms to uphold data security and comply with privacy regulations such as the General Data Protection Regulation (GDPR). In this paper, we propose a comprehensive architecture with a suite of efficient protocols that leverage the synergistic capabilities of the Blockchain and Interplanetary File System (IPFS) technologies to enable secure access control and sharing of EHRs. Our approach is based on a private blockchain, wherein smart contracts are deployed to enforce control exclusively by patients. By granting patients exclusive control over their EHRs, our solution ensures compliance with personal data protection laws and empowers individuals to manage their health information autonomously. Notably, our proposed architecture seamlessly integrates with existing health provider information systems, facilitating interoperability and effectively addressing security and data heterogeneity challenges. To demonstrate the effectiveness of our approach, we developed a prototype based on a private implementation of the Hyperledger platform, enabling the simulation of diverse scenarios involving access control and health data sharing among healthcare practitioners. Our experimental results demonstrate the scalability of our solution, thereby substantiating its efficacy and robustness in real-world healthcare settings.

#15 SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

agent

著者: Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20867

要約:
Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbf{seven design patterns} capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbf{representation $\times$ scope} taxonomy describing what skills \emph{are} (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1{,}200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents.

#16 CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

著者: Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, Shengyu Tao

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20213

要約:
The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

#17 Personal Information Parroting in Language Models

著者: Nishant Subramani, Kshitish Ghate, Mona Diab

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20580

要約:
Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.

#18 Is the Trigger Essential? A Feature-Based Triggerless Backdoor Attack in Vertical Federated Learning

backdoor

著者: Yige Liu, Yiwei Lou, Che Wang, Yongzhi Cao, Hanpin Wang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20593

要約:
As a distributed collaborative machine learning paradigm, vertical federated learning (VFL) allows multiple passive parties with distinct features and one active party with labels to collaboratively train a model. Although it is known for the privacy-preserving capabilities, VFL still faces significant privacy and security threats from backdoor attacks. Existing backdoor attacks typically involve an attacker implanting a trigger into the model during the training phase and executing the attack by adding the trigger to the samples during the inference phase. However, in this paper, we find that triggers are not essential for backdoor attacks in VFL. In light of this, we disclose a new backdoor attack pathway in VFL by introducing a feature-based triggerless backdoor attack. This attack operates under a more stringent security assumption, where the attacker is honest-but-curious rather than malicious during the training phase. It comprises three modules: label inference for the targeted backdoor attack, poison generation with amplification and perturbation mechanisms, and backdoor execution to implement the attack. Extensive experiments on five benchmark datasets demonstrate that our attack outperforms three baseline backdoor attacks by 2 to 50 times while minimally impacting the main task. Even in VFL scenarios with 32 passive parties and only one set of auxiliary data, our attack maintains high performance. Moreover, when confronted with distinct defense strategies, our attack remains largely unaffected and exhibits strong robustness. We hope that the disclosure of this triggerless backdoor attack pathway will encourage the community to revisit security threats in VFL scenarios and inspire researchers to develop more robust and practical defense strategies.

#19 ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

agent

著者: Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20708

要約:
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

#20 PackMonitor: Enabling Zero Package Hallucinations Through Decoding-Time Monitoring

著者: Xiting Liu, Yuetong Liu, Yitong Zhang, Jia Li, Shi-Min Hu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20717

要約:
As Large Language Models (LLMs) are increasingly integrated into software development workflows, their trustworthiness has become a critical concern. However, in dependency recommendation scenarios, the reliability of LLMs is undermined by widespread package hallucinations, where models often recommend hallucinated packages. Recent studies have proposed a range of approaches to mitigate this issue. Nevertheless, existing approaches typically merely reduce hallucination rates rather than eliminate them, leaving persistent software security risks. In this work, we argue that package hallucinations are theoretically preventable based on the key insight that package validity is decidable through finite and enumerable authoritative package lists. Building on this, we propose PackMonitor, the first approach capable of fundamentally eliminating package hallucinations by continuously monitoring the model's decoding process and intervening when necessary. To implement this in practice, PackMonitor addresses three key challenges: (1) determining when to trigger intervention via a Context-Aware Parser that continuously monitors model outputs and selectively activates intervening only during installation command generation; (2) resolving how to intervene by employing a Package-Name Intervenor that strictly limits the decoding space to an authoritative package list; and (3) ensuring monitoring efficiency through a DFA-Caching Mechanism that enables scalability to millions of packages with negligible overhead. Extensive experiments on five widely used LLMs demonstrate that PackMonitor is a training-free, plug-and-play solution that consistently reduces package hallucination rates to zero while maintaining low-latency inference and preserving original model capabilities.

#21 "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

agent

著者: Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.21127

要約:
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on agent-centric threats, human susceptibility to deception by a compromised agent remains unexplored. We present the first large-scale empirical study with 303 participants to measure human susceptibility to AMD. This is based on HAT-Lab (Human-Agent Trust Laboratory), a high-fidelity research platform we develop, featuring nine carefully crafted scenarios spanning everyday and professional domains (e.g., healthcare, software development, human resources). Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives. Specifically, only 8.6% of participants perceive AMD attacks, while domain experts show increased susceptibility in certain scenarios. We identify six cognitive failure modes in users and find that their risk awareness often fails to translate to protective behavior. The defense analysis reveals that effective warnings should interrupt workflows with low verification costs. With experiential learning based on HAT-Lab, over 90% of users who perceive risks report increased caution against AMD. This work provides empirical evidence and a platform for human-centric agent security research.

#22 Usability Study of Security Features in Programmable Logic Controllers

著者: Karen Li, Kopo M. Ramokapane, Awais Rashid

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2208.02500

要約:
Programmable Logic Controllers (PLCs) drive industrial processes critical to society, for example, water treatment and distribution, electricity and fuel networks. Search engines, e.g., Shodan, have highlighted that PLCs are often left exposed to the Internet, one of the main reasons being the misconfigurations of security settings. This leads to the question - why do these misconfigurations occur and, specifically, whether usability of security controls plays a part. To date, the usability of configuring PLC security mechanisms has not been studied. We present the first investigation through a task based study and subsequent semi-structured interviews (N=19). We explore the usability of PLC connection configurations and two key security mechanisms (i.e., access levels and user administration). We find that the use of unfamiliar labels, layouts and misleading terminology exacerbates an already complex process of configuring security mechanisms. Our results uncover various misperceptions about the security controls and how design constraints, e.g., safety and lack of regular updates due to the long-term nature of such systems, provide significant challenges to the realization of modern HCI and usability principles. Based on these findings, we provide design recommendations to bring usable security in industrial settings at par with its IT counterpart.

#23 Watermarking Language Models with Error Correcting Codes

intellectual property

著者: Patrick Chao, Yan Sun, Edgar Dobriban, Hamed Hassani

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2406.10281

要約:
Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find that our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating $p$-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

#24 MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

agent

著者: Luca Collini, Baleegh Ahmad, Joey Ah-kiow, Ramesh Karri

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.11963

要約:
Hardware security verification is a challenging and time-consuming task. Design engineers may use formal verification, linting, and functional simulation tests, coupled with analysis and a deep understanding of the hardware design being inspected. Large Language Models (LLMs) have been used to assist during this task, either directly or in conjunction with existing tools. We improve the state of the art by proposing MARVEL, a multi-agent LLM framework for a unified approach to decision-making, tool use, and reasoning. MARVEL mimics the cognitive process of a designer looking for security vulnerabilities in RTL code. It consists of a supervisor agent that devises the security policy of the system-on-chips (SoCs) using its security documentation. It delegates tasks to validate the security policy to individual executor agents. Each executor agent carries out its assigned task using a particular strategy. Each executor agent may use one or more tools to identify potential security bugs in the design and send the results back to the supervisor agent for further analysis and confirmation. MARVEL includes executor agents that leverage formal tools, linters, simulation tests, LLM-based detection schemes, and static analysis-based checks. We test our approach on a known buggy SoC based on OpenTitan from the Hack@DATE competition. We find that of the 51 issues reported by MARVEL, 19 are valid security vulnerabilities, 14 are concrete warnings, and 18 are hallucinated reports.

#25 A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

著者: Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.14297

要約:
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses. The learning-style queries are constructed by a novel reframing paradigm: HILL (Hiding Intention by Learning from LLMs). The deterministic, model-agnostic reframing framework is composed of 4 conceptual components: 1) key concept, 2) exploratory transformation, 3) detail-oriented inquiry, and optionally 4) hypotheticality. Further, new metrics are introduced to thoroughly evaluate the efficiency and harmfulness of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong generalizability. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. On the other hand, results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of fulfilling both helpfulness and safety alignments.

#26 GPM: The Gaussian Pancake Mechanism for Planting Undetectable Backdoors in Differential Privacy

privacybackdoor

著者: Haochen Sun, Xi He

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.23834

要約:
Differential privacy (DP) has become the gold standard for preserving individual privacy in data analysis. However, an implicit yet fundamental assumption underlying these rigorous privacy guarantees is the correct implementation and execution of DP mechanisms. Several incidents of unintended privacy loss have occurred due to numerical issues and inappropriate configurations of DP software, which have been successfully exploited in privacy attacks. To better understand the seriousness of defective DP software, we ask the following question: is it possible to elevate these passive defects into active privacy attacks while maintaining covertness? To address this question, we present the Gaussian pancake mechanism (GPM), a novel mechanism that is computationally indistinguishable from the widely used Gaussian mechanism (GM), yet exhibits arbitrarily weaker statistical DP guarantees. This unprecedented separation enables a new class of backdoor attacks: by indistinguishably passing off as the authentic GM, GPM can covertly degrade statistical privacy. Unlike the unintentional privacy loss caused by GM's numerical issues, GPM is an adversarial yet undetectable backdoor attack against data privacy. We formally prove GPM's covertness, characterize its statistical leakage, and demonstrate a concrete distinguishing attack that can achieve near-perfect success rates under suitable parameter choices, both theoretically and empirically. Our results underscore the importance of using transparent, open-source DP libraries and highlight the need for rigorous scrutiny and formal verification of DP implementations to prevent subtle, undetectable privacy compromises in real-world systems.

#27 Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

agent

著者: Julia Bazinska, Max Mathys, Francesco Casucci, Mateo Rojas-Carulla, Xander Davies, Alexandra Souly, Niklas Pfister

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.22620

要約:
AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks. Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents. To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level. We apply this framework to construct the $b^3$ benchmark, a security benchmark based on 194,331 unique crowdsourced adversarial attacks. We then evaluate 34 popular LLMs with it, revealing, among other insights, that enhanced reasoning capabilities improve security, while model size does not correlate with security. We release our benchmark, dataset, and evaluation code to facilitate widespread adoption by LLM providers and practitioners, offering guidance for agent developers and incentivizing model developers to prioritize backbone security improvements.

#28 HYDRA: Unearthing "Black Swan" Vulnerabilities in LEO Satellite Networks

著者: Bintao Yuan, Mingsheng Tang, Binbin Ge, Hongbin Luo, Zijie Yan

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06612

要約:
As Low Earth Orbit (LEO) become mega-constellations critical infrastructure, attacks targeting them have grown in number and range. The security analysis of LEO constellations faces a fundamental paradigm gap: traditional topology-centric methods fail to capture systemic risks arising from dynamic load imbalances and high-order dependencies, which can transform localized failures into network-wide cascades. To address this, we propose HYDRA, a hypergraph-based dynamic risk analysis framework. Its core is a novel metric, Hyper-Bridge Centrality (HBC), which quantifies node criticality via a load-to-redundancy ratio within dependency structures. A primary challenge to resilience: the most critical vulnerabilities are not in the densely connected satellite core, but in the seemingly marginal ground-space interfaces. These are the system's "Black Swan" nodes--topologically peripheral yet structurally lethal. We validate this through extensive simulations using realistic StarLink TLE data and population-based gravity model. Experiments demonstrate that HBC consistently outperforms traditional metrics, identifying critical failure points that surpass the structural damage potential of even betweenness centrality. This work shifts the security paradigm from connectivity to structural stress, demonstrating that securing the network edge is paramount and necessitates a fundamental redesign of redundancy strategies.

#29 MalTool: Malicious Tool Attacks on LLM Agents

agent

著者: Yuepeng Hu, Yuqi Jia, Mengyuan Li, Dawn Song, Neil Gong

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.12194

要約:
In a malicious tool attack, an attacker uploads a malicious tool to a distribution platform; once a user installs the tool and the LLM agent selects it during task execution, the tool can compromise the user's security and privacy. Prior work primarily focuses on manipulating tool names and descriptions to increase the likelihood of installation by users and selection by LLM agents. However, a successful attack also requires embedding malicious behaviors in the tool's code implementation, which remains largely unexplored. In this work, we bridge this gap by presenting the first systematic study of malicious tool code implementations. We first propose a taxonomy of malicious tool behaviors based on the confidentiality-integrity-availability triad, tailored to LLM-agent settings. To investigate the severity of the risks posed by attackers exploiting coding LLMs to automatically generate malicious tools, we develop MalTool, a coding-LLM-based framework that synthesizes tools exhibiting specified malicious behaviors, either as standalone tools or embedded within otherwise benign implementations. To ensure functional correctness and structural diversity, MalTool leverages an automated verifier that validates whether generated tools exhibit the intended malicious behaviors and differ sufficiently from prior instances, iteratively refining generations until success. Our evaluation demonstrates that MalTool is highly effective even when coding LLMs are safety-aligned. Using MalTool, we construct two datasets of malicious tools: 1,200 standalone malicious tools and 5,287 real-world tools with embedded malicious behaviors. We further show that existing detection methods, including commercial malware detection approaches such as VirusTotal and methods tailored to the LLM-agent setting, exhibit limited effectiveness at detecting the malicious tools, highlighting an urgent need for new defenses.

#30 MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

agent

著者: Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang, Prakhar Mehrotra, Kun Wang, Qingsong Wen

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.14281

要約:
The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP servers. As a result, agents remain vulnerable to MCP-based attacks that exploit the misalignment between agents and servers throughout the tool invocation lifecycle. In this paper, we propose MCPShield as a plug-in security cognition layer that mitigates this misalignment and ensures agent security when invoking MCP-based tools. Drawing inspiration from human experience-driven tool validation, MCPShield assists agent forms security cognition with metadata-guided probing before invocation. Our method constrains execution within controlled boundaries while cognizing runtime events, and subsequently updates security cognition by reasoning over historical traces after invocation, building on human post-use reflection on tool behavior. Experiments demonstrate that MCPShield exhibits strong generalization in defending against six novel MCP-based attack scenarios across six widely used agentic LLMs, while avoiding false positives on benign servers and incurring low deployment overhead. Overall, our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.

#31 Intent Laundering: AI Safety Datasets Are Not What They Seem

著者: Shahriar Golchin, Marc Wetter

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.16729

要約:
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.

#32 KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

privacy

著者: Ce Fang, Zhikun Zhang, Min Chen, Qing Liu, Lu Zhou, Zhe Liu, Yunjun Gao

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.19275

要約:
Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data. LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.

#33 Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

agent

著者: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.20156

要約:
LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

#34 Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

intellectual property

著者: Apurv Verma, NhatHai Phan, Shubhendu Trivedi

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.04462

要約:
Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Using standard results on the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size. In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection. This is the first empirical study of watermarking-alignment interactions; it shows that a simple inference-time fix can recover alignment.

#35 Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery

model extraction

著者: Didrik Bergstr\"om, Deniz G\"und\"uz, Onur G\"unl\"u

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.06868

要約:
We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module to semantically cluster images, facilitating security-oriented applications through enhanced semantic consistency and improving the perceptual reconstruction quality. We train the DeepJSCC module to both reduce mean square error (MSE) and minimize cosine distance between DHD hashes of source and reconstructed images. Significantly improved perceptual quality as a result of semantic alignment is illustrated for different multi-hop settings, for which classical DeepJSCC may suffer from noise accumulation, measured by the learned perceptual image patch similarity (LPIPS) metric.

#36 Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

著者: Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.12462

要約:
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.

#37 Defending Unauthorized Model Merging via Dual-Stage Weight Protection

著者: Wei-Jia Chen, Min-Yen Tsai, Cheng-Yi Lee, Chia-Mu Yu

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.11851

要約:
The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model's parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

#38 What Matters For Safety Alignment?

著者: Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03868

要約:
This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

#39 Analysis of Shuffling Beyond Pure Local Differential Privacy

privacy

著者: Shun Takagi, Seng Pei Liew

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.19154

要約:
Shuffling is a powerful way to amplify privacy of a local randomizer in private distributed data analysis. Most existing analyses of how shuffling amplifies privacy are based on the pure local differential privacy (DP) parameter $\varepsilon_0$. This paper raises the question of whether $\varepsilon_0$ adequately captures the privacy amplification. For example, since the Gaussian mechanism does not satisfy pure local DP for any finite $\varepsilon_0$, does it follow that shuffling yields weak amplification? To solve this problem, we revisit the privacy blanket bound of Balle et al. (the blanket divergence) and develop a direct asymptotic analysis that bypasses $\varepsilon_0$. Our key finding is that, asymptotically, the blanket divergence depends on the local mechanism only through a single scalar parameter $\chi$ and that this dependence is monotonic. Therefore, this parameter serves as a proxy for shuffling efficiency, which we call the shuffle index. By applying this analysis to both upper and lower bounds of the shuffled mechanism's privacy profile, we obtain a band for its privacy guarantee through shuffle indices. Furthermore, we derive a simple structural, necessary and sufficient condition on the local randomizer under which this band collapses asymptotically. $k$-RR families with $k\ge3$ satisfy this condition, while for generalized Gaussian mechanisms the condition may not hold but the resulting band remains tight. Finally, we complement the asymptotic theory with an FFT-based algorithm for computing the blanket divergence at finite $n$, which offers rigorously controlled relative error and near-linear running time in $n$, providing a practical numerical analysis for shuffle DP.

#40 How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

privacyagent

著者: Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das

公開日: Wed, 25 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.18464

要約:
A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats. If correct, these simulations could offer a scalable way to forecast S&P risks in products prior to deployment. We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated S&P human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP-ABCBench to enable reproducible evaluation as methods improve.

cs.CR updates on arXiv.org

📋 論文タイトル一覧