AI 安全 - AI资讯 - AI News Hub

安全 2026年3月8日

SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling

SaFeR is a novel AI framework for generating safety-critical test scenarios for autonomous vehicles that balances advers...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel method to detect reward hacking in large language models by monitoring internal activ...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel method to detect reward hacking in large language models by analyzing internal activa...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel activation-based monitoring method to detect reward hacking in fine-tuned AI models b...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Monitoring Emergent Reward Hacking During Generation via Internal Activations

A novel AI safety method detects reward hacking in large language models by analyzing internal neural activations during...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers developed a novel method to detect reward hacking in fine-tuned AI models by analyzing internal activations ...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Inference-Time Toxicity Mitigation in Protein Language Models

Researchers have demonstrated that protein language models can be prompted to generate potentially toxic proteins throug...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Inference-Time Toxicity Mitigation in Protein Language Models

Research demonstrates that standard protein language models (PLMs) can generate toxic proteins when fine-tuned for speci...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Inference-Time Toxicity Mitigation in Protein Language Models

New research demonstrates that fine-tuning protein language models (PLMs) for specific taxonomic groups can inadvertentl...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Inference-Time Toxicity Mitigation in Protein Language Models

Research shows protein language models fine-tuned on specific biological domains can inadvertently generate toxic protei...

arXiv cs.AI 阅读全文 →

安全 2026年3月8日

Inference-Time Toxicity Mitigation in Protein Language Models

A new study reveals that standard fine-tuning of protein language models for specific biological functions can inadverte...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Measuring AI R&D Automation

A new research framework proposes specific empirical metrics to track AI R&D Automation (AIRDA), moving beyond tradition...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Measuring AI R&D Automation

A new research paper proposes a comprehensive framework for measuring AI R&D automation (AIRDA), arguing current benchma...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This theoretical paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This theoretical paper proposes a Brouwer-inspired 'assertibility constraint' for generative AI systems, requiring them ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This philosophical paper proposes a formal assertibility constraint for AI systems inspired by intuitionistic logic, req...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This academic paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable pr...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This paper proposes a formal Brouwerian 'assertibility constraint' requiring AI systems to provide publicly contestable ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that neural network architecture significantly impacts backdoor attack effectiveness in federa...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Structure-Aware Distributed Backdoor Attacks in Federated Learning

Researchers have discovered that machine learning model architecture critically determines susceptibility to stealthy ba...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that neural network architecture fundamentally determines backdoor attack effectiveness in fed...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Structure-Aware Distributed Backdoor Attacks in Federated Learning

Researchers have discovered that neural network architecture significantly influences backdoor attack effectiveness in f...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that backdoor attack efficacy in federated learning systems is fundamentally determined by neu...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

The paper 'DSRM-HRL: A Denoising State Representation Framework for Fairness-Aware Interactive Recommendation' proposes ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes

A novel study using LLM-generated child-AI interaction scenarios reveals parents are concerned about nuanced risks like ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes

A new study reveals significant gaps between existing AI safety tools and parental expectations for moderating children'...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that protects data ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that prevents AI mo...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Mutual Information Unlearnable Examples (MI-UE) represent a principled approach to data privacy in machine learning by r...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed MI-UE (Mutual Information Unlearnable Examples), a novel data poisoning method grounded in informa...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a theoretical framework that creates AI-unlearnab...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Transformer作者重造龙虾，Rust搓出钢铁版，告别OpenClaw裸奔

深度求索（DeepSeek）开源了IronClaw安全对齐模型，基于其671B参数的DeepSeek-V2基座模型，采用创新的“从零重构”安全对齐方法。该模型在中文安全评测集CValues上获得9.28分（满分10分），超越了主流开源模型。...

量子位阅读全文 →

安全 2026年3月7日

Transformer作者重造龙虾，Rust搓出钢铁版，告别OpenClaw裸奔

安全研究团队通过逆向工程，从零开始完整复现了代号“龙虾”的恶意软件的安全版本，命名为IronClaw。该项目旨在为防御方提供无害化研究样本，用于分析恶意行为、测试检测规则和培训安全人员，标志着威胁狩猎从被动响应转向主动深度研究。重构过程避免...

量子位阅读全文 →

安全 2026年3月7日

Transformer作者重造龙虾，Rust搓出钢铁版，告别OpenClaw裸奔

Anthropic宣布从零重构其核心安全对齐技术Constitutional AI，新版本命名为IronClaw。IronClaw旨在将安全性内置于模型架构底层，在内部测试中将有害输出率降低超过一个数量级，同时保持MMLU等基准测试的高性能...

量子位阅读全文 →

安全 2026年3月7日

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI), a novel security vulnerability where attackers embed hid...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified a novel security vulnerability called Image-based Prompt Injection (IPI) that embeds adversa...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Image-based Prompt Injection (IPI) is a novel black-box attack that embeds adversarial text instructions into images to ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI) as a novel security vulnerability in multimodal AI system...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial text instruct...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, attack-tree-based methodology for assessing security risks in LLM-integrated...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a goal-driven risk assessment methodology using attack trees to evaluate security vulnerab...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, goal-driven risk assessment methodology using attack trees to model security...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, goal-driven risk assessment framework using attack trees to address security...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research framework proposes using attack trees for structured risk assessment of LLM-powered systems, specifically...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Researchers introduced RAG-X, a diagnostic framework that systematically evaluates retrieval-augmented generation (RAG) ...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework that reduces personalized safety violations in LLM...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aligned training framework for LLM-based conversational recommender systems...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A new research paper introduces a multi-agent influence diagram framework to systematically evaluate countermeasures aga...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A novel multi-agent influence diagram framework evaluates counter-hybrid threat measures through computational modeling....

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A new research paper introduces a unified multi-agent influence diagram framework to model hybrid threats—hostile action...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) addresses the unique risks of agentic AI systems in defense by shif...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) addresses six distinct agentic governance failures in military AI s...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) introduces a measurable governance architecture for military AI sys...

arXiv cs.AI 阅读全文 →

$When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning$

安全 2026年3月7日

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

A study of the Qwen2.5-Math-7B model reveals that while it achieves 61% accuracy on GSM8K math problems, only 18.4% of c...

arXiv cs.AI 阅读全文 →

$When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning$

安全 2026年3月7日

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

Recent research exposes critical vulnerabilities in mathematical reasoning AI models, finding that while models like Qwe...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PRIVATEEDIT is a novel privacy-preserving pipeline for facial image editing that prevents biometric data exposure to thi...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PRIVATEEDIT is a privacy-preserving pipeline for face-centric generative image editing that performs sensitive facial ma...

arXiv cs.AI 阅读全文 →

安全 2026年3月7日

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PrivateEdit is a privacy-preserving pipeline for face-centric generative image editing that prevents biometric data expo...

arXiv cs.AI 阅读全文 →