AI 安全

AI 对齐、安全评估、隐私保护、伦理治理等 AI 安全领域深度报道。

SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling
安全

SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling

SaFeR is a novel AI framework for generating safety-critical test scenarios for autonomous vehicles that balances advers...

Monitoring Emergent Reward Hacking During Generation via Internal Activations
安全

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel method to detect reward hacking in large language models by monitoring internal activ...

Monitoring Emergent Reward Hacking During Generation via Internal Activations
安全

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel method to detect reward hacking in large language models by analyzing internal activa...

Monitoring Emergent Reward Hacking During Generation via Internal Activations
安全

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers have developed a novel activation-based monitoring method to detect reward hacking in fine-tuned AI models b...

Monitoring Emergent Reward Hacking During Generation via Internal Activations
安全

Monitoring Emergent Reward Hacking During Generation via Internal Activations

A novel AI safety method detects reward hacking in large language models by analyzing internal neural activations during...

Monitoring Emergent Reward Hacking During Generation via Internal Activations
安全

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers developed a novel method to detect reward hacking in fine-tuned AI models by analyzing internal activations ...

Inference-Time Toxicity Mitigation in Protein Language Models
安全

Inference-Time Toxicity Mitigation in Protein Language Models

Researchers have demonstrated that protein language models can be prompted to generate potentially toxic proteins throug...

Inference-Time Toxicity Mitigation in Protein Language Models
安全

Inference-Time Toxicity Mitigation in Protein Language Models

Research demonstrates that standard protein language models (PLMs) can generate toxic proteins when fine-tuned for speci...

Inference-Time Toxicity Mitigation in Protein Language Models
安全

Inference-Time Toxicity Mitigation in Protein Language Models

New research demonstrates that fine-tuning protein language models (PLMs) for specific taxonomic groups can inadvertentl...

Inference-Time Toxicity Mitigation in Protein Language Models
安全

Inference-Time Toxicity Mitigation in Protein Language Models

Research shows protein language models fine-tuned on specific biological domains can inadvertently generate toxic protei...

Inference-Time Toxicity Mitigation in Protein Language Models
安全

Inference-Time Toxicity Mitigation in Protein Language Models

A new study reveals that standard fine-tuning of protein language models for specific biological functions can inadverte...

Measuring AI R&D Automation
安全

Measuring AI R&D Automation

A new research framework proposes specific empirical metrics to track AI R&D Automation (AIRDA), moving beyond tradition...

Measuring AI R&D Automation
安全

Measuring AI R&D Automation

A new research paper proposes a comprehensive framework for measuring AI R&D automation (AIRDA), arguing current benchma...

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
安全

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This theoretical paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable...

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
安全

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This theoretical paper proposes a Brouwer-inspired 'assertibility constraint' for generative AI systems, requiring them ...

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
安全

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This philosophical paper proposes a formal assertibility constraint for AI systems inspired by intuitionistic logic, req...

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
安全

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This academic paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable pr...

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
安全

Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

This paper proposes a formal Brouwerian 'assertibility constraint' requiring AI systems to provide publicly contestable ...

Structure-Aware Distributed Backdoor Attacks in Federated Learning
安全

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that neural network architecture significantly impacts backdoor attack effectiveness in federa...

Structure-Aware Distributed Backdoor Attacks in Federated Learning
安全

Structure-Aware Distributed Backdoor Attacks in Federated Learning

Researchers have discovered that machine learning model architecture critically determines susceptibility to stealthy ba...

Structure-Aware Distributed Backdoor Attacks in Federated Learning
安全

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that neural network architecture fundamentally determines backdoor attack effectiveness in fed...

Structure-Aware Distributed Backdoor Attacks in Federated Learning
安全

Structure-Aware Distributed Backdoor Attacks in Federated Learning

Researchers have discovered that neural network architecture significantly influences backdoor attack effectiveness in f...

Structure-Aware Distributed Backdoor Attacks in Federated Learning
安全

Structure-Aware Distributed Backdoor Attacks in Federated Learning

New research demonstrates that backdoor attack efficacy in federated learning systems is fundamentally determined by neu...

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation
安全

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

The paper 'DSRM-HRL: A Denoising State Representation Framework for Fairness-Aware Interactive Recommendation' proposes ...

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes
安全

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes

A novel study using LLM-generated child-AI interaction scenarios reveals parents are concerned about nuanced risks like ...

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes
安全

Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes

A new study reveals significant gaps between existing AI safety tools and parental expectations for moderating children'...

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
安全

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that protects data ...

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
安全

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that prevents AI mo...

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
安全

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Mutual Information Unlearnable Examples (MI-UE) represent a principled approach to data privacy in machine learning by r...

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
安全

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed MI-UE (Mutual Information Unlearnable Examples), a novel data poisoning method grounded in informa...

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
安全

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Researchers developed Mutual Information Unlearnable Examples (MI-UE), a theoretical framework that creates AI-unlearnab...

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
安全

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔

深度求索(DeepSeek)开源了IronClaw安全对齐模型,基于其671B参数的DeepSeek-V2基座模型,采用创新的“从零重构”安全对齐方法。该模型在中文安全评测集CValues上获得9.28分(满分10分),超越了主流开源模型。...

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
安全

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔

安全研究团队通过逆向工程,从零开始完整复现了代号“龙虾”的恶意软件的安全版本,命名为IronClaw。该项目旨在为防御方提供无害化研究样本,用于分析恶意行为、测试检测规则和培训安全人员,标志着威胁狩猎从被动响应转向主动深度研究。重构过程避免...

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
安全

Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔

Anthropic宣布从零重构其核心安全对齐技术Constitutional AI,新版本命名为IronClaw。IronClaw旨在将安全性内置于模型架构底层,在内部测试中将有害输出率降低超过一个数量级,同时保持MMLU等基准测试的高性能...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
安全

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI), a novel security vulnerability where attackers embed hid...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
安全

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified a novel security vulnerability called Image-based Prompt Injection (IPI) that embeds adversa...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
安全

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Image-based Prompt Injection (IPI) is a novel black-box attack that embeds adversarial text instructions into images to ...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
安全

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI) as a novel security vulnerability in multimodal AI system...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
安全

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have identified Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial text instruct...

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
安全

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, attack-tree-based methodology for assessing security risks in LLM-integrated...

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
安全

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a goal-driven risk assessment methodology using attack trees to evaluate security vulnerab...

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
安全

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, goal-driven risk assessment methodology using attack trees to model security...

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
安全

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research paper proposes a structured, goal-driven risk assessment framework using attack trees to address security...

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
安全

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

A new research framework proposes using attack trees for structured risk assessment of LLM-powered systems, specifically...

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering
安全

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Researchers introduced RAG-X, a diagnostic framework that systematically evaluates retrieval-augmented generation (RAG) ...

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
安全

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework that reduces personalized safety violations in LLM...

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
安全

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
安全

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
安全

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Researchers have developed SafeCRS, a safety-aligned training framework for LLM-based conversational recommender systems...

Multi-Agent Influence Diagrams to Hybrid Threat Modeling
安全

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A new research paper introduces a multi-agent influence diagram framework to systematically evaluate countermeasures aga...

Multi-Agent Influence Diagrams to Hybrid Threat Modeling
安全

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A novel multi-agent influence diagram framework evaluates counter-hybrid threat measures through computational modeling....

Multi-Agent Influence Diagrams to Hybrid Threat Modeling
安全

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

A new research paper introduces a unified multi-agent influence diagram framework to model hybrid threats—hostile action...

The Controllability Trap: A Governance Framework for Military AI Agents
安全

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) addresses the unique risks of agentic AI systems in defense by shif...

The Controllability Trap: A Governance Framework for Military AI Agents
安全

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) addresses six distinct agentic governance failures in military AI s...

The Controllability Trap: A Governance Framework for Military AI Agents
安全

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) introduces a measurable governance architecture for military AI sys...

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning
安全

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

A study of the Qwen2.5-Math-7B model reveals that while it achieves 61% accuracy on GSM8K math problems, only 18.4% of c...

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning
安全

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

Recent research exposes critical vulnerabilities in mathematical reasoning AI models, finding that while models like Qwe...

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
安全

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PRIVATEEDIT is a novel privacy-preserving pipeline for facial image editing that prevents biometric data exposure to thi...

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
安全

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PRIVATEEDIT is a privacy-preserving pipeline for face-centric generative image editing that performs sensitive facial ma...

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
安全

PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

PrivateEdit is a privacy-preserving pipeline for face-centric generative image editing that prevents biometric data expo...