AI 安全
AI 对齐、安全评估、隐私保护、伦理治理等 AI 安全领域深度报道。
SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling
SaFeR is a novel AI framework for generating safety-critical test scenarios for autonomous vehicles that balances advers...
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Researchers have developed a novel method to detect reward hacking in large language models by monitoring internal activ...
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Researchers have developed a novel method to detect reward hacking in large language models by analyzing internal activa...
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Researchers have developed a novel activation-based monitoring method to detect reward hacking in fine-tuned AI models b...
Monitoring Emergent Reward Hacking During Generation via Internal Activations
A novel AI safety method detects reward hacking in large language models by analyzing internal neural activations during...
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Researchers developed a novel method to detect reward hacking in fine-tuned AI models by analyzing internal activations ...
Inference-Time Toxicity Mitigation in Protein Language Models
Researchers have demonstrated that protein language models can be prompted to generate potentially toxic proteins throug...
Inference-Time Toxicity Mitigation in Protein Language Models
Research demonstrates that standard protein language models (PLMs) can generate toxic proteins when fine-tuned for speci...
Inference-Time Toxicity Mitigation in Protein Language Models
New research demonstrates that fine-tuning protein language models (PLMs) for specific taxonomic groups can inadvertentl...
Inference-Time Toxicity Mitigation in Protein Language Models
Research shows protein language models fine-tuned on specific biological domains can inadvertently generate toxic protei...
Inference-Time Toxicity Mitigation in Protein Language Models
A new study reveals that standard fine-tuning of protein language models for specific biological functions can inadverte...
Measuring AI R&D Automation
A new research framework proposes specific empirical metrics to track AI R&D Automation (AIRDA), moving beyond tradition...
Measuring AI R&D Automation
A new research paper proposes a comprehensive framework for measuring AI R&D automation (AIRDA), arguing current benchma...
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
This theoretical paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable...
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
This theoretical paper proposes a Brouwer-inspired 'assertibility constraint' for generative AI systems, requiring them ...
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
This philosophical paper proposes a formal assertibility constraint for AI systems inspired by intuitionistic logic, req...
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
This academic paper proposes a formal 'assertibility constraint' requiring AI systems to provide publicly contestable pr...
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
This paper proposes a formal Brouwerian 'assertibility constraint' requiring AI systems to provide publicly contestable ...
Structure-Aware Distributed Backdoor Attacks in Federated Learning
New research demonstrates that neural network architecture significantly impacts backdoor attack effectiveness in federa...
Structure-Aware Distributed Backdoor Attacks in Federated Learning
Researchers have discovered that machine learning model architecture critically determines susceptibility to stealthy ba...
Structure-Aware Distributed Backdoor Attacks in Federated Learning
New research demonstrates that neural network architecture fundamentally determines backdoor attack effectiveness in fed...
Structure-Aware Distributed Backdoor Attacks in Federated Learning
Researchers have discovered that neural network architecture significantly influences backdoor attack effectiveness in f...
Structure-Aware Distributed Backdoor Attacks in Federated Learning
New research demonstrates that backdoor attack efficacy in federated learning systems is fundamentally determined by neu...
Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation
The paper 'DSRM-HRL: A Denoising State Representation Framework for Fairness-Aware Interactive Recommendation' proposes ...
Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes
A novel study using LLM-generated child-AI interaction scenarios reveals parents are concerned about nuanced risks like ...
Understanding Parents' Desires in Moderating Children's Interactions with GenAI Chatbots through LLM-Generated Probes
A new study reveals significant gaps between existing AI safety tools and parental expectations for moderating children'...
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that protects data ...
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
Researchers developed Mutual Information Unlearnable Examples (MI-UE), a novel theoretical framework that prevents AI mo...
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
Mutual Information Unlearnable Examples (MI-UE) represent a principled approach to data privacy in machine learning by r...
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
Researchers developed MI-UE (Mutual Information Unlearnable Examples), a novel data poisoning method grounded in informa...
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
Researchers developed Mutual Information Unlearnable Examples (MI-UE), a theoretical framework that creates AI-unlearnab...
Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
深度求索(DeepSeek)开源了IronClaw安全对齐模型,基于其671B参数的DeepSeek-V2基座模型,采用创新的“从零重构”安全对齐方法。该模型在中文安全评测集CValues上获得9.28分(满分10分),超越了主流开源模型。...
Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
安全研究团队通过逆向工程,从零开始完整复现了代号“龙虾”的恶意软件的安全版本,命名为IronClaw。该项目旨在为防御方提供无害化研究样本,用于分析恶意行为、测试检测规则和培训安全人员,标志着威胁狩猎从被动响应转向主动深度研究。重构过程避免...
Transformer作者重造龙虾,Rust搓出钢铁版,告别OpenClaw裸奔
Anthropic宣布从零重构其核心安全对齐技术Constitutional AI,新版本命名为IronClaw。IronClaw旨在将安全性内置于模型架构底层,在内部测试中将有害输出率降低超过一个数量级,同时保持MMLU等基准测试的高性能...
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Researchers have identified Image-based Prompt Injection (IPI), a novel security vulnerability where attackers embed hid...
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Researchers have identified a novel security vulnerability called Image-based Prompt Injection (IPI) that embeds adversa...
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Image-based Prompt Injection (IPI) is a novel black-box attack that embeds adversarial text instructions into images to ...
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Researchers have identified Image-based Prompt Injection (IPI) as a novel security vulnerability in multimodal AI system...
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Researchers have identified Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial text instruct...
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
A new research paper proposes a structured, attack-tree-based methodology for assessing security risks in LLM-integrated...
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
A new research paper proposes a goal-driven risk assessment methodology using attack trees to evaluate security vulnerab...
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
A new research paper proposes a structured, goal-driven risk assessment methodology using attack trees to model security...
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
A new research paper proposes a structured, goal-driven risk assessment framework using attack trees to address security...
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
A new research framework proposes using attack trees for structured risk assessment of LLM-powered systems, specifically...
RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering
Researchers introduced RAG-X, a diagnostic framework that systematically evaluates retrieval-augmented generation (RAG) ...
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Researchers have developed SafeCRS, a safety-aware training framework that reduces personalized safety violations in LLM...
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Researchers have developed SafeCRS, a safety-aware training framework for LLM-based conversational recommender systems t...
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Researchers have developed SafeCRS, a safety-aligned training framework for LLM-based conversational recommender systems...
Multi-Agent Influence Diagrams to Hybrid Threat Modeling
A new research paper introduces a multi-agent influence diagram framework to systematically evaluate countermeasures aga...
Multi-Agent Influence Diagrams to Hybrid Threat Modeling
A novel multi-agent influence diagram framework evaluates counter-hybrid threat measures through computational modeling....
Multi-Agent Influence Diagrams to Hybrid Threat Modeling
A new research paper introduces a unified multi-agent influence diagram framework to model hybrid threats—hostile action...
The Controllability Trap: A Governance Framework for Military AI Agents
The Agentic Military AI Governance Framework (AMAGF) addresses the unique risks of agentic AI systems in defense by shif...
The Controllability Trap: A Governance Framework for Military AI Agents
The Agentic Military AI Governance Framework (AMAGF) addresses six distinct agentic governance failures in military AI s...
The Controllability Trap: A Governance Framework for Military AI Agents
The Agentic Military AI Governance Framework (AMAGF) introduces a measurable governance architecture for military AI sys...
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning
A study of the Qwen2.5-Math-7B model reveals that while it achieves 61% accuracy on GSM8K math problems, only 18.4% of c...
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning
Recent research exposes critical vulnerabilities in mathematical reasoning AI models, finding that while models like Qwe...
PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
PRIVATEEDIT is a novel privacy-preserving pipeline for facial image editing that prevents biometric data exposure to thi...
PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
PRIVATEEDIT is a privacy-preserving pipeline for face-centric generative image editing that performs sensitive facial ma...
PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing
PrivateEdit is a privacy-preserving pipeline for face-centric generative image editing that prevents biometric data expo...