AI 安全

AI 对齐、安全评估、隐私保护、伦理治理等 AI 安全领域深度报道。

Anthropic refuses Pentagon’s new terms, standing firm on lethal autonomous weapons and mass surveillance
安全

Anthropic refuses Pentagon’s new terms, standing firm on lethal autonomous weapons and mass surveillance

Less than 24 hours before the deadline in an ultimatum issued by the Pentagon, Anthropic has refused the Department of D...

How disconnected clouds improve AI data governance
安全

How disconnected clouds improve AI data governance

<p>Disconnected clouds aim to improve AI data governance as businesses rethink their infrastructure under tighter regul...

Anthropic: Claude faces ‘industrial-scale’ AI model distillation
安全

Anthropic: Claude faces ‘industrial-scale’ AI model distillation

Anthropic has detailed three &#8220;industrial-scale&#8221; AI model distillation campaigns by overseas labs designed to...

市场监管总局:严厉打击3类食品虚假宣传行为
安全

市场监管总局:严厉打击3类食品虚假宣传行为

市场监管总局将于今年3月开始,组织在全国范围开展为期半年的,网络食品、保健食品销售虚假宣传专项整治行动。本次行动将重点严厉打击三类违法违规行为:一是严厉打击各类虚假宣传行为。二是严厉打击违法广告行为。三是严厉打击平台机构违法行为。(央视新闻...

市场监管总局新规要求外卖增设“无堂食”专项标识
安全

市场监管总局新规要求外卖增设“无堂食”专项标识

市场监管总局2月26日晚对外发布《网络餐饮服务经营者落实食品安全主体责任监督管理规定》,规定明确专门从事外卖服务、不提供堂食的外卖商家必须在其主页面显著位置设置“无堂食”标识,且外卖平台需将该标识同步展示在商家列表页面。(新华社)

市场监管总局:防“幽灵外卖”,外卖平台应“实地核查”登记店铺
安全

市场监管总局:防“幽灵外卖”,外卖平台应“实地核查”登记店铺

从市场监管总局召开的新闻发布会上了解到,《网络餐饮服务经营者落实食品安全主体责任监督管理规定》将于6月1日正式实施。要求外卖平台应当对外卖商家进行实名登记,并通过实地核查等方式,对外卖商家的食品经营许可证等经营资质证书进行实质性审查,保证外...

安全

天力锂能:四川天力磷酸铁锂生产线已完成检修并恢复生产

36氪获悉,天力锂能公告,2026年1月14日起,为确保磷酸铁锂生产线高效、稳定、安全运行,保障安全生产顺利开展,四川天力对产线按照预定计划进行停产检修。检修期间,四川天力对产线进行了全面检修、升级改造,为后续安全、连续、高效生产奠定了坚实...

安全

Contextual Safety Reasoning and Grounding for Open-World Robots

arXiv:2602.19983v2 Announce Type: replace-cross Abstract: Robots are increasingly operating in open-world environments w...

安全

Capabilities Ain't All You Need: Measuring Propensities in AI

arXiv:2602.18182v2 Announce Type: replace-cross Abstract: AI evaluation has primarily focused on measuring capabilities,...

安全

Stop Saying "AI"

arXiv:2602.17729v2 Announce Type: replace-cross Abstract: Across academia, industry, and government, ``AI'' has become c...

安全

Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

arXiv:2601.12415v5 Announce Type: replace-cross Abstract: We propose Orthogonalized Policy Optimization (OPO), a princip...

安全

Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study

arXiv:2601.01016v3 Announce Type: replace-cross Abstract: In this study, we focus on the training process and inference ...

安全

The Subject of Emergent Misalignment in Superintelligence: An Anthropological, Cognitive Neuropsychological, Machine-Learning, and Ontological Perspective

arXiv:2512.17989v2 Announce Type: replace-cross Abstract: We examine the conceptual and ethical gaps in current represen...

安全

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

arXiv:2511.12033v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have demonstra...

安全

Data-Augmented Deep Learning for Downhole Depth Sensing and Validation

arXiv:2511.00129v4 Announce Type: replace-cross Abstract: Accurate downhole depth measurement is essential for oil and g...

安全

MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics

arXiv:2510.16071v2 Announce Type: replace-cross Abstract: Neural operators have emerged as a powerful data-driven paradi...

安全

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

arXiv:2506.07452v3 Announce Type: replace-cross Abstract: Large language models (LLMs) can be prompted with specific sty...

安全

Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance

arXiv:2504.06533v3 Announce Type: replace-cross Abstract: Graph Edit Distance (GED) is a widely used measure of graph si...

安全

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

arXiv:2602.22146v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) plays a significant ...

安全

Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

arXiv:2602.21939v1 Announce Type: cross Abstract: How can researchers identify beliefs that large language models (LLMs)...

安全

Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett's Video Segmentation

arXiv:2602.21855v1 Announce Type: cross Abstract: Accurate annotation of endoscopic videos is essential yet time-consumi...

安全

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

arXiv:2602.21829v1 Announce Type: cross Abstract: Visual storytelling models that correctly ground entities in images ma...

安全

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

arXiv:2602.21819v1 Announce Type: cross Abstract: Reconstructing dynamic visual experiences from brain activity provides...

安全

Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

arXiv:2602.21720v1 Announce Type: cross Abstract: Human recursive numeral systems (i.e., counting systems such as Englis...

安全

Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis

arXiv:2602.21657v1 Announce Type: cross Abstract: Computer-aided diagnosis (CAD) has significantly advanced automated ch...

安全

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

arXiv:2602.21633v1 Announce Type: cross Abstract: Standard vision-language-action (VLA) models rely on fitting statistic...

安全

Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

arXiv:2602.21613v1 Announce Type: cross Abstract: Deep intracranial tumors situated in eloquent brain regions controllin...

安全

Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

arXiv:2602.21543v1 Announce Type: cross Abstract: Multilingual pretraining typically lacks explicit alignment signals, l...

安全

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

arXiv:2602.21424v1 Announce Type: cross Abstract: Reinforcement learning (RL) agents under partial observability often c...

安全

Towards Controllable Video Synthesis of Routine and Rare OR Events

arXiv:2602.21365v1 Announce Type: cross Abstract: Purpose: Curating large-scale datasets of operating room (OR) workflow...

安全

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

arXiv:2602.21346v1 Announce Type: cross Abstract: Recent advances in alignment techniques such as Supervised Fine-Tuning...

安全

Equitable Evaluation via Elicitation

arXiv:2602.21327v1 Announce Type: cross Abstract: Individuals with similar qualifications and skills may vary in their d...

安全

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

arXiv:2602.21269v1 Announce Type: cross Abstract: We present Group Orthogonalized Policy Optimization (GOPO), a new alig...

安全

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

arXiv:2602.21267v1 Announce Type: cross Abstract: Cybersecurity threats are becoming increasingly sophisticated, making ...

安全

Inference-time Alignment via Sparse Junction Steering

arXiv:2602.21215v1 Announce Type: cross Abstract: Token-level steering has emerged as a pivotal approach for inference-t...

安全

Language Models Exhibit Inconsistent Biases Towards Algorithmic Agents and Human Experts

arXiv:2602.22070v1 Announce Type: new Abstract: Large language models are increasingly used in decision-making tasks tha...

安全

Distill and Align Decomposition for Enhanced Claim Verification

arXiv:2602.21857v1 Announce Type: new Abstract: Complex claim verification requires decomposing sentences into verifiabl...

安全

The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

arXiv:2602.21745v1 Announce Type: new Abstract: We introduce the ASIR (Awakened Shared Intelligence Relationship) Courag...

安全

美国下令外交官游说反对数据监管倡议

一份内部外交电报报道,特朗普政府已要求美国的外交官游说反对外国政府推动相关举措,这些举措旨在监管美国科技公司对外国人数据的处理。报道称,美国政府认为,外国的上述监管可能干扰与人工智能相关的服务。美国国务院未回应置评请求。(新浪财经)

安全

Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer's and Lewy Body Dementia Diagnosis

arXiv:2602.17557v2 Announce Type: replace-cross Abstract: Alzheimer's disease (AD) and Lewy body dementia (LBD) present ...

安全

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

arXiv:2602.16863v2 Announce Type: replace-cross Abstract: The ability to manipulate tools significantly expands the set ...

安全

Intent Laundering: AI Safety Datasets Are Not What They Seem

arXiv:2602.16729v2 Announce Type: replace-cross Abstract: We systematically evaluate the quality of widely used AI safet...

安全

Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

arXiv:2602.15997v3 Announce Type: replace-cross Abstract: Capability emergence during neural network training remains me...

安全

Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

arXiv:2602.14462v2 Announce Type: replace-cross Abstract: Data-parallel (DP) training with synchronous all-reduce is a d...

安全

Pawsterior: Variational Flow Matching for Structured Simulation-Based Inference

arXiv:2602.13813v2 Announce Type: replace-cross Abstract: We introduce Pawsterior, a variational flow-matching framework...

安全

SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

arXiv:2602.09050v2 Announce Type: replace-cross Abstract: High-speed optical-resolution photoacoustic microscopy (OR-PAM...

安全

What Matters For Safety Alignment?

arXiv:2601.03868v2 Announce Type: replace-cross Abstract: This paper presents a comprehensive empirical study on the saf...

安全

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

arXiv:2512.16602v3 Announce Type: replace-cross Abstract: We introduce Refusal Steering, an inference-time method to exe...

安全

Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery

arXiv:2510.06868v2 Announce Type: replace-cross Abstract: We consider image transmission via deep joint source-channel c...

安全

From Parameters to Behaviors: Unsupervised Compression of the Policy Space

arXiv:2509.22566v2 Announce Type: replace-cross Abstract: Despite its recent successes, Deep Reinforcement Learning (DRL...

安全

RooseBERT: A New Deal For Political Language Modelling

arXiv:2508.03250v3 Announce Type: replace-cross Abstract: The increasing amount of political debates and politics-relate...

安全

Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

arXiv:2507.00407v2 Announce Type: replace-cross Abstract: Accurate molecular property predictions require 3D geometries,...

安全

Safe Reinforcement Learning for Real-World Engine Control

arXiv:2501.16613v2 Announce Type: replace-cross Abstract: This work introduces a toolchain for applying Reinforcement Le...

安全

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals

arXiv:2510.02276v2 Announce Type: replace Abstract: Biosignals offer valuable insights into the physiological states of ...

安全

Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

arXiv:2602.21072v1 Announce Type: cross Abstract: Off-dynamics offline reinforcement learning (RL) aims to learn a polic...

安全

UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

arXiv:2602.20677v1 Announce Type: cross Abstract: Urban systems, as dynamic complex systems, continuously generate spati...

安全

PRECTR-V2:Unified Relevance-CTR Framework with Cross-User Preference Mining, Exposure Bias Correction, and LLM-Distilled Encoder Optimization

arXiv:2602.20676v1 Announce Type: cross Abstract: In search systems, effectively coordinating the two core objectives of...

安全

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

arXiv:2602.20400v1 Announce Type: cross Abstract: To steer language models towards truthful outputs on tasks which are b...

安全

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

arXiv:2602.20170v1 Announce Type: cross Abstract: Existing red-teaming benchmarks, when adapted to new languages via dir...

安全

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave...