Daily Papers Arch&EAI

Snapshot: 20260611_0812

Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators

Authors: Joohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto Horowitz

First: 2025-04-23T20:06:09+00:00 · Latest: 2026-06-09T17:41:55+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.

Summary / 总结

In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity.

Defeat the Heap: Zero-Copy Data Movement in AXI4MLIR

Authors: Elam Cohavi, Nicolas Bohm Agostini, Jude Haris, Antonino Tumeo, David Kaeli, José Cano

First: 2026-06-09T17:40:13+00:00 · Latest: 2026-06-09T17:40:13+00:00

Comments: Accepted to the 7th Compilers for Machine Learning Workshop (C4ML), co-located with CGO 2026

Abs · PDF · Code1 · Code2

Abstract

As custom hardware accelerators become increasingly central to machine learning workloads, efficient data transfer is critical for maximizing accelerator performance on linear algebra kernels. AXI4MLIR, an extension of the Multi-Level Intermediate Representation (MLIR) compiler framework for automated generation of host-accelerator driver code, incurs significant runtime overhead due to non-zero-copy CPU-accelerator data movement. During transfers from the host to the accelerator, data is copied from heap-allocated memory buffers into contiguous Direct Memory Access (DMA)-mapped buffers. This work identifies this copy as a redundant staging operation and eliminates it through zero-copy data movement. The optimization extends accel, an MLIR dialect introduced by AXI4MLIR, and implements lowering support that allocates buffers directly within DMA-mapped memory, thereby omitting the staging copy. We evaluate the proposed scheme using a configurable matrix-matrix multiplication accelerator and show that the zero-copy optimization reduces main memory data movement by up to 2x, increasing overall accelerator utilization.

Summary / 总结

As custom hardware accelerators become increasingly central to machine learning workloads, efficient data transfer is critical for maximizing accelerator performance on linear algebra kernels.

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

Authors: Vinamra Sharma, Xingjian Fu, Jude Haris, José Cano

First: 2026-06-09T17:14:44+00:00 · Latest: 2026-06-09T17:14:44+00:00

Comments: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

Abs · PDF · Code1 · Code2

Abstract

Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

Summary / 总结

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

Authors: Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

First: 2026-06-09T17:03:37+00:00 · Latest: 2026-06-09T17:03:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

Summary / 总结

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography.

GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware Accelerators

Authors: Yuhao Liu, Salim Ullah, Akash Kumar

First: 2026-02-25T19:18:22+00:00 · Latest: 2026-06-09T16:42:58+00:00

Abs · PDF · Code1 · Code2

Abstract

With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators. Classic multi-threshold activation hardware requires 2^n thresholds for $n$-bit outputs, causing a rapid increase in hardware cost as precision increases. We propose a reconfigurable activation hardware, GRAU, based on piecewise linear fitting, where the segment slopes are approximated by powers of two. Our design requires only basic comparators and 1-bit right shifters, supporting mixed-precision quantization and nonlinear functions such as SiLU. Compared with multi-threshold activators, GRAU reduces LUT consumption by over 90%, achieving higher hardware efficiency, flexibility, and scalability. The best trade-off is usually achieved with 6-8 segments, while complex nonlinearities under aggressive low-cost settings may suffer larger accuracy degradation.

Summary / 总结

With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators.

Coset Ensemble Decoder for Quantum Error Correction with Algorithm-Hardware Co-Design

Authors: Shuang Liang, Jubo Xu, Giulio Bassanino, Qianzhou Wang, Yidong Zhou, Yuncheng Lu, Zhiwen Mo, Paul H. J. Kelly, Bo Yuan, Wayne Luk, Hongxiang Fan

First: 2026-06-09T16:37:13+00:00 · Latest: 2026-06-09T16:37:13+00:00

Comments: 15 pages, 19 figures, 1 table. Accepted to appear in the 53rd Annual International Symposium on Computer Architecture (ISCA 2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reliable large-scale quantum computation relies on fault-tolerant architectures, where quantum error correction (QEC) continuously extracts and decodes error syndromes in real time. A critical component in QEC is the decoder, a classical subsystem that must simultaneously deliver high logical accuracy and ultra-low latency. This paper presents a novel algorithm-hardware co-design that improves the accuracy-latency trade-off over existing approaches such as vanilla Minimum-Weight Perfect Matching (MWPM) and Union-Find (UF) decoders. At the algorithmic level, we introduce coset ensemble decoding, which improves UF decoding by explicitly exploiting logically equivalent cosets. Our method performs ensemble forest exploration to generate multiple coset-consistent candidates and aggregates them to approximate coset-level maximum-likelihood decoding. We further reduce computational and memory complexity via reverse-order elimination and lossless graph compression, without sacrificing accuracy. At the hardware level, we design a domain-specific architecture that temporally reuses resources, avoiding the code-distance-proportional resource growth in prior spatial architectures. Several optimizations, such as multi-bank memory hashing and hierarchical ID mapping, are proposed to mitigate pipeline stalls and memory conflicts under highly concurrent access patterns. Under a circuit-level depolarizing noise model, our co-design approach achieves a better accuracy-latency trade-off than prior MWPM- and UF-based decoders, while reducing FPGA LUT consumption by up to 8.2 times compared with reported UF-based decoder resources. The tunable candidate number further exposes a flexible design knob, enabling users to tailor decoding performance to the requirements of different fault-tolerant workloads. Our implementation is publicly available at https://github.com/IMSeonL/coset-ensemble-decoder.

Summary / 总结

Reliable large-scale quantum computation relies on fault-tolerant architectures, where quantum error correction (QEC) continuously extracts and decodes error syndromes in real time.

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

Authors: Yinchen Tian, Huan Li, Muyao Peng, Xi Wang, Yan Wang, You Yang

First: 2026-06-09T14:09:13+00:00 · Latest: 2026-06-09T14:09:13+00:00

Comments: 14 pages,9 figures

Abs · PDF · Code1 · Code2

Abstract

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

Summary / 总结

Robotic manipulation has been widely applied in industrial scenarios.

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Authors: Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

First: 2026-02-18T21:28:56+00:00 · Latest: 2026-06-09T14:01:51+00:00

Comments: Some fundemental change in text and codebase

Abs · PDF · Code1 · Code2 · Code3

Abstract

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

Summary / 总结

Task planning for robotic manipulation with large language models (LLMs) is an emerging area.

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

Authors: Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

First: 2026-05-27T18:16:23+00:00 · Latest: 2026-06-09T13:49:01+00:00

Comments: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

Abs · PDF · Code1 · Code2

Abstract

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

Summary / 总结

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive.

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

Authors: Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

First: 2026-06-09T13:39:49+00:00 · Latest: 2026-06-09T13:39:49+00:00

Comments: 14 pages, 7 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

Summary / 总结

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible.

A 185 TOPS/W/mm2 Bayesian Inference Engine with 640 aJ Write-Free FeFET GRNG for Uncertainty-Aware Aerial Search and Rescue

Authors: Zephan M. Enciso, Xuezhong Niu, Xingtian Wang, Mohammad Mehdi Sharifi, Subhasish Mukherjee, Likai Pei, Halid Mulaosmanovic, Stefan Duenkel, Sven Beyer, Michael Niemier, Kai Ni, Ningyuan Cao

First: 2026-06-09T13:04:02+00:00 · Latest: 2026-06-09T13:04:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Aerial search and rescue missions require fast and reliable victim detection under uncertain and rapidly changing environments. Deterministic deep learning models can produce overconfident false positives, forcing unmanned aircraft systems to perform costly verification maneuvers that reduce search coverage and increase rescue delay. Bayesian neural networks provide uncertainty-aware detection, but their sampling overhead is challenging for battery-constrained edge platforms. This work presents a FeFET-based Bayesian inference engine with a write-free central limit theorem Gaussian random number generator embedded in a compute-in-memory macro. By summing currents from a randomly selected subset of minimum-sized, programmed-once FeFETs, the proposed architecture eliminates energy- and endurance-intensive write operations during inference while maintaining scalable Gaussian sampling. The CLT-GRNG consumes 640 aJ per sample, providing a 560x energy-efficiency improvement over prior BNN accelerators, while the CIM tile achieves 185 TOPS/W/mm2. Evaluated on aerial search and rescue detection, the Bayesian model improves uncertainty calibration and robustness under environmental corruption, reducing risk and enabling low-confidence detections to be filtered before costly verification. These results demonstrate an energy-efficient and uncertainty-aware edge AI engine for autonomous search and rescue systems.

Summary / 总结

Aerial search and rescue missions require fast and reliable victim detection under uncertain and rapidly changing environments.

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

Authors: Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

First: 2026-06-09T13:00:56+00:00 · Latest: 2026-06-09T13:00:56+00:00

Comments: Project website: https://gao-jiawei.com/IMPACT/

Abs · PDF · Code1 · Code2

Abstract

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

Summary / 总结

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Authors: Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

First: 2026-06-09T12:49:11+00:00 · Latest: 2026-06-09T12:49:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

Summary / 总结

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world.

CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

Authors: Yosuke Yamaguchi, Issei Suemitsu, Yuki Kajihara, Wenpeng Wei

First: 2026-06-09T12:46:15+00:00 · Latest: 2026-06-09T12:46:15+00:00

Comments: Accepted to EUSIPCO 2026

Abs · PDF · Code1 · Code2

Abstract

Pretrained time series foundation models (TSFMs) have enabled zero-shot forecasting on unseen target series. However, existing TSFMs often incur high computational cost and provide limited support for diverse variable types, often failing to account for covariates that exogenously influence target variability. To address these challenges, we propose CITRAS-FM, a tiny 7M-parameter TSFM that supports univariate, multivariate, and covariate-informed zero-shot forecasting with real-time CPU inference. Built on a patch-based, decoder-only Transformer, CITRAS-FM introduces Shifted Attention into the cross-variate module to effectively exploit known covariates accessible throughout the forecast horizon. Moreover, to enable covariate-aware pretraining despite the scarcity of covariate-rich corpora, we propose CovSynth, which synthesizes realistic covariates from decomposed components of target series. Experiments on fev-bench, spanning 100 tasks across various settings, demonstrate that CITRAS-FM achieves state-of-the-art zero-shot accuracy among sub-10M TSFMs while delivering sub-0.1-second CPU inference, offering a strong balance between forecasting accuracy and real-time deployability.

Summary / 总结

Pretrained time series foundation models (TSFMs) have enabled zero-shot forecasting on unseen target series.

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Authors: Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

First: 2025-10-16T16:11:18+00:00 · Latest: 2026-06-09T10:47:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

Summary / 总结

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks.

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

Authors: Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

First: 2025-05-01T09:22:23+00:00 · Latest: 2026-06-09T09:31:25+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.

Summary / 总结

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe.

Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies

Authors: Clemence Grislain, Mathilde Kappel, Olivier Sigaud, Mohamed Chetouani

Venue: ICML 2026

First: 2026-03-05T15:34:43+00:00 · Latest: 2026-06-09T09:25:28+00:00

Comments: Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation (DEMO)

Abs · PDF · Code1 · Code2

Abstract

Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.

Summary / 总结

Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller.

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Authors: Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

First: 2026-06-09T09:13:36+00:00 · Latest: 2026-06-09T09:13:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

Summary / 总结

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots.

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

Authors: Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

First: 2026-06-09T08:43:05+00:00 · Latest: 2026-06-09T08:43:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

Summary / 总结

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs).

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Authors: Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

First: 2026-06-09T08:36:08+00:00 · Latest: 2026-06-09T08:36:08+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

Summary / 总结

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence.

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

Authors: Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu

Venue: ACM MM

First: 2026-06-09T08:31:59+00:00 · Latest: 2026-06-09T08:31:59+00:00

Comments: Submit to ACM MM

Abs · PDF · Code1 · Code2

Abstract

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

Summary / 总结

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

Authors: Huang Junda, Honghao Guo, Hao Wu, Zhengyang Liu, Marcelo H Ang, Jianshu Zhou

First: 2025-05-13T04:06:35+00:00 · Latest: 2026-06-09T08:27:49+00:00

Comments: 8 pages, 7 figures, conference

Abs · PDF · Code1 · Code2 · Code3

Abstract

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB

Summary / 总结

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical.

FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection

Authors: Yutong He, Zhengyang Huang, Jiahe Geng, Kun Yuan

First: 2026-04-27T03:47:50+00:00 · Latest: 2026-06-09T08:27:40+00:00

Comments: 27 pages, 7 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in heterogeneous, resource-constrained environments. We introduce FedSLoP, a federated optimization algorithm that combines stochastic low-rank subspace projections of gradients, thereby reducing the dimension of communicated and stored updates while preserving optimization progress. On the theoretical side, we develop a detailed nonconvex convergence analysis under standard smoothness and bounded-variance assumptions, showing that FedSLoP is guaranteed to converge to a first-order stationary point at a rate of $O(1/\sqrt{NT})$. On the empirical side, we conduct extensive experiments on federated MNIST classification with heterogeneous data partitions, showing that FedSLoP substantially reduces communication volume and client-side memory while achieving competitive or better accuracy compared with FedAvg and representative sparse or low-rank baselines. Together, our results demonstrate that random subspace momentum methods such as FedSLoP provide a principled and effective approach to communication- and memory-efficient federated learning. Codes are available at: https://github.com/pkumelon/FedSLoP.git.

Summary / 总结

A Hybrid Edge-Cloud Architecture for Low-Latency Entitlement Verification in Resource-Constrained Devices

Authors: Pravin Nagare, Aditya Sabbineni, Devendra Dahiphale, Faiz Gouri, Pratik Thantharate

First: 2026-06-09T08:04:52+00:00 · Latest: 2026-06-09T08:04:52+00:00

Comments: 6 pages, 4 figures, 2 tables, 1 algorithm. Prepared in IEEE format. Proposes the AEC-PR framework for low-latency OTT entitlement verification using TEE and Ed25519

Abs · PDF · Code1 · Code2

Abstract

As digital media consumption shifts toward large-scale Over-the-Top (OTT) platforms, the efficiency of the control plane, specifically entitlement and identity verification, has become a critical factor in user experience. Current architectures often rely on synchronous cloud-tethered validation flows that introduce significant latency, especially on resource-constrained consumer electronics. This paper proposes a Hybrid Edge-Cloud Entitlement Framework designed to minimize user-perceived friction. By implementing a secure, local caching layer within device middleware and utilizing an Adaptive Entitlement Cache with Proactive Refresh (AEC-PR) algorithm, we decouple the user interaction from backend network variability. We evaluate the performance on ARM Cortex-A series hardware, demonstrating that localized cryptographic verification reduces authorization latency from a mean of 422.8ms to 18.4ms (a 95.6% reduction) while mitigating implementation-level side-channel risks through deterministic Ed25519 arithmetic and TEE isolation.

Summary / 总结

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

Authors: Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan

First: 2026-06-07T18:49:33+00:00 · Latest: 2026-06-09T07:53:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

Summary / 总结

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses.

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

Authors: Kuofei Fang, Xinyi Che, Haomin Ouyang, Shufan Zhang, Xuehao Wang, Qi Liu, Liyi Liu, Chenqi Zhang, Wenxi Cai, Wenyu Dai, Jinyang Wu, Fan Zhang, Haoyu Chen, Bin He, Zheng Lian

First: 2026-05-07T13:22:26+00:00 · Latest: 2026-06-09T07:34:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,894 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 4,944 action judgment questions and 1,157 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results demonstrate that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.

Summary / 总结

Embodied AI is a prominent research topic in both academia and industry.

TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving

Authors: Tugrul Gorgulu, Atakan Dag, M. Esat Kalfaoglu, Halil Ibrahim Kuru, Baris Can Cam, Halil Ibrahim Ozturk, Ozsel Kilinc

Venue: CVPR 2026

First: 2026-02-26T21:16:20+00:00 · Latest: 2026-06-09T07:27:50+00:00

Comments: Accepted at the Third Workshop on Simulation for Autonomous Driving (SAD), CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.

Summary / 总结

Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable.

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

Authors: Minsoo Jo, Taeju Kwon, Junha Chun, Youngjoon Jeong, Taesup Kim

First: 2026-06-09T07:26:25+00:00 · Latest: 2026-06-09T07:26:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.

Summary / 总结

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

Authors: Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

First: 2026-06-09T07:18:01+00:00 · Latest: 2026-06-09T07:18:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

Summary / 总结

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent.

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

Authors: Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

Venue: ICLR 2025

First: 2024-07-16T13:13:16+00:00 · Latest: 2026-06-09T05:19:18+00:00

Comments: Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

Summary / 总结

Embodied AI represents systems where AI is integrated into physical entities.

History

20260610_0810 20260609_0803 20260608_0804 20260607_0802 20260606_0806 20260604_0829 20260603_0828 20260602_0811 20260601_0806 20260531_0758 20260530_0809 20260528_0759 20260526_0803 20260525_0757 20260524_0754 20260523_0805 20260522_0759 20260521_0810 20260519_0805 20260518_0755 20260517_0750 20260516_0753 20260515_0755 20260514_0754 20260513_0757 20260512_0755 20260511_0750 20260510_0743 20260509_0754 20260507_0746 20260506_0748 20260505_0752 20260504_0741 20260503_0739 20260502_0749 20260501_0751 20260430_0752 20260429_0753 20260428_0751 20260427_0736 20260426_0735 20260425_0737 20260424_0742 20260423_0743 20260422_0733 20260421_0740 20260420_0733 20260419_0732 20260418_0736 20260417_0737 20260416_0739 20260415_0740 20260414_0740 20260413_0732 20260412_0730 20260410_0735 20260409_0735 20260408_0735 20260407_0733 20260406_0731 20260405_0728 20260403_0732 20260401_0731 20260331_0732 20260330_0731 20260328_0730 20260327_0730 20260326_0732 20260325_0729 20260324_0729 20260323_0725 20260322_0721 20260321_0726 20260320_0727 20260319_0728 20260318_0733 20260317_0729 20260316_0726 20260315_0725 20260314_0725 20260313_2237 20260312_0723 20260311_0724 20260310_0725 20260309_0721 20260308_0720 20260307_0725 20260306_0749 20260305_0727 20260304_2013 20260304_2010 20260304_0724 20260303_0723 20260302_2107 20260302_0721 20260301_0719 20260228_0721 20260227_1206 20260227_0727 20260226_1121 20260226_1100 20260226_0725 20260225_2020 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553