Daily Papers Arch&EAI

Snapshot: 20260310_0725

Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation

Authors: Christian Dreher, Patrick Dormanns, Andre Meixner, Tamim Asfour

First: 2026-03-06T18:25:42+00:00 · Latest: 2026-03-06T18:25:42+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Temporal task structure is fundamental for bimanual manipulation: a robot must not only know that one action precedes or overlaps another, but also when each action should occur and how long it should take. While symbolic temporal relations enable high-level reasoning about task structure and alternative execution sequences, concrete timing parameters are equally essential for coordinating two hands at the execution level. Existing approaches address these two levels in isolation, leaving a gap between high-level task planning and low-level movement synchronization. This work presents an approach for learning both symbolic and subsymbolic temporal task constraints from human demonstrations and deriving executable, temporally parametrized plans for bimanual manipulation. Our contributions are (i) a 3-dimensional representation of timings between two actions with methods based on multivariate Gaussian Mixture Models to represent temporal relationships between actions on a subsymbolic level, (ii) a method based on the Davis-Putnam-Logemann-Loveland (DPLL) algorithm that finds and ranks all contradiction-free assignments of Allen relations to action pairs, representing different modes of a task, and (iii) an optimization-based planning system that combines the identified symbolic and subsymbolic temporal task constraints to derive temporally parametrized plans for robot execution. We evaluate our approach on several datasets, demonstrating that our method generates temporally parametrized plans closer to human demonstrations than the most characteristic demonstration baseline.

Summary / 总结

Temporal task structure is fundamental for bimanual manipulation: a robot must not only know that one action precedes or overlaps another, but also when each action should occur and how long it should take.

History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation

Authors: Qitong Wang, Yijun Liang, Ming Li, Tianyi Zhou, Christopher Rasmussen

First: 2026-03-06T17:03:16+00:00 · Latest: 2026-03-06T17:03:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the current view, alongside spatio-temporal compression for historical memories, enabling efficient long-horizon inference while reducing redundant computation. Leveraging attention-based token importance and query-guided spatio-temporal filtering, the proposed approach preserves navigation-relevant information without retraining or modifying pretrained models, allowing plug-and-play integration into existing VLA systems. Through experiments on standard VLN benchmarks, we confirm that our method significantly outperforms existing pruning strategies. It successfully preserves superior navigation accuracy under extreme pruning scenarios, all while maintaining the highly competitive inference efficiency. Real-world deployment on a Unitree Go2 quadruped robot further validates reliable and low-latency instruction-following navigation under practical robotic constraints. We hope this work helps bridge the gap between large-scale multimodal modeling and efficient, real-time embodied deployment in robotic navigation systems.

Summary / 总结

Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems.

SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation

Authors: Tongqing Chen, Hang Wu, Jiasen Wang, Xiaotao Li, Zhu Jin, Lu Fang

First: 2026-03-06T13:40:30+00:00 · Latest: 2026-03-06T13:40:30+00:00

Abs · PDF · Code1 · Code2

Abstract

High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6$\times$ higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.

Summary / 总结

High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck.

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

Authors: Zhenhao Huang, Siyuan Luo, Bingyang Zhou, Ziqiu Zeng, Jason Pho, Fan Shi

First: 2026-03-06T12:32:56+00:00 · Latest: 2026-03-06T12:32:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

Summary / 总结

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data.

Safe Autonomous Lane Changing: Planning with Dynamic Risk Fields and Time-Varying Convex Space Generation

Authors: Yijun Lu, Zhihao Lin, Zhen Tian

First: 2025-11-28T01:27:24+00:00 · Latest: 2026-03-06T12:24:16+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents a novel trajectory planning pipeline for complex driving scenarios like autonomous lane changing, by integrating risk-aware planning with guaranteed collision avoidance into a unified optimization framework. We first construct a dynamic risk fields (DRF) that captures both the static and dynamic collision risks from surrounding vehicles. Then, we develop a rigorous strategy for generating time-varying convex feasible spaces that ensure kinematic feasibility and safety requirements. The trajectory planning problem is formulated as a finite-horizon optimal control problem and solved using a constrained iterative Linear Quadratic Regulator (iLQR) algorithm that jointly optimizes trajectory smoothness, control effort, and risk exposure while maintaining strict feasibility. Extensive simulations demonstrate that our method outperforms traditional approaches in terms of safety and efficiency, achieving collision-free trajectories with shorter lane-changing distances (28.59 m) and times (2.84 s) while maintaining smooth and comfortable acceleration patterns. In dense roundabout environments the planner further demonstrates robust adaptability, producing larger safety margins, lower jerk, and superior curvature smoothness compared with APF, MPC, and RRT based baselines. These results confirm that the integrated DRF with convex feasible space and constrained iLQR solver provides a balanced solution for safe, efficient, and comfortable trajectory generation in dynamic and interactive traffic scenarios.

Summary / 总结

This paper presents a novel trajectory planning pipeline for complex driving scenarios like autonomous lane changing, by integrating risk-aware planning with guaranteed collision avoidance into a unified optimization framework.

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

Authors: Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao

First: 2026-02-11T02:33:04+00:00 · Latest: 2026-03-06T11:45:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

Summary / 总结

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation.

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Authors: Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang

First: 2026-03-06T09:01:34+00:00 · Latest: 2026-03-06T09:01:34+00:00

Comments: Accepted by CVPR2026 findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.

Summary / 总结

We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity.

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Authors: Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zang, Weinan Zhang, Chao Yu, Yu Wang

First: 2026-02-13T05:15:50+00:00 · Latest: 2026-03-06T08:46:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Summary / 总结

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations.

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Authors: Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen

First: 2026-03-06T08:01:36+00:00 · Latest: 2026-03-06T08:01:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.

Summary / 总结

Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies.

HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

Authors: Ziyang Zhao, Shuheng Wang, Zhonghua Miao, Ya Xiong

First: 2026-03-06T07:26:45+00:00 · Latest: 2026-03-06T07:26:45+00:00

Abs · PDF · Code1 · Code2

Abstract

This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: https://youtu.be/bN8ZowZKPMI.

Summary / 总结

This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections.

Iterative Convex Optimization with Control Barrier Functions for Obstacle Avoidance among Polytopes

Authors: Shuo Liu, Zhe Huang, Calin A. Belta

First: 2026-03-06T05:10:44+00:00 · Latest: 2026-03-06T05:10:44+00:00

Comments: 9 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Obstacle avoidance of polytopic obstacles by polytopic robots is a challenging problem in optimization-based control and trajectory planning. Many existing methods rely on smooth geometric approximations, such as hyperspheres or ellipsoids, which allow differentiable distance expressions but distort the true geometry and restrict the feasible set. Other approaches integrate exact polytope distances into nonlinear model predictive control (MPC), resulting in nonconvex programs that limit real-time performance. In this paper, we construct linear discrete-time control barrier function (DCBF) constraints by deriving supporting hyperplanes from exact closest-point computations between convex polytopes. We then propose a novel iterative convex MPC-DCBF framework, where local linearization of system dynamics and robot geometry ensures convexity of the finite-horizon optimization at each iteration. The resulting formulation reduces computational complexity and enables fast online implementation for safety-critical control and trajectory planning of general nonlinear dynamics. The framework extends to multi-robot and three-dimensional environments. Numerical experiments demonstrate collision-free navigation in cluttered maze scenarios with millisecond-level solve times.

Summary / 总结

Obstacle avoidance of polytopic obstacles by polytopic robots is a challenging problem in optimization-based control and trajectory planning.

Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers

Authors: Takeru Tsunoori, Masato Kobayashi, Yuki Uranishi

First: 2025-11-20T05:11:26+00:00 · Latest: 2026-03-06T04:18:19+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Underwater robotic manipulation remains challenging because lighting variation, color attenuation, scattering, and reduced visibility can severely degrade visuomotor policies. We present Bi-AQUA, the first underwater bilateral control-based imitation learning framework for robot arms that explicitly models lighting within the policy. Bi-AQUA integrates transformer-based bilateral action chunking with a hierarchical lighting-aware design composed of a label-free Lighting Encoder, FiLM-based visual feature modulation, and a lighting token for action conditioning. This design enables adaptation to static and dynamically changing underwater illumination while preserving the force-sensitive advantages of bilateral control, which are particularly important in long-horizon and contact-rich manipulation. Real-world experiments on underwater pick-and-place, drawer closing, and peg extraction tasks show that Bi-AQUA outperforms a bilateral baseline without lighting modeling and achieves robust performance under seen, unseen, and changing lighting conditions. These results highlight the importance of combining explicit lighting modeling with force-aware bilateral imitation learning for reliable underwater manipulation. For additional material, please check: https://mertcookimg.github.io/bi-aqua

Summary / 总结

Underwater robotic manipulation remains challenging because lighting variation, color attenuation, scattering, and reduced visibility can severely degrade visuomotor policies.

AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

Authors: Hyeongjun Heo, Seungyeon Woo, Sang Min Kim, Junho Kim, Junho Lee, Yonghyeon Lee, Young Min Kim

First: 2026-03-06T03:44:23+00:00 · Latest: 2026-03-06T03:44:23+00:00

Comments: Under review, Project Page: https://heo0224.github.io/AnyCamVLA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.

Summary / 总结

Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments.

DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization

Authors: Qianyou Zhao, Wenqiao Li, Chiyu Wang, Kaifeng Zhang

First: 2026-03-06T03:36:33+00:00 · Latest: 2026-03-06T03:36:33+00:00

Abs · PDF · Code1 · Code2

Abstract

High-fidelity teleoperation of dexterous robotic hands is essential for bringing robots into unstructured domestic environments. However, existing teleoperation systems often face a trade-off between performance and portability: vision-based capture systems are constrained by costs and line-of-sight requirements, while mechanical exoskeletons are bulky and physically restrictive. In this paper, we present DexEMG, a lightweight and cost-effective teleoperation system leveraging surface electromyography (sEMG) to bridge the gap between human intent and robotic execution. We first collect a synchronized dataset of sEMG signals and hand poses via a MoCap glove to train EMG2Pose, a neural network capable of continuously predicting hand kinematics directly from muscle activity. To ensure seamless control, we develop a robust hand retargeting algorithm that maps the predicted poses onto a multi-fingered dexterous hand in real-time. Experimental results demonstrate that DexEMG achieves high precision in diverse teleoperation tasks. Notably, our system exhibits strong generalization capabilities across novel objects and complex environments without the need for intensive individual-specific recalibration. This work offers a scalable and intuitive interface for both general-purpose robotic manipulation and assistive technologies.

Summary / 总结

High-fidelity teleoperation of dexterous robotic hands is essential for bringing robots into unstructured domestic environments.

EchoVLA: Synergistic Declarative Memory for VLA-Driven Mobile Manipulation

Authors: Min Lin, Xiwen Liang, Bingqian Lin, Liu Jingzhi, Zijian Jiao, Kehan Li, Yu Sun, Weijia Liufu, Yuhan Ma, Yuecheng Liu, Shen Zhao, Yuzheng Zhuang, Xiaodan Liang

First: 2025-11-22T16:30:55+00:00 · Latest: 2026-03-06T03:26:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. The two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide base-arm diffusion policies. To support large-scale training, we further introduce MoMani, an automated benchmark that generates expert-level trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Comprehensive simulated and real-world results demonstrate that EchoVLA substantially improves overall performance, e.g., it achieves the highest success rates of 0.52 on manipulation/navigation tasks and 0.31 on mobile manipulation tasks in simulation, exceeding the strong baseline $π_{0.5}$ by +0.20 and +0.11, respectively.

Summary / 总结

Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks.

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Authors: Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager

Venue: ICRA

First: 2025-10-13T17:51:23+00:00 · Latest: 2026-03-06T01:58:18+00:00

Comments: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/.

Summary / 总结

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming.

ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval

Authors: Shahram Najam Syed, Yatharth Ahuja, Arthur Jakobsson, Jeff Ichnowski

Venue: ICRA

First: 2025-11-09T03:24:28+00:00 · Latest: 2026-03-06T00:12:03+00:00

Comments: 8 pages, 4 figures, 3 tables, accepted to International Conference on Robotics and Automation (ICRA) 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models like OpenVLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad generalization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our approach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLA's frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the $k$ most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations. Experiments on the LIBERO benchmark show improvements from 82.6% to 93.1% on spatial reasoning and 61% to 72.3% on long-horizon tasks over base OpenVLA, with gains across architectures including $π_0$ (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five tasks demonstrate 98% success on both in-distribution and out-of-distribution conditions, improving from 84.7% and 32% respectively for naive fine-tuning. Adaptation completes in 31 seconds using 12 demonstrations on a single RTX 5090.

Summary / 总结

Vision-Language-Action (VLA) models like OpenVLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad generalization.

Multi-Robot Trajectory Planning via Constrained Bayesian Optimization and Local Cost Map Learning with STL-Based Conflict Resolution

Authors: Sourav Raxit, Abdullah Al Redwan Newaz, Jose Fuentes, Paulo Padrao, Ana Cavalcanti, Leonardo Bobadilla

Venue: ICRA 2026

First: 2026-03-06T00:03:18+00:00 · Latest: 2026-03-06T00:03:18+00:00

Comments: Accepted to ICRA 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints. Exact approaches face scalability bottlenecks and limited adaptability, while conventional sampling-based methods require excessive samples to construct optimal trajectories. We propose a two-stage framework integrating sampling-based online learning with formal STL reasoning. At the single-robot level, our constrained Bayesian Optimization-based Tree search (cBOT) planner uses a Gaussian process as a surrogate model to learn local cost maps and feasibility constraints, generating shorter collision-free trajectories with fewer samples. At the multi-robot level, our STL-enhanced Kinodynamic Conflict-Based Search (STL-KCBS) algorithm incorporates STL monitoring into conflict detection and resolution, ensuring specification satisfaction while maintaining scalability and probabilistic completeness. Benchmarking demonstrates improved trajectory efficiency and safety over existing methods. Real-world experiments with autonomous surface vehicles validate robustness and practical applicability in uncertain environments. The STLcBOT Planner will be released as an open-source package, and videos of real-world and simulated experiments are available at https://stlbot.github.io/.

Summary / 总结

We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints.

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Authors: Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, Qi Zhu

First: 2026-03-05T23:31:56+00:00 · Latest: 2026-03-05T23:31:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

Summary / 总结

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation.

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Authors: Dian Yu, Qingchuan Zhou, Bingkun Huang, Majid Khadiv, Zewen Yang

First: 2026-03-05T23:26:44+00:00 · Latest: 2026-03-05T23:26:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Summary / 总结

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors.

CAVER: Curious Audiovisual Exploring Robot

Authors: Luca Macesanu, Boueny Folefack, Samik Singh, Ruchira Ray, Ben Abbatematteo, Roberto Martín-Martín

First: 2025-11-10T20:42:51+00:00 · Latest: 2026-03-05T22:06:10+00:00

Comments: 9 pages, 6 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/

Summary / 总结

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear).

Scalable Digital Compute-in-Memory Ising Machines for Robustness Verification of Binary Neural Networks

Authors: Madhav Vadlamani, Rahul Singh, Yuyao Kong, Zheng Zhang, Shimeng Yu

First: 2026-03-05T21:08:46+00:00 · Latest: 2026-03-05T21:08:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Verification of binary neural network (BNN) robustness is NP-hard, as it can be formulated as a combinatorial search for an adversarial perturbation that induces misclassification. Exact verification methods therefore scale poorly with problem dimension, motivating the use of hardware-accelerated heuristics and unconventional computing platforms, such as Ising solvers, that can efficiently explore complex energy landscapes and discover high-quality solutions. In this work, we reformulate BNN robustness verification as a quadratic unconstrained binary optimization (QUBO) problem and solve it using a digital compute-in-memory (DCIM) SRAM-based Ising machine. Instead of requiring globally optimal solutions, we exploit imperfect solutions produced by the DCIM Ising machine to extract adversarial perturbations and thereby demonstrate the non-robustness of the BNN. The proposed architecture stores quantized QUBO coefficients in approximately 9.1~Mb of SRAM and performs annealing in memory via voltage-controlled pseudo-read dynamics, enabling iterative updates with minimal data movement. Experimental projections indicate that the proposed approach achieves a $178\times$ acceleration in convergence rate and a $1538\times$ improvement in power efficiency relative to conventional CPU-based implementations.

Summary / 总结

Verification of binary neural network (BNN) robustness is NP-hard, as it can be formulated as a combinatorial search for an adversarial perturbation that induces misclassification.

Observing and Controlling Features in Vision-Language-Action Models

Authors: Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann, Marco Pavone

First: 2026-03-05T18:53:50+00:00 · Latest: 2026-03-05T18:53:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.

Summary / 总结

Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence.

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Authors: Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu

First: 2026-03-05T18:22:54+00:00 · Latest: 2026-03-05T18:22:54+00:00

Comments: The first two authors contributed equally. The last two authors advised equally. Project website: https://liuwei283.github.io/RealWonder/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

Summary / 总结

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes.

PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

Authors: Weikai Qin, Sichen Wu, Ci Chen, Mengfan Liu, Linxi Feng, Xinru Cui, Haoqi Han, Hesheng Wang

First: 2026-03-05T17:33:20+00:00 · Latest: 2026-03-05T17:33:20+00:00

Abs · PDF · Code1 · Code2

Abstract

In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.

Summary / 总结

In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks.

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

Authors: Arnau Boix-Granell, Alberto San-Miguel-Tello, Magí Dalmau-Moreno, Néstor García

First: 2026-03-05T17:05:08+00:00 · Latest: 2026-03-05T17:05:08+00:00

Comments: 10 pages, 3 figures, Accepted for publication at European Robotics Forum 2026

Abs · PDF · Code1 · Code2

Abstract

This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.

Summary / 总结

This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation.

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum

First: 2026-03-05T17:02:22+00:00 · Latest: 2026-03-05T17:02:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Summary / 总结

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements.

LHM-Humanoid: Learning a Unified Policy for Long-Horizon Humanoid Whole-Body Loco-Manipulation in Diverse Messy Environments

Authors: Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan

First: 2025-08-23T08:23:14+00:00 · Latest: 2026-03-05T16:38:10+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes. In our setting, multiple objects are displaced from their intended locations and may obstruct navigation; a humanoid agent must repeatedly (i) walk to a target, (ii) pick it up with diverse whole-body postures under balance constraints, (iii) carry it while navigating around obstacles, and (iv) place it at a designated goal -- all within a single continuous episode and without any environment reset. This task simultaneously demands cross-scene generalization and unified one-policy control: layouts, obstacle arrangements, object category/mass/shape/color and object start/goal poses vary substantially even within a room category, requiring a single general policy that directly outputs actions rather than invoking pre-trained skill libraries. Our dataset spans four room types (bedroom, living room, kitchen, and warehouse), comprising 350 diverse scenes/tasks with 79 objects (25 movable targets). Since no scene-specific ground-truth motion sequences are provided, we learn goal-conditioned teacher policies via reinforcement learning and distill them into a single end-to-end student policy using DAgger. We further distill this unified policy into a vision-language-action (VLA) model driven by egocentric RGB observations and natural language. Experiments in Isaac Gym demonstrate that LHM-Humanoid substantially outperforms end-to-end RL baselines and prior humanoid loco-manipulation methods on both seen and unseen scenes, exhibiting strong long-horizon robustness and cross-scene generalization.

Summary / 总结

We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes.

Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups

Authors: Saray Bakker, Martin Schonger, Tobias Löw, Javier Alonso-Mora, Sylvain Calinon

First: 2026-03-05T15:18:26+00:00 · Latest: 2026-03-05T15:18:26+00:00

Comments: Preprint, 14 pages, video linked in the paper, Saray Bakker and Martin Schonger contributed equally as first authors and are listed alphabetically

Abs · PDF · Code1 · Code2

Abstract

Deploying robots in household environments requires safe, adaptable, and interpretable behaviors that respect the geometric structure of tasks. Often represented on Lie groups and Riemannian manifolds, this includes poses on SE(3) or symmetric positive definite matrices encoding stiffness or damping matrices. In this context, dynamical system-based approaches offer a natural framework for generating such behavior, providing stability and convergence while remaining responsive to changes in the environment. We introduce Curve-induced Dynamical systems on Smooth Manifolds (CDSM), a real-time framework for constructing dynamical systems directly on Riemannian manifolds and Lie groups. The proposed approach constructs a nominal curve on the manifold, and generates a dynamical system which combines a tangential component that drives motion along the curve and a normal component that attracts the state toward the curve. We provide a stability analysis of the resulting dynamical system and validate the method quantitatively. On an S2 benchmark, CDSM demonstrates improved trajectory accuracy, reduced path deviation, and faster generation and query times compared to state-of-the-art methods. Finally, we demonstrate the practical applicability of the framework on both a robotic manipulator, where poses on SE(3) and damping matrices on SPD(n) are adapted online, and a mobile manipulator.

Summary / 总结

Deploying robots in household environments requires safe, adaptable, and interpretable behaviors that respect the geometric structure of tasks.

Parallel Split Learning with Global Sampling

Authors: Mohammad Kohankhaki, Ahmad Ayad, Mahdi Barhoush, Anke Schmeink

First: 2024-07-22T15:41:23+00:00 · Latest: 2026-03-05T14:34:53+00:00

Comments: Accepted at the 2025 IEEE 3rd International Conference on Foundation and Large Language Models (FLLM). This version corresponds to the accepted manuscript

Abs · PDF · Code1 · Code2

Abstract

Parallel split learning (PSL) suffers from two intertwined issues: the effective batch size grows with the number of clients, and data that is not identically and independently distributed (non-IID) skews global batches. We present parallel split learning with global sampling (GPSL), a server-driven scheme that fixes the global batch size while computing per-client batch-size schedules using pooled-level proportions. The actual samples are drawn locally without replacement by each selected client. This eliminates per-class rounding, decouples the effective batch from the client count, and makes each global batch distributionally equivalent to centralized uniform sampling without replacement. Consequently, we obtain finite-population deviation guarantees via Serfling's inequality, yielding a zero rounding bias compared to local sampling schemes. GPSL is a drop-in replacement for PSL with negligible overhead and scales to large client populations. In extensive experiments on CIFAR-10/100 and ResNet-18/34 under non-IID splits, GPSL stabilizes optimization and achieves centralized-like accuracy, while fixed local batching trails by up to 60%. Furthermore, GPSL shortens training time by avoiding inflation of training steps induced by data-depletion. These findings suggest GPSL is a promising and scalable approach for learning in resource-constrained environments.

Summary / 总结

Parallel split learning (PSL) suffers from two intertwined issues: the effective batch size grows with the number of clients, and data that is not identically and independently distributed (non-IID) skews global batches.

History

20260309_0721 20260308_0720 20260307_0725 20260306_0749 20260305_0727 20260304_2013 20260304_2010 20260304_0724 20260303_0723 20260302_2107 20260302_0721 20260301_0719 20260228_0721 20260227_1206 20260227_0727 20260226_1121 20260226_1100 20260226_0725 20260225_2020 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553