Daily Papers Arch&EAI

Snapshot: 20260519_0805

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

Authors: Jin Shi, Brady Zhang, Yishun Lu

First: 2026-05-15T17:48:25+00:00 · Latest: 2026-05-15T17:48:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

Summary / 总结

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control.

Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance

Authors: Jorge Vicente-Martinez, Edgar Ramirez-Laboreo

First: 2025-10-28T14:24:47+00:00 · Latest: 2026-05-15T16:53:35+00:00

Comments: 6 pages, 8 figures. Final version, after peer review and acceptance, submitted to the 23rd IFAC World Congress

Abs · PDF · Code1 · Code2

Abstract

This paper presents an optimal trajectory generation method for 3D overhead cranes by leveraging differential flatness. This framework enables the direct inclusion of complex physical and dynamic constraints, such as nonlinear friction and collision avoidance for both payload and rope. Our approach allows for aggressive movements by constraining payload swing only at the final point. A comparative simulation study validates our approach, demonstrating that neglecting dry friction leads to actuator saturation and collisions. The results show that friction modeling is a fundamental requirement for fast and safe crane trajectories.

Summary / 总结

This paper presents an optimal trajectory generation method for 3D overhead cranes by leveraging differential flatness.

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

Authors: Vaidehi Bagaria, Nikshep Grampurohit, Pulkit Verma

First: 2026-05-15T16:33:59+00:00 · Latest: 2026-05-15T16:33:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.

Summary / 总结

Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive.

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

Authors: Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, Yanwei Fu

Venue: ICML 2026

First: 2026-05-15T16:18:42+00:00 · Latest: 2026-05-15T16:18:42+00:00

Comments: ICML 2026

Abs · PDF · Code1 · Code2

Abstract

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

Summary / 总结

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI.

Health-Conditioned Vision-Language-Action Models for Malfunction-Aware Robot Control

Authors: Hüseyin Arslan, Özgür Erkent

Venue: ICRA

First: 2026-05-15T15:21:39+00:00 · Latest: 2026-05-15T15:21:39+00:00

Comments: VLA Pipelines Workshop at IEEE International Conference on Robotics and Automation (ICRA) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Research on Vision Language Action (VLA) models has been increasing rapidly in recent years. Although some of them focus on detecting, preventing, and recovering from task failures, they usually don't deal with adapting to robot's physical failures. In real-life scenarios, most robots face physical degradations in various ways such as joint degradation, actuator failure, or weak gripper. We introduce malfunction-aware (health-conditioned) VLA that takes a health vector as an input that gives information about robots' joints' operation angle and torque capability, and adapts its predictions to complete the tasks with the degraded joints. To achieve this, we inject a Health Projector module to the VLA-Adapter architecture and train it on malfunction robot data we collected on the LIBERO environment [1]. We collect 128 teleoperated episodes on Libero-Spatial tasks. Our results show that, with a very lightweight addition, the model can learn to operate successfully with different configurations of degraded joints which the default pretrained VLA-Adapter's Libero-Spatial-Pro model cannot. The code and dataset will be available soon at https://github.com/h-arslan/health-aware-vla

Summary / 总结

Research on Vision Language Action (VLA) models has been increasing rapidly in recent years.

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Authors: Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

First: 2026-04-03T07:27:33+00:00 · Latest: 2026-05-15T14:52:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

Summary / 总结

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors.

Petri Net Induced Heuristic Search for Resource Constrained Scheduling

Authors: Ido Lublin, Dor Atzmon, Izack Cohen

First: 2026-05-15T14:15:50+00:00 · Latest: 2026-05-15T14:15:50+00:00

Comments: Accepted at the International Symposium on Combinatorial Search (SoCS 2026)

Abs · PDF · Code1 · Code2

Abstract

We formulate the Resource-Constrained Project Scheduling Problem (RCPSP) as optimal search over the reachability graph of a Timed Transition Petri Net with Resources, using relative-delay tokens so that scheduling decisions correspond to transition firings in the induced state space. We solve the resulting problem with $A^*$ guided by a heuristic that combines Critical Path and resource-based lower bounds, and prove that it is consistent under our token-based time semantics. Experiments on the PSPLIB benchmarks show that the approach outperforms strong exact Mixed-Integer Linear Programming (MIP) baselines (SCIP, CBC) in both success rate and solve time. Per-instance analysis shows that heuristic search and MIP degrade along independent axes, resource tightness for $A^*$ and formulation size for MIP, with resource strength mediating which solver benefits from scale.

Summary / 总结

We formulate the Resource-Constrained Project Scheduling Problem (RCPSP) as optimal search over the reachability graph of a Timed Transition Petri Net with Resources, using relative-delay tokens so that scheduling decisions correspond to transition firings in the induced state space.

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Authors: Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

First: 2026-05-15T14:08:44+00:00 · Latest: 2026-05-15T14:08:44+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

Summary / 总结

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems.

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

Authors: Yunyang Mo, Jian Li, Qiwei Wu, Yihang Kang, Renjing Xu

First: 2026-05-15T14:02:34+00:00 · Latest: 2026-05-15T14:02:34+00:00

Abs · PDF · Code1 · Code2

Abstract

While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.

Summary / 总结

While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

Authors: Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

First: 2026-05-15T13:55:39+00:00 · Latest: 2026-05-15T13:55:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

Summary / 总结

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments.

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Authors: Esteban Padilla-Cerdio, Boyang Sun, Marc Pollefeys, Hermann Blum

First: 2026-03-05T17:02:22+00:00 · Latest: 2026-05-15T13:33:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision-language navigation (VLN) and vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D semantic mapping, task-specific policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Summary / 总结

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements.

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

Authors: Faezeh Pasandideh, Mehdi Azarafza, Achim Rettberg

First: 2026-03-19T17:55:59+00:00 · Latest: 2026-05-15T11:58:05+00:00

Abs · PDF · Code1 · Code2

Abstract

As deep learning models are deployed on resource constrained edge platforms in autonomous driving systems, reli able knowledge of hardware behavior under resource degradation becomes an essential requirement. Therefore, we introduce a systematic characterization of CPU load, GPU utilization, RAM consumption, power draw, throughput, and thermal behaviour of TensorRT-optimized YOLOv10s, YOLOv11s and YOLO2026n pipelines running on NVIDIA Jetson Nano under a large-scale fault injection campaign targeting both lane-following and ob ject detection tasks. Faults are synthesized using a decoupled framework that leverages large language models (LLMs) and latent diffusion models (LDMs), based on original data from our JetBot platform data collection. Results show that across both tasks and both models the inference engines keep GPU occupancy stable, temperature rise under control, and power consumption within safe limits, while memory usage settles into a consistent release pattern after the initial warm-up phase. Object detection tends to show somewhat more variability in memory and thermal behavior, yet both tasks point to the same conclusion: the TensorRT pipelines hold up well even when the input data is heavily degraded. These findings offer a hardware-level view of model reliability that sits alongside, rather than against, the broader body of work focused on inference performance at the edge.

Summary / 总结

As deep learning models are deployed on resource constrained edge platforms in autonomous driving systems, reli able knowledge of hardware behavior under resource degradation becomes an essential requirement.

Ti-iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems

Authors: Mandar Joshi, Farzana Zahid, Judy Bowen, Matthew M. Y. Kuo, Valeriy Vyatkin, Emil Karlsson

First: 2026-05-15T11:44:31+00:00 · Latest: 2026-05-15T11:44:31+00:00

Comments: NA

Abs · PDF · Code1 · Code2

Abstract

Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating obvious devices outliers. In particular, logic-layer deception anomalies can preserve numerically plausible measurements while breaking expected cause-and-effect relationships in the control process. These attacks are difficult to detect using threshold-based monitoring or require heavy server-oriented anomaly detection models. This paper explores the potential of Tiny Deep Learning (TinyDL) to provide lightweight on-device logic-level anomaly detection for resource constrained Programmable Logic Controllers (PLCs). We propose a novel framework, TinyDL-based incremental LSTM (Ti-iLSTM) which optimises the memory and space foot print of Long Short-Term Memory (LSTM), to detect logic-layer inconsistencies in Programmable Logic Controller (PLC) based Industrial Water Treatment Systems (IWTS). Experiments on the publicly available SWaT dataset show that the optimised model achieves high detection performance (F1-score=0.983 and ROC-AUC=0.998). A deployment-style validation on the WADI dataset confirms that the proposed light-weight framework remains applicable beyond a single dataset. The research demonstrates that combining logic-aware supervision with Tiny Deep Learning (TinyDL) sequence learning creates an efficient and accurate anomaly detection suitable for resource constrained Programmable Logic Controllers (PLCs) in industrial environments.

Summary / 总结

Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating obvious devices outliers.

GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

Authors: Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta

First: 2026-05-15T10:48:30+00:00 · Latest: 2026-05-15T10:48:30+00:00

Comments: Project webpage at https://lambdavi.github.io/gap

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

Summary / 总结

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation.

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Authors: Ralf Römer, Yi Zhang, Yuming Li, Angela P. Schoellig

First: 2026-01-14T14:23:42+00:00 · Latest: 2026-05-15T10:27:07+00:00

Comments: Accepted to IEEE Robotics and Automation Letters 2026. Project page: https://tum-lsy.github.io/clare. 11 pages, 9 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

To teach robots complex manipulation tasks, a common approach is to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected VLA modules and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark and five real-world tasks, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code, data, and videos are available at our website: https://tum-lsy.github.io/clare.

Summary / 总结

To teach robots complex manipulation tasks, a common approach is to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data.

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Authors: Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

First: 2026-05-15T09:14:50+00:00 · Latest: 2026-05-15T09:14:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

Summary / 总结

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges.

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

Authors: Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

First: 2026-05-15T08:45:37+00:00 · Latest: 2026-05-15T08:45:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

Summary / 总结

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data.

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Authors: Wenyun Li, Wenjie Huang, Chen Sun

First: 2025-01-31T13:35:19+00:00 · Latest: 2026-05-15T06:40:35+00:00

Abs · PDF · Code1 · Code2

Abstract

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

Summary / 总结

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping.

TFZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices

Authors: Hao Wang, Kuang Zhang, Yonggang Chi, Tianqi Zhao, Yanbo Fu, Jiaxing Guo

First: 2026-05-15T06:24:44+00:00 · Latest: 2026-05-15T06:24:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol-level modulation classification. Research directly targeting physical-layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time-frequency transforms, making deployment on resource-constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.'' To address this dual gap, we propose an ultra-lightweight waveform classification framework based on time-frequency multidimensional features with a cooperative Z-test tree (ZTree). The framework employs low-complexity time-domain feature extraction, and the classification backend adopts a ZTree optimized by Z-statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource-limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB-IoT, the method achieves 99.5\% average accuracy under AWGN and 87.4\% under TDL-C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real-time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open-sourced at: https://github.com/Einstein-sworder/IoT-wave.

Summary / 总结

Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling.

Perforated Neural Networks for Keyword Spotting

Authors: Vishy Gopal, Aris Ilias Goutis, Ralph Crewe, Erin Yanacek, Rorry Brenner

First: 2026-05-15T06:02:19+00:00 · Latest: 2026-05-15T06:02:19+00:00

Comments: 9 pages, 1 figure, 800-trial hyperparameter sweep; Best Model award, Edge Impulse 2025 Hackathon

Abs · PDF · Code1 · Code2

Abstract

Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously. Existing compression and optimization techniques can trade one resource for another, but rarely improve both accuracy and model size at the same time. This paper presents the application of Perforated Backpropagation to keyword spotting on the Edge Impulse platform, an experiment that won the Best Model award at the Edge Impulse 2025 Hackathon in December 2025. By adding artificial Dendrite Nodes to a standard convolutional neural network trained on the Edge Impulse keyword spotting tutorial pipeline, we demonstrate that dendritic models outperform traditional architectures at every level of parameter count and at every accuracy threshold tested across 800 hyperparameter trials. The best dendritic model achieved a test accuracy of 0.933 with only 1,500 parameters, versus the baseline accuracy of 0.921 requiring approximately 4,000 parameters. These results suggest that Perforated Backpropagation is a powerful addition to the edge AI engineer's toolkit, offering simultaneous gains in both model quality and deployment efficiency.

Summary / 总结

Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously.

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Authors: Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

First: 2025-05-24T02:23:46+00:00 · Latest: 2026-05-15T05:59:15+00:00

Comments: 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

Abs · PDF · Code1 · Code2

Abstract

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

Summary / 总结

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators.

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

Authors: Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri

First: 2026-05-15T05:19:10+00:00 · Latest: 2026-05-15T05:19:10+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

Summary / 总结

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings.

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

Authors: Sijia Liu, Yicheng Lang, Soumyadeep Pal, Changsheng Wang, Yancheng Huang, Chongyu Fan, James Diffenderfer, Bhavya Kailkhura, Yihua Zhang

First: 2026-05-15T05:11:43+00:00 · Latest: 2026-05-15T05:11:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

Summary / 总结

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines.

Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems

Authors: Luca Morando, Nishanth Bobbili, Giuseppe Loianno

Venue: ICRA 2026

First: 2026-05-15T05:01:09+00:00 · Latest: 2026-05-15T05:01:09+00:00

Comments: Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

Abs · PDF · Code1 · Code2

Abstract

Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates $\mathcal{C}^3$ continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.

Summary / 总结

Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints.

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Authors: Dung Anh Hoang, Cuong Pham, Cuong Nguyen, Trung le, Jianfei Cai, Thanh-Toan Do

First: 2025-12-25T12:39:36+00:00 · Latest: 2026-05-15T03:51:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.

Summary / 总结

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Authors: Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

First: 2026-02-04T02:54:37+00:00 · Latest: 2026-05-15T03:25:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at https://github.com/KingdalfGoodman/BPDQ.

Summary / 总结

Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving.

KaRMA: A Kinematic Metric for Fine Manipulation Ability in Robotic Hands

Authors: Martin Peticco, Pulkit Agrawal

First: 2026-05-15T02:40:05+00:00 · Latest: 2026-05-15T02:40:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Traditional robotic hand metrics focus on static properties such as workspace, manipulability, and grasp stability. However, these metrics do not directly measure dexterity under the standard definition in robotic manipulation: the ability to continuously change an object's pose within the hand while maintaining contact from an initial grasp. We introduce Kinematic Rolling Manipulation Ability (KaRMA), a kinematic-only metric for fine manipulation that quantifies reachable in-hand translation and reorientation of a spherical test object within a two-finger precision pinch through feasible rolling motions. KaRMA enforces joint limits, collision constraints, rolling contact, and antipodal force feasibility, then investigates reachable in-hand object poses via breadth-first search over translation and rotation primitives. KaRMA reports three scores: translational coverage (KaRMA-T), rotational coverage (KaRMA-R), and sensitivity to the initial grasp (KaRMA-S). We evaluate KaRMA on 16 widely used robotic hands and compare against static baselines, showing that KaRMA separates hands that rank identically under static proxies, reveals translation-rotation tradeoffs invisible to existing baselines, and is qualitatively consistent with selected published task benchmarks where Jacobian-based metrics can be misleading.

Summary / 总结

Traditional robotic hand metrics focus on static properties such as workspace, manipulability, and grasp stability.

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

Authors: Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

First: 2026-05-15T02:16:34+00:00 · Latest: 2026-05-15T02:16:34+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

Summary / 总结

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases.

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Authors: Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

First: 2026-04-16T16:47:08+00:00 · Latest: 2026-05-15T00:53:18+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

Summary / 总结

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning.

Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Authors: Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee

First: 2025-11-18T09:18:26+00:00 · Latest: 2026-05-15T00:13:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep neural networks achieve outstanding performance across vision and language tasks, yet their large parameter counts limit deployment in resource-constrained settings. One-shot pruning reduces model size without retraining, but models trained with standard objectives often suffer substantial accuracy drops under aggressive sparsity. Prior work mitigates this drop along two directions: regularizers such as $\ell_1$ and DeepHoyer that shape the weight distribution during training, and pruning-robust optimizers such as SAM, CrAM, and S$^2$SAM that flatten the loss landscape. However, existing regularizers either shrink all weights uniformly ($\ell_1$) or induce scale-invariant sparsity (DeepHoyer), without concentrating weight energy onto a small set of informative parameters. We propose a Weight Concentration Regularizer (WCR), a training-time regularizer that amplifies the magnitude of a small subset of parameters while driving the remainder toward zero, so that magnitude pruning predominantly removes parameters with negligible functional contribution. We provide a convergence analysis and evaluate WCR on LLM fine-tuning, image classification, and medical segmentation, demonstrating consistent improvements in pruning robustness across architectures and compatibility with existing pruning-robust optimizers.

Summary / 总结

Deep neural networks achieve outstanding performance across vision and language tasks, yet their large parameter counts limit deployment in resource-constrained settings.

History

20260518_0755 20260517_0750 20260516_0753 20260515_0755 20260514_0754 20260513_0757 20260512_0755 20260511_0750 20260510_0743 20260509_0754 20260507_0746 20260506_0748 20260505_0752 20260504_0741 20260503_0739 20260502_0749 20260501_0751 20260430_0752 20260429_0753 20260428_0751 20260427_0736 20260426_0735 20260425_0737 20260424_0742 20260423_0743 20260422_0733 20260421_0740 20260420_0733 20260419_0732 20260418_0736 20260417_0737 20260416_0739 20260415_0740 20260414_0740 20260413_0732 20260412_0730 20260410_0735 20260409_0735 20260408_0735 20260407_0733 20260406_0731 20260405_0728 20260403_0732 20260401_0731 20260331_0732 20260330_0731 20260328_0730 20260327_0730 20260326_0732 20260325_0729 20260324_0729 20260323_0725 20260322_0721 20260321_0726 20260320_0727 20260319_0728 20260318_0733 20260317_0729 20260316_0726 20260315_0725 20260314_0725 20260313_2237 20260312_0723 20260311_0724 20260310_0725 20260309_0721 20260308_0720 20260307_0725 20260306_0749 20260305_0727 20260304_2013 20260304_2010 20260304_0724 20260303_0723 20260302_2107 20260302_0721 20260301_0719 20260228_0721 20260227_1206 20260227_0727 20260226_1121 20260226_1100 20260226_0725 20260225_2020 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553