arXiv 论文速递

2026-02-16 03:39
Snapshot: 20260216_0339
Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指令的执行的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律,我们提出了CoVer,一种对比验证器,用于视觉-语言-行动对齐,并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后,我们介绍了“启动时计算”和层次验证推理流水线,用于VLAs。在部署时,我们的框架从视觉语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer实现了14%的任务进展和9%的成功率改进。
Summary / 总结
This paper explores test-time verification as a method to improve alignment between actions and natural language instructions in vision-language-action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, leading to more efficient recovery of correct actions. The proposed CoVer architecture scales gracefully with additional resources, and the framework precomputes diverse rephrased instructions, generating action candidates, and using a verifier to select optimal actions. Compared to scaling policy pre-training, the verification approach shows 22% in-distribution and 13% out-of-distribution gains on the SIMPLER benchmark, with further improvements in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
本文探讨了测试时验证作为提高视觉-语言-行动模型中行动与自然语言指令之间对齐的方法。研究表明,同时扩大重述指令的数量和生成动作的数量可以增加测试时样本多样性,通常比独立扩大每个维度更有效地获得正确动作。提出的CoVer架构能够随着资源的增加而平滑扩展,并且该框架预先计算出多样化的重述指令,生成动作候选,并使用验证器选择最优的动作。与扩大政策预训练相比,验证方法显示出显著的改进,在SIMPLER基准测试中,分布内和分布外分别获得了22%和13%的提升,进一步在实际实验中提高了性能。在PolaRiS基准测试中,CoVer实现了14%的任务进度提升和9%的成功率提升。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-12T17:32:59+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵。这产生了一种子空间并集模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重建误差,而不是权重空间误差。激活衍生的格拉姆正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题,并支持层内压缩和跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40%的压缩比下,相对于基于SVD的先进基线和强大的结构剪枝基线,始终能够改善准确度-压缩和困惑度-压缩的权衡。这种结构稀疏性使得稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to replace low-rank factorization, improving expressiveness while maintaining accuracy. It optimizes the factorization using a calibration set to minimize functional reconstruction error, leading to better performance than SVD-based and structured pruning methods at 20-40% compression ratios. The structured sparsity allows for efficient sparse-dense computation and integrates well with post-training quantization.
CoSpaDi 是一种无需训练的大型语言模型压缩框架,使用结构化稀疏分解来提高表达能力同时保持固定的参数预算。它使用校准集来优化因子分解,以最小化层输出的功能重构误差,并支持逐层压缩和跨层字典共享。在 Llama 和 Qwen 模型中,CoSpaDi 在 20-40% 压缩比下优于最先进的 SVD 基准和结构化剪枝基准,提升了准确性和困惑度之间的权衡。
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Venue: Nat Mach Intell 8, 20-31 (2026)
First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00
Comments: Published at Nature Machine Intelligence
Abstract
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
中文标题/摘要
标题:实验室安全台:评估大型语言模型在实验室安全问题上的表现
人工智能(AI)正在革新科学研究,但其在实验室环境中的日益集成带来了关键的安全挑战。大型语言模型(LLMs)和视觉语言模型(VLMs)现在协助实验设计和程序指导,但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示了当前模型远未达到实验室安全操作所需的可靠性。我们引入了LabSafety Bench,这是一个全面的基准测试,评估模型在危害识别、风险评估和后果预测方面的表现,涵盖765个多项选择题和404个现实实验室场景,共计3,128个开放式任务。对19个先进LLMs和VLMs的评估显示,没有模型在危害识别上的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理方面并没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前,迫切需要专门的安全评估框架。
Summary / 总结
The research aims to address the safety challenges posed by the integration of AI in scientific laboratories. It introduces LabSafety Bench, a benchmark that tests models on hazard identification, risk assessment, and consequence prediction. Evaluations on 19 advanced LLMs and VLMs reveal that no model achieves more than 70% accuracy in hazard identification, highlighting the need for specialized safety evaluation frameworks before deploying AI in real laboratory settings.
研究旨在解决AI在科学实验室中的安全挑战。引入了LabSafety Bench基准,评估模型在危害识别、风险评估和后果预测方面的表现。对19种先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%,强调了在实际实验室环境中部署AI系统前需要专门的安全评估框架的迫切性。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-12T16:49:33+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:基于图像对话的内省视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的文本推理,这往往会导致细微视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了跨模态对齐的有效性——特别是在需要在远距离区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的“基于图像的对话”框架,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT这一新型LVLM实例化了这一范式,ViLaVT配备了一个明确设计用于此类交互视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练,促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT在多个图像和视频基的复杂空间推理任务中取得了显著且一致的改进。
Summary / 总结
This paper addresses the limitations of current large vision-language models (LVLMs) that rely on text-only reasoning and propose a new framework called 'chatting with images'. This framework reframes visual manipulation as language-guided feature modulation, enabling tighter coupling between linguistic reasoning and visual state updates. The model, named ViLaVT, is trained with a two-stage curriculum combining supervised fine-tuning and reinforcement learning. Experiments show that ViLaVT outperforms existing models, especially in complex multi-image and video-based spatial reasoning tasks.
论文针对当前大型视觉-语言模型(LVLM)依赖于基于单一视觉编码的文本推理,往往会丢失细粒度的视觉信息这一局限,提出了‘通过图像对话’的新框架,将视觉操作重新定义为语言引导的特征调制。开发了具有动态视觉编码器的新型LVLM ViLaVT,并通过结合监督微调和强化学习的两阶段课程训练,以增强推理能力。在八个基准测试中的实验表明,ViLaVT在复杂的多图像和基于视频的空间推理任务中表现出显著的性能提升。
3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
Authors: Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu
First: 2026-02-12T16:41:26+00:00 · Latest: 2026-02-12T16:41:26+00:00
Abstract
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/
中文标题/摘要
标题:3DGSNav:通过主动三维高斯点绘制增强视觉-语言模型的物体导航推理
物体导航是具身智能的核心能力,使代理能够在未知环境中定位目标物体。近期视觉-语言模型(VLMs)的进步促进了零样本物体导航(ZSON)。然而,现有方法往往依赖于场景抽象,将环境转换为语义地图或文本表示,导致高层决策受限于低层感知的准确性。在本文中,我们提出了3DGSNav,这是一种新颖的ZSON框架,将三维高斯点绘制(3DGS)嵌入为持久记忆,以增强VLM的空间推理。通过主动感知,3DGSNav逐步构建环境的3DGS表示,实现基于轨迹的自由视角渲染,生成前沿感知的第一人称视图。此外,我们设计了结构化视觉提示,并将其与链式思考(CoT)提示相结合,进一步提高VLM的推理能力。在导航过程中,实时物体检测器筛选潜在目标,而VLM驱动的主动视角切换执行目标再验证,确保高效可靠的识别。在多个基准测试和现实世界实验中,我们的方法在四足机器人上展示了稳健且具有竞争力的性能。项目页面:https://aczheng-cai.github.io/3dgsnav.github.io/
Summary / 总结
3DGSNav is a novel framework for zero-shot object navigation that enhances vision-language model reasoning by using 3D Gaussian Splatting (3DGS) as persistent memory. It incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided rendering and improving spatial reasoning. The method integrates structured visual prompts and Chain-of-Thought prompting to further enhance reasoning. Experimental results show that 3DGSNav achieves robust and competitive performance in various benchmarks and real-world scenarios compared to state-of-the-art approaches.
3DGSNav 是一种通过主动 3D 高斯点绘技术增强视觉语言模型空间推理能力的新框架,用于零样本物体导航。它逐步构建环境的 3D 表示,实现轨迹导向的渲染,并改善 VLM 的决策能力。实验结果显示,3DGSNav 在多个基准测试和真实世界实验中表现出色,优于现有方法,展示了在物体导航任务中的稳健和可靠性能。
Self-Attention Decomposition For Training Free Diffusion Editing
Authors: Tharun Anand, Mohammad Hassan Vali, Arno Solin, Green Rosh, BH Pawan Prasad
Venue: ICASSP 2026
First: 2025-10-26T12:22:56+00:00 · Latest: 2026-02-12T16:23:33+00:00
Comments: ICASSP 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Abstract
Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model's latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.
中文标题/摘要
标题:自注意力分解用于训练自由扩散编辑
扩散模型在图像合成中实现了惊人的保真度,但对输出进行精确控制以实现目标编辑仍然具有挑战性。可控性的关键一步是识别与语义属性相对应的模型潜在表示中的可解释方向。现有找到可解释方向的方法通常依赖于采样大量图像或训练辅助网络,这限制了效率。我们提出了一种分析方法,可以直接从预训练的扩散模型参数中推导出语义编辑方向,无需额外数据或微调。我们的见解是,自注意力权重矩阵编码了模型在训练过程中学习到的数据分布的丰富结构信息。通过计算这些权重矩阵的特征向量,我们获得了稳健且可解释的编辑方向。实验表明,我们的方法在多个数据集上生成了高质量的编辑结果,同时将编辑时间减少了60%以上,超越了当前基准。
Summary / 总结
The research aims to improve the control over diffusion models for image synthesis by identifying interpretable directions in the model's latent representations. The method proposed involves deriving these directions directly from the pretrained parameters of diffusion models using self-attention weight matrices, without the need for additional data or fine-tuning. The experiments show that this approach produces high-quality edits across multiple datasets and reduces editing time by 60% compared to current benchmarks.
研究旨在提高从扩散模型输出中进行目标编辑的控制能力。提出了一种分析方法,可以直接从预训练的扩散模型参数中提取语义编辑方向,无需额外数据或微调。该方法利用自注意力权重矩阵的特征向量来获得稳健且可解释的编辑方向,能够生成高质量的编辑结果,并将编辑时间减少60%以上,超过当前基准方法。
Kelix Technical Report
Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang
First: 2026-02-10T14:48:26+00:00 · Latest: 2026-02-12T15:36:05+00:00
Comments: Work in progress
Abstract
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
中文标题/摘要
标题:Kelix 技术报告
自回归大型语言模型(LLMs)通过将多样化的任务表示为离散自然语言标记序列,并通过下一个标记预测进行训练,能够很好地扩展,这将理解和生成统一在自我监督下。将这一范式扩展到多模态数据需要跨模态的共享离散表示。然而,大多数视觉语言模型(VLMs)仍然依赖于混合界面:离散文本标记配以连续的视觉变换器(ViT)特征。由于监督主要由文本驱动,这些模型往往偏向于理解,而不能充分利用大规模的非文本数据的自我监督学习。最近的工作探索了离散视觉标记化,以实现完全自回归的多模态建模,显示出统一理解和生成的有希望的进展。然而,现有的离散视觉标记经常由于编码能力有限而丢失信息,导致理解能力明显弱于连续特征的VLMs。我们提出了Kelix,一种完全离散的自回归统一模型,以弥合离散和连续视觉表示之间的理解差距。
Summary / 总结
The research aims to improve multimodal understanding by developing a fully discrete autoregressive model, Kelix, which addresses the limitations of current vision-language models that rely on hybrid interfaces. Kelix uses discrete visual tokens to enable unified understanding and generation, overcoming the information loss issue in existing discrete vision tokens. Key experimental findings show that Kelix closes the understanding gap between discrete and continuous visual representations, demonstrating promising progress toward unified multimodal modeling.
研究旨在通过将自回归语言模型扩展到处理视觉数据,提高多模态理解和生成能力。Kelix 是一个完全离散的自回归统一模型,使用跨文本和视觉的共享离散表示。关键发现表明,Kelix 关闭了离散和连续视觉表示之间的理解差距,展示了统一多模态建模的进展。
Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
Authors: Enrico Guerriero, Kjersti Engan, Øyvind Meinich-Bache
First: 2026-02-12T14:31:10+00:00 · Latest: 2026-02-12T14:31:10+00:00
Comments: Presented at the Satellite Workshop on Workshop 15: Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos, IEEE International Conference on Image Processing (ICIP), 2025
Abstract
Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
中文标题/摘要
标题:本地视觉语言模型能否提升活动识别性能——以新生儿复苏为例
准确记录新生儿复苏过程对于质量改进和遵守临床指南至关重要,但在实践中却未得到充分利用。先前使用3D-CNN和视觉变换器(ViT)的工作在检测新生儿复苏视频中的关键活动方面取得了令人鼓舞的结果,但也指出了识别此类细粒度活动的挑战。本研究探讨了生成式人工智能(GenAI)方法在提高此类视频活动识别方面的潜力。具体而言,我们探索了局部视觉语言模型(VLM)与大型语言模型(LLM)结合使用的方法,并将其与监督TimeSFormer基线进行比较。使用包含13.26小时新生儿复苏视频的模拟数据集,我们评估了几种零样本VLM策略和带有分类头的微调VLM,包括低秩适应(LoRA)。结果显示,小型(局部)VLM在幻觉方面存在困难,但在使用LoRA微调后,F1分数达到0.91,超过了TimeSformer的0.70。
Summary / 总结
This study investigates the potential of local vision-language models (VLMs) to improve activity recognition in newborn resuscitation videos, comparing them to a supervised TimeSFormer baseline. Using a simulated dataset of 13.26 hours of newborn resuscitation videos, the research finds that fine-tuned VLMs with LoRA achieve an F1 score of 0.91, surpassing the TimeSFormer results of 0.70, despite initial struggles with hallucinations in zero-shot settings.
研究探讨了局部视觉-语言模型(VLM)在新生儿复苏视频中活动识别方面的潜力,将其与监督学习的TimeSFormer基线进行比较。使用模拟数据集,研究发现经过Low-Rank Adaptation (LoRA) 微调的VLMs实现了0.91的F1分数,超过了TimeSFormer的0.70分数,尽管小型VLMs初期存在幻觉问题。
Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
Authors: Bruno Rigal, Victor Dupriez, Alexis Mignon, Ronan Le Hy, Nicolas Mery
First: 2026-02-12T13:55:43+00:00 · Latest: 2026-02-12T13:55:43+00:00
Comments: 13 pages, 6 figures
Abstract
This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
中文标题/摘要
标题:视觉-语言模型在法语PDF转Markdown转换中的基准测试
本报告评估了使用近期视觉-语言模型(VLMs)对具有挑战性的法语文档进行PDF转Markdown转换的情况。文档解析是检索增强生成(RAG)管道中的一个关键步骤,其中转录和布局错误会传播到下游检索和定位。现有基准测试通常侧重于英语或中文,并且往往会过度惩罚无害的格式和线性化选择(例如行间距、列表分割、替代表格呈现),这些选择对下游使用来说大多无关紧要。 我们通过模型分歧采样从60,000份文档的语料库中引入了一个以法语为重点的基准测试,涵盖了手写表单、复杂布局、密集表格和图形丰富的页面。评估使用了单元测试风格的检查,针对具体的失败模式(文本存在、阅读顺序和局部表格约束)进行,并结合了特定类别的归一化,以减少仅涉及呈现的差异。在15个模型中,我们观察到最强的专有模型在手写和表单方面表现出显著更高的鲁棒性,而几个开源权重系统在标准印刷布局方面仍然具有竞争力。
Summary / 总结
This report evaluates the performance of Vision-Language Models (VLMs) in converting French PDFs to Markdown. The study focuses on challenging French documents, addressing transcription and layout errors that affect downstream retrieval and grounding. A French-focused benchmark was created using model-disagreement sampling from 60,000 documents, including handwritten forms and complex layouts. The evaluation uses unit-test-style checks and category-specific normalization. Results show that proprietary models are more robust for handwriting and forms, while open-source models remain competitive for standard printed layouts.
该报告评估了Vision-Language模型在将法语PDF转换为Markdown时的表现,重点关注包含手写表单、复杂布局和密集表格的挑战性文档。研究引入了一个基于模型分歧采样的法语特定基准,从60,000份文档中选取,并使用单元测试风格的检查进行评估。结果显示,专有模型在处理手写表单方面表现出更高的鲁棒性,而开源模型在标准印刷布局方面仍具有竞争力。
Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization
Authors: Suyash Mishra, Qiang Li, Anubhav Girdhar
First: 2026-02-12T13:53:29+00:00 · Latest: 2026-02-12T13:53:29+00:00
Comments: Submitted to the Demo Track of Top Tier Conference; currently under peer review
Abstract
Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.
中文标题/摘要
标题:两颗LLM比一颗好吗?一种用于制药内容优化的学生-教师双头LLM架构
大型语言模型(LLM)在制药等受监管领域中越来越多地用于生成内容,这些内容必须科学准确且符合法律要求。手动质量控制(QC)过程缓慢、容易出错,可能会成为出版瓶颈。我们引入了LRBTC,这是一种模块化LLM和视觉语言模型(VLM)驱动的QC架构,涵盖了语言、法规、品牌、技术以及内容结构检查。LRBTC结合了学生-教师双模型架构、人工在环(HITL)工作流和瀑布规则过滤,以实现可扩展、可验证的内容验证和优化。在AIReg-Bench上,我们的方法实现了83.0%的F1和97.5%的召回率,将未检测到的违规行为减少了5倍,与Gemini 2.5 Pro相比。在CSpelling上,它将平均准确性提高了26.7%。错误分析进一步表明,当前模型在检测拼写错误方面表现强劲(召回率为92.5%),但在识别复杂医学语法(召回率25.0%)和标点符号(召回率41.7%)错误方面存在不足,这指出了未来工作的关键领域。这项工作提供了一种实用、即插即用的解决方案,用于高风险、合规关键行业的内容可靠、透明的质量控制。我们还根据MIT许可证提供了我们的演示程序访问权限。
Summary / 总结
The research aims to improve the quality control of pharmaceutical content using a modular LLM and VLM architecture. The method involves a Student-Teacher dual model with a human-in-the-loop workflow and waterfall rule filtering. Key findings include an 83.0% F1 score and 97.5% recall on AIReg-Bench, reducing missed violations by 5x compared to Gemini 2.5 Pro, and a 26.7% improvement in mean accuracy on CSpelling. However, current models struggle with complex medical grammatical and punctuation errors, indicating areas for future improvement.
本文介绍了LRBTC,一种结合语言、法规、品牌、技术和内容结构检查的双学生-教师LLM质量控制架构。它在AIReg-Bench上实现了83.0%的F1和97.5%的召回率,相比Gemini 2.5 Pro将未检测到的违规行为减少了5倍,同时在CSpelling上提高了26.7%的平均准确性。研究还指出了当前模型在检测复杂医学语法和标点错误方面的局限性,提出了未来改进的方向。
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
Authors: Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
First: 2026-02-05T12:03:11+00:00 · Latest: 2026-02-12T13:43:33+00:00
Abstract
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
中文标题/摘要
标题:LoGoSeg:结合局部和全局特征的开放词汇语义分割
开放词汇语义分割(OVSS)扩展了传统的封闭集分割,通过任意文本描述对已见和未见类别进行像素级标注。现有方法利用如CLIP等视觉语言模型(VLMs),但其依赖于图像级预训练,常导致空间对齐不精确,从而在模糊或杂乱场景中产生不匹配的分割。然而,大多数现有方法缺乏强大的物体先验和区域级约束,可能导致物体幻觉或漏检,进一步降低性能。为解决这些挑战,我们提出LoGoSeg,一种高效的单阶段框架,集成了三项关键创新:(i)一种物体存在先验,通过全局图像-文本相似性动态加权相关类别,有效减少幻觉;(ii)一种区域感知对齐模块,建立精确的区域级视觉-文本对应关系;(iii)一种双流融合机制,最优结合局部结构信息与全局语义上下文。与先前工作不同,LoGoSeg 消除了对外部掩码提案、额外骨干网络或额外数据集的需求,确保高效性。在六个基准(A-847、PC-459、A-150、PC-59、PAS-20 和 PAS-20b)上的广泛实验表明,其在开放词汇设置中的性能和泛化能力具有竞争力。
Summary / 总结
LoGoSeg is an efficient single-stage framework for open-vocabulary semantic segmentation that integrates global image-text similarity, region-aware alignment, and dual-stream fusion to improve spatial alignment and reduce hallucinations. Experiments on six benchmarks show its competitive performance and strong generalization in open-vocabulary settings, addressing the limitations of existing methods that rely on image-level pretraining and lack strong object priors and region-level constraints.
LoGoSeg 是一种通过整合局部和全局特征来提升开放词汇语义分割的方法。它引入了对象存在先验、区域感知对齐模块和双流融合机制,以增强空间对齐并减少幻觉现象。在六个基准上的实验表明,LoGoSeg 在开放词汇设置中表现出色且具有较强的泛化能力。
LLM-in-Sandbox Elicits General Agentic Intelligence
Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
First: 2026-01-22T18:57:09+00:00 · Latest: 2026-02-12T12:39:21+00:00
Comments: Project Page: https://llm-in-sandbox.github.io
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文标题/摘要
标题:LLM-in-Sandbox 激发通用代理智能
我们介绍了 LLM-in-Sandbox,使大语言模型能够在代码沙箱(即虚拟计算机)中探索,以激发非代码领域的通用智能。我们首先证明,强大的大语言模型在无需额外训练的情况下,能够利用代码沙箱来执行非代码任务,表现出泛化能力。例如,大语言模型会自发地访问外部资源以获取新知识,利用文件系统处理长文本,并执行脚本以满足格式要求。我们进一步展示了通过仅使用非代理数据训练模型来进行沙箱探索的 LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)可以增强这些代理能力。实验表明,无论是训练前还是训练后,LLM-in-Sandbox 都能够在数学、物理、化学、生物医学、长文本理解以及指令遵循等多个领域实现稳健的泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 包,以促进其实用部署。
Summary / 总结
The research introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. The study demonstrates that strong LLMs can generalize and use the sandbox for non-code tasks, such as accessing external resources and executing scripts. LLM-in-Sandbox-RL further enhances these capabilities through reinforcement learning. Experiments show robust generalization across various domains including mathematics, physics, and biomedicine. The research also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.
研究引入了LLM-in-Sandbox,使大型语言模型(LLMs)能够在代码沙箱中探索,以在非代码领域发展一般智能。LLMs展示了通过访问外部资源、处理长文本和执行脚本的能力进行泛化的潜力。进一步通过仅使用非代理数据的LLM-in-Sandbox强化学习来增强这些能力。实验表明,LLM-in-Sandbox在数学、物理、化学、生物医学和指令遵循等多个领域实现了稳健的泛化。研究还从计算和系统效率方面评估了LLM-in-Sandbox,并将其作为Python包开源,以促进实际部署。
TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Authors: Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata
First: 2025-09-25T14:14:27+00:00 · Latest: 2026-02-12T12:11:03+00:00
Abstract
While table understanding increasingly relies on pixel-only settings, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 21 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. To evaluate whether models are able to jointly reason over tabular and visual content, we also introduce VisualTableQA, a benchmark requiring both visual perception and table understanding. Fine-tuning vision-language models like Qwen2.5-VL-7B and Gemma 3-4B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
中文标题/摘要
标题:TABLET:大规模视觉表格理解数据集
尽管表格理解越来越多地依赖于像素级设置,但当前的基准测试主要使用缺乏现实世界表格复杂性和视觉多样性的合成渲染。此外,现有的视觉表格理解(VTU)数据集提供固定示例和单一可视化,并预定义指令,无法访问底层序列化数据进行重新表述。我们引入了TABLET,一个包含21个任务的400万示例的大规模VTU数据集,基于200万张独特表格,其中88%保留了原始可视化。为了评估模型是否能够联合推理表格和视觉内容,我们还引入了VisualTableQA,一个需要视觉感知和表格理解的基准测试。在TABLET上微调如Qwen2.5-VL-7B和Gemma 3-4B等视觉语言模型,可以提高已见和未见VTU任务的性能,同时增强对现实世界表格可视化的鲁棒性。通过保留原始可视化并保持示例可追溯性,TABLET为未来VTU模型的稳健训练和扩展评估奠定了基础。
Summary / 总结
The research aims to address the limitations of current VTU benchmarks by introducing TABLET, a large-scale dataset with 4 million examples across 21 tasks, grounded in 2 million unique tables. TABLET improves robustness in visual table understanding by preserving original visualizations and providing access to underlying data. Fine-tuning vision-language models on TABLET enhances performance on both seen and unseen VTU tasks and increases robustness on real-world table visualizations.
研究引入了TABLET,一个包含400万示例、覆盖21个任务的大规模视觉表格理解数据集,基于200万张独特表格。该研究旨在解决当前基准的局限性,提供现实世界的复杂性和视觉多样性。TABLET还包括VisualTableQA,一个需要视觉感知和表格理解的基准。在TABLET上微调视觉语言模型可以提高对已见和未见任务的性能,并增强对真实世界表格视觉化的鲁棒性。
Free Lunch for Stabilizing Rectified Flow Inversion
Authors: Chenru Wang, Beier Zhu, Chi Zhang
First: 2026-02-12T11:42:36+00:00 · Latest: 2026-02-12T11:42:36+00:00
Abstract
Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
中文标题/摘要
标题:免费午餐以稳定校正流反转
基于校正流(RF)的生成模型最近已成为传统扩散模型的强大替代方案,展示了在各种任务中的最佳性能。通过学习一个连续的速度场,将简单的噪声转换为复杂的数据,基于RF的模型不仅能够实现高质量的生成,还支持无需训练的反转,这有助于下游任务如重建和编辑。然而,现有的反转方法,如传统的RF反转,会遭受随时间累积的近似误差,导致不稳定的速度场和重建和编辑质量下降。为了解决这一挑战,我们提出了邻近均值反转(PMI),这是一种无需训练的梯度校正方法,通过引导速度场向过去速度的运行平均值趋近,约束在理论上推导出的球形高斯内,从而稳定速度场。此外,我们还引入了模仿-CFG,这是一种轻量级的速度校正方案,用于编辑任务,它在当前速度和其在历史平均值上的投影之间进行插值,平衡编辑效果和结构一致性。在PIE-Bench上的大量实验表明,我们的方法显著提高了反转稳定性、图像重建质量和编辑保真度,同时减少了所需的神经函数评估次数。我们的方法在PIE-Bench上实现了最先进的性能,具有增强的效率和理论严谨性。
Summary / 总结
This paper addresses the issue of instability in velocity fields during the inversion process of Rectified-Flow (RF) models, which can degrade reconstruction and editing quality. To tackle this, the authors propose Proximal-Mean Inversion (PMI), a training-free method that guides the velocity field towards a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Additionally, they introduce mimic-CFG, a lightweight correction scheme for editing tasks that balances editing effectiveness and structural consistency. Experiments on PIE-Bench show that these methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the number of neural function evaluations.
论文解决了基于Rectified-Flow (RF)的生成模型中不稳定的速度场问题,这会降低图像重建和编辑的质量。为此,作者提出了Proximal-Mean Inversion (PMI)方法,这是一种无需训练的梯度校正方法,用于稳定速度场。此外,他们还引入了mimic-CFG,这是一种轻量级的编辑任务校正方案,能够在编辑效果和结构一致性之间取得平衡。实验表明,这些方法显著提高了反演稳定性、图像质量和编辑保真度,同时减少了神经函数评估的数量,实现了在PIE-Bench上的最先进性能。
JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
Authors: Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long
First: 2026-02-12T11:20:43+00:00 · Latest: 2026-02-12T11:20:43+00:00
Abstract
Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
中文标题/摘要
标题:JEPA-VLA:视频预测嵌入对于VLA模型是必要的
基于预训练视觉-语言模型(VLMs)的近期视觉-语言-动作(VLA)模型在机器人操作方面取得了显著进步。然而,当前的VLA模型仍然存在样本效率低和泛化能力有限的问题。本文认为这些限制与被忽视的组件——预训练视觉表示密切相关,后者在环境理解和策略先验知识方面提供的知识不足。通过深入分析,我们发现VLA中常用的视觉表示,无论是通过语言-图像对比学习还是基于图像的自我监督学习预训练,仍然无法充分捕捉关键的任务相关信息,也无法诱导有效的策略先验,即在成功执行任务时环境如何演变的预见性知识。相比之下,我们发现基于视频的预训练预测嵌入,特别是V-JEPA 2,能够灵活地忽略不可预测的环境因素,并编码任务相关的时间动态,从而有效弥补现有VLA模型中视觉表示的关键不足。基于这些观察,我们提出了JEPA-VLA,这是一种简单而有效的方法,能够将预测嵌入自适应地整合到现有的VLA模型中。我们的实验表明,JEPA-VLA在包括LIBERO、LIBERO-plus、RoboTwin2.0和真实机器人任务在内的多种基准测试中取得了显著的性能提升。
Summary / 总结
This paper addresses the limitations of current vision-language-action (VLA) models, particularly their low sample efficiency and limited generalization, by highlighting the inadequacy of pretrained visual representations. The authors propose JEPA-VLA, which integrates predictive embeddings pretrained on videos to improve the models' ability to capture task-relevant temporal dynamics and induce effective policy priors. Experiments show that JEPA-VLA significantly outperforms existing VLA models across various benchmarks and real-robot tasks.
本文针对视觉-语言-行动(VLA)模型在样本效率和泛化能力方面的不足,指出这些问题是由于预训练视觉表示不足引起的。作者提出了JEPA-VLA方法,通过将基于视频预训练的预测嵌入整合到现有模型中,提高模型捕捉任务相关的时间动态并丢弃不可预测的环境因素的能力。实验表明,JEPA-VLA在各种基准测试和真实机器人任务中显著优于现有模型。
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
Authors: Jialin Wu, Wei Shi, Han Shen, Peigui Qi, Kunsheng Tang, Zhicong Huang, Binghao Wang, Zhou Yang
First: 2026-02-12T11:07:44+00:00 · Latest: 2026-02-12T11:07:44+00:00
Abstract
Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.
中文标题/摘要
标题:Revis: 稀疏潜在引导以减轻大型视觉-语言模型中的物体幻觉
尽管大型视觉-语言模型(LVLMs)具有先进的能力,但它们经常遭受物体幻觉的问题。其中一个原因是视觉特征和预训练的文本表示在深层网络层中常常交织在一起。为了解决这一问题,我们提出了一种无需训练的REVIS框架,旨在明确重新激活这种被抑制的视觉信息。基于潜在空间几何,REVIS通过正交投影提取纯净的视觉信息向量,并采用校准策略仅在抑制发生的精确深度处进行稀疏干预。这种手术式的方法有效地以最小的计算成本恢复了视觉信息。在标准基准上的实证评估表明,与最先进的基线相比,REVIS将物体幻觉率降低了约19%,同时保持了一般推理能力。
Summary / 总结
The paper addresses the issue of object hallucination in Large Vision-Language Models (LVLMs) by proposing REVIS, a training-free framework. REVIS reactivates suppressed visual information through orthogonal projection and sparse intervention at the precise depth where suppression occurs. Experiments show that REVIS reduces object hallucination rates by about 19% compared to state-of-the-art methods, while maintaining general reasoning capabilities.
研究旨在通过提出REVIS框架解决大型视觉-语言模型(LVLM)中的物体幻觉问题。REVIS通过正交投影和在抑制发生的精确深度处进行稀疏干预来重新激活被抑制的视觉信息。实证评估表明,与最先进的基线相比,REVIS可以将物体幻觉率降低约19%,同时保持一般的推理能力。
Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
Authors: Zhenghuang Wu, Kang Chen, Zeyu Zhang, Hao Tang
First: 2026-02-12T09:50:13+00:00 · Latest: 2026-02-12T09:50:13+00:00
Abstract
Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.
中文标题/摘要
标题:Light4D: 无需训练的极端视角4D视频重新光照
基于扩散生成模型的最新进展为图像和视频重新光照建立了一种新的范式。然而,将这些能力扩展到4D重新光照仍然具有挑战性,主要是由于缺乏配对的4D重新光照训练数据,以及在极端视角下保持时间一致性的难度。在本文中,我们提出了一种名为Light4D的新型无需训练框架,旨在在目标光照下合成一致的4D视频,即使在极端视角变化下也是如此。首先,我们引入了时间感知的解耦流引导策略,该策略有效地将光照控制注入到潜在空间中,同时保持几何完整性。其次,为了增强时间一致性,我们在IC-Light架构中开发了时间一致注意力,并进一步引入确定性正则化以消除外观闪烁。广泛的实验表明,我们的方法在时间一致性和光照保真度方面取得了竞争力的表现,能够稳健地处理从-90到90的相机旋转。代码: https://github.com/AIGeeksGroup/Light4D。网站: https://aigeeksgroup.github.io/Light4D.
Summary / 总结
Light4D is a training-free framework for 4D video relighting that addresses the challenges of maintaining temporal consistency and geometric integrity under extreme viewpoint changes. It uses Disentangled Flow Guidance to inject lighting control into the latent space and Temporal Consistent Attention to reinforce temporal consistency, while deterministic regularization is employed to prevent appearance flickering. Experiments show that Light4D achieves competitive performance in temporal consistency and lighting fidelity, handling camera rotations from -90 to 90 degrees effectively.
Light4D 是一个无需训练的 4D 视频重新光照框架,旨在解决在极端视角变化下保持时间一致性和几何完整性的问题。它引入了 Disentangled Flow Guidance 来将光照控制注入到潜在空间,并使用 Temporal Consistent Attention 来增强时间一致性。实验表明,Light4D 在时间一致性和光照保真度方面表现出色,能够有效处理从 -90 到 90 度的相机旋转。
Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
Authors: Xiangyu Wu, Dongming Jiang, Feng Yu, Yueying Tian, Jiaqi Tang, Qing-Guo Chen, Yang Yang, Jianfeng Lu
Venue: ICLR 2026
First: 2026-02-12T09:12:22+00:00 · Latest: 2026-02-12T09:12:22+00:00
Comments: Accepted for publication at ICLR 2026; 24 pages; 5 figures
Abstract
Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.
中文标题/摘要
标题:自适应去偏Tsallis熵在测试时适应中的应用
主流的测试时适应(TTA)方法,例如针对CLIP等视觉-语言模型,通常依赖于测试时的香农熵(SE)来衡量预测不确定性与不一致性。然而,由于CLIP在预训练时使用了高度不平衡的网络抓取数据,SE不可避免地会产生有偏的不确定性熵估计。为了解决这一问题,我们发现并证明了Tsallis熵(TE),作为SE的一种广义形式,通过引入非可加参数q,自然适用于描述有偏分布,并且SE的性能是TE的下界。基于此,我们为TTA提出了自适应去偏Tsallis熵(ADTE),为每个类别定制了一个由不断到来的测试实例估计的标签偏置归一化得到的类别特定参数q^l。这种自适应方法使ADTE能够准确选择高置信度视图,并无缝地与标签调整策略结合以增强适应,而无需进行特定分布的超参数调整。此外,我们的研究发现,TE和ADTE都可以直接作为SE在TTA中的高级替代方案,无需其他修改。实验结果表明,ADTE在ImageNet及其五个变体上优于最先进的方法,并在10个跨域基准测试中实现了最高的平均性能,无论使用哪种模型架构或文本提示。我们的代码可在https://github.com/Jinx630/ADTE获取。
Summary / 总结
This paper addresses the issue of biased uncertainty estimation in Test-Time Adaptation (TTA) for vision-language models like CLIP, which are pre-trained on imbalanced data. It proposes Adaptive Debiasing Tsallis Entropy (ADTE), a generalized form of Tsallis Entropy that introduces an adaptive parameter q for each category, to better handle biased distributions. Experiments show that ADTE outperforms existing methods on ImageNet and its variants, as well as on 10 cross-domain benchmarks, demonstrating its effectiveness in TTA without requiring additional hyperparameter tuning.
本文针对视觉-语言模型CLIP在测试时适应性(TTA)中由于预训练数据不平衡导致的偏差不确定性估计问题,提出了自适应去偏差Tsallis熵(ADTE),该方法通过为每个类别引入自适应参数q来更好地处理偏差分布。实验结果显示,ADTE在ImageNet及其变体以及10个跨域基准上均优于现有方法,且无需额外的超参数调整,证明了其在TTA中的有效性。
Adapting Vision-Language Models for E-commerce Understanding at Scale
Authors: Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi
First: 2026-02-12T08:59:22+00:00 · Latest: 2026-02-12T08:59:22+00:00
Abstract
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
中文标题/摘要
标题:适应电子商务大规模理解的视觉-语言模型调整
电子商务产品理解本质上要求强大的跨模态理解能力,包括文本、图像和结构化属性。通用的视觉-语言模型(VLMs)能够实现跨模态的泛化潜在建模,但尚未有记录且广为人知的方法能够在不牺牲泛化性能的情况下,将它们调整到以属性为中心、多图像且噪声较大的电子商务数据上。在本研究中,我们通过大规模实验研究,展示了针对通用VLMs进行目标调整如何在保持广泛跨模态能力的同时显著提高电子商务性能。此外,我们还提出了一套新的全面评估方案,涵盖深入的产品理解、严格的指令遵循和动态属性提取。
Summary / 总结
The research aims to enhance the performance of Vision-Language Models (VLMs) in e-commerce by adapting them to the specific needs of the e-commerce domain, such as attribute-centric, multi-image, and noisy data. The study employs a large-scale experimental approach to demonstrate that targeted adaptation of VLMs can significantly improve e-commerce performance while maintaining their general multimodal capabilities. Key findings include the development of a comprehensive evaluation suite that assesses deep product understanding, strict instruction following, and dynamic attribute extraction.
本研究针对将通用视觉-语言模型适应于电子商务的需求,通过开发一种针对性的适应策略,提高了在属性中心、多图像和嘈杂的电子商务数据上的性能,同时保持了广泛的多模态能力。关键发现包括通过大规模实验研究实现了显著的电子商务理解改进,并引入了一个全面的评估套件,涵盖深度产品理解、严格指令遵循和动态属性提取。
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
Authors: Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li
First: 2026-02-12T08:53:32+00:00 · Latest: 2026-02-12T08:53:32+00:00
Abstract
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
中文标题/摘要
标题:STVG-R1:通过强化学习激励实例级推理和视频中的语义对齐
在视觉-语言模型(VLMs)中,文本描述与视觉坐标之间的不一致常常导致幻觉。这一问题在时空视频语义对齐(STVG)等密集预测任务中尤为严重。先前的方法通常侧重于增强视觉-文本对齐或附加辅助解码器。然而,这些策略不可避免地引入了额外的可训练模块,导致显著的注释成本和计算开销。在本文中,我们提出了一种新颖的视觉提示范式,避免了跨模态对齐坐标的困难问题。具体而言,我们将每帧坐标预测重新表述为一个紧凑的实例级识别问题,为每个对象分配一个唯一的、时间一致的ID。这些ID被嵌入到视频中作为视觉提示,为VLMs提供明确且可解释的输入。此外,我们引入了STVG-R1,这是第一个用于STVG的强化学习框架,它采用任务驱动的奖励来联合优化时间准确性、空间一致性和结构格式正则化。在六个基准上的广泛实验表明了我们方法的有效性。STVG-R1在HCSTVG-v2基准上的m_IoU上超越了基线Qwen2.5-VL-7B,取得了20.9%的显著优势,建立了新的最先进水平(SOTA)。令人惊讶的是,STVG-R1在多对象引用视频对象分割任务上也表现出强大的零样本泛化能力,实现了MeViS上的SOTA 47.3% J&F。
Summary / 总结
This work addresses the issue of hallucinations in vision-language models for spatial-temporal video grounding by proposing a novel visual prompting paradigm. The method assigns unique, temporally consistent IDs to objects and embeds them as visual prompts, reformulating per-frame coordinate prediction as an instance-level identification problem. The approach also introduces STVG-R1, a reinforcement learning framework that optimizes temporal accuracy, spatial consistency, and structural format regularization. Experiments show that STVG-R1 outperforms the baseline Qwen2.5-VL-7B by 20.9% on m_IoU on the HCSTVG-v2 benchmark and achieves a SOTA 47.3% J&F on MeViS for multi-object referring video object segmentation tasks.
本文针对视觉语言模型中文本描述与视觉坐标之间的对齐问题,特别是在时空视频定位任务中的问题,提出了一种新颖的视觉提示范式,为每个对象分配唯一的、时序一致的ID,并将其嵌入视频中作为视觉提示,避免了额外可训练模块的需求。该方法还引入了STVG-R1,这是一种基于强化学习的框架,用于优化时间准确性、空间一致性和结构格式正则化。实验表明,STVG-R1在HCSTVG-v2基准上的m_IoU上比基线Qwen2.5-VL-7B高出20.9%,并建立了新的最先进的技术水平。此外,它在多对象引用视频对象分割任务上也表现出强大的零样本泛化能力,实现了MeViS上的SOTA 47.3% J&F。
Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Authors: Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim
Venue: ICRA 2026
First: 2026-02-12T07:25:52+00:00 · Latest: 2026-02-12T07:25:52+00:00
Comments: Accepted to ICRA 2026. 9 pages, 8 figures
Abstract
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
中文标题/摘要
标题:Clutt3R-Seg:杂乱场景中基于语言的抓取稀疏视图3D实例分割
可靠的3D实例分割是基于语言的机器人操作的基础。其关键应用在于杂乱环境,其中遮挡、有限视角和噪声掩膜会降低感知效果。为了解决这些挑战,我们提出了Clutt3R-Seg,这是一种用于杂乱场景中基于语言的抓取的鲁棒3D实例分割零样本管道。我们的核心思想是引入层次化的语义实例树。与先前尝试细化噪声掩膜的方法不同,我们的方法将它们用作信息性线索:通过跨视图分组和条件替换,树抑制了过度分割和不足分割,产生了视图一致的掩膜和鲁棒的3D实例。每个实例都丰富了开放词汇语义嵌入,使从自然语言指令中准确选择目标成为可能。为了处理多阶段任务中的场景变化,我们进一步引入了一种一致性感知更新,仅通过单张后交互图像保留实例对应关系,从而实现高效的适应而无需重新扫描。Clutt3R-Seg在合成和真实世界数据集上进行了评估,并在真实机器人上进行了验证。在所有设置中,它在杂乱和稀疏视图场景中始终优于最先进的基线。即使在最具挑战性的重杂乱序列中,Clutt3R-Seg的AP@25也达到了61.66,比基线高出2.2倍,仅使用四个输入视图时,它也超过了使用八视图的MaskClustering超过2倍。代码可在:https://github.com/jeonghonoh/clutt3r-seg/ 获取。
Summary / 总结
Clutt3R-Seg is a zero-shot pipeline for 3D instance segmentation in cluttered scenes, addressing occlusions and limited viewpoints. It introduces a hierarchical instance tree of semantic cues, using cross-view grouping and conditional substitution to suppress over- and under-segmentation, resulting in view-consistent masks and robust 3D instances. The method enriches each instance with open-vocabulary semantic embeddings and introduces a consistency-aware update to preserve instance correspondences, enabling efficient adaptation. Clutt3R-Seg outperforms state-of-the-art baselines in cluttered and sparse-view scenarios, achieving an AP@25 of 61.66 and surpassing MaskClustering with fewer input views.
Clutt3R-Seg 是一种针对杂乱场景的零样本 3D 实例分割管道,解决遮挡和有限视角的问题。它引入了一种基于语义线索的分层实例树来抑制过度分割和不足分割,生成视图一致的掩码和稳健的 3D 实例。该方法为每个实例添加开放词汇语义嵌入,并使用一致性感知更新来保存实例对应关系。实验表明,Clutt3R-Seg 在杂乱和稀疏视角场景中优于最先进的基线,AP@25 达到 61.66,并且在重杂乱场景中使用更少的输入视图超过 MaskClustering。
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Authors: Changti Wu, Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Bin Yu, Xiaopeng Lin, Cong Huang, Lei Zhang, Kai Chen
First: 2026-02-12T06:38:49+00:00 · Latest: 2026-02-12T06:38:49+00:00
Comments: The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}
Abstract
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
中文标题/摘要
标题:ScalSelect:高效视觉指令调优的可扩展无训练多模态数据选择
大规模视觉指令调优(VIT)已成为提升视觉语言模型(VLMs)在各种多模态任务中性能的关键范式。然而,由于数据中的冗余性,大规模数据集的训练在计算上既昂贵又低效,这促使需要多模态数据选择以提高训练效率。现有的VIT数据选择方法要么需要昂贵的训练或梯度计算,要么依赖于代理模型或数据集、指令无关的表示以及具有二次复杂度的成对相似性,限制了可扩展性和表示保真度。在本文中,我们提出了一种名为ScalSelect的可扩展无训练多模态数据选择方法,其时间复杂度与样本数量成线性关系,消除了对外部模型或辅助数据集的需求。ScalSelect首先通过提取目标VLM中指令标记最关注的视觉特征来构建样本表示,捕捉指令相关信息。然后,它识别出那些表示最能逼近完整数据集表示主导子空间的样本,从而在无需成对比较的情况下实现可扩展的重要性评分。在多个VLM、数据集和选择预算上的广泛实验表明,ScalSelect仅使用数据的16%就能达到超过97.5%的完整数据集训练性能,并且在某些情况下甚至优于完整数据集训练。代码可在<https://github.com/ChangtiWu/ScalSelect> 获取。
Summary / 总结
ScalSelect is a scalable training-free multimodal data selection method for efficient Visual Instruction Tuning (VIT) of vision-language models. It constructs sample representations by extracting visual features attended by instruction tokens and identifies samples that best approximate the dominant subspace of the full dataset, avoiding pairwise comparisons. ScalSelect achieves over 97.5% of the performance of full-data training using only 16% of the data across multiple VLMs, datasets, and selection budgets, and even outperforms full-data training in some settings.
ScalSelect 是一种可扩展的无需训练的多模态数据选择方法,用于高效的视觉指令调优,解决大规模训练的计算效率问题。该方法通过提取目标 VLM 中指令标记最关注的视觉特征来构建样本表示,并识别出那些能够最好地逼近完整数据集表示主导子空间的样本。实验表明,ScalSelect 使用仅 16% 的数据即可达到超过 97.5% 的全数据训练性能,并在某些情况下甚至优于全数据训练。
SkillRater: Untangling Capabilities in Multimodal Data
Authors: Naveen Sahi, Jeremy Dohmann, Armen Aghajanyan, Akshat Shrivastava
First: 2026-02-12T06:07:03+00:00 · Latest: 2026-02-12T06:07:03+00:00
Abstract
Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.
中文标题/摘要
标题:SkillRater:解开多模态数据能力
数据整理方法通常为样本分配单一的质量评分。我们认为这种标量框架本质上是有限的:当训练需要多种不同的能力时,单一的评分器无法同时最大化所有能力的有用信号。质量应被视为多维的,每个维度对应模型必须掌握的一种能力。我们引入了SkillRater框架,将数据过滤分解为专门的评分器——每个能力一个,通过元学习在独立的验证目标上进行训练——并通过逐步选择规则组合它们的评分:在每个训练阶段,如果任何评分器将样本排名高于随时间收紧的阈值,则保留该样本,早期保持多样性,后期集中于高价值样本。我们在视觉语言模型上验证了这种方法,将质量分解为三个能力维度:视觉理解、OCR和STEM推理。在2亿参数下,与未过滤的基线相比,SkillRater在视觉理解上提高了5.63%,在OCR上提高了2.00%,在STEM上提高了3.53%。学习到的评分器信号几乎正交,证实了分解捕捉到了真正独立的质量维度,解释了为什么它优于未过滤训练和单一学习过滤。
The Determinism of Randomness: Latent Space Degeneracy in Diffusion Model
Authors: Song Yan, Chenfeng Wang, Wei Zhai, Xinliang Bi, Jian Yang, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha
First: 2025-11-11T02:12:38+00:00 · Latest: 2026-02-12T03:33:55+00:00
Abstract
Diffusion models draw the initial latent from an isotropic Gaussian distribution (all directions equally likely). But in practice, changing only the random seed can sharply alter image quality and prompt faithfulness. We explain this by distinguishing the isotropic prior from the semantics induced by the sampling map: while the prior is direction-agnostic, the mapping from latent noise to semantics has semantic-invariant directions and semantic-sensitive directions, so different seeds can lead to very different semantic outcomes. Motivated by this view, we propose a training-free inference procedure that (i) suppresses seed-specific, semantic-irrelevant variation via distribution-preserving semantic erasure, (ii) reinforces prompt-relevant semantic directions through timestep-aggregated horizontal injection, and (iii) applies a simple spherical retraction to stay near the prior's typical set. Across multiple backbones and benchmarks, our method consistently improves alignment and generation quality over standard sampling.
中文标题/摘要
标题:随机性的决定论:扩散模型中的潜在空间退化
扩散模型从各向同性的高斯分布中抽取初始潜在变量(所有方向的可能性相同)。但在实践中,仅改变随机种子就能显著改变图像质量和提示一致性。我们通过区分各向同性先验与采样映射诱导的语义:虽然先验对方向无偏好,但从潜在噪声到语义的映射具有语义不变的方向和语义敏感的方向,因此不同的种子可能导致非常不同的语义结果。受此观点的启发,我们提出了一种无需训练的推理过程,(i)通过保持分布的语义擦除抑制种子特定的、与语义无关的变异,(ii)通过时间步聚合水平注入强化与提示相关的语义方向,(iii)应用简单的球面收缩以保持在先验的典型集附近。在多个骨干网络和基准测试中,我们的方法在一致性与生成质量上均优于标准采样方法。
Summary / 总结
The paper investigates why changing the random seed in diffusion models can significantly affect image quality and prompt faithfulness. It proposes a method that suppresses seed-specific, irrelevant variation, reinforces prompt-relevant semantic directions, and applies a spherical retraction to stay near the prior's typical set. This method improves alignment and generation quality across different models and benchmarks compared to standard sampling.
论文探讨了为什么在扩散模型中改变随机种子会显著影响图像质量和提示一致性,尽管初始潜在变量使用的是等向性的高斯先验。提出了一种无需训练的推理方法,该方法抑制了与语义无关的种子特定变异,强化了与提示相关的语义方向,并使生成的图像保持在先验的典型集中。实验结果显示,在不同模型和基准上一致地提高了对齐性和生成质量。
What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
Authors: Zhenlong Yuan, Xiangyan Qu, Jing Tang, Rui Chen, Lei Sun, Ruidong Chen, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou
First: 2026-02-12T02:51:59+00:00 · Latest: 2026-02-12T02:51:59+00:00
Abstract
Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.
中文标题/摘要
标题:假如智能体能想象?通过生成增强开放词汇人-物交互理解
多模态大型语言模型在视觉和文本推理方面显示出有希望的能力,但在开放词汇人-物交互(OV-HOI)中的推理能力受限于跨模态幻觉和遮挡引起的歧义。为了解决这一问题,我们提出了一种名为\textbf{ImagineAgent}的智能体框架,该框架将认知推理与生成性想象相结合,以实现稳健的视觉理解。具体而言,我们的方法创新性地构建了认知地图,明确地建模了检测到的实体与候选动作之间可能的关系。随后,它动态地调用包括检索增强、图像裁剪和扩散模型在内的工具,以收集领域特定知识和丰富的视觉证据,从而在模糊场景中实现跨模态对齐。此外,我们提出了一种综合奖励,平衡预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明,我们的方法在性能上达到SOTA水平,所需训练数据量约为现有方法的20%,验证了我们的稳健性和效率。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models in Open-Vocabulary Human-Object Interaction (OV-HOI) by addressing cross-modal hallucinations and occlusion-induced ambiguity. The proposed ImagineAgent framework combines cognitive reasoning with generative imagination to construct cognitive maps and dynamically invoke various tools for gathering domain-specific knowledge and visual evidence. This approach achieves state-of-the-art performance on SWIG-HOI and HICO-DET datasets with reduced training data requirements, demonstrating robustness and efficiency.
研究旨在通过解决跨模态幻觉和遮挡引起的歧义问题,增强多模态大型语言模型在开放词汇人类物体交互(OV-HOI)中的推理能力。提出了ImagineAgent框架,该框架结合了认知推理和生成性想象,构建认知地图,并动态调用各种工具来收集领域特定知识和视觉证据。该方法在SWIG-HOI和HICO-DET数据集上达到了最先进的性能,仅需现有方法约20%的训练数据,证明了其稳健性和效率。
Anagent For Enhancing Scientific Table & Figure Analysis
Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
First: 2026-02-10T18:46:28+00:00 · Latest: 2026-02-12T02:51:40+00:00
Abstract
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
中文标题/摘要
标题:增强科学表格与图表分析的代理
在科学研究中,分析需要准确解读复杂的多模态知识,整合来自不同来源的证据,并基于领域特定知识得出推论。然而,当前的人工智能(AI)系统在持续展示这些能力方面存在困难。科学表格和图表的复杂性和变异性,以及异构结构和长上下文需求,构成了科学表格与图表分析的基本障碍。为了量化这些挑战,我们引入了AnaBench,这是一个包含来自九个科学领域的63,178个实例的大规模基准,系统地沿七个复杂维度进行分类。为应对这些挑战,我们提出了一种多代理框架Anagent,通过四个专门的代理进行增强的科学表格与图表分析:规划者将任务分解为可执行的子任务,专家通过有针对性的工具执行检索特定任务信息,解决者综合信息生成连贯的分析,评论家通过五维质量评估进行迭代改进。我们进一步开发了模块化的训练策略,利用监督微调和专门的强化学习来优化个体能力,同时保持有效的协作。在9个广泛领域和170个子领域的全面评估中,Anagent实现了显著的改进,在无训练设置中最高可达$\uparrow 13.43\%$,在微调设置中最高可达$\uparrow 42.12\%$,揭示了任务导向的推理和上下文感知的问题解决对于高质量的科学表格与图表分析至关重要。我们的项目页面:https://xhguo7.github.io/Anagent/
Summary / 总结
The paper addresses the challenge of accurately analyzing complex scientific tables and figures, which current AI systems often struggle with. It introduces AnaBench, a benchmark with 63,178 instances from nine scientific domains, and proposes Anagent, a multi-agent framework comprising Planner, Expert, Solver, and Critic agents. Anagent shows significant improvements in analysis quality, up to 13.43% without fine-tuning and 42.12% with fine-tuning, emphasizing the importance of task-oriented reasoning and context-aware problem-solving in scientific analysis.
论文旨在解决当前AI系统在分析复杂科学表格和图表时遇到的挑战,由于这些表格和图表的复杂性和变异性,现有的AI系统难以准确分析。为此,作者引入了包含63,178个实例的AnaBench基准,并提出了一种多智能体框架Anagent,该框架包括规划者、专家、解决者和批评者。Anagent在分析质量上取得了显著改进,无训练设置下最高可达13.43%,有训练设置下可达42.12%,强调了任务导向的推理和上下文感知问题解决在科学分析中的重要性。
NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
Authors: Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu
First: 2026-02-09T09:39:42+00:00 · Latest: 2026-02-12T02:33:29+00:00
Abstract
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
中文标题/摘要
标题:NarraScore:通过层次情感控制连接视觉叙事与音乐动态
为长视频合成连贯的音轨仍然是一个艰巨的挑战,目前受阻于三个关键障碍:计算可扩展性、时间连贯性和最致命的,普遍缺乏对叙事逻辑演变的语义感知。为弥合这些差距,我们提出NarraScore,这是一种基于核心洞察的层次框架,即情感是叙事逻辑的高密度压缩。独特地,我们重新利用冻结的视觉-语言模型(VLMs)作为连续的情感传感器,将高维视觉流提炼为密集的、叙事意识的愉悦-唤醒轨迹。机制上,NarraScore采用双分支注入策略来协调全局结构与局部动态:一个全局语义锚点确保风格稳定性,而一个手术式的标记级情感适配器通过直接元素级残差注入调节局部紧张度。这种简约设计绕过了密集注意力和架构克隆的瓶颈,有效缓解了数据稀缺相关的过拟合风险。实验表明,NarraScore在一致性与叙事对齐方面达到最先进的水平,且几乎无计算开销,确立了一种完全自主的长视频音轨生成范式。
Summary / 总结
NarraScore is a hierarchical framework that addresses the challenges of generating coherent soundtracks for long-form videos by leveraging the insight that emotion can compress narrative logic. It uses repurposed Vision-Language Models as affective sensors to generate Valence-Arousal trajectories and employs a Dual-Branch Injection strategy to balance global structure and local dynamism. Experiments show that NarraScore achieves state-of-the-art consistency and narrative alignment with minimal computational cost, setting a new standard for long-video soundtrack generation.
NarraScore 是一个层次框架,通过将情感作为叙事逻辑的高密度压缩来为长视频合成连贯的音轨。它采用双分支注入策略,结合全局语义锚点以确保风格稳定性和局部情感适配器以直接调节局部紧张度。实验表明,NarraScore 在保持最小计算开销的同时实现了最先进的连贯性和叙事一致性。
DSO: Direct Steering Optimization for Bias Mitigation
Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
First: 2025-12-17T19:43:46+00:00 · Latest: 2026-02-12T00:30:59+00:00
Abstract
Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
中文标题/摘要
标题:DSO:直接导向优化以减轻偏差
生成模型通常被部署为代用户做出决策,例如视觉语言模型(VLMs)识别房间中哪个人是医生,以帮助视力受损的个体。然而,VLM 的决策会受到输入中人们感知的种族属性的影响,这可能导致偏差的结果,例如未能识别女性为医生。此外,当减少偏差导致性能损失时,用户可能有不同的需求来平衡偏差减轻与整体模型能力,突显了需要在推理时能够控制偏差减少方法的需求。激活导向是一种流行的推理时可控性方法,在诱导大型语言模型(LLMs)的更安全行为方面显示出潜力。然而,我们观察到当前的导向方法难以纠正偏差,需要在不同种族群体之间实现等概率的结果。为了解决这个问题,我们提出了直接导向优化(DSO),它使用强化学习来寻找导向激活的线性变换,以减轻偏差同时控制模型性能。我们证明DSO在VLMs和LLMs上实现了公平性和能力之间的最佳权衡,同时为实践者提供了在权衡中进行推理时的控制。总体而言,我们的工作突显了设计直接优化以控制模型行为的导向策略的好处,提供了比依赖预定义启发式方法的可控性更有效的偏差干预。
Summary / 总结
The research aims to address the issue of bias in generative models like vision-language models (VLMs) and large language models (LLMs) by proposing Direct Steering Optimization (DSO), which uses reinforcement learning to find linear transformations for steering activations. This method enables controllable bias reduction while maintaining model performance. Experimental results show that DSO achieves the best trade-off between fairness and capabilities on both VLMs and LLMs, offering practitioners control over the bias-performance trade-off during inference.
研究旨在通过开发直接优化(DSO)方法来解决生成模型如视觉语言模型(VLMs)和大型语言模型(LLMs)中的偏见问题,该方法利用强化学习优化模型激活的线性变换,以实现可控的偏见减少同时保持模型性能。DSO在VLMs和LLMs上实现了公平性和能力之间的最佳权衡,达到了最先进的技术水平。
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
Authors: Nghia Nguyen, Tianjiao Ding, René Vidal
First: 2026-02-11T23:53:15+00:00 · Latest: 2026-02-11T23:53:15+00:00
Abstract
Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding \& Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we construct a corresponding hierarchy of concept embeddings and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas vanilla sparse coding fails. Our experiments on real-world datasets demonstrate that HCEP outperforms baselines in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results show that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.
中文标题/摘要
标题:层次概念嵌入与追求以实现可解释的图像分类
设计可解释的模型在计算机视觉中正逐渐受到重视,因为它们能够为预测提供忠实的解释。在图像分类中,这些模型通常从图像中恢复出可由人类理解的概念,并使用这些概念进行分类。稀疏概念恢复方法利用视觉-语言模型的潜在空间,将图像嵌入表示为概念嵌入的稀疏组合。然而,由于这些方法忽略了概念的层次结构,它们可能会产生与层次结构不一致的解释,但仍然得到正确的预测。在本文中,我们提出了一种名为层次概念嵌入与追求(HCEP)的框架,该框架在潜在空间中诱导概念嵌入的层次结构,并使用层次稀疏编码来恢复图像中存在的概念。给定概念层次结构,我们构建相应的概念嵌入层次结构,并假设图像的正确概念形成层次结构中的根路径,从而在嵌入空间中识别它们。我们证明了层次稀疏编码可靠地恢复了层次概念嵌入,而传统的稀疏编码则失败。在真实数据集上的实验表明,HCEP 在概念精确度和召回率方面优于基线模型,同时保持了竞争力的分类准确性。此外,当样本数量有限时,HCEP 在分类准确性和概念恢复方面表现出色。这些结果表明,将层次结构纳入稀疏编码中可以生成更可靠和可解释的图像分类模型。
Summary / 总结
This work addresses the need for interpretable image classification models by proposing Hierarchical Concept Embedding & Pursuit (HCEP), which incorporates the hierarchical structure of concepts in the latent space of vision-language models. HCEP uses hierarchical sparse coding to recover concepts from images, leading to better concept precision and recall compared to vanilla sparse coding. Experiments show that HCEP outperforms baselines in both classification accuracy and concept recovery, especially with limited samples.
研究旨在通过将层次结构纳入稀疏编码来开发可解释的图像分类模型。提出的层次概念嵌入与追求(HCEP)框架构建概念的层次结构,并使用层次稀疏编码从图像中恢复相关概念。实验表明,HCEP 在概念精确度和召回率方面优于基线方法,同时保持了竞争力的分类准确性,尤其是在数据有限的情况下。
Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti
First: 2025-06-06T11:50:18+00:00 · Latest: 2026-02-11T20:03:58+00:00
Abstract
Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between $7\%$ and $13\%$ according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.
中文标题/摘要
标题:统一视觉语言模型中的基于动作的视觉动力学自举
统一视觉语言模型(VLMs)能否执行前向动力学预测(FDP),即给定先前观察和动作(语言形式)的情况下预测未来状态(图像形式)?我们发现,VLMs 在从指令生成物理上合理的帧间过渡方面存在困难。然而,我们发现多模态定位中的一个关键不对称性:将 VLM 微调以学习逆向动力学预测(IDP),即在帧之间有效描述动作,比学习 FDP 要容易得多。反过来,IDP 可以通过两种主要策略来启动 FDP:1)从合成数据进行弱监督学习,2)推理时验证。首先,IDP 可以为未标记的视频帧观察对标注动作,以扩大 FDP 的训练数据规模。其次,IDP 可以为 FDP 的多个样本分配奖励,从而在推理时有效指导搜索。我们通过使用两种家族的 VLMs 在 Aurora-Bench 上执行以动作为中心的图像编辑任务,评估了这两种策略产生的 FDP。尽管保持通用性,我们的最佳模型在 GPT4o-as-judge 的评估中与最先进的图像编辑模型竞争,比它们高出 7% 至 13% 的性能,并在 Aurora-Bench 的所有子集上实现了最佳的人类评估平均得分。
Summary / 总结
The study aims to explore whether unified vision-language models (VLMs) can predict future states given past observations and actions. Despite VLMs struggling with generating physically plausible transitions, the research identifies that learning inverse dynamics prediction (IDP) is easier and can be used to bootstrap forward dynamics prediction (FDP). Two strategies are proposed: weakly supervised learning from synthetic data and inference-time verification. The best model, while remaining general-purpose, shows competitive performance in action-centric image editing, improving state-of-the-art models by 7% to 13% according to GPT4o-as-judge and achieving the highest average human evaluation across all subsets of Aurora-Bench.
研究旨在探索统一的视觉-语言模型是否能够根据过去的观察和动作预测未来状态。尽管在生成物理上合理的过渡方面存在挑战,但研究发现逆向动力学预测(IDP)比正向动力学预测(FDP)更容易学习。IDP可以通过从合成数据的弱监督学习和推理时验证来促进FDP。在Aurora-Bench上的评估显示,最佳模型的性能与最先进的图像编辑模型相当,根据GPT4o-as-judge的评估,其改进幅度在7%到13%之间,并且在Aurora-Bench的所有子集上获得了最高的平均人类评估得分。
History
20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553