Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指令的执行的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律,我们提出了CoVer,一种对比验证器,用于视觉-语言-行动对齐,并展示了我们的架构在额外计算资源和数据上的扩展能力。然后,我们介绍了“启动时计算”和层次验证推理管道。在部署时,我们的框架从视觉语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer实现了14%的任务进展和9%的成功率改进。
Summary / 总结
This paper explores test-time verification as a method to improve the alignment between actions and natural language instructions in vision-language-action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, leading to more efficient recovery of correct actions. The proposed CoVer architecture scales gracefully with additional resources, and the framework precomputes diverse rephrased instructions, generating action candidates and using a verifier to select the optimal actions. The verification approach outperforms scaling policy learning, achieving 22% and 13% gains in-distribution and out-of-distribution on the SIMPLER benchmark, and 14% and 9% improvements in task progress and success rate on the PolaRiS benchmark.
本文探讨了测试时验证作为提高视觉-语言-行动模型中指令与自然语言指令之间对齐的方法。通过表征指令跟随的缩放定律,作者表明同时缩放重述指令的数量和生成的动作可以提高测试时样本多样性。他们引入了CoVer对比验证器,该验证器在额外资源下可以平滑扩展。提出的框架包括启动时计算和分层验证推理管道,在SIMPLER基准测试中实现了22%和13%的分布内和分布外改进,而在实际实验中进一步提高了这些改进。在PolaRiS基准测试中,CoVer展示了14%的任务进度提升和9%的成功率提升。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-12T17:32:59+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大语言模型
大语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵。这产生了一种子空间并集模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重建误差,而不是权重空间误差。激活衍生的Gram正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题,并且我们支持层内压缩和组内相似投影之间的跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40% 压缩比下始终优于基于SVD的先进基线和强大的结构化剪枝基线,提高了准确度-压缩和困惑度-压缩的权衡。由此产生的结构稀疏性使稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to replace low-rank factorization, allowing for a more flexible representation of weight matrices and improved accuracy at lower compression ratios. It optimizes the factorization using a small calibration set to minimize functional reconstruction error, and supports both per-layer compression and cross-layer dictionary sharing. Experiments on Llama and Qwen models show that CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines at 20-40% compression ratios.
CoSpaDi 是一种无需训练的大型语言模型(LLM)压缩框架,它使用结构化的稀疏分解来替代低秩分解,从而在固定参数预算下提高表达能力和准确性。该框架使用一个小的校准集来优化分解,以最小化层输出的功能重构误差,并支持逐层压缩和跨层字典共享。实验表明,CoSpaDi 在 Llama 和 Qwen 模型家族中,在 20-40% 压缩比下,优于最先进的 SVD 基准和结构化剪枝基准。
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Venue: Nat Mach Intell 8, 20-31 (2026)
First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00
Comments: Published at Nature Machine Intelligence
Abstract
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
中文标题/摘要
标题:实验室安全台:评估大型语言模型在科学实验室安全问题上的表现
人工智能(AI)正在革新科学研究,但其在实验室环境中的日益集成带来了关键的安全挑战。大型语言模型(LLMs)和视觉语言模型(VLMs)现在协助实验设计和程序指导,但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示当前模型远未达到安全实验室操作所需的可靠性。我们引入了LabSafety Bench,这是一个全面的基准测试,评估模型在危害识别、风险评估和后果预测方面的表现,涵盖765个多项选择题和404个现实实验室场景,共计3,128个开放式任务。对19个先进LLM和VLM的评估显示,在危害识别方面,没有模型的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理方面并没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前,迫切需要专门的安全评估框架。
Summary / 总结
The research aims to address the critical safety challenges posed by the integration of AI in scientific laboratories. It introduces LabSafety Bench, a benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction. Evaluations on 19 advanced LLMs and VLMs reveal that no model achieves over 70% accuracy in hazard identification, highlighting the need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
研究旨在应对AI在科学实验室中集成所带来的安全挑战。引入了LabSafety Bench基准测试模型在危害识别、风险评估和后果预测方面的表现。对19种先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%,强调了在实际实验室环境中部署AI系统前需要专门的安全评估框架的紧迫性。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-12T16:49:33+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:基于图像对话的内省视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的纯文本推理,这往往会导致精细视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了有效的跨模态对齐——特别是在需要在远距离区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的“基于图像的对话”框架,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT实现这一范式,这是一种新型的LVLM,配备了专门设计用于此类交互式视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练,促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT实现了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现尤为突出。
Summary / 总结
The paper addresses the limitation of current large vision-language models (LVLMs) in handling fine-grained visual information and proposes a new framework called 'chatting with images' to improve cross-modal alignment. This framework reframes visual manipulation as language-guided feature modulation, allowing the model to dynamically re-encode multiple image regions under the guidance of language prompts. The model, ViLaVT, is trained with a two-stage curriculum combining supervised fine-tuning and reinforcement learning. Experiments show that ViLaVT performs well across various benchmarks, especially in complex multi-image and video-based spatial reasoning tasks.
论文针对当前大型视觉-语言模型(LVLM)在处理细粒度视觉信息和跨模态对齐不足的问题,提出了一种新的框架‘与图像对话’,将视觉操作重新定义为语言引导的特征调制。开发了具有动态视觉编码器的新型LVLM ViLaVT,并使用两阶段课程进行训练。实验表明,ViLaVT 在复杂的多图像和视频空间推理任务中表现出色,优于现有模型。
3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
Authors: Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu
First: 2026-02-12T16:41:26+00:00 · Latest: 2026-02-12T16:41:26+00:00
Abstract
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/
中文标题/摘要
标题:3DGSNav:通过主动三维高斯点绘制增强视觉-语言模型的物体导航推理
物体导航是具身智能的核心能力,使代理能够在未知环境中定位目标物体。近期视觉-语言模型(VLMs)的进步促进了零样本物体导航(ZSON)。然而,现有方法往往依赖于场景抽象,将环境转换为语义地图或文本表示,导致高层决策受限于低层感知的准确性。在本文中,我们提出了3DGSNav,这是一种新颖的ZSON框架,将三维高斯点绘制(3DGS)嵌入为持久记忆,以增强VLM的空间推理。通过主动感知,3DGSNav逐步构建环境的3DGS表示,实现基于轨迹的前方视角感知第一人称视图的自由视角渲染。此外,我们设计了结构化视觉提示,并将其与链式思考(CoT)提示结合使用,以进一步提高VLM的推理能力。在导航过程中,实时物体检测器过滤潜在目标,而VLM驱动的主动视角切换执行目标再验证,确保高效可靠的识别。在多个基准测试和现实世界实验中,我们的方法在四足机器人上展示了稳健且具有竞争力的性能。项目页面:https://aczheng-cai.github.io/3dgsnav.github.io/
Summary / 总结
3DGSNav is a novel framework for zero-shot object navigation that enhances vision-language model reasoning through active 3D Gaussian Splatting. It constructs a 3D representation of the environment, enabling trajectory-guided rendering and improving spatial reasoning. The method integrates structured visual prompts and Chain-of-Thought prompting to further enhance VLM performance. Experiments show that 3DGSNav achieves robust and competitive performance compared to state-of-the-art approaches in both benchmarks and real-world scenarios on a quadruped robot.
3DGSNav 是一种新颖的零样本物体导航框架,通过使用3D高斯点绘(3DGS)作为持久记忆来增强视觉语言模型的推理能力。该方法通过主动感知构建环境的3DGS表示,并通过轨迹引导的方式渲染前沿感知的第一人称视图。此外,该方法还结合了结构化视觉提示和链式思考提示以进一步提高推理效果。实验结果表明,3DGSNav 在多个基准测试和真实世界实验中,特别是在四足机器人上,表现出稳健且具有竞争力的性能。
Self-Attention Decomposition For Training Free Diffusion Editing
Authors: Tharun Anand, Mohammad Hassan Vali, Arno Solin, Green Rosh, BH Pawan Prasad
Venue: ICASSP 2026
First: 2025-10-26T12:22:56+00:00 · Latest: 2026-02-12T16:23:33+00:00
Comments: ICASSP 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Abstract
Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model's latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.
中文标题/摘要
标题:扩散模型训练自由扩散编辑的自我注意力分解
扩散模型在图像合成中实现了惊人的保真度,但对输出进行精确控制以实现目标编辑仍然具有挑战性。可控性的关键步骤是识别与语义属性相对应的模型潜在表示中的可解释方向。现有找到可解释方向的方法通常依赖于采样大量图像或训练辅助网络,这限制了效率。我们提出了一种分析方法,可以直接从预训练的扩散模型参数中推导出语义编辑方向,无需额外数据或微调。我们的见解是,自我注意力权重矩阵编码了模型在训练过程中学习到的数据分布的丰富结构信息。通过计算这些权重矩阵的特征向量,我们获得了稳健且可解释的编辑方向。实验表明,我们的方法在多个数据集上生成了高质量的编辑结果,同时将编辑时间显著减少了60%。
Summary / 总结
The research aims to enhance the control over diffusion models for precise image editing. It introduces an analytical method that decomposes self-attention weight matrices to derive interpretable editing directions directly from pretrained parameters, without the need for additional data or fine-tuning. The method significantly reduces editing time by 60% and produces high-quality edits across multiple datasets.
研究旨在通过识别模型潜空间中的可解释方向来增强扩散模型在图像编辑中的可控性。方法是通过分析自注意力权重矩阵来获取这些方向,无需额外数据或微调。关键发现表明,所提出的方法在多个数据集上生成高质量的编辑结果,并将编辑时间显著减少了60%,优于当前基准。
Kelix Technical Report
Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang
First: 2026-02-10T14:48:26+00:00 · Latest: 2026-02-12T15:36:05+00:00
Comments: Work in progress
Abstract
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
中文标题/摘要
标题:Kelix 技术报告
自回归大型语言模型(LLMs)通过将多样化的任务表示为离散的自然语言标记序列,并通过下一个标记预测进行训练,从而能够很好地扩展,这将理解与生成统一在自我监督之下。将这一范式扩展到多模态数据需要在不同模态之间共享离散表示。然而,大多数视觉语言模型(VLMs)仍然依赖于混合界面:离散文本标记配以连续的视觉变换器(ViT)特征。由于监督主要来自文本,这些模型往往偏向于理解,而无法充分利用大规模的非文本数据的自我监督学习。最近的工作探索了离散视觉标记化,以实现完全自回归的多模态建模,显示出统一理解和生成的有希望的进展。然而,现有的离散视觉标记经常由于编码容量有限而丢失信息,导致理解能力明显弱于连续特征的VLMs。我们提出了Kelix,一种完全离散的自回归统一模型,以缩小离散和连续视觉表示之间的理解差距。
Summary / 总结
The research aims to improve multimodal understanding by developing a fully discrete autoregressive model, Kelix, which addresses the limitations of current vision-language models that rely on hybrid interfaces. Kelix uses discrete visual tokens to enable unified understanding and generation, overcoming the information loss issue faced by previous models. Key experimental findings show that Kelix closes the understanding gap between discrete and continuous visual representations, demonstrating promising progress in multimodal modeling.
研究旨在通过开发一个完全离散的自回归统一模型Kelix来提高多模态理解,解决当前依赖混合接口的视觉语言模型的局限性。Kelix使用离散视觉标记来实现统一的理解和生成,克服了现有离散视觉标记中的信息丢失问题。主要发现表明,Kelix在视觉表示的离散和连续之间缩小了理解差距,展示了多模态建模的进展。
Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
Authors: Enrico Guerriero, Kjersti Engan, Øyvind Meinich-Bache
First: 2026-02-12T14:31:10+00:00 · Latest: 2026-02-12T14:31:10+00:00
Comments: Presented at the Satellite Workshop on Workshop 15: Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos, IEEE International Conference on Image Processing (ICIP), 2025
Abstract
Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
中文标题/摘要
标题:本地视觉语言模型能否提升新生儿复苏活动识别?——基于视觉变换器的案例研究
准确记录新生儿复苏过程对于质量改进和遵守临床指南至关重要,但在实践中却未得到充分利用。先前使用3D-CNN和视觉变换器(ViT)的工作在检测新生儿复苏视频中的关键活动方面取得了令人鼓舞的结果,但也指出了识别此类细粒度活动的挑战。本研究探讨了生成式人工智能(GenAI)方法在提高此类视频活动识别方面的潜力。具体而言,我们探索了局部视觉语言模型(VLM)与大型语言模型(LLM)结合使用的方法,并将其与监督TimeSFormer基线进行比较。使用包含13.26小时新生儿复苏视频的模拟数据集,我们评估了几种零样本VLM策略和带有分类头的微调VLM,包括低秩适应(LoRA)。结果显示,小型(局部)VLM在幻觉方面存在困难,但通过LoRA微调后,F1分数达到0.91,超过了TimeSformer的0.70。
Summary / 总结
This study investigates the use of local vision-language models (VLMs) combined with large language models (LLMs) to improve activity recognition in newborn resuscitation videos, comparing them to a supervised TimeSFormer baseline. Using a simulated dataset, the study finds that fine-tuned VLMs with LoRA achieve an F1 score of 0.91, surpassing the TimeSFormer baseline's F1 score of 0.70, despite initial struggles with hallucinations in zero-shot settings.
本研究旨在通过使用局部视觉-语言模型(VLM)和大型语言模型(LLM),提高新生儿复苏视频中的活动识别准确性,与监督学习的TimeSFormer基线相比。在包含13.26小时新生儿复苏视频的模拟数据集上的评估结果显示,经过LoRA微调的VLM实现了0.91的F1分数,超过了TimeSFormer基线的0.70分数,尽管初期存在幻觉问题。
Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
Authors: Bruno Rigal, Victor Dupriez, Alexis Mignon, Ronan Le Hy, Nicolas Mery
First: 2026-02-12T13:55:43+00:00 · Latest: 2026-02-12T13:55:43+00:00
Comments: 13 pages, 6 figures
Abstract
This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use.
We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
中文标题/摘要
标题:视觉-语言模型在法语PDF转Markdown评估
本报告评估了使用近期视觉-语言模型(VLMs)对具有挑战性的法语文档进行PDF转Markdown的转换。文档解析是检索增强生成(RAG)流水线中的关键步骤,其中转录和布局错误会传播到下游检索和定位。现有基准测试通常强调英语或中文,并且往往会过度惩罚无害的格式和线性化选择(例如行间距、列表分割、替代表格呈现),这些选择对下游使用来说大多无关紧要。
我们通过模型分歧采样从60,000份文档的语料库中引入了一个以法语为重点的基准测试,涵盖了手写表单、复杂布局、密集表格和图形丰富的页面。评估使用单元测试风格的检查,针对具体的失败模式(文本存在、阅读顺序和局部表格约束)进行,并结合特定类别的归一化设计,以减少仅涉及呈现的差异。在15个模型中,我们观察到最强的专有模型在手写和表单方面表现出更高的鲁棒性,而几个开源权重系统在标准印刷布局方面仍然具有竞争力。
Summary / 总结
This report evaluates the performance of Vision-Language Models (VLMs) in converting French PDFs to Markdown, focusing on challenging documents with handwritten forms, complex layouts, and graphics. The study introduces a French-specific benchmark using model-disagreement sampling from 60,000 documents. Evaluation metrics target specific failure modes and normalize presentation variance. Results show that proprietary models are more robust for handwriting and forms, while open-source models perform well on standard printed layouts.
该报告评估了Vision-Language模型在将法语PDF转换为Markdown时的表现,重点关注包含手写表单、复杂布局和图形的挑战性文档。研究引入了一个包含60,000份文档的新基准,使用模型分歧抽样确保难度。评估包括单元测试风格的检查和类别特定的归一化,以评估具体的失败模式和呈现差异。结果显示,专有模型在手写表单方面更具鲁棒性,而开源模型在标准印刷布局方面仍然具有竞争力。
Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization
Authors: Suyash Mishra, Qiang Li, Anubhav Girdhar
First: 2026-02-12T13:53:29+00:00 · Latest: 2026-02-12T13:53:29+00:00
Comments: Submitted to the Demo Track of Top Tier Conference; currently under peer review
Abstract
Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.
中文标题/摘要
标题:两颗LLM比一颗好吗?一种用于制药内容优化的学生-教师双头LLM架构
大型语言模型(LLMs)在制药等受监管领域中越来越多地用于生成内容,这些内容必须科学准确且符合法律要求。手动质量控制(QC)过程缓慢、容易出错,可能会成为出版瓶颈。我们引入了LRBTC,这是一种模块化LLM和视觉语言模型(VLM)驱动的QC架构,涵盖了语言、法规、品牌、技术以及内容结构检查。LRBTC结合了学生-教师双模型架构、人工在环(HITL)工作流和瀑布规则过滤,以实现可扩展、可验证的内容验证和优化。在AIReg-Bench上,我们的方法实现了83.0%的F1和97.5%的召回率,将未检测到的违规行为减少了5倍,与Gemini 2.5 Pro相比。在CSpelling上,它将平均准确性提高了26.7%。错误分析进一步表明,当前模型在检测拼写错误方面表现强劲(召回率为92.5%),但在识别复杂医学语法(召回率25.0%)和标点符号(召回率41.7%)错误方面存在不足,这指出了未来工作的关键领域。这项工作提供了一种实用、即插即用的解决方案,用于高风险、合规关键行业的内容可靠、透明的质量控制。我们还根据MIT许可证提供了我们的演示访问。
Summary / 总结
The research aims to enhance the quality control of pharmaceutical content using a Student-Teacher dual-head LLM architecture. The method involves combining a modular LLM and VLM to perform Language, Regulatory, Brand, Technical, and Content Structure checks, with a human-in-the-loop workflow. Key findings include an 83.0% F1 score and 97.5% recall on AIReg-Bench, reducing missed violations by 5x compared to Gemini 2.5 Pro, and a 26.7% improvement in mean accuracy on CSpelling. Error analysis indicates current models struggle with complex medical grammatical and punctuation errors, suggesting areas for future improvement.
研究旨在通过使用学生-教师双头LLM架构来提高制药内容的质量控制。该方法结合了模块化的LLM和VLM,以执行语言、法规、品牌、技术方面和内容结构的检查。在AIReg-Bench基准测试中,该方法的F1得分为83.0%,召回率为97.5%,显著减少了与Gemini 2.5 Pro相比的遗漏违规情况。它还在CSpelling上提高了平均准确性26.7%。然而,分析表明,当前模型在复杂医学语法和标点符号错误方面存在困难,这表明未来需要改进的领域。
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
Authors: Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
First: 2026-02-05T12:03:11+00:00 · Latest: 2026-02-12T13:43:33+00:00
Abstract
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
中文标题/摘要
标题:LoGoSeg:结合局部和全局特征的开放词汇语义分割
开放词汇语义分割(OVSS)通过使用任意文本描述对已见和未见类别进行像素级标注,扩展了传统的封闭集分割。现有方法利用如CLIP等视觉语言模型(VLMs),但其依赖于图像级预训练,往往导致空间对齐不精确,导致在模糊或杂乱场景中出现不匹配的分割。然而,大多数现有方法缺乏强大的对象先验和区域级约束,这可能导致对象幻觉或漏检,进一步降低性能。为解决这些挑战,我们提出LoGoSeg,一种高效的单阶段框架,集成了三个关键创新:(i)一种对象存在先验,通过全局图像-文本相似性动态加权相关类别,有效减少幻觉;(ii)一种区域感知对齐模块,建立精确的区域级视觉-文本对应关系;(iii)一种双流融合机制,最优地结合局部结构信息与全局语义上下文。与先前工作不同,LoGoSeg 消除了对外部掩码提案、额外骨干网络或额外数据集的需要,确保高效性。在六个基准(A-847、PC-459、A-150、PC-59、PAS-20 和 PAS-20b)上的广泛实验表明,其在开放词汇设置中的性能和泛化能力具有竞争力。
Summary / 总结
LoGoSeg is an efficient single-stage framework for open-vocabulary semantic segmentation that integrates object existence prior, region-aware alignment, and dual-stream fusion to improve spatial alignment and reduce hallucinations. Experiments on six benchmarks show its competitive performance and strong generalization in open-vocabulary settings.
LoGoSeg 是一种高效的单阶段框架,用于开放词汇语义分割,通过结合全局图像-文本相似性、区域感知对齐和双流融合来解决现有方法的局限性。它基于全局图像-文本相似性动态加权类别,建立精确的区域级对应关系,并结合局部和全局信息。在六个基准上的实验表明,LoGoSeg 在开放词汇设置中的性能和泛化能力都优于现有方法。
LLM-in-Sandbox Elicits General Agentic Intelligence
Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
First: 2026-01-22T18:57:09+00:00 · Latest: 2026-02-12T12:39:21+00:00
Comments: Project Page: https://llm-in-sandbox.github.io
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文标题/摘要
标题:LLM-in-Sandbox 激发通用代理智能
我们介绍了 LLM-in-Sandbox,使大语言模型能够在代码沙箱(即虚拟计算机)中探索,以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下,能够利用代码沙箱来完成非代码任务的一般化能力。例如,大语言模型自发地访问外部资源以获取新知识,利用文件系统处理长文本,并执行脚本以满足格式要求。我们进一步表明,通过仅使用非代理数据训练用于沙箱探索的模型,LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)可以增强这些代理能力。实验表明,无论是在无训练还是后训练设置下,LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 包,以促进其实用部署。
Summary / 总结
The study introduces LLM-in-Sandbox, which allows large language models (LLMs) to interact within a code sandbox to develop general intelligence in non-code domains. LLMs demonstrated the ability to generalize and use the sandbox for non-code tasks such as accessing external resources, handling long contexts, and executing scripts. The research further shows that LLM-in-Sandbox Reinforcement Learning can enhance these capabilities. Experiments show robust generalization across various fields including mathematics, physics, chemistry, biomedicine, and instruction following. The study also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.
研究旨在通过探索代码沙箱,使大型语言模型(LLMs)在非代码领域展示出通用智能。研究表明,强大的LLMs可以在无需额外训练的情况下,通过利用沙箱执行非代码任务,例如访问外部资源和执行脚本。通过LLM-in-Sandbox强化学习进一步提升这些能力。实验验证了LLM-in-Sandbox在数学、物理和生物医学等多个领域的稳健泛化能力。研究还从计算和系统效率方面评估了LLM-in-Sandbox,并将其开源为Python包,以便在实际场景中部署。
TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Authors: Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata
First: 2025-09-25T14:14:27+00:00 · Latest: 2026-02-12T12:11:03+00:00
Abstract
While table understanding increasingly relies on pixel-only settings, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 21 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. To evaluate whether models are able to jointly reason over tabular and visual content, we also introduce VisualTableQA, a benchmark requiring both visual perception and table understanding. Fine-tuning vision-language models like Qwen2.5-VL-7B and Gemma 3-4B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
中文标题/摘要
标题:TABLET:大规模视觉表格理解数据集
尽管表格理解越来越多地依赖于基于像素的设置,但当前的基准测试主要使用缺乏现实世界表格复杂性和视觉多样性的合成渲染。此外,现有的视觉表格理解(VTU)数据集提供固定示例和单一可视化,并预定义指令,不提供访问底层序列化数据以重新表述的机会。我们引入了TABLET,这是一个包含400万示例、涵盖21项任务的大规模VTU数据集,基于200万张独特表格,其中88%保留了原始可视化。为了评估模型是否能够联合推理表格和视觉内容,我们还引入了VisualTableQA,这是一个需要视觉感知和表格理解的基准测试。在TABLET上微调如Qwen2.5-VL-7B和Gemma 3-4B等视觉语言模型,可以提高已见和未见VTU任务的性能,同时增强对现实世界表格可视化的鲁棒性。通过保留原始可视化并保持示例可追溯性,TABLET为未来的VTU模型的稳健训练和扩展评估奠定了基础。
Summary / 总结
The research aims to address the limitations of current VTU datasets by introducing TABLET, a large-scale dataset with 4 million examples across 21 tasks, grounded in 2 million unique tables. The method involves fine-tuning vision-language models on this dataset to improve performance on both seen and unseen VTU tasks. Key findings show that fine-tuning on TABLET enhances robustness on real-world table visualizations and improves model performance across various tasks.
研究旨在通过引入包含400万示例、覆盖21项任务的TABLET数据集来解决当前VTU数据集的局限性,该数据集基于200万张独特的表格。方法是通过在TABLET上微调视觉-语言模型来提高在各种VTU任务上的性能。关键发现包括在真实世界表格可视化上的鲁棒性增强以及模型性能的提升。
Free Lunch for Stabilizing Rectified Flow Inversion
Authors: Chenru Wang, Beier Zhu, Chi Zhang
First: 2026-02-12T11:42:36+00:00 · Latest: 2026-02-12T11:42:36+00:00
Abstract
Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
中文标题/摘要
标题:免费午餐以稳定校正流反转
基于校正流(RF)的生成模型最近已成为传统扩散模型的强大替代方案,在各种任务中表现出最先进的性能。通过学习一个连续的速度场,将简单的噪声转换为复杂的数据,基于RF的模型不仅能够实现高质量的生成,还支持无需训练的反转,这有助于下游任务如重建和编辑。然而,现有的反转方法,如纯RF反转,会遭受随时间累积的近似误差,导致不稳定的速度场和重建和编辑质量下降。为了解决这一挑战,我们提出了邻近均值反转(PMI),这是一种无需训练的梯度校正方法,通过将其引导向理论推导出的球形高斯内的过去速度的运行平均值来稳定速度场。此外,我们还引入了模仿-CFG,这是一种轻量级的速度校正方案,用于编辑任务,它在当前速度与其历史平均值投影之间进行插值,平衡编辑效果和结构一致性。在PIE-Bench上的广泛实验表明,我们的方法显著提高了反转稳定性、图像重建质量和编辑保真度,同时减少了所需的神经函数评估次数。我们的方法在PIE-Bench上实现了最先进的性能,具有增强的效率和理论严谨性。
Summary / 总结
The paper addresses the issue of unstable velocity fields in existing Rectified-Flow (RF)-based inversion methods, which can degrade reconstruction and editing quality. It proposes Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Additionally, it introduces mimic-CFG, a lightweight velocity correction scheme for editing tasks that balances editing effectiveness and structural consistency. Experiments show significant improvements in inversion stability, image reconstruction quality, and editing fidelity, while reducing the number of neural function evaluations.
论文针对Rectified-Flow (RF)生成模型中不稳定的速度场问题,该问题会降低图像重建和编辑的质量。为此,作者提出了Proximal-Mean Inversion (PMI),这是一种无需训练的方法,通过将速度场引导向过去速度的运行平均值来稳定它。此外,他们还引入了mimic-CFG,这是一种轻量级的编辑任务速度校正方案。实验表明,这些方法显著提高了反演稳定性、图像重建质量和编辑保真度,同时减少了神经函数评估的数量,实现了在PIE-Bench上的最先进的性能。
JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
Authors: Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long
First: 2026-02-12T11:20:43+00:00 · Latest: 2026-02-12T11:20:43+00:00
Abstract
Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
中文标题/摘要
标题:JEPA-VLA:视频预测嵌入对于VLA模型是必需的
基于预训练视觉语言模型(VLMs)的近期视觉语言行动(VLA)模型在机器人操作方面取得了显著进步。然而,当前的VLA模型仍然存在样本效率低和泛化能力有限的问题。本文认为,这些限制与一个被忽视的组件——预训练视觉表示——密切相关,该组件在环境理解和策略先验方面提供的知识不足。通过深入分析,我们发现VLA中常用的视觉表示,无论是通过语言图像对比学习还是基于图像的自我监督学习预训练,仍然无法充分捕捉关键的任务相关信息,也无法诱导有效的策略先验,即在成功执行任务时环境如何演变的预见性知识。相比之下,我们发现视频上预训练的预测嵌入,特别是V-JEPA 2,能够灵活地忽略不可预测的环境因素,并编码任务相关的时间动态,从而有效弥补现有视觉表示在VLA中的关键不足。基于这些观察,我们提出了JEPA-VLA,这是一种简单而有效的方法,能够将预测嵌入适应性地整合到现有的VLA中。我们的实验表明,JEPA-VLA在包括LIBERO、LIBERO-plus、RoboTwin2.0和真实机器人任务在内的多种基准测试中取得了显著的性能提升。
Summary / 总结
This paper addresses the limitations of current vision-language-action (VLA) models in terms of sample efficiency and generalization, attributing these issues to inadequate pretrained visual representations. The authors propose JEPA-VLA, which integrates predictive embeddings pretrained on videos to improve the models' ability to capture task-relevant temporal dynamics and discard unpredictable environment factors. Experiments show that JEPA-VLA significantly enhances performance across various benchmarks and real-robot tasks.
该论文探讨了VLA模型在样本效率和泛化能力方面的局限性,认为这些局限性源于预训练视觉表示的不足。作者提出JEPA-VLA,该方法将基于视频预训练的预测嵌入整合到现有VLA模型中。实验表明,JEPA-VLA在各种基准测试和真实机器人任务中显著提高了性能。
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
Authors: Jialin Wu, Wei Shi, Han Shen, Peigui Qi, Kunsheng Tang, Zhicong Huang, Binghao Wang, Zhou Yang
First: 2026-02-12T11:07:44+00:00 · Latest: 2026-02-12T11:07:44+00:00
Abstract
Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.
中文标题/摘要
标题:Revis: 稀疏潜在引导以减轻大型视觉-语言模型中的物体幻觉
尽管大型视觉-语言模型(LVLMs)具有先进的功能,但它们经常遭受物体幻觉的问题。其中一个原因是视觉特征和预训练的文本表示在深层网络层中常常交织在一起。为了解决这个问题,我们提出了一种无需训练的REVIS框架,旨在显式地重新激活这种被抑制的视觉信息。基于潜在空间几何,REVIS通过正交投影提取纯净的视觉信息向量,并采用校准策略仅在抑制发生的精确深度处进行稀疏干预。这种手术式的方法以最小的计算成本有效地恢复了视觉信息。在标准基准上的实证评估表明,与最先进的基线相比,REVIS将物体幻觉率降低了约19%,同时保持了通用推理能力。
Summary / 总结
The research aims to address the issue of object hallucination in Large Vision-Language Models (LVLMs) by proposing REVIS, a training-free framework. REVIS reactivates suppressed visual information through orthogonal projection and sparse intervention at the precise depth where suppression occurs. Experimental results show that REVIS reduces object hallucination rates by about 19% compared to state-of-the-art baselines, while maintaining general reasoning capabilities.
研究旨在通过提出REVIS框架来解决大型视觉-语言模型(LVLM)中的物体幻觉问题。REVIS通过正交投影和在抑制发生的精确深度处进行稀疏干预来重新激活被抑制的视觉信息。实验表明,与最先进的基线相比,REVIS可以将物体幻觉率降低约19%,同时保持一般的推理能力。
Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
Authors: Zhenghuang Wu, Kang Chen, Zeyu Zhang, Hao Tang
First: 2026-02-12T09:50:13+00:00 · Latest: 2026-02-12T09:50:13+00:00
Abstract
Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.
中文标题/摘要
标题:Light4D:无需训练的极端视角4D视频重新光照
基于扩散生成模型的最新进展已经确立了图像和视频重新光照的新范式。然而,将这些能力扩展到4D重新光照仍然具有挑战性,主要是由于缺乏配对的4D重新光照训练数据,以及在极端视角下保持时间一致性的难度。在本文中,我们提出了一种名为Light4D的新型无需训练框架,旨在在目标照明下合成一致的4D视频,即使在极端视角变化下也是如此。首先,我们引入了时间感知的解耦流引导策略,该策略有效地将光照控制注入到潜在空间中,同时保持几何完整性。其次,为了增强时间一致性,我们开发了IC-Light架构内的时间一致注意力,并进一步引入确定性正则化以消除外观闪烁。广泛的实验表明,我们的方法在时间一致性和光照保真度方面取得了竞争力的表现,能够稳健地处理从-90到90的相机旋转。代码:https://github.com/AIGeeksGroup/Light4D。网站:https://aigeeksgroup.github.io/Light4D。
Summary / 总结
Light4D is a training-free framework for 4D video relighting that addresses the challenges of maintaining temporal consistency and geometric integrity under extreme viewpoint changes. It uses Disentangled Flow Guidance to inject lighting control into the latent space and Temporal Consistent Attention to reinforce temporal consistency, with deterministic regularization to prevent appearance flickering. Experiments show that Light4D achieves competitive performance in temporal consistency and lighting fidelity, handling camera rotations from -90 to 90 degrees effectively.
Light4D 是一个无需训练的 4D 视频重新光照框架,旨在解决在极端视角变化下保持时间一致性和几何完整性的问题。它通过引入 Disentangled Flow Guidance 将光照控制注入到潜在空间,并通过 Temporal Consistent Attention 强化时间一致性,结合确定性正则化以消除外观闪烁。实验表明,Light4D 在时间一致性和光照保真度方面表现出色,能够有效处理从 -90 到 90 度的相机旋转。
Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
Authors: Xiangyu Wu, Dongming Jiang, Feng Yu, Yueying Tian, Jiaqi Tang, Qing-Guo Chen, Yang Yang, Jianfeng Lu
Venue: ICLR 2026
First: 2026-02-12T09:12:22+00:00 · Latest: 2026-02-12T09:12:22+00:00
Comments: Accepted for publication at ICLR 2026; 24 pages; 5 figures
Abstract
Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.
中文标题/摘要
标题:自适应去偏Tsallis熵在测试时适应中的应用
主流的测试时适应(TTA)方法,例如针对CLIP等视觉-语言模型,通常依赖于测试时的香农熵(SE)来衡量预测不确定性与不一致性。然而,由于CLIP在预训练时使用了高度不平衡的网络抓取数据,SE不可避免地会产生偏倚的不确定性熵估计。为了解决这一问题,我们发现并证明了Tsallis熵(TE),一种SE的广义形式,通过引入非可加参数q,自然适用于描述偏倚分布,并且SE的性能是TE的下界。基于此,我们为TTA提出了自适应去偏Tsallis熵(ADTE),为每个类别定制了一个由不断到来的测试实例估计的标签偏倚归一化得到的类别特定参数q^l。这种自适应方法使ADTE能够准确选择高置信度视图,并无缝地与标签调整策略结合以增强适应,而无需进行特定分布的超参数调整。此外,我们的研究发现,TE和ADTE都可以直接作为SE在TTA中的高级替代方案,无需其他修改。实验结果表明,ADTE在ImageNet及其五个变体上优于最先进的方法,并在10个跨域基准测试中实现了最高的平均性能,无论使用哪种模型架构或文本提示。我们的代码可在https://github.com/Jinx630/ADTE获取。
Summary / 总结
This paper addresses the issue of biased uncertainty estimation in Test-Time Adaptation (TTA) for vision-language models like CLIP, which are pre-trained on imbalanced data. It proposes Adaptive Debiasing Tsallis Entropy (ADTE), a method that generalizes Shannon Entropy (SE) using a non-extensive parameter q to better characterize biased distributions. ADTE adapts the parameter q for each category based on incoming test instances, allowing it to accurately select high-confidence views and integrate with a label adjustment strategy. Experiments show ADTE outperforms existing TTA methods on various benchmarks, including ImageNet and its variants, and achieves the highest average performance across 10 cross-domain benchmarks.
本文针对视觉-语言模型CLIP在不平衡数据上训练导致的测试时自适应(TTA)中偏差的不确定性估计问题,提出了一种改进的Tsallis熵方法——自适应去偏差Tsallis熵(ADTE)。ADTE通过引入非可加参数q来更好地表征偏差分布,并根据传入的测试实例自适应地调整每个类别的参数q,从而准确选择高置信度的视图并结合标签调整策略。实验结果显示,ADTE在ImageNet及其变体和10个跨域基准上均优于现有方法,且不依赖于特定的模型架构或文本提示。
Adapting Vision-Language Models for E-commerce Understanding at Scale
Authors: Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi
First: 2026-02-12T08:59:22+00:00 · Latest: 2026-02-12T08:59:22+00:00
Abstract
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
中文标题/摘要
标题:适应电子商务大规模理解的视觉-语言模型调整
电子商务产品理解本质上需要从文本、图像和结构化属性中获得强大的多模态理解。通用的视觉-语言模型(VLMs)能够实现多模态的泛化潜在建模,但尚未有记录且广为人知的方法能够在不牺牲一般性能的情况下将它们调整为以属性为中心、多图像且噪声较大的电子商务数据。在本研究中,我们通过大规模实验研究展示了如何有针对性地调整通用VLMs,以显著提高电子商务性能同时保留广泛的多模态能力。此外,我们提出了一套新的全面评估方案,涵盖了深入的产品理解、严格的指令遵循和动态属性提取。
Summary / 总结
The research aims to enhance the performance of Vision-Language Models (VLMs) in e-commerce by adapting them to handle attribute-centric, multi-image, and noisy data. The study demonstrates that targeted adaptation of general VLMs can significantly improve e-commerce performance while maintaining their broad multimodal capabilities. The authors propose an extensive evaluation suite that includes deep product understanding, strict instruction following, and dynamic attribute extraction to validate their approach.
研究旨在通过适应Vision-Language Models (VLMs),使其能够处理属性中心、多图片和嘈杂的数据,从而提升在电子商务中的性能。研究显示,有针对性地适应通用VLMs可以显著提高电子商务性能,同时保持其广泛的多模态能力。作者提出了一套全面的评估套件,包括深入的产品理解、严格的指令遵循和动态的属性提取,以验证其方法。
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
Authors: Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li
First: 2026-02-12T08:53:32+00:00 · Latest: 2026-02-12T08:53:32+00:00
Abstract
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
中文标题/摘要
标题:STVG-R1:通过强化学习激励实例级推理和视频接地
在视觉语言模型(VLMs)中,文本描述与视觉坐标之间的不一致常常导致幻觉。这一问题在时空视频接地(STVG)等密集预测任务中尤为严重。先前的方法通常侧重于增强视觉-文本对齐或附加辅助解码器。然而,这些策略不可避免地引入了额外的可训练模块,导致显著的注释成本和计算开销。在本文中,我们提出了一种新颖的视觉提示范式,避免了跨模态对齐坐标的困难问题。具体而言,我们将每帧坐标预测重新表述为一个紧凑的实例级识别问题,为每个对象分配一个唯一的、时间一致的ID。这些ID被嵌入到视频中作为视觉提示,为VLMs提供明确且可解释的输入。此外,我们引入了STVG-R1,这是第一个用于STVG的强化学习框架,它采用任务驱动的奖励来联合优化时间准确性、空间一致性和结构格式正则化。在六个基准上的广泛实验表明了我们方法的有效性。STVG-R1在HCSTVG-v2基准上的m_IoU上超越了基线Qwen2.5-VL-7B,取得了20.9%的显著优势,建立了新的最先进水平(SOTA)。令人惊讶的是,STVG-R1在多对象引用视频对象分割任务上也表现出强大的零样本泛化能力,实现了MeViS上的SOTA 47.3% J&F。
Summary / 总结
This paper addresses the issue of misalignment between textual descriptions and visual coordinates in vision-language models, particularly in dense prediction tasks like spatial-temporal video grounding. It proposes a novel visual prompting paradigm that assigns each object a unique, temporally consistent ID and embeds these IDs as visual prompts into the video. Additionally, it introduces STVG-R1, a reinforcement learning framework that optimizes temporal accuracy, spatial consistency, and structural format regularization. Experiments show that STVG-R1 outperforms the baseline Qwen2.5-VL-7B by 20.9% on m_IoU and achieves a new state of the art in zero-shot generalization to multi-object referring video object segmentation tasks.
该论文针对视觉语言模型中文本描述与视觉坐标之间的对齐问题,特别是在时空视频定位任务中的问题。提出了一种新颖的视觉提示方法,为每个对象分配一个独特的、时间一致的ID,并将其嵌入到视频中作为视觉提示。该方法使用STVG-R1强化学习框架,联合优化时间准确性、空间一致性和结构格式正则化。实验表明,STVG-R1在m_IoU上比基线Qwen2.5-VL-7B高出20.9%,并在多对象引用视频对象分割任务中实现了零样本泛化的新最佳性能,达到47.3%的J&F。
Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Authors: Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim
Venue: ICRA 2026
First: 2026-02-12T07:25:52+00:00 · Latest: 2026-02-12T07:25:52+00:00
Comments: Accepted to ICRA 2026. 9 pages, 8 figures
Abstract
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
中文标题/摘要
标题:Clutt3R-Seg:杂乱场景中语言导向抓取的稀疏视图3D实例分割
可靠的3D实例分割是语言导向机器人操作的基础。其关键应用在于杂乱环境,其中遮挡、有限视角和噪声掩膜会降低感知效果。为应对这些挑战,我们提出了Clutt3R-Seg,这是一种用于杂乱场景中语言导向抓取的零样本稳健3D实例分割管道。我们的核心思想是引入层次化的语义实例树。与先前尝试细化噪声掩膜的方法不同,我们的方法利用它们作为信息性线索:通过跨视图分组和条件替换,树抑制了过度分割和不足分割,生成视图一致的掩膜和稳健的3D实例。每个实例都丰富了开放词汇语义嵌入,使从自然语言指令中准确选择目标成为可能。为处理多阶段任务中的场景变化,我们进一步引入了一种一致性感知更新,仅通过单张后交互图像保留实例对应关系,从而实现高效的适应而无需重新扫描。Clutt3R-Seg在合成和真实世界数据集上进行了评估,并在真实机器人上进行了验证。在所有设置中,它在杂乱和稀疏视角场景中始终优于最先进的基线。即使在最具挑战性的重杂乱序列中,Clutt3R-Seg的AP@25为61.66,比基线高出2.2倍,且仅使用四个输入视图时,它也超过了使用八视图的MaskClustering超过2倍。代码可在:https://github.com/jeonghonoh/clutt3r-seg/ 获取。
Summary / 总结
Clutt3R-Seg is a zero-shot pipeline for 3D instance segmentation in cluttered scenes, addressing occlusions and limited viewpoints. It introduces a hierarchical instance tree of semantic cues to suppress over- and under-segmentation, yielding view-consistent masks and robust 3D instances. The method enriches each instance with open-vocabulary semantic embeddings for accurate target selection from natural language instructions. Clutt3R-Seg outperforms state-of-the-art baselines in cluttered and sparse-view scenarios, achieving an AP@25 of 61.66 and surpassing MaskClustering with fewer input views.
Clutt3R-Seg 是一种零样本管道,用于处理杂乱场景中的 3D 实例分割,解决遮挡和有限视角的问题。它通过引入层次化的实例树来抑制过度分割和不足分割,生成视图一致的掩码和稳健的 3D 实例。该方法为每个实例添加开放词汇量语义嵌入,并使用一致性感知更新来保留实例对应关系。实验表明,Clutt3R-Seg 在多个设置中均优于最先进的基线方法,AP@25 达到 61.66,使用更少的输入视图就超过了 MaskClustering。
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Authors: Changti Wu, Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Bin Yu, Xiaopeng Lin, Cong Huang, Lei Zhang, Kai Chen
First: 2026-02-12T06:38:49+00:00 · Latest: 2026-02-12T06:38:49+00:00
Comments: The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}
Abstract
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
中文标题/摘要
标题:ScalSelect:高效视觉指令调优的可扩展无训练多模态数据选择
大规模视觉指令调优(VIT)已成为提升视觉语言模型(VLMs)在各种多模态任务性能的关键范式。然而,由于数据中的冗余性,大规模数据集的训练在计算上既昂贵又低效,这促使需要多模态数据选择以提高训练效率。现有的VIT数据选择方法要么需要昂贵的训练或梯度计算,要么依赖于代理模型或数据集、指令无关的表示以及具有二次复杂度的成对相似性,限制了可扩展性和表示保真度。在本文中,我们提出了一种名为ScalSelect的可扩展无训练多模态数据选择方法,其时间复杂度与样本数量成线性关系,消除了对外部模型或辅助数据集的需求。ScalSelect首先通过提取目标VLM中指令标记最关注的视觉特征来构建样本表示,捕捉指令相关信息。然后,它识别出那些表示最能逼近完整数据集表示主导子空间的样本,从而在无需成对比较的情况下实现可扩展的重要性评分。在多个VLM、数据集和选择预算的广泛实验中表明,ScalSelect仅使用数据的16%就能达到全数据集训练超过97.5%的性能,并且在某些情况下甚至优于全数据集训练。代码可在<https://github.com/ChangtiWu/ScalSelect> 获取。
Summary / 总结
ScalSelect is a scalable training-free multimodal data selection method for efficient visual instruction tuning, addressing the computational inefficiency of large-scale training. It constructs sample representations by extracting visual features most attended by instruction tokens and identifies samples that best approximate the dominant subspace of full dataset representations, avoiding pairwise comparisons. ScalSelect achieves over 97.5% of the performance of full dataset training using only 16% of the data across various VLMs and datasets, outperforming full-data training in some settings.
ScalSelect 是一种用于高效视觉指令调优(VIT)的可扩展无训练数据选择方法,适用于视觉语言模型。它通过指令标记最关注的视觉特征构建样本表示,并识别出最能逼近完整数据集表示主导子空间的样本。ScalSelect 使用仅 16% 的数据即可达到完整数据集训练超过 97.5% 的性能,并在某些情况下甚至优于完整数据集训练。
SkillRater: Untangling Capabilities in Multimodal Data
Authors: Naveen Sahi, Jeremy Dohmann, Armen Aghajanyan, Akshat Shrivastava
First: 2026-02-12T06:07:03+00:00 · Latest: 2026-02-12T06:07:03+00:00
Abstract
Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.
中文标题/摘要
标题:SkillRater:解开多模态数据能力的谜团
数据整理方法通常为样本分配单一的质量评分。我们认为这种标量框架本质上是有限的:当训练需要多种不同的能力时,单一的评分器无法同时最大化所有能力的有用信号。质量应被视为多维的,每个维度对应模型必须掌握的一种能力。我们引入了SkillRater框架,将数据过滤分解为专门的评分器——每个能力一个,通过元学习在独立的验证目标上进行训练——并通过逐步选择规则组合它们的评分:在每个训练阶段,样本仅在任何评分器将其排名高于随时间收紧的阈值时才被保留,早期保持多样性,后期集中于高价值样本。我们在视觉语言模型上验证了这种方法,将质量分解为三个能力维度:视觉理解、OCR和STEM推理。在2000万参数下,与未过滤的基线相比,SkillRater在视觉理解上提高了5.63%,在OCR上提高了2.00%,在STEM上提高了3.53%。学习到的评分器信号几乎正交,证实了分解捕捉到了真正独立的质量维度,解释了为什么它优于未过滤训练和单一学习过滤。
Summary / 总结
SkillRater is a framework that decomposes data quality into multiple dimensions corresponding to distinct capabilities, addressing the limitations of scalar quality scores. It uses meta-learning to train specialized raters for each capability and combines their scores through a progressive selection rule. On vision language models, SkillRater improves performance by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM compared to unfiltered baselines, demonstrating that the learned rater signals capture independent quality dimensions effectively.
SkillRater 是一个框架,将数据过滤分解为针对不同能力的专业评分器,每个评分器通过元学习进行训练。它在视觉语言模型上分别提高了5.63%的视觉理解、2.00%的OCR和3.53%的STEM推理,超过了未过滤的基线。学习到的评分信号几乎是正交的,表明独立的质量维度,并解释了该框架的有效性。
The Determinism of Randomness: Latent Space Degeneracy in Diffusion Model
Authors: Song Yan, Chenfeng Wang, Wei Zhai, Xinliang Bi, Jian Yang, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha
First: 2025-11-11T02:12:38+00:00 · Latest: 2026-02-12T03:33:55+00:00
Abstract
Diffusion models draw the initial latent from an isotropic Gaussian distribution (all directions equally likely). But in practice, changing only the random seed can sharply alter image quality and prompt faithfulness. We explain this by distinguishing the isotropic prior from the semantics induced by the sampling map: while the prior is direction-agnostic, the mapping from latent noise to semantics has semantic-invariant directions and semantic-sensitive directions, so different seeds can lead to very different semantic outcomes. Motivated by this view, we propose a training-free inference procedure that (i) suppresses seed-specific, semantic-irrelevant variation via distribution-preserving semantic erasure, (ii) reinforces prompt-relevant semantic directions through timestep-aggregated horizontal injection, and (iii) applies a simple spherical retraction to stay near the prior's typical set. Across multiple backbones and benchmarks, our method consistently improves alignment and generation quality over standard sampling.
中文标题/摘要
标题:随机性的决定论:扩散模型中的潜在空间退化
扩散模型从各向同性的高斯分布中抽取初始潜在变量(所有方向的可能性相同)。但在实践中,仅改变随机种子就能显著改变图像质量和提示一致性。我们通过区分各向同性先验与采样映射诱导的语义:虽然先验对方向无偏好,但从潜在噪声到语义的映射具有语义不变的方向和语义敏感的方向,因此不同的种子可能导致非常不同的语义结果。受此观点的启发,我们提出了一种无需训练的推理过程,(i)通过分布保持的语义擦除抑制种子特定的、与语义无关的变异,(ii)通过时间步聚合的水平注入强化与提示相关的语义方向,(iii)应用简单的球面收缩以保持在先验的典型集附近。在多个骨干网络和基准测试中,我们的方法在一致性与生成质量上均优于标准采样方法。
Summary / 总结
This paper investigates the determinism of randomness in diffusion models, where changing the random seed can significantly affect image quality and prompt faithfulness despite the isotropic Gaussian prior. The authors propose a training-free inference method that suppresses seed-specific, semantic-irrelevant variations, reinforces prompt-relevant semantic directions, and maintains proximity to the prior's typical set, leading to improved alignment and generation quality across different models and benchmarks.
论文探讨了随机种子如何显著影响扩散模型中的图像质量和提示一致性,这些模型的初始潜在变量来自一个各向同性的高斯分布。作者提出了一种无需训练的推理方法,该方法抑制了与种子相关的、无关的变异,强化了与提示相关的语义方向,并应用球面收缩以保持在先验的典型集中。该方法在多种扩散模型骨干和基准上提高了对齐和生成质量,优于标准采样方法。
What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
Authors: Zhenlong Yuan, Xiangyan Qu, Jing Tang, Rui Chen, Lei Sun, Ruidong Chen, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou
First: 2026-02-12T02:51:59+00:00 · Latest: 2026-02-12T02:51:59+00:00
Abstract
Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.
中文标题/摘要
标题:假如智能体能想象?通过生成增强开放词汇HOI理解
多模态大型语言模型在视觉和文本推理方面显示出有希望的能力,但在开放词汇人类-物体交互(OV-HOI)中的推理能力受限于跨模态幻觉和遮挡引起的歧义。为了解决这个问题,我们提出了**ImagineAgent**,这是一种将认知推理与生成性想象相结合的智能体框架,以实现稳健的视觉理解。具体而言,我们的方法创新性地构建了认知地图,明确地建模了检测到的实体与候选动作之间可能的关系。随后,它动态地调用包括检索增强、图像裁剪和扩散模型在内的工具,以收集领域特定知识和丰富的视觉证据,从而在模糊场景中实现跨模态对齐。此外,我们提出了一种综合奖励,平衡预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明,我们的方法在性能上达到SOTA水平,所需训练数据量约为现有方法的20%,验证了我们的稳健性和效率。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models in Open-Vocabulary Human-Object Interaction (OV-HOI) by addressing cross-modal hallucinations and occlusion-induced ambiguity. The proposed ImagineAgent framework combines cognitive reasoning with generative imagination to construct cognitive maps and dynamically invoke various tools for cross-modal alignment. Experimental results on SWIG-HOI and HICO-DET datasets show that ImagineAgent achieves state-of-the-art performance with significantly reduced training data requirements, validating its robustness and efficiency.
研究旨在通过解决跨模态幻觉和遮挡引起的歧义问题,增强多模态大型语言模型在开放词汇人类物体交互(OV-HOI)中的推理能力。提出的ImagineAgent框架结合了认知推理和生成性想象,构建认知地图,并动态调用各种工具进行跨模态对齐。实验结果表明,ImagineAgent在SWIG-HOI和HICO-DET数据集上达到了最先进的性能,并且所需训练数据量显著减少,验证了其稳健性和效率。
Anagent For Enhancing Scientific Table & Figure Analysis
Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
First: 2026-02-10T18:46:28+00:00 · Latest: 2026-02-12T02:51:40+00:00
Abstract
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
Summary / 总结
The paper addresses the challenge of accurately analyzing complex scientific tables and figures, which current AI systems often struggle with. It introduces AnaBench, a benchmark with 63,178 instances, and proposes Anagent, a multi-agent framework with four specialized agents: Planner, Expert, Solver, and Critic. Anagent shows significant improvements in analysis quality, up to 13.43% in training-free settings and 42.12% with finetuning, highlighting the importance of task-oriented reasoning and context-aware problem-solving in scientific analysis.
论文旨在解决当前AI系统在分析复杂多变的科学表格和图表时遇到的挑战。为此,作者引入了包含63,178个实例的AnaBench基准,并提出了一个由规划者、专家、解决者和评论者组成的多智能体框架Anagent。Anagent在无训练和微调设置下的分析质量分别提高了13.43%和42.12%,表明任务导向的推理和上下文感知的问题解决对于高质量的科学分析至关重要。
NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
Authors: Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu
First: 2026-02-09T09:39:42+00:00 · Latest: 2026-02-12T02:33:29+00:00
Abstract
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
中文标题/摘要
标题:NarraScore:通过层次情感控制连接视觉叙事与音乐动态
为长视频合成连贯的音轨仍然是一个艰巨的挑战,目前受阻于三个关键障碍:计算可扩展性、时间连贯性和最致命的,普遍缺乏对叙事逻辑演变的语义感知。为弥合这些差距,我们提出NarraScore,这是一种基于核心洞察的层次框架,即情绪是叙事逻辑的高密度压缩。独特地,我们重新利用冻结的视觉-语言模型(VLMs)作为连续的情感传感器,将高维视觉流提炼为密集的、叙事意识的愉悦-唤醒轨迹。从机制上讲,NarraScore采用双分支注入策略来协调全局结构与局部动态:一个全局语义锚点确保风格的稳定性,而一个精确的标记级情感适配器通过直接元素级残差注入调节局部紧张度。这种简约设计绕过了密集注意力和架构克隆的瓶颈,有效缓解了数据稀缺性带来的过拟合风险。实验表明,NarraScore在一致性与叙事对齐方面达到最先进的水平,且几乎不增加计算开销,确立了一种完全自主的长视频音轨生成范式。
Summary / 总结
NarraScore is a hierarchical framework that addresses the challenges of synthesizing coherent soundtracks for long-form videos by leveraging the insight that emotion can compress narrative logic. It uses frozen Vision-Language Models as affective sensors to generate Valence-Arousal trajectories and employs a Dual-Branch Injection strategy to balance global structure and local dynamism. Experiments show that NarraScore achieves state-of-the-art consistency and narrative alignment with minimal computational cost, setting a new standard for long-video soundtrack generation.
NarraScore 是一个分层框架,通过利用情感可以压缩叙事逻辑的核心洞察来解决长视频配乐的挑战。它利用重新利用的视觉-语言模型作为情感传感器,将视觉流压缩成叙事意识的正向情感轨迹。该框架采用双分支注入策略,确保风格稳定性并调节局部紧张感,实现了与最小计算开销的状态最先进的一致性和叙事对齐。
DSO: Direct Steering Optimization for Bias Mitigation
Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
First: 2025-12-17T19:43:46+00:00 · Latest: 2026-02-12T00:30:59+00:00
Abstract
Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
中文标题/摘要
标题:DSO:直接导向优化以减轻偏差
生成模型通常被部署为代用户做出决策,例如视觉语言模型(VLMs)识别房间中哪个人是医生,以帮助视力受损的个体。然而,VLM 的决策会受到输入中人们感知的种族属性的影响,这可能导致偏差的结果,例如未能识别女性为医生。此外,当减少偏差导致性能损失时,用户可能有不同的需求来平衡偏差减轻与整体模型能力,突显了需要能够在推理时控制偏差减少的方法的需求。激活导向是一种流行的推理时可控性方法,在诱导大型语言模型(LLMs)的更安全行为方面显示出潜力。然而,我们观察到当前的导向方法难以纠正偏差,需要在不同种族群体之间实现等概率的结果。为了解决这个问题,我们提出了直接导向优化(DSO),它使用强化学习来寻找用于导向激活的线性变换,以减轻偏差同时控制模型性能。我们证明DSO在VLMs和LLMs上实现了公平性和能力之间的最佳权衡,同时为实践者提供了在权衡中进行推理时的控制。总体而言,我们的工作突显了设计直接优化以控制模型行为的导向策略的好处,提供了比依赖预定义启发式方法进行可控性的方法更有效的偏差干预。
Summary / 总结
The paper addresses the issue of bias in generative models like vision-language models (VLMs) and large language models (LLMs), which can lead to unfair outcomes. It introduces Direct Steering Optimization (DSO), a method using reinforcement learning to optimize linear transformations of model activations, enabling better control over bias mitigation and model performance. DSO demonstrates superior performance in balancing fairness and capabilities compared to existing methods, offering practitioners more control over the trade-off during inference.
研究旨在减轻生成模型如视觉语言模型(VLMs)和大型语言模型(LLMs)中的偏见,同时保持模型性能。提出了一种直接引导优化(DSO)方法,使用强化学习找到线性变换来引导模型激活,从而实现可控的偏见减少。DSO在VLMs和LLMs上实现了公平性和能力的最佳权衡,为用户提供在推理时控制这种权衡的能力。
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
Authors: Nghia Nguyen, Tianjiao Ding, René Vidal
First: 2026-02-11T23:53:15+00:00 · Latest: 2026-02-11T23:53:15+00:00
Abstract
Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding \& Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we construct a corresponding hierarchy of concept embeddings and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas vanilla sparse coding fails. Our experiments on real-world datasets demonstrate that HCEP outperforms baselines in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results show that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.
中文标题/摘要
标题:层次概念嵌入与追求以实现可解释的图像分类
设计可解释的模型在计算机视觉中正逐渐流行,因为它们能够为预测提供忠实的解释。在图像分类中,这些模型通常从图像中恢复出可由人类理解的概念,并使用这些概念进行分类。稀疏概念恢复方法利用视觉-语言模型的潜在空间,将图像嵌入表示为概念嵌入的稀疏组合。然而,由于这些方法忽略了概念的层次结构,它们可能会产生与层次结构不一致的解释,但仍然得到正确的预测。在本文中,我们提出了一种名为层次概念嵌入与追求(HCEP)的框架,该框架在潜在空间中诱导概念嵌入的层次结构,并使用层次稀疏编码来恢复图像中存在的概念。给定一组语义概念的层次结构,我们构建相应的概念嵌入层次结构,并假设图像中的正确概念形成层次结构中的根路径,从而推导出在嵌入空间中识别它们的有利条件。我们证明,层次稀疏编码可靠地恢复了层次概念嵌入,而普通的稀疏编码则失败。我们在真实数据集上的实验表明,HCEP 在概念精确度和召回率方面优于基线模型,同时保持了竞争力的分类准确性。此外,当样本数量有限时,HCEP 在分类准确性和概念恢复方面表现出色。这些结果表明,将层次结构纳入稀疏编码中可以产生更可靠和可解释的图像分类模型。
Summary / 总结
This work addresses the need for interpretable image classification models by proposing Hierarchical Concept Embedding & Pursuit (HCEP), which incorporates the hierarchical structure of concepts in the latent space of vision-language models. HCEP uses hierarchical sparse coding to recover concepts from images, leading to better concept precision and recall compared to vanilla sparse coding. Experiments on real-world datasets show that HCEP outperforms baselines in both classification accuracy and concept recovery, especially with limited samples.
本文提出了层次概念嵌入与追求(HCEP)框架,通过引入概念的层次结构来提升图像分类模型的可解释性。HCEP 构建了概念的层次结构,并使用层次稀疏编码来恢复图像中存在的概念。实验表明,HCEP 在概念精度和召回率方面优于基线方法,同时保持了竞争力的分类准确性,特别是在样本有限的情况下表现更优。
Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti
First: 2025-06-06T11:50:18+00:00 · Latest: 2026-02-11T20:03:58+00:00
Abstract
Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between $7\%$ and $13\%$ according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.
中文标题/摘要
标题:统一视觉语言模型中的基于动作的视觉动力学自举
统一视觉语言模型(VLMs)能否执行前向动力学预测(FDP),即给定先前观察和动作(语言形式)的情况下预测未来状态(图像形式)?我们发现,VLMs 在从指令生成物理上合理的帧间过渡方面表现不佳。然而,我们发现多模态定位中的一个关键不对称性:将 VLM 微调以学习逆向动力学预测(IDP),即在帧之间有效描述动作,比学习 FDP 要容易得多。反过来,IDP 可以通过两种主要策略来启动 FDP:1)从合成数据的弱监督学习,2)推理时验证。首先,IDP 可以为未标记的视频帧观察对标注动作,从而扩大 FDP 的训练数据规模。其次,IDP 可以为 FDP 的多个样本分配奖励,从而在推理时有效指导搜索。我们通过使用两种家族的 VLMs 在 Aurora-Bench 上执行以动作为中心的图像编辑任务,评估了这两种策略产生的 FDP。尽管保持通用性,我们的最佳模型在 GPT4o-as-judge 的评估中与最先进的图像编辑模型竞争,比它们高出 7% 至 13% 的幅度,并在 Aurora-Bench 的所有子集上实现了最佳的人类评估平均得分。
Summary / 总结
The study explores whether unified vision-language models (VLMs) can predict future states given past observations and actions described in language. Despite difficulties, the research identifies that learning inverse dynamics prediction (IDP) is easier than forward dynamics prediction (FDP). IDP can be used to bootstrap FDP through weakly supervised learning from synthetic data and inference-time verification. Evaluations on Aurora-Bench show that the best model improves state-of-the-art image editing models by 7% to 13% in human evaluation scores.
研究探讨了统一视觉语言模型(VLMs)是否能够根据过去的观察和动作指令预测未来图像状态。尽管存在挑战,研究发现学习逆向动力学预测(IDP)比预测动力学(FDP)更容易。IDP可以通过从合成数据的弱监督学习和推理时验证来辅助FDP。在Aurora-Bench上的评估显示,最佳模型在GPT4o-as-judge评分上比最先进的图像编辑模型提高了7%到13%,并在所有子集上实现了最高的平均人类评估。