Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
Authors: Guile Wu, David Huang, Bingbing Liu, Dongfeng Bai
First: 2026-02-17T17:10:13+00:00 · Latest: 2026-02-17T17:10:13+00:00
Comments: Technical Report
Abstract
Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
中文标题/摘要
标题:语言和几何驱动的稀疏体素表示法用于整体场景理解
现有的3D开放词汇场景理解方法主要强调将2D基础模型的语言特征提炼为3D特征场,但很大程度上忽视了场景外观、语义和几何之间的协同作用。因此,场景理解往往偏离场景的几何结构,与重建过程脱节。在本文中,我们提出了一种新的方法,利用语言和几何驱动的稀疏体素表示法,在统一框架中全面建模外观、语义和几何。具体而言,我们使用3D稀疏体素作为基本单元,并采用外观场、密度场、特征场和置信场来全面表示3D场景。为了促进外观、密度和特征场之间的协同作用,我们构建了一个特征调制模块,并将2D基础模型的语言特征提炼到我们的3D场景模型中。此外,我们还将几何提炼整合到特征场提炼中,通过深度相关正则化和模式一致性正则化,将几何知识从几何基础模型转移到我们的3D场景表示中。这些组件共同作用,在统一框架中协同建模3D场景的外观、语义和几何。大量实验表明,我们的方法在整体场景理解和重建方面优于最先进的方法。
Summary / 总结
This work addresses the limitations of existing 3D scene understanding methods by proposing a novel approach that integrates language and geometry into sparse voxel representations. The method uses 3D sparse voxels to represent appearance, density, features, and confidence fields, and includes a feature modulation module and regularization techniques to enhance the synergy among these fields. Experiments show that this approach outperforms state-of-the-art methods in both scene understanding and reconstruction tasks.
该研究针对现有3D场景理解方法的局限性,提出了一种将语言和几何学整合到稀疏体素表示中的新方法。该方法使用3D稀疏体素,并构建了外观、密度、特征和置信度字段来表示3D场景。此外,还包含特征调制模块和几何蒸馏技术,以增强这些字段之间的协同作用。实验表明,该方法在场景理解和重建任务中均优于现有最先进的方法。
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Authors: Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Venue: WWW
First: 2025-02-25T03:37:43+00:00 · Latest: 2026-02-17T17:04:01+00:00
Comments: ACM Web Conference 2026 (WWW'26)
Abstract
Time series anomaly detection (TSAD) has been a long-standing pillar problem in Web-scale systems and online infrastructures, such as service reliability monitoring, system fault diagnosis, and performance optimization. Large language models (LLMs) have demonstrated unprecedented capabilities in time series analysis, the potential of multimodal LLMs (MLLMs), particularly vision-language models, in TSAD remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. It motivates our research question: Can multimodal LLMs perform time series anomaly detection? Existing studies often oversimplify the problem by treating point-wise anomalies as special cases of range-wise ones or by aggregating point anomalies to approximate range-wise scenarios. They limit our understanding for realistic scenarios such as multi-granular anomalies and irregular time series. To address the gap, we build a VisualTimeAnomaly benchmark to comprehensively investigate zero-shot capabilities of MLLMs for TSAD, progressively from point-, range-, to variate-wise anomalies, and extends to irregular sampling conditions. Our study reveals several key insights in multimodal MLLMs for TSAD. Built on these findings, we propose a MLLMs-based multi-agent framework TSAD-Agents to achieve automatic TSAD. Our framework comprises scanning, planning, detection, and checking agents that synergistically collaborate to reason, plan, and self-reflect to enable automatic TSAD. These agents adaptively invoke tools such as traditional methods and MLLMs and dynamically switch between text and image modalities to optimize detection performance.
中文标题/摘要
标题:多模态LLM能否进行时间序列异常检测?
时间序列异常检测(TSAD)一直是大规模系统和在线基础设施中的长期核心问题,如服务可靠性监控、系统故障诊断和性能优化。大型语言模型(LLMs)在时间序列分析方面展现了前所未有的能力,而多模态LLM(MLLMs),尤其是视觉-语言模型,在TSAD方面的潜力尚未得到充分探索。人类自然地通过可视化和文本描述来检测时间序列异常。这激发了我们的研究问题:多模态LLM能否进行时间序列异常检测?现有研究往往通过将点异常视为区间异常的特殊情况或通过聚合点异常来近似区间场景来简化问题,这限制了我们对多粒度异常和不规则时间序列等现实场景的理解。为解决这一差距,我们构建了一个VisualTimeAnomaly基准,全面调查MLLMs在TSAD中的零样本能力,从点、区间到变量层面的异常,进一步扩展到不规则采样条件。我们的研究揭示了多模态MLLMs在TSAD中的几个关键见解。基于这些发现,我们提出了一种基于MLLMs的多智能体框架TSAD-Agents,以实现自动TSAD。该框架包括扫描、规划、检测和检查智能体,它们协同合作,进行推理、规划和自我反思,以实现自动TSAD。这些智能体能够适当地调用传统方法和MLLMs,并动态切换文本和图像模态,以优化检测性能。
Summary / 总结
This study explores the capability of multimodal large language models (MLLMs) in time series anomaly detection (TSAD), addressing limitations in existing studies. It introduces a VisualTimeAnomaly benchmark to investigate zero-shot TSAD capabilities of MLLMs, covering point-wise, range-wise, and variate-wise anomalies under irregular sampling conditions. The research reveals that MLLMs can effectively perform TSAD, leading to the development of a multi-agent framework, TSAD-Agents, which synergistically collaborates to enable automatic TSAD through adaptive tool invocation and dynamic switching between text and image modalities.
论文探讨了多模态大型语言模型(MLLMs)是否能够进行时间序列异常检测(TSAD),解决了现有研究的局限性。引入了VisualTimeAnomaly基准来探索MLLMs在TSAD中的零样本能力,涵盖了点异常、区间异常和变量异常,并扩展到不规则采样条件。关键发现包括MLLMs在TSAD中的表现,进而开发了TSAD-Agents多代理框架,该框架利用文本和图像模态协同工作,以优化异常检测性能。
LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases
Authors: Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach
First: 2026-02-14T08:10:27+00:00 · Latest: 2026-02-17T16:47:13+00:00
Comments: 26 pages, 13 figures and 8 tables
Abstract
Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
中文标题/摘要
标题:LeafNet:植物病害基础视觉-语言理解的大规模数据集和全面基准
基础模型和视觉-语言预训练显著推动了视觉-语言模型(VLMs)的发展,使其能够处理视觉和语言数据。然而,由于缺乏大规模、全面的多模态图像-文本数据集和基准,它们在特定农业任务中的应用,如植物病理学,仍然受到限制。为解决这一问题,我们引入了LeafNet,一个全面的多模态数据集,以及LeafBench,一个视觉问答基准,用于系统评估VLMs在理解植物病害方面的能力。该数据集包含186,000张叶子的数字图像,涵盖97种疾病类别,配以元数据,生成了13,950个问题-答案对,覆盖六个关键农业任务。问题评估了植物病理学理解的各个方面,包括视觉症状识别、分类关系和诊断推理。在我们的LeafBench数据集上对12个最先进的VLMs进行基准测试,我们揭示了它们在疾病理解能力上的巨大差异。我们的研究表明,不同任务的性能差异显著:二元健康-患病分类的准确率超过90%,而细粒度病原体和物种识别的准确率低于65%。视觉模型与VLMs之间的直接比较表明,多模态架构具有关键优势:微调的VLMs优于传统的视觉模型,证实了整合语言表示显著提高了诊断精度。这些发现突显了当前VLMs在植物病理学应用中的关键差距,并强调了LeafBench作为严格框架的重要性,用于方法学进步和可靠AI辅助植物病害诊断的进展评估。代码可在https://github.com/EnalisUs/LeafBench/获取。
Summary / 总结
The research introduces LeafNet, a large-scale multimodal dataset for plant diseases, and LeafBench, a benchmark for evaluating Vision-Language Models (VLMs) in understanding plant pathology. The dataset includes 186,000 leaf images with 13,950 question-answer pairs covering six agricultural tasks. Benchmarking 12 state-of-the-art VLMs on LeafBench, the study reveals significant disparities in disease understanding, with binary classification outperforming fine-grained identification. The findings underscore the advantage of multimodal architectures in enhancing diagnostic precision and highlight the need for further research in plant pathology applications.
研究引入了LeafNet,这是一个大规模的多模态数据集,用于植物病害,以及LeafBench,一个用于评估Vision-Language模型在理解植物病理方面能力的基准。数据集包含186,000张叶子图像和13,950个问题-答案对,涵盖了97种病害类别。在LeafBench上对12个最先进的Vision-Language模型进行基准测试,研究揭示了在疾病理解方面的显著差异,二分类任务的准确率超过90%,但细粒度识别低于65%。研究强调了多模态架构在诊断精度方面优于仅视觉模型的优势。这项工作突显了进一步研究以提高Vision-Language模型在植物病理应用中的必要性。
cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning
Authors: Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna Vorontsova, Anton Konushin, Vladislav Kurenkov, Danila Rukhovich
Venue: ICLR 2026 Oral
First: 2025-05-28T22:32:31+00:00 · Latest: 2026-02-17T16:31:55+00:00
Comments: ICLR 2026 (Oral)
Abstract
Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one. Code is avaliable at https://github.com/col14m/cadrille .
中文标题/摘要
标题:cadrille:基于强化学习的多模态CAD重建
计算机辅助设计(CAD)在工程和制造中起着核心作用,使其能够创建精确可编辑的3D模型。使用各种传感器或用户提供的数据作为CAD重建的输入可以普及设计应用程序的访问。然而,现有方法通常仅专注于单一输入模态,如点云、图像或文本,这限制了它们的通用性和鲁棒性。利用视觉语言模型(VLM)的最新进展,我们提出了一种同时处理所有三种输入模态的多模态CAD重建模型。受大型语言模型(LLM)训练范式的启发,我们采用两阶段管道:在大规模程序生成数据上进行监督微调(SFT),然后使用在线反馈进行强化学习(RL)微调,该反馈是程序获取的。此外,我们首次探索了使用强化学习微调LLM进行CAD任务,表明在线RL算法如组相对偏好优化(GRPO)优于离线替代方案。在DeepCAD基准测试中,我们的SFT模型在所有三种输入模态上均优于现有单模态方法。更重要的是,在RL微调后,cadrille在三个具有挑战性的数据集中均达到了新的最佳性能,包括一个真实世界的数据集。代码可在https://github.com/col14m/cadrille 获取。
Summary / 总结
The research aims to enhance CAD reconstruction by integrating multiple input modalities such as point clouds, images, and text, which is crucial for democratizing access to design applications. The method involves a two-stage pipeline: supervised fine-tuning on large-scale procedurally generated data, followed by reinforcement learning fine-tuning using online feedback. Key experimental findings show that the proposed model, cadrille, outperforms existing single-modal approaches in all three input modalities and sets new state-of-the-art on three challenging datasets after RL fine-tuning, including a real-world dataset.
研究旨在通过结合点云、图像和文本等多种输入模态来提升CAD重建能力,并利用强化学习(RL)提高通用性和鲁棒性。方法包括两阶段管道:在大规模程序生成数据上进行监督微调(SFT),然后使用在线反馈进行RL微调。关键发现表明,提出的模型cadrille在DeepCAD基准测试中优于现有单模态方法,并在三个具有挑战性的数据集上(包括一个真实世界数据集)达到了新的最佳性能。
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra
Authors: Xiachong Feng, Liang Zhao, Weihong Zhong, Yichong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, Bing Qin
Venue: ICLR 2026
First: 2026-02-17T15:47:58+00:00 · Latest: 2026-02-17T15:47:58+00:00
Comments: ICLR 2026
Abstract
Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.
中文标题/摘要
标题:PERSONA:通过激活向量代数实现动态和组合的推理时个性控制
当前在大型语言模型中实现个性控制的方法依赖于静态提示或昂贵的微调,无法捕捉人类特质的动态和组合性质。我们提出了PERSONA,一种无需训练的框架,通过直接在激活空间中操纵个性向量实现了微调级别的性能。我们的核心洞察是,个性特质在模型表示空间中表现为可提取的、近似正交的方向,支持代数运算。该框架通过三个阶段运作:Persona-Base 通过对比激活分析提取正交特质向量;Persona-Algebra 通过向量算术实现精确控制(标量乘法用于强度,加法用于组合,减法用于抑制);Persona-Flow 在推理时通过动态组合这些向量实现上下文感知的适应。在PersonalityBench上,我们的方法平均得分为9.60,几乎达到了监督微调上限9.61,而没有任何梯度更新。在我们提出的用于动态个性适应的Persona-Evolve基准测试中,我们在多种模型家族中实现了高达91%的胜率。这些结果表明,LLM个性的某些方面是数学上可处理的,为可解释和高效的行为控制开辟了新方向。
Summary / 总结
The research aims to address the limitations of static prompting and expensive fine-tuning in controlling personality in Large Language Models. PERSONA is a training-free framework that manipulates personality vectors in activation space to achieve fine-tuning level performance. It consists of three stages: Persona-Base extracts orthogonal trait vectors, Persona-Algebra enables precise control through vector arithmetic, and Persona-Flow dynamically composes these vectors during inference. On PersonalityBench, the approach scores 9.60, nearly matching the supervised fine-tuning upper bound of 9.61. On the Persona-Evolve benchmark, it achieves up to 91% win rates across various model families, demonstrating the mathematical tractability of LLM personality aspects and efficient behavioral control.
研究旨在解决在大型语言模型中控制个性时静态提示和昂贵微调的局限性。PERSONA 是一个无需训练的框架,通过在激活空间中操纵个性向量来实现与微调相当的性能。该框架分为三个阶段:Persona-Base 提取正交特质向量,Persona-Algebra 通过向量算术实现精确控制,Persona-Flow 在推理过程中动态组合这些向量。在 PersonalityBench 上,该方法得分 9.60,几乎与监督微调的上限 9.61 相匹配。在 Persona-Evolve 基准测试中,它在各种模型家族中实现了高达 91% 的胜率,证明了 LLM 个性方面的数学可处理性,并展示了高效的行为控制。
Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation
Authors: Marco Salmè, Federico Siciliano, Fabrizio Silvestri, Paolo Soda, Rosa Sicilia, Valerio Guarrasi
First: 2026-02-17T15:18:07+00:00 · Latest: 2026-02-17T15:18:07+00:00
Abstract
Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.
中文标题/摘要
标题:概念增强的多模态RAG:迈向可解释和准确的放射学报告生成
通过视觉-语言模型(VLMs)进行放射学报告生成(RRG)有望减轻文档负担,提高报告一致性并加速临床工作流程。然而,其临床应用受限于缺乏可解释性和生成与影像证据不符的发现的倾向。现有研究通常将可解释性和准确性视为两个独立的目标,基于概念的可解释性技术主要关注透明度,而检索增强生成(RAG)方法则通过外部检索来实现事实上的定位。我们提出了概念增强的多模态RAG(CEMRAG),这是一种统一框架,将视觉表示分解为可解释的临床概念,并将其与多模态RAG结合。该方法利用丰富的上下文提示来提高RRG的可解释性和事实准确性。在MIMIC-CXR和IU X-Ray数据集上,针对多种VLM架构、训练策略和检索配置的实验表明,CEMRAG在临床准确性和标准自然语言处理(NLP)指标上均优于传统的RAG和仅概念基线。这些结果挑战了可解释性和性能之间的假设权衡,表明透明的视觉概念可以增强而不是损害医学VLM中的诊断准确性。我们的模块化设计将可解释性分解为视觉透明度和结构化语言模型条件,提供了一条通往临床可信赖的AI辅助放射学的原理性途径。
Summary / 总结
The research aims to improve the interpretability and accuracy of radiology report generation using Vision-Language Models (VLMs). It introduces Concept-Enhanced Multimodal RAG (CEMRAG), which decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. Experiments on MIMIC-CXR and IU X-Ray show consistent improvements in clinical accuracy and standard NLP measures over conventional RAG and concept-only baselines, challenging the trade-off between interpretability and performance.
该研究提出了CEMRAG框架,通过将视觉表示分解为可解释的临床概念并与多模态RAG集成,增强RRG的可解释性和事实准确性。实验结果显示,在不同VLM架构和训练方案下,CEMRAG在临床准确性和NLP指标上均优于传统RAG和概念基线,挑战了可解释性和性能之间的权衡。
CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving
Authors: Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners, Federico Scari, Simeon Calvert, Bart van Arem, Arkady Zgonnikov
First: 2026-02-17T15:13:36+00:00 · Latest: 2026-02-17T15:13:36+00:00
Comments: 21 pages, on submission to Transportation Research Part C
Abstract
Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.
中文标题/摘要
标题:CARE Drive 一种评估视觉语言模型在自动驾驶中合理性响应的框架
基础模型,包括视觉语言模型,越来越多地在自动驾驶中用于解释场景、推荐行动和生成自然语言解释。然而,现有的评估方法主要评估基于结果的表现,如安全性和轨迹准确性,而没有确定模型决策是否反映了人类相关考虑。因此,目前尚不清楚此类模型生成的解释是否对应于真正的合理性响应决策,还是仅仅是事后合理化。这一局限性在安全关键领域尤为重要,因为它可能导致虚假的信心。为了解决这一差距,我们提出了CARE Drive,一种适用于自动驾驶的基于上下文的合理性评估框架,用于评估视觉语言模型的合理性响应。CARE Drive 在受控的上下文变化下比较基线模型和理由增强模型的决策,以评估人类理由是否因果影响决策行为。该框架采用两阶段评估过程。提示校准确保稳定的输出。系统性的上下文扰动则测量决策对人类理由的敏感性,如安全裕度、社会压力和效率约束。我们通过涉及竞争规范性考虑的骑自行车者超车场景展示了CARE Drive。结果表明,明确的人类理由显著影响模型决策,提高了与专家推荐行为的一致性。然而,响应性在不同上下文因素之间存在差异,表明对不同类型的理由敏感性不均。这些发现提供了实证证据,证明基础模型中的合理性响应可以系统地评估,而无需修改模型参数。
Summary / 总结
CARE Drive is a framework designed to evaluate the reason-responsiveness of vision language models in automated driving. It compares baseline and reason-augmented model decisions under controlled contextual variations to assess whether human reasons causally influence model behavior. The framework demonstrates that explicit human reasons significantly influence model decisions, improving alignment with expert recommendations, but responsiveness varies across different contextual factors.
CARE Drive 是一个框架,用于评估视觉语言模型在自动驾驶中的理由响应性。该框架通过在受控的上下文变化下比较基础模型和理由增强模型的决策,来评估人类理由是否能因果地影响模型的行为。研究结果表明,明确的人类理由显著影响了模型的决策,使其更符合专家推荐的行为,但不同上下文因素对理由的响应性存在差异。
Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
Authors: Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu
First: 2026-02-06T17:19:53+00:00 · Latest: 2026-02-17T14:56:20+00:00
Comments: 18 pages
Abstract
Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
中文标题/摘要
标题:提示重注入:多模态扩散变换器中的提示遗忘缓解
多模态扩散变换器(MMDiTs)在文本到图像生成中保持独立的文本和图像分支,在去噪过程中文本令牌和视觉潜在变量之间存在双向信息流。在这种设置中,我们观察到一种提示遗忘现象:文本分支中的提示表示的语义会随着深度增加而逐渐被遗忘。我们进一步通过在文本分支的各层中探测表示的语言属性,验证了这一效果。受这些发现的启发,我们提出了一种无需训练的方法——提示重注入,该方法将早期层中的提示表示重新注入到后续层中,以缓解这种遗忘现象。在GenEval、DPG和T2I-CompBench++上的实验显示,该方法在指令遵循能力方面取得了持续的改进,并且在衡量偏好、美学和整体文本-图像生成质量的指标上也有所提升。
Summary / 总结
This paper addresses the prompt forgetting issue in Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation, where the semantic information of the prompt representation in the text branch is gradually lost as the depth increases. The authors introduce a training-free method called prompt reinjection, which reintroduces prompt representations from earlier layers into later layers to mitigate this problem. Experimental results on GenEval, DPG, and T2I-CompBench++ demonstrate consistent improvements in instruction-following capability and enhancements in preference, aesthetics, and overall text-image generation quality.
本文探讨了Multimodal Diffusion Transformers (MMDiTs)在文本到图像生成中的提示遗忘问题,即文本分支中的提示表示的语义信息随着深度增加而逐渐丢失。作者提出了一种无需训练的方法——提示重注入,该方法将早期层的提示表示重新注入到后期层以缓解这一问题。实验结果表明,在GenEval、DPG和T2I-CompBench++上,该方法在指令遵循能力、偏好、美学和整体文本-图像生成质量方面均表现出一致的改进。
Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs
Authors: Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang, Cheng Deng
First: 2026-02-17T13:08:06+00:00 · Latest: 2026-02-17T13:08:06+00:00
Abstract
LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.
中文标题/摘要
标题:揭示并增强核心视觉区域:利用内部注意力动态减轻LVLM中的幻觉
LVLMs已经实现了强大的多模态推理能力,但仍容易产生幻觉,输出与视觉输入或用户指令不一致。现有的无需训练的方法,包括对比解码和辅助专家模型,可能会引入更多的计算开销和潜在干扰,且静态内部信号增强往往容易受到注意力陷阱现象的影响。我们发现,LVLM中的内部正向注意力动态(PAD)在注意力陷阱的扭曲下自然揭示了语义上核心的视觉区域。基于此,我们提出了正向注意力动态增强(PADE),这是一种无需训练的注意力干预方法,通过构建PAD图来识别语义上核心的视觉区域,应用每个头的中值绝对偏差缩放以自适应地控制干预强度,并利用系统标记补偿来保持对复杂用户指令的注意力并支持长期输出一致性。在多个LVLM和基准上的实验表明,PADE提高了视觉定位并减少了幻觉,验证了利用内部注意力动态进行可靠多模态推理的有效性。
Summary / 总结
The research aims to mitigate hallucinations in large visual-language models (LVLMs) by leveraging internal attention dynamics. The method, Positive Attention Dynamics Enhancement (PADE), constructs a PAD map to identify semantically core visual regions and uses per-head Median Absolute Deviation Scaling to adaptively control the intervention strength. Experiments demonstrate that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of using internal attention dynamics for reliable multimodal reasoning.
论文通过提出Positive Attention Dynamics Enhancement (PADE) 方法,利用大型视觉语言模型(LVLM)内部的Positive Attention Dynamics (PAD) 来识别和增强语义核心视觉区域,解决了LVLM中的幻觉问题。PADE 构建PAD图,通过每头Median Absolute Deviation Scaling进行自适应控制,并使用System-Token Compensation保持对用户指令的注意力,从而支持长期输出一致性。实验表明,PADE 提高了视觉定位并减少了幻觉,验证了其在增强多模态推理可靠性方面的有效性。
VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing
Authors: Guoqin Tang, Qingxuan Jia, Gang Chen, Tong Li, Zeyuan Huang, Zihang Lv, Ning Ji
First: 2026-02-17T12:54:18+00:00 · Latest: 2026-02-17T12:54:18+00:00
Abstract
Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.
中文标题/摘要
标题:VLM-DEWM:制造动态工作单元中可验证和抗扰的视觉-语言规划的认知架构
视觉-语言模型(VLM)在智能制造中的高级规划显示出潜力,但在动态工作单元中的部署面临两个关键挑战:(1)无状态操作,它们无法持续跟踪视域外的状态,导致世界状态漂移;(2)不透明的推理,故障难以诊断,导致昂贵的盲目重试。本文提出了一种认知架构VLM-DEWM,通过持久的、可查询的动态外部世界模型(DEWM)将VLM推理与世界状态管理解耦。每个VLM决策被结构化为可外部化的推理痕迹(ERT),包括行动建议、世界信念和因果假设,这些痕迹在执行前需验证DEWM。当发生故障时,预测状态与观察状态之间的差异分析可实现有针对性的恢复,而不是全局重规划。我们通过多站装配、大规模设施探索以及在诱导故障下真实机器人恢复评估了VLM-DEWM。与基线增强记忆的VLM系统相比,VLM-DEWM将状态跟踪准确性从56%提高到93%,将恢复成功率从不到5%提高到95%,并通过结构化记忆显著减少了计算开销。这些结果确立了VLM-DEWM作为动态制造环境中长期机器人操作的可验证和抗扰解决方案的地位。
Summary / 总结
VLM-DEWM addresses the challenges of stateless operation and opaque reasoning in VLMs for smart manufacturing by introducing a Dynamic External World Model (DEWM). Each VLM decision is recorded as an Externalizable Reasoning Trace (ERT) and validated against DEWM before execution. This approach improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and reduces computational overhead. VLM-DEWM is evaluated on various tasks and demonstrates verifiable and resilient performance in dynamic manufacturing environments.
VLM-DEWM通过引入动态外部世界模型(DEWM)来解决VLM在智能制造中的状态无持续跟踪和推理不透明的问题。每个VLM决策被记录为可外部化的推理痕迹(ERT),并在执行前与DEWM进行验证。这种方法将状态跟踪准确性从56%提高到93%,恢复成功率从不到5%提高到95%,并显著减少了计算开销。VLM-DEWM在各种任务上的评估表明,它是一个可验证和可靠的解决方案,适用于动态制造环境中的长期机器人操作。
Dynamic Training-Free Fusion of Subject and Style LoRAs
Authors: Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang
First: 2026-02-17T12:42:30+00:00 · Latest: 2026-02-17T12:42:30+00:00
Abstract
Recent studies have explored the combination of multiple LoRAs to simultaneously generate user-specified subjects and styles. However, most existing approaches fuse LoRA weights using static statistical heuristics that deviate from LoRA's original purpose of learning adaptive feature adjustments and ignore the randomness of sampled inputs. To address this, we propose a dynamic training-free fusion framework that operates throughout the generation process. During the forward pass, at each LoRA-applied layer, we dynamically compute the KL divergence between the base model's original features and those produced by subject and style LoRAs, respectively, and adaptively select the most appropriate weights for fusion. In the reverse denoising stage, we further refine the generation trajectory by dynamically applying gradient-based corrections derived from objective metrics such as CLIP and DINO scores, providing continuous semantic and stylistic guidance. By integrating these two complementary mechanisms-feature-level selection and metric-guided latent adjustment-across the entire diffusion timeline, our method dynamically achieves coherent subject-style synthesis without any retraining. Extensive experiments across diverse subject-style combinations demonstrate that our approach consistently outperforms state-of-the-art LoRA fusion methods both qualitatively and quantitatively.
中文标题/摘要
标题:主体和风格LoRA的动态无训练融合
近期研究探索了将多个LoRA结合以同时生成用户指定的主体和风格。然而,大多数现有方法使用静态统计启发式融合LoRA权重,这偏离了LoRA最初学习自适应特征调整的目的,并忽略了采样输入的随机性。为解决这一问题,我们提出了一种动态无训练融合框架,该框架在整个生成过程中运行。在前向传递过程中,在每个应用LoRA的层中,我们动态计算基模型原始特征与主体和风格LoRA生成的特征之间的KL散度,并自适应选择最合适的权重进行融合。在反向去噪阶段,我们进一步通过从CLIP和DINO分数等客观指标导出的梯度基纠正动态应用生成轨迹,提供连续的语义和风格指导。通过在整个扩散时间线中整合这两种互补机制——特征级选择和指标引导的潜在调整,我们的方法在无需任何重新训练的情况下动态实现了主体-风格的连贯合成。在多种主体-风格组合的广泛实验中,我们的方法在定性和定量上均优于最先进的LoRA融合方法。
Summary / 总结
The paper addresses the limitation of static fusion methods for combining subject and style LoRAs, which ignore the randomness of inputs and deviate from LoRA's purpose. It proposes a dynamic training-free fusion framework that computes the KL divergence at each layer to select the most appropriate weights and refines the generation trajectory using gradient-based corrections from objective metrics. Experiments show that this method outperforms existing LoRA fusion techniques in both qualitative and quantitative evaluations across various subject-style combinations.
论文提出了一种动态无训练融合框架,以解决在文本到图像生成中融合主题和风格LoRA的问题。该框架在每一层计算KL散度以选择最合适的权重,并在反向去噪阶段应用基于梯度的修正来细化生成。实验表明,所提出的方法在各种主题和风格组合中,在定性和定量评价方面均优于现有最先进的LoRA融合方法。
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Authors: Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng
First: 2026-02-17T11:50:58+00:00 · Latest: 2026-02-17T11:50:58+00:00
Abstract
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.
中文标题/摘要
标题:ExpertWeaver:通过GLU激活模式解锁稠密LLM中的固有MoE
混合专家(MoE)通过稀疏专家激活有效扩展模型容量,同时保持计算效率。然而,从头开始训练高质量的MoE是极其昂贵的。一种有前途的替代方案是将预训练的稠密模型转换为稀疏的MoE。现有的稠密到MoE方法可分为两类:动态结构剪枝,将稠密模型转换为具有适度稀疏性的MoE架构,以平衡性能和推理效率;以及降级方法,使用预训练的稠密模型初始化高度稀疏的MoE架构。然而,现有方法破坏了稠密模型内的固有激活模式,导致专家构建效果不佳。在本文中,我们主张门控线性单元(GLU)机制为稠密到MoE转换提供自然蓝图。我们展示了GLU的细粒度神经级激活模式揭示了粗粒度结构,揭示了一个固有的MoE架构,由始终激活的通用神经元和动态激活的专业神经元组成。利用这一发现,我们引入了ExpertWeaver,这是一种无需训练的框架,根据激活模式划分神经元,并构建具有层自适应配置的共享专家和专业路由专家。我们的实验表明,ExpertWeaver在作为无需训练的动态结构剪枝技术和作为高性能MoE初始化的降级策略方面均显著优于现有方法。
Summary / 总结
The research aims to convert dense pretrained models into efficient sparse MoE architectures without breaking the intrinsic activation patterns. The method leverages the GLU mechanism to reveal a natural inherent MoE structure within dense models. ExpertWeaver partitions neurons based on their activation patterns and constructs shared and specialized experts, leading to superior performance compared to existing methods both in dynamic structural pruning and MoE initialization.
研究旨在利用Gated Linear Units (GLU) 的内在激活模式,将密集的大语言模型转换为高效的稀疏MoE架构。方法是根据神经元的激活模式进行分区,并构建具有层自适应配置的共享专家和专门路由专家。实验结果表明,ExpertWeaver 在作为无训练动态结构剪枝技术以及作为初始化高度稀疏MoE架构的下采样策略方面均优于现有方法。
Semantic-Guided 3D Gaussian Splatting for Transient Object Removal
Authors: Aditi Prabakaran, Priyesh Shukla
First: 2026-02-17T11:44:16+00:00 · Latest: 2026-02-17T11:44:16+00:00
Abstract
Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.
中文标题/摘要
标题:语义引导的3D高斯点云化以移除瞬态对象
在随意的多视角捕获中,瞬态对象会导致3D高斯点云化(3DGS)重建中的幽灵伪影。现有解决方案依赖于在显著内存开销下的场景分解,或者依赖于易受视差歧义影响的基于运动的启发式方法。提出了一种语义过滤框架,用于使用视觉-语言模型进行类别感知的瞬态移除。在训练迭代中,根据渲染视图与干扰文本提示之间的CLIP相似度得分,逐高斯积累得分。超过校准阈值的高斯点云进行了透明度正则化和周期性修剪。与基于运动的方法不同,语义分类通过独立于运动模式识别对象类别来解决视差歧义。在RobustNeRF基准上的实验表明,与vanilla 3DGS相比,在四个序列上的一致重建质量改进,同时保持了最小的内存开销和实时渲染性能。阈值校准和与基线的比较验证了语义引导作为在具有可预测干扰类别场景中瞬态移除的实用策略的有效性。
Summary / 总结
The paper addresses the issue of ghosting artifacts caused by transient objects in 3D Gaussian Splatting reconstructions. It proposes a semantic filtering framework using vision-language models to identify and remove transient objects. CLIP similarity scores were used to accumulate per-Gaussian across training iterations, and Gaussians exceeding a threshold underwent opacity regularization and periodic pruning. Experiments showed consistent improvement in reconstruction quality without significant memory overhead or loss of real-time rendering performance.
论文解决了3D高斯点云重建中由瞬态物体引起的鬼影伪影问题。提出了一种基于语义的过滤框架,利用视觉-语言模型识别并移除瞬态物体。该方法通过累积CLIP相似度分数,并对高斯点应用透明度正则化和周期性修剪。实验在RobustNeRF基准上显示,该方法在保持最小内存开销和实时渲染性能的同时,能够一致地提高重建质量。语义分类比基于运动的方法更好地解决了视差歧义问题。
LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge
Authors: Xin Wang, Hong Jia, Hualin Zhou, Sheng Guang Wang, Yu Zhang, Ting Dang, Tao Gu
First: 2026-02-08T07:37:37+00:00 · Latest: 2026-02-17T11:38:26+00:00
Comments: 15 pages, 9 figures ,9 tables, preprint
Abstract
Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.
中文标题/摘要
标题:LQA:边缘设备上视觉-语言模型的轻量级量化自适应框架
在边缘设备上部署视觉-语言模型(VLMs)受到资源限制和分布转移导致性能下降的挑战。虽然测试时自适应(TTA)可以对抗这些转移,但现有方法在设备上部署时过于资源密集。为了解决这一挑战,我们提出了一种轻量级的量化自适应框架LQA,该框架结合了模态感知量化策略和无梯度测试时自适应机制。我们引入了选择性混合量化(SHQ)和量化、无梯度自适应机制,以在资源受限的硬件上实现鲁棒且高效的VLM部署。在合成和真实世界分布转移的实验中,LQA 的整体自适应性能提高了4.5%,使用比全精度模型更少的内存,并且显著优于基于梯度的TTA方法,实现了七个开源数据集上高达19.9倍的内存使用率降低。这些结果表明,LQA 提供了一条在边缘设备上实现鲁棒、隐私保护和高效VLM部署的实用途径。
Summary / 总结
LQA is a lightweight quantized-adaptive framework for Vision-Language Models (VLMs) on edge devices, combining a modality-aware quantization strategy with gradient-free test-time adaptation. It introduces Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enhance robust and efficient VLM deployment. Experiments show that LQA improves adaptation performance by 4.5%, uses less memory than full-precision models, and outperforms gradient-based TTA methods with up to 19.9 times lower memory usage across seven open-source datasets.
LQA 是一种轻量级的量化自适应框架,用于边缘设备上的视觉-语言模型(VLMs),解决了资源限制和分布变化下的性能下降问题。它结合了模态感知的量化策略和无梯度的测试时自适应机制,实现了整体自适应性能4.5%的提升,并且在七个开源数据集上与基于梯度的自适应方法相比,内存使用量最多降低了19.9倍。
Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Authors: Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek, Katarzyna Bogusz, Sebastian Cygert, Wojciech Kusa
First: 2026-02-15T09:54:40+00:00 · Latest: 2026-02-17T10:14:04+00:00
Abstract
Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
中文标题/摘要
标题:使用LLaVA框架高效注释的视觉-语言模型在波兰语中的适应
大多数视觉-语言模型(VLMs)都是在以英语为中心的数据上训练的,这限制了它们在其他语言和文化背景中的性能。这限制了非英语使用者的使用,并阻碍了反映多元语言和文化现实的多模态系统的开发。在本文中,我们重现并适应了LLaVA-Next方法,以创建一系列波兰VLMs。我们依赖于一个完全自动化的管道来翻译和过滤现有的多模态数据集,并通过合成的波兰数据补充了OCR和文化特定任务。尽管几乎完全依赖自动翻译和少量的手动干预对训练数据的干预,我们的方法取得了良好的结果:我们在波兰适应的MMBench上观察到对LLaVA-1.6-Vicuna-13B的改进达到了+9.5%,并且在生成性评估中,由人类注释者衡量的字幕质量更高。这些发现表明,大规模的自动翻译结合轻量级的过滤可以有效地为低资源语言启动高质量的多模态模型。仍存在一些挑战,特别是在文化覆盖面和评估方面。为了促进进一步的研究,我们公开了我们的模型和评估数据集。
Summary / 总结
This work addresses the limitation of vision-language models (VLMs) trained primarily on English data by adapting the LLaVA-Next methodology to create Polish VLMs. The approach uses an automated pipeline for translating and filtering existing datasets, supplemented by synthetic Polish data. Despite minimal manual intervention, the models show a significant improvement of +9.5% on a Polish-adapted MMBench and higher-quality captions in human evaluations. This demonstrates the effectiveness of large-scale automated translation and lightweight filtering in developing high-quality VLMs for low-resource languages, though challenges in cultural coverage and evaluation persist.
本研究旨在通过将LLaVA-Next方法论适应波兰语,解决视觉-语言模型(VLMs)主要基于英语数据训练的局限性。该方法使用自动化的数据翻译和过滤管道,并补充了合成的波兰数据。尽管手动干预较少,但模型在波兰适应的MMBench上取得了显著的+9.5%改进,并且生成的字幕也得到了人类注释者的更高评价。这表明大规模的自动翻译和轻量级过滤可以有效地支持低资源语言的高质量VLMs的发展,尽管在文化覆盖面和评估方面仍存在挑战。
On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks
Authors: Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce
First: 2026-02-17T09:51:40+00:00 · Latest: 2026-02-17T09:51:40+00:00
Abstract
Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.
中文标题/摘要
标题:多模态大语言模型在简单视觉规划任务中推理的跨分布泛化研究
将推理整合到大型语言模型和大型视觉-语言模型中,最近显著提高了它们的能力。然而,推理模型的泛化仍然定义模糊且理解不足。在本工作中,我们提出了一种评估框架,以严格检查链式思考(CoT)方法在简单规划任务中的泛化能力。具体而言,我们考虑了一个基于网格的导航任务,在该任务中,模型被提供一张地图,并必须输出一系列动作,引导玩家从起点到达目标并避开障碍。任务的多样性和数据允许我们使用不同的输入表示(视觉和文本)和CoT推理策略对模型变体进行微调,并在分布内(ID)和分布外(OOD)测试条件下系统地评估它们。我们的实验表明,虽然CoT推理在所有表示下都提高了分布内泛化,但在大多数情况下,控制与ID数据的简单匹配时,分布外泛化(例如,到更大的地图)仍然非常有限。令人惊讶的是,我们发现结合多种文本格式的推理轨迹在分布外泛化中表现最佳(且非平凡)。最后,纯文本模型始终优于使用图像输入的模型,包括一个最近提出的依赖于潜在空间推理的方法。
ActionCodec: What Makes for Good Action Tokenizers
Authors: Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, Jianye Hao
First: 2026-02-17T07:07:15+00:00 · Latest: 2026-02-17T07:07:15+00:00
Abstract
Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
中文标题/摘要
标题:ActionCodec:什么是好的动作分词器
利用视觉-语言模型(VLMs)的原生自回归范式的视觉-语言-动作(VLA)模型在指令遵循和训练效率方面表现出色。这一范式的核心是动作分词,但其设计主要集中在重建保真度上,未能解决其对VLA优化的直接影响。因此,什么是好的动作分词器这一基本问题仍然没有答案。在本文中,我们通过从VLA优化的角度建立设计原则来填补这一空白。基于信息论的见解,我们确定了一套最佳实践,包括最大化时间分词重叠、最小化词汇冗余、增强多模态互信息以及分词独立性。遵循这些原则,我们引入了**ActionCodec**,这是一种高性能的动作分词器,显著提高了各种模拟和现实世界基准中的训练效率和VLA性能。值得注意的是,在LIBERO上,使用ActionCodec微调的SmolVLM2-2.2B在没有任何机器人预训练的情况下达到了95.5%的成功率。通过先进的架构增强,这一数字达到了97.4%,这是在没有机器人预训练的情况下VLA模型的新SOTA。我们认为,我们建立的设计原则以及发布的模型将为社区提供一条明确的道路,以开发更有效的动作分词器。
Summary / 总结
This paper addresses the lack of design principles for action tokenizers in Vision-Language-Action models, focusing on VLA optimization rather than reconstruction fidelity. It introduces ActionCodec, an action tokenizer that maximizes temporal token overlap, minimizes vocabulary redundancy, enhances multimodal mutual information, and ensures token independence. ActionCodec improves both training efficiency and VLA performance, achieving a 95.5% success rate on LIBERO without robotics pre-training and reaching 97.4% with architectural enhancements, setting a new SOTA for VLA models without robotics pre-training.
本文探讨了在视觉-语言-动作(VLA)模型中什么是好的动作分词器,这些模型对于指令遵循和训练效率至关重要。通过应用信息论的见解,作者提出了最大化时间分词重叠和最小化词汇冗余等设计原则。他们引入了ActionCodec,这是一种遵循这些原则的分词器,显著提高了训练效率和VLA性能。在LIBERO基准测试中,使用ActionCodec微调的SmolVLM2-2.2B模型达到了95.5%的成功率,通过架构增强后,这一数字提高到97.4%,这是无机器人预训练的VLA模型的新最佳表现。
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Authors: Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao
First: 2026-02-17T06:31:53+00:00 · Latest: 2026-02-17T06:31:53+00:00
Comments: Preprint. Work in progress
Abstract
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas
中文标题/摘要
标题:视觉虫洞:异构多智能体系统中的潜在空间通信
由大型语言模型驱动的多智能体系统(MAS)已经解锁了高级协作推理,但它们仍然受到离散文本通信低效性的束缚,这会带来显著的运行时开销和信息量化损失。虽然潜在状态传输提供了高带宽的替代方案,但现有方法要么假设同构的发送者-接收者架构,要么依赖于特定对的已学习翻译器,这限制了在具有不同流形的多样化模型家族之间进行扩展和模块化的能力。在本文中,我们提出了一种名为视觉虫洞的新型框架,该框架重新利用视觉语言模型(VLM)的视觉界面,以实现模型无关、无文本的通信。通过引入通用视觉编解码器,我们将异构推理轨迹映射到共享的连续潜在空间,并直接注入接收者的视觉路径中,从而有效地将视觉编码器视为智能体之间心灵感应的通用端口。我们的框架采用中心辐射型拓扑结构,将成对对齐的复杂度从O(N^2)降低到O(N),并利用无标签、教师-学生蒸馏目标来对齐高速视觉通道与文本路径的稳健推理模式。在不同模型家族(例如,Qwen-VL,Gemma)的广泛实验中,视觉虫洞在受控比较中减少了端到端的墙钟时间,同时保持了与标准基于文本的MAS相当的推理保真度。代码可在https://github.com/xz-liu/heterogeneous-latent-mas获取
Summary / 总结
The research addresses the inefficiency of text-based communication in Multi-Agent Systems (MAS) powered by Large Language Models, proposing the Vision Wormhole framework. This framework uses a Universal Visual Codec to map heterogeneous reasoning traces into a shared latent space and inject them into the receiver's visual pathway, enabling model-agnostic, text-free communication. Experiments show that the Vision Wormhole reduces end-to-end wall-clock time while maintaining reasoning fidelity comparable to standard text-based MAS.
研究旨在通过提出Vision Wormhole框架来提高异构多智能体系统(MAS)中通信的效率和可扩展性。该框架使用通用视觉编解码器将推理轨迹映射到共享的潜在空间,并直接注入接收者的视觉路径中,实现模型无关的无文本通信。实验表明,Vision Wormhole可以减少端到端的时间消耗,同时保持与标准基于文本的MAS相当的推理准确性。
GMAIL: Generative Modality Alignment for generated Image Learning
Authors: Shentong Mo, Sukmin Yun
First: 2026-02-17T05:40:25+00:00 · Latest: 2026-02-17T05:40:25+00:00
Abstract
Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.
中文标题/摘要
标题:GMAIL:生成图像学习的生成模态对齐
生成模型使得合成高度逼真图像成为可能,为训练机器学习模型提供了丰富的数据来源。尽管这些可合成的数据源具有优势,但随意将生成图像作为真实图像用于训练可能会由于真实域和合成域模态差异而导致模式崩溃。在本文中,我们提出了一种新的框架GMAIL,明确将生成图像视为与真实图像不同的模态。我们的方法不是在像素空间中随意用生成图像替换真实图像,而是通过多模态学习方法在相同的潜在空间中连接这两种不同的模态。具体来说,我们首先使用跨模态对齐损失在生成图像上微调模型,然后使用此对齐模型进一步训练各种视觉-语言模型。通过对齐两种模态,我们的方法有效地利用了生成模型近期进展的好处,从而在一系列视觉-语言任务中提升了生成图像学习的效果。我们的框架可以轻松与各种视觉-语言模型结合,并通过广泛的实验展示了其有效性。例如,我们的框架在图像字幕、零样本图像检索、零样本图像分类和长字幕检索任务中显著提高了性能。它还展示了生成数据规模扩展的积极趋势,并在大型多模态模型LLaVA的字幕性能方面取得了显著提升。
Summary / 总结
GMAIL is a framework that aligns generated images with real images to improve the effectiveness of generated image learning for vision-language tasks. It fine-tunes a model on generated images and then uses this aligned model to train various vision-language models. Experiments show significant improvements in tasks like image captioning, zero-shot image retrieval, and zero-shot image classification, and positive scaling trends with more generated data.
GMAIL 是一种框架,通过将生成图像与真实图像对齐来提高生成图像在视觉语言任务中的学习效果。该框架首先在生成图像上微调模型,然后使用此对齐后的模型进一步训练各种视觉语言模型。实验显示,在图像字幕、零样本图像检索和零样本图像分类等任务中取得了显著改进,并且随着更多生成数据的使用,表现出积极的扩展趋势。
TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction
Authors: Zhijie Zheng, Xinhao Xiang, Jiawei Zhang
First: 2026-01-30T06:14:42+00:00 · Latest: 2026-02-17T05:37:58+00:00
Abstract
Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments demonstrate the effectiveness of TTSA3R in diverse 3D tasks. Moreover, our method exhibits only 1.33x error increase compared to over 4x degradation in the baseline model on extended sequences of 3D reconstruction, significantly improving long-term reconstruction stability. Our codes are available at https://github.com/anonus2357/ttsa3r.
中文标题/摘要
标题:TTSA3R:无需训练的空间-时间自适应持久状态用于流式3D重建
流式递归模型通过维护持久状态表示来实现高效的3D重建。然而,由于在历史信息与新观察之间进行平衡,它们在长序列中会遭受灾难性遗忘。最近的方法通过从注意力视角推导自适应信号来缓解这一问题,但它们仅在单个维度上操作,而不考虑时间和空间一致性。为了解决这个问题,我们提出了一种无需训练的框架TTSA3R,该框架利用时间和空间信息进行自适应状态更新。特别是,我们设计了一个时间自适应更新模块,通过分析时间状态演变模式来调节更新幅度。然后,我们引入了一个空间上下文更新模块,通过观察-状态对齐和场景动态来定位需要更新的空间区域。最后,这些互补信号被融合以确定状态更新策略。广泛的实验表明,TTSA3R在各种3D任务中具有有效性。此外,与基线模型相比,我们的方法在长序列3D重建中的误差增加仅为1.33倍,而基线模型的性能下降超过4倍,显著提高了长期重建的稳定性。我们的代码可在https://github.com/anonus2357/ttsa3r获取。
Summary / 总结
TTSA3R is a training-free framework that enhances 3D reconstruction by adapting persistent state representations through temporal and spatial mechanisms. It introduces a Temporal Adaptive Update Module to regulate update magnitude based on temporal state evolution and a Spatial Contextual Update Module to identify regions needing updates based on observation-state alignment and scene dynamics. Experiments show that TTSA3R maintains reconstruction accuracy, with only a 1.33x error increase compared to a 4x degradation in baseline models on long sequences.
TTSA3R 是一个无需训练的框架,用于通过结合时空自适应更新来解决长序列中的灾难性遗忘问题。它使用时序自适应更新模块根据时序状态演变来调节更新幅度,并使用空间上下文更新模块通过观测状态对齐和场景动态来识别需要更新的区域。实验表明,与基线模型相比,TTSA3R 在扩展的 3D 重建序列上仅导致 1.33 倍的误差增加,而基线模型则增加了 4 倍的误差,显著提高了长期重建的稳定性。
Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation
Authors: Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So
First: 2026-02-15T07:14:47+00:00 · Latest: 2026-02-17T04:53:36+00:00
Comments: 19 pages, 15 figures
Abstract
Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.
中文标题/摘要
标题:短训练,长推理:无需训练的超长期自回归视频生成
自回归视频扩散模型已成为长视频生成的可扩展范式。然而,它们通常会遭受严重的外推失败,即快速的误差累积导致在超出训练范围时出现显著的时间降解。我们发现,这种失败主要源于3D位置嵌入的频谱偏差以及噪声采样中缺乏动态先验。为了解决这些问题,我们提出了FLEX(频率感知长度扩展),这是一种无需训练的推理时框架,能够弥合短期训练与长期推理之间的差距。FLEX引入了频率感知RoPE调制,以适应性地插值未充分训练的低频成分,同时外推高频成分,以保持多尺度时间可分辨性。这与反相噪声采样(ANS)结合使用,以注入高频动态先验,并与推理专用注意力汇合,以锚定全局结构。在VBench上的广泛评估表明,FLEX在6倍外推(30秒时长)时显著优于最先进的模型,并在12倍尺度(60秒时长)时与长视频微调基线相当。作为即插即用的增强方法,FLEX无缝集成到现有的推理管道中,有效推动了如LongLive等模型的生成极限,支持在4分钟尺度上的一致和动态视频合成。项目页面可在https://ga-lee.github.io/FLEX_demo/获取。
Summary / 总结
The paper addresses the issue of temporal degradation in autoregressive video diffusion models when extending beyond their training horizons. It proposes FLEX, a training-free framework that enhances long-term inference by introducing Frequency-aware RoPE Modulation, Antiphase Noise Sampling, and Inference-only Attention Sink. Experiments on VBench show that FLEX significantly improves performance at 6x extrapolation and matches long-video fine-tuned baselines at 12x scale, demonstrating its effectiveness in extending the generation limits of models like LongLive for consistent and dynamic video synthesis.
论文针对自回归视频扩散模型在超出训练范围时快速积累误差导致的外推失败问题,提出了一种名为FLEX的训练免费框架,通过引入频率感知RoPE调制、反相噪声采样和推理专用注意力下陷来增强长期推理能力。在VBench上的实验表明,FLEX在6倍外推时显著提高了性能,并在12倍规模时与长视频微调基线相当,展示了其在将模型如LongLive的生成极限扩展至4分钟视频中的有效性和一致性动态视频合成能力。
Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
Authors: Seunghwan Jang, SooJean Han
First: 2026-01-06T08:19:02+00:00 · Latest: 2026-02-17T04:16:50+00:00
Comments: Work in progress. Feedback welcome
Abstract
Uniform-noise discrete diffusion and flow models (e.g., D3PM, SEDD, UDLM, DFM) generate sequences non-autoregressively by iteratively refining randomly initialized vocabulary tokens through multiple context-dependent replacements. These models are typically formulated as time-inhomogeneous CTMC/DTMC processes and sampled using independent Bernoulli change decisions at each discretization step. This induces Poisson-binomial variance in per-position jump counts that grows with the number of required edits, leading to the characteristic under-editing (residual noise) and over-editing (cascading substitutions) failure modes that degrade sample quality, especially under tight discretization budgets. In contrast, absorbing-state (mask-start) models avoid this instability by allowing each position to jump at most once. We propose Stratified Hazard Sampling (SHS), a training-free, drop-in, and hyperparameter-free inference principle for any sampler that admits a stay-vs.-replace decomposition. SHS models per-token edits as events driven by cumulative hazard (CTMC) or cumulative jump mass (DTMC) and places events by stratifying this cumulative quantity: with a single random phase per position, a token is updated whenever its accumulated hazard crosses unit-spaced thresholds. This preserves the expected number of jumps while achieving the minimum possible conditional variance among unbiased integer estimators (bounded by 1/4 for any fixed cumulative mass), without altering per-jump destination sampling and thus retaining multimodality. Experiments on uniform-noise discrete diffusion language models show that SHS consistently improves sample quality. We further show that SHS improves robustness under token-level blacklist filtering, with benefits increasing as lexical constraints grow more severe.
中文标题/摘要
标题:分层危险抽样:CTMC/DTMC 离散扩散和流模型的最小方差事件调度
均匀噪声离散扩散和流模型(例如 D3PM、SEDD、UDLM、DFM)通过多次上下文相关替换逐步精炼随机初始化的词汇单元来生成非自回归序列。这些模型通常被表述为时间非齐次的 CTMC/DTMC 过程,并在每个离散化步骤中使用独立的伯努利变化决策进行采样。这导致了每个位置跳跃次数的泊松二项式方差随所需编辑次数增加而增加,从而导致特征性的残余噪声(剩余噪声)和级联替换(级联替换)失败模式,这些模式在紧缩的离散化预算下尤其会降低样本质量。相比之下,吸收态(掩码起始)模型通过允许每个位置最多跳跃一次来避免这种不稳定性。我们提出了一种无需训练、即插即用且无需超参数的推理原则——分层危险抽样(SHS),适用于任何可分解为停留或替换的采样器。SHS 将每个词汇单元的编辑视为由累积危险(CTMC)或累积跳跃质量(DTMC)驱动的事件,并通过分层累积量来放置这些事件:每个位置仅有一个随机相位,当累积危险超过单位间隔阈值时,该词汇单元将被更新。这保持了预期的跳跃次数,同时实现了无偏整数估计器中可能的最小条件方差(任何固定累积质量下的上限为 1/4),而不改变每次跳跃的目的地采样,从而保留了多模态性。在均匀噪声离散扩散语言模型上的实验表明,SHS 一致地提高了样本质量。我们还表明,SHS 在词汇单元级黑名单过滤下提高了鲁棒性,随着词汇约束条件变得越来越严格,其益处会增加。
Summary / 总结
The paper addresses the issue of under-editing and over-editing in uniform-noise discrete diffusion models by proposing Stratified Hazard Sampling (SHS). SHS stratifies cumulative hazard or jump mass to minimize variance in per-position jump counts, thereby improving sample quality. Experiments show that SHS consistently enhances sample quality and robustness under token-level blacklist filtering, especially under tight discretization budgets.
论文提出了分层危险率采样(SHS),通过最小化每个位置跳跃计数的条件方差来解决均匀噪声离散扩散模型中的编辑不足和过度编辑问题。SHS通过对累积危险率或跳跃质量进行分层,确保最小条件方差,同时保留多模态性而不改变每个跳跃的目的地采样。实验表明,SHS能够一致地提高样本质量,并增强在词元级黑名单过滤下的鲁棒性,随着词汇约束条件的增加,这种优势更加明显。
Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
Authors: Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li
First: 2026-02-17T02:51:36+00:00 · Latest: 2026-02-17T02:51:36+00:00
Comments: 15 pages , 6 figures
Abstract
Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
中文标题/摘要
标题:麻雀:基于文本锚定窗口注意力的视觉语义概览在视频大语言模型中推测性解码
尽管推测性解码被广泛用于加速视觉语言模型(VLMs)的推理,但在应用于视频大语言模型(Vid-LLMs)时,它会面临严重的性能崩溃。草稿模型通常会陷入注意力稀释和负视觉增益的陷阱,这主要是由于关键值缓存爆炸和上下文窗口不匹配造成的。我们观察到Vid-LLMs中存在视觉语义内化现象,表明在深层层交互过程中,关键视觉语义被隐式编码到文本隐藏状态中,这使得在深层推理过程中原始视觉输入结构上变得冗余。为了解决这个问题,我们提出了麻雀框架,该框架首先利用视觉感知的文本锚定窗口注意力并通过隐藏状态重用将视觉计算完全卸载到目标模型上,并利用中间层视觉状态桥梁来用富含语义的中间状态训练草稿模型,从而过滤掉低级视觉噪声。此外,还引入了一种多令牌预测策略来弥合训练-推理分布偏移。实验表明,即使使用25k视觉令牌,麻雀也能实现2.82倍的平均加速,有效解决了长序列中的性能下降问题,并为实时长视频任务提供了一种实用的解决方案。
Summary / 总结
The Sparrow framework addresses the performance collapse of speculative decoding in Video Large Language Models (Vid-LLMs) by utilizing visually-aware text-anchored window attention and intermediate-layer visual state bridging. This approach offloads visual computation and filters out low-level visual noise, achieving an average speedup of 2.82x even with 25k visual tokens, and effectively resolving performance degradation in long sequences.
Sparrow框架通过利用视觉感知的文字锚定窗口注意力和中间层视觉状态桥梁来解决视频大型语言模型(Vid-LLMs)中投机解码的性能下降问题。该方法卸载视觉计算并过滤掉低级视觉噪声,即使使用25k视觉标记也能实现2.82倍的平均加速,有效解决了长序列中的性能下降问题。
Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models
Authors: Tai Le-Gia, Jaehyun Ahn
First: 2026-02-17T02:46:45+00:00 · Latest: 2026-02-17T02:46:45+00:00
Comments: Accepted for MIDL 2026
Abstract
Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.
中文标题/摘要
标题:基于2D基础模型的无训练3D脑MRI零样本异常检测
零样本异常检测(ZSAD)在医学影像中越来越受到关注,作为一种无需任务特定监督即可识别异常的方法,但大多数进展仍局限于2D数据集。将ZSAD扩展到3D医学图像具有挑战性,现有方法依赖于切片特征和视觉-语言模型,无法捕捉体素结构。在本文中,我们提出了一种用于3D脑MRI的完全无训练框架,通过使用2D基础模型处理多轴切片来构建局部体素标记。这些3D片段标记恢复了立方体空间上下文,并直接与基于距离的批量级异常检测管道集成。该框架提供了紧凑的3D表示,可以在标准GPU上高效计算,无需微调、提示或监督。我们的结果表明,无训练、批量级ZSAD可以从2D编码器扩展到完整的3D MRI体积,提供了一种简单且稳健的体素异常检测方法。
Summary / 总结
This paper presents a training-free framework for zero-shot anomaly detection (ZSAD) in 3D brain MRI using 2D foundation models. It constructs localized volumetric tokens by aggregating multi-axis slices, which restores cubic spatial context and integrates with distance-based anomaly detection pipelines. The framework does not require fine-tuning, prompts, or supervision, and can effectively extend 2D encoder-based ZSAD to full 3D MRI volumes, providing a simple and robust approach for volumetric anomaly detection.
本文提出了一种使用2D基础模型的无训练零样本异常检测(ZSAD)框架,用于3D脑MRI。该方法通过聚合多轴切片构建局部体素化标记,恢复了立方体空间上下文,并直接与基于距离的异常检测管道集成。结果表明,这种方法可以将基于2D编码器的ZSAD有效扩展到完整的3D MRI体积,提供了一种无需微调、提示或监督的简单且稳健的体积异常检测解决方案。
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Authors: Peng-Fei Zhang, Zi Huang
First: 2026-01-15T11:45:56+00:00 · Latest: 2026-02-17T02:05:08+00:00
Comments: 10 pages, 7 figures
Abstract
Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.
中文标题/摘要
标题:视觉语言模型的分层细化普遍多模态攻击
现有的针对VLP模型的对抗攻击大多针对样本特定,当扩展到大规模数据集或新场景时会产生大量的计算开销。为克服这一局限,我们提出了一种分层细化攻击(HRA),这是一种针对VLP模型的普遍多模态攻击框架。对于图像模态,我们通过利用历史和估计未来梯度的时间层次结构来细化优化路径,以避免局部最小值并稳定普遍扰动学习。对于文本模态,我们通过考虑句内和句间贡献来分层建模文本的重要性,以识别全局有影响力的单词,然后将这些单词用作普遍文本扰动。广泛的实验表明,提出的普遍多模态攻击具有优越的迁移性。
Summary / 总结
The research aims to address the computational overhead of sample-specific adversarial attacks on vision-language models by proposing Hierarchical Refinement Attack (HRA), a universal multimodal attack framework. HRA refines the optimization path for images using a temporal hierarchy of gradients and hierarchically models textual importance for text. Experiments show that HRA achieves superior transferability across different downstream tasks, VLP models, and datasets.
研究旨在通过提出层次精炼攻击(HRA),一种多模态的通用攻击框架,解决视觉语言模型中样本特定的对抗性攻击带来的计算开销问题。HRA 通过时间层次的历史和估计未来梯度来细化图像优化路径,并通过考虑句内和句间贡献来层次建模文本的重要性,以识别全局有影响力的单词。实验表明,HRA 在各种下游任务、视觉语言模型和数据集上具有更好的转移性。
EAA: Automating materials characterization with vision language model agents
Authors: Ming Du, Yanqi Luo, Srutarshi Banerjee, Michael Wojcik, Jelena Popovic, Mathew J. Cherukara
First: 2026-02-17T01:34:05+00:00 · Latest: 2026-02-17T01:34:05+00:00
Abstract
We present Experiment Automation Agents (EAA), a vision-language-model-driven agentic system designed to automate complex experimental microscopy workflows. EAA integrates multimodal reasoning, tool-augmented action, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. Built on a flexible task-manager architecture, the system enables workflows ranging from fully agent-driven automation to logic-defined routines that embed localized LLM queries. EAA further provides a modern tool ecosystem with two-way compatibility for Model Context Protocol (MCP), allowing instrument-control tools to be consumed or served across applications. We demonstrate EAA at an imaging beamline at the Advanced Photon Source, including automated zone plate focusing, natural language-described feature search, and interactive data acquisition. These results illustrate how vision-capable agents can enhance beamline efficiency, reduce operational burden, and lower the expertise barrier for users.
中文标题/摘要
标题:EAA:使用视觉语言模型代理自动化材料表征
我们介绍了实验自动化代理(EAA),这是一种以视觉-语言模型驱动的代理系统,旨在自动化复杂的实验显微镜工作流程。EAA 结合了多模态推理、工具增强的操作以及可选的长期记忆,以支持自主程序和交互式用户引导测量。基于灵活的任务管理架构,该系统使工作流程从完全由代理驱动的自动化到嵌入局部LLM查询的逻辑定义例行程序成为可能。EAA 还提供了一个现代工具生态系统,该生态系统与模型上下文协议(MCP)具有双向兼容性,允许仪器控制工具在应用程序之间被消费或提供。我们在先进光子源的成像束线处展示了EAA,包括自动化区板对焦、自然语言描述的特征搜索以及交互式数据采集。这些结果说明了视觉能力代理如何提高束线效率、减轻操作负担并降低用户的专业门槛。
Summary / 总结
EAA is a vision-language-model-driven system designed to automate complex experimental microscopy workflows. It integrates multimodal reasoning, tool-augmented actions, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. EAA demonstrates automated zone plate focusing, natural language-described feature search, and interactive data acquisition, enhancing beamline efficiency and reducing operational burden.
EAA 是一个基于视觉-语言模型的系统,旨在自动化复杂的实验显微镜工作流程。它结合了多模态推理、工具增强的动作和可选的长期记忆,以支持自主程序和交互式用户指导测量。EAA 在先进光子源的成像光束线上展示了自动聚焦区板、自然语言描述的特征搜索和交互式数据采集,展示了增强的光束线效率和降低的操作负担。
Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
Authors: Nanaka Hosokawa, Ryou Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita
First: 2025-10-02T13:22:13+00:00 · Latest: 2026-02-17T00:52:38+00:00
Comments: Revised manuscript; supplementary materials added. Submitted to Diagnostics
Abstract
Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.
中文标题/摘要
标题:基于GPT的VLM生成牙颌囊肿全景牙片发现:初步研究及SLSO框架下的两阶段自我校正循环构建
视觉语言模型(VLMs)如GPT(生成预训练变换器)在医学图像解释方面显示出潜力;然而,在临床实践中生成可靠的放射学发现仍面临挑战,特别是在牙科病理学方面。本研究提出了一种结构输出自我校正循环(SLSO)框架,作为集成处理方法,以提高AI生成的牙颌囊肿发现的准确性和可靠性。使用包含牙颌囊肿的全景牙片实施了一个包含图像分析、结构化数据生成、牙号提取、一致性检查和迭代再生的十步集成处理框架。该框架作为GPT输出的外部验证机制。在七个评估项目:透明度、内部结构、边界、根吸收、牙齿移动、与其他结构的关系和牙号方面,与传统的链式思维(CoT)方法进行了性能比较。SLSO框架在多个项目上提高了输出准确性,特别是在牙号识别、牙齿移动检测和根吸收评估方面取得了最显著的改进。在成功案例中,经过最多五次再生后实现了结构一致的输出。该框架强制执行明确的负面发现描述并抑制幻觉,尽管准确识别跨越多个牙齿的广泛病变仍有限。本研究确立了所提集成处理方法的可行性,并为未来使用更大、更多样数据集的验证研究奠定了基础。
Summary / 总结
This study introduces a Self-correction Loop with Structured Output (SLSO) framework to improve the accuracy of AI-generated findings for jaw cysts in dental panoramic radiographs using a GPT-based vision-language model. The framework incorporates multiple steps including image analysis and structured data generation, and was evaluated against the Chain-of-Thought (CoT) method across seven criteria. The SLSO framework showed significant improvements in tooth number identification, tooth movement detection, and root resorption assessment, with consistent outputs achieved after up to five regenerations. However, accurate identification of extensive lesions spanning multiple teeth was still challenging.
本研究提出了一种自校正循环与结构输出(SLSO)框架,利用基于GPT的视觉语言模型来提高牙源性囊肿在全景牙科X光片中生成的诊断结果的准确性和可靠性。该框架结合了图像分析、结构化数据生成、牙齿编号提取和迭代再生,以提高输出准确性。与链式思考(CoT)方法相比,SLSO框架在牙齿编号识别、牙齿移动检测和根吸收评估方面显示出显著改进,且在最多五次再生后可实现一致的结构化输出。然而,对跨越多个牙齿的广泛病变的准确识别仍有限制。
Visual Persuasion: What Influences Decisions of Vision-Language Models?
Authors: Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh
First: 2026-02-17T00:33:53+00:00 · Latest: 2026-02-17T00:33:53+00:00
Comments: 45 pages, 17 figures
Abstract
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.
中文标题/摘要
标题:视觉说服力:什么影响视觉语言模型的决策?
互联网上充斥着图像,这些图像原本是为人类准备的,现在越来越多地被使用视觉语言模型(VLMs)的代理所解读。这些代理在大规模地做出视觉决策,决定点击什么、推荐什么或购买什么。然而,我们对它们的视觉偏好结构知之甚少。我们通过将VLMs置于受控的基于图像的选择任务中,并系统地扰动其输入,引入了一个研究框架。我们的核心思想是将代理的决策函数视为一种潜在的视觉效用,可以通过揭示偏好(即系统编辑的图像之间的选择)来推断:从常见的图像(如产品照片)开始,我们提出了视觉提示优化的方法,将文本优化方法适应为迭代地提出并应用视觉上合理的修改(如构图、照明或背景),然后评估哪些修改增加了选择概率。通过在前沿VLMs上的大规模实验,我们证明了优化的修改在一对一比较中显著地改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好,识别出驱动选择的一致视觉主题。我们认为这种方法提供了一种实用且高效的方式来揭示视觉漏洞,这些安全问题在野生环境中可能会隐式地被发现,支持对基于图像的AI代理进行更积极的审计和治理。
Summary / 总结
This study investigates how vision-language models (VLMs) make visual decisions by placing them in controlled choice tasks and systematically perturbing their inputs. By optimizing visual prompts and evaluating the impact of different edits, the research demonstrates that specific visual modifications can significantly alter VLMs' choices. An automatic interpretability pipeline was developed to explain these preferences, revealing consistent visual themes that drive selection. This approach provides a practical method to identify visual vulnerabilities and safety concerns in VLMs, supporting proactive auditing and governance.
研究通过将视觉语言模型(VLMs)置于受控的图像选择任务中,并系统地改变其输入,来探讨它们如何做出视觉决策。关键方法是优化视觉提示,迭代地提出并应用如构图、照明或背景等视觉上合理的修改,以确定哪些修改能增加选择概率。主要发现表明,优化的修改在一对一比较中显著改变了选择概率,揭示了驱动选择的一致视觉主题,并提供了一种实用且高效的方法来发现VLMs中的视觉漏洞和安全问题,支持更主动的审计和治理。
How to Train Your Long-Context Visual Document Model
Authors: Austin Veselka
First: 2026-02-16T23:26:51+00:00 · Latest: 2026-02-16T23:26:51+00:00
Abstract
We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.
中文标题/摘要
标题:如何训练您的长上下文视觉文档模型
我们首次全面研究了训练至344K上下文的长上下文视觉语言模型,针对长文档视觉问答,并通过测量将其转移到长上下文文本上。虽然有几种这样的强大模型,例如Qwen3 VL和GLM 4.5/6V,但它们的训练配方和数据管道不可重现。我们系统地研究了24B和32B参数模型的持续预训练、监督微调和偏好优化,通过广泛的长上下文评估和消融实验来弥补这一差距,并在MMLongBenchDoc基准测试中实现了参数规模的最新性能。此外,我们的主要发现包括:(i) 在与评估上下文长度匹配的上下文中进行训练优于在更长的上下文中进行训练,(ii) 使用页面索引进行训练和评估为长文档性能提供了简单而高效的提升,(iii) 我们合成的数据管道通过持续预训练和监督微调实现自我改进,(iv) 我们将已知的文本到视觉长上下文转移扩展到反向,表明视觉长上下文训练转移到长上下文文本性能。我们还发布了MMLBD-C,这是MMLongBenchDoc的手动修正版本,以减少基准测试中的错误和低质量示例。
Summary / 总结
This study presents a comprehensive analysis of training long-context vision language models up to 344K context, focusing on long-document visual question answering and its transfer to long-context text. The research systematically explores continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, achieving state-of-the-art performance on MMLongBenchDoc. Key findings include the benefit of training on context lengths matching evaluation lengths, the positive impact of using page indices, the effectiveness of synthetic data pipelines, and the transfer of visual long context training to long-context text performance.
该研究探讨了训练至344K上下文长度的长上下文视觉语言模型,重点关注长文档视觉问答及其向长上下文文本的迁移。研究系统地考察了24B和32B参数模型的持续预训练、监督微调和偏好优化,实现了在MMLongBenchDoc上的最新性能。关键发现包括训练上下文长度与评估长度匹配的益处、使用页面索引的积极影响、合成数据管道的有效性,以及视觉长上下文训练向长上下文文本性能的迁移。
COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis
First: 2026-02-16T21:31:34+00:00 · Latest: 2026-02-16T21:31:34+00:00
Abstract
Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available $\href{https://github.com/mts-ai/COMPOT}{here}$.
中文标题/摘要
标题:COMPOT:校准优化矩阵普克鲁特正交化方法用于变压器压缩
Transformer模型的后训练压缩通常依赖于截断奇异值分解(SVD)。然而,在适度压缩时强制使用单一共享子空间会降低准确性。稀疏字典学习提供了一种更灵活的子空间并集表示,但现有方法通常会遭受迭代字典和系数更新的问题。我们提出了COMPOT(校准优化矩阵普克鲁特正交化方法用于变压器),这是一种无需训练的压缩框架,使用少量校准数据集来估计稀疏权重分解。COMPOT采用正交字典,使字典的普克鲁特更新和系数的分析单步稀疏编码能够闭式求解,从而消除了迭代优化。为了在全局压缩预算下处理异构层敏感性,COMPOT进一步引入了一次性动态分配策略,以自适应地重新分配层间压缩率。在多种架构和任务上的广泛实验表明,COMPOT在与强大的低秩和稀疏基线相比时,始终能够提供更优的质量-压缩权衡,同时仍然完全兼容后训练量化以实现极端压缩。代码可在$\href{https://github.com/mts-ai/COMPOT}{这里}$获取。
Summary / 总结
The research aims to improve the accuracy of Transformer model compression by addressing the limitations of truncated SVD and sparse dictionary learning. COMPOT proposes a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. It employs orthogonal dictionaries and analytical single-step sparse coding, avoiding iterative optimization. Additionally, COMPOT introduces a one-shot dynamic allocation strategy to handle heterogeneous layer sensitivity under a global compression budget. Experiments demonstrate that COMPOT outperforms strong low-rank and sparse baselines in terms of the quality-compression trade-off, and is fully compatible with post-training quantization for extreme compression.
研究旨在通过解决截断SVD和稀疏字典学习的限制,提高Transformer模型压缩的准确性。COMPOT提出了一种无需训练的压缩框架,使用小的校准数据集来估计稀疏权重分解。它采用正交字典和分析的单步稀疏编码,避免了迭代优化。此外,它还引入了一种动态分配策略来处理在全局压缩预算下的层敏感性。实验结果表明,COMPOT在各种架构和任务上优于强大的低秩和稀疏基线,在质量和压缩之间的权衡上表现出色,并且与后训练量化兼容以实现极端压缩。