Cold-Start Personalization via Training-Free Priors from Structured World Models
Authors: Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov, Maryam Fazel, Lin Xiao, Asli Celikyilmaz
First: 2026-02-16T18:52:13+00:00 · Latest: 2026-02-16T18:52:13+00:00
Comments: 24 pages, 4 figures, 4 tables
Abstract
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
中文标题/摘要
标题:基于结构化世界模型无训练先验的冷启动个性化
冷启动个性化要求通过用户交互推断用户偏好,当没有用户特定的历史数据时。核心挑战是路由问题:每个任务包含数十个偏好维度,但每个用户只关心其中的几个,而且哪些维度重要取决于提问的对象。在有限的问题预算下,无结构提问将错过重要的维度。强化学习是自然的表述方式,但在多轮设置中,其终端奖励未能利用偏好数据的因式分解和按标准划分的结构,实践中学习到的策略会退化为静态问题序列,忽略用户反馈。我们提出将冷启动提取分解为离线结构学习和在线贝叶斯推理。Pep(偏好提取与先验)从完整档案中学习一个结构化的偏好相关世界模型,然后在线进行无训练的贝叶斯推理,选择信息性问题并预测完整的偏好档案,包括从未提问过的维度。该框架在下游求解器中是模块化的,只需要简单的信念模型。在医学、数学、社会和常识推理中,Pep 在生成的响应与用户声明的偏好之间的匹配度为 80.8%,而强化学习为 68.5%,交互次数少 3-5 倍。当两个用户对同一问题给出不同答案时,Pep 的后续问题改变比例为 39-62%,而强化学习为 0-28%。它仅使用约 10K 参数,而强化学习为 8B,表明冷启动提取的瓶颈在于利用偏好数据的因式结构的能力。
Summary / 总结
This paper addresses the challenge of cold-start personalization by proposing a method that decomposes the task into offline structure learning and online Bayesian inference. The approach, named Pep, learns a structured world model of preference correlations from complete profiles and uses this model for training-free Bayesian inference to select informative questions and predict complete preference profiles. The results show that Pep achieves higher alignment with users' stated preferences (80.8% vs 68.5% for reinforcement learning) with fewer interactions (3-5x fewer). Pep also adapts more effectively to differing user responses, changing follow-up questions in a higher percentage of cases compared to reinforcement learning.
论文通过提出Pep解决冷启动个性化问题,将过程分解为离线结构学习和在线贝叶斯推理。Pep从完整资料中学习结构化的偏好模型,并使用该模型选择有信息量的问题和预测偏好。实验结果显示,Pep在与用户偏好匹配度和所需交互次数上均优于强化学习方法。
BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames
Authors: Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, Aviral Kumar
First: 2026-02-16T18:49:56+00:00 · Latest: 2026-02-16T18:49:56+00:00
Abstract
Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.
中文标题/摘要
标题:BPP:通过关注关键历史帧进行长上下文机器人模仿学习
许多机器人的任务需要关注过去的观察历史。例如,在房间中寻找物品需要记住已经搜索过的地方。然而,表现最好的机器人策略通常仅依赖当前的观察,限制了它们在这些任务中的应用。简单地依赖过去的观察往往由于虚假相关性而失败:策略会抓住训练历史中的偶然特征,这些特征在部署到新的分布时无法泛化。我们分析了策略为何会抓住这些虚假相关性,并发现这个问题源于训练过程中对可能历史的覆盖有限,随着时间范围的增长,这种覆盖呈指数增长。现有的正则化技术在不同任务上提供的益处不一致,因为它们没有从根本上解决这个问题。受这些发现的启发,我们提出了大图策略(BPP),该方法通过一个由视觉-语言模型检测出的有意义的关键帧集进行条件化。通过将多样化的演示投影到一组与任务相关的事件上,BPP显著减少了训练和部署之间的分布偏移,而不会牺牲表达能力。我们在四个具有挑战性的现实世界操作任务和三个模拟任务上评估了BPP,所有任务都需要历史条件化。BPP在现实世界评估中的成功率比最佳对比方法高出70%。
Summary / 总结
The paper addresses the challenge of robot imitation learning by focusing on key history frames to improve the robot's ability to remember past observations. It proposes Big Picture Policies (BPP), which condition on a minimal set of meaningful keyframes detected by a vision-language model. This approach reduces distribution shift between training and deployment, leading to 70% higher success rates on real-world manipulation tasks compared to the best comparison methods.
研究旨在通过改进机器人模仿学习,解决当前仅依赖当前观察的策略在需要长期记忆的任务中表现不佳的问题。提出的Big Picture Policies (BPP) 方法关注由视觉-语言模型检测到的关键历史帧,减少了部署时的数据分布差异,并在真实世界的操作任务中比现有方法的成功率高出70%。
ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
Authors: Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra
First: 2026-02-16T18:16:19+00:00 · Latest: 2026-02-16T18:16:19+00:00
Comments: 8 Pages with 2 figures of main content. 2 pages of References. 10 pages of appendix with 6 figures
Abstract
Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.
中文标题/摘要
标题:ThermEval:热成像视觉语言模型评估基准
视觉语言模型(VLMs)在RGB图像上表现出色,但在热图像上却无法泛化。热成像在可见光失效的环境中至关重要,包括夜间监视、搜索与救援、自动驾驶和医学筛查。与RGB图像不同,热图像编码的是物理温度而非颜色或纹理,这需要感知和推理能力,而现有的以RGB为中心的基准测试并未对其进行评估。我们引入了ThermEval-B,这是一个包含约55,000个热视觉问答对的结构化基准,旨在评估热视觉语言理解所需的底层能力。ThermEval-B将公共数据集与我们新收集的ThermEval-D集成,后者是首个提供密集的逐像素温度图和语义身体部分注释的数据集,适用于多种室内外环境。评估25个开源和闭源的VLMs,我们发现模型在温度相关推理方面表现一致不佳,在颜色映射变换下性能下降,并依赖语言先验或固定响应,仅通过提示或监督微调获得微小改进。这些结果表明,热理解需要超越RGB中心假设的专门评估,将ThermEval定位为推动热视觉语言建模进展的基准。
Summary / 总结
ThermEval is a structured benchmark for evaluating vision-language models on thermal imagery, addressing the lack of generalization from RGB to thermal images. It includes 55,000 thermal visual question answering pairs and a new dataset with per-pixel temperature maps and semantic body-part annotations. Evaluating 25 models, the study finds that they struggle with temperature-grounded reasoning, degrade under colormap transformations, and rely on language priors, highlighting the need for dedicated thermal understanding benchmarks.
研究动机是评估视觉语言模型(VLMs)在热成像上的泛化能力,这对于夜间监视和医疗筛查等应用至关重要。主要方法是创建包含55,000个热视觉问答对的基准ThermEval-B,将公共数据集与新的ThermEval-D数据集结合,后者提供了密集的像素级温度图和语义人体部位标注。关键发现表明,VLMs在温度相关的推理上表现不佳,在颜色映射变换下性能下降,并依赖于语言先验或固定响应,这表明需要超越RGB中心假设的专门评估基准。
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Authors: Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, Steven Peters, Andrea Stocco, Bassam Alrifaee, Marco Pavone, Johannes Betz
Venue: IEEE Open Journal of Intelligent Transportation Systems, 03 February 2026
First: 2025-06-13T07:25:59+00:00 · Latest: 2026-02-16T16:49:09+00:00
Comments: IEEE Open Journal of Intelligent Transportation Systems
Abstract
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
中文标题/摘要
标题:自动驾驶中的基础模型:场景生成与分析综述
对于自动驾驶车辆而言,安全导航依赖于处理各种多样且罕见的驾驶场景。仿真和基于场景的测试已成为开发和验证自动驾驶系统的关键方法。传统场景生成依赖于基于规则的系统、知识驱动模型和数据驱动合成,通常产生有限的多样性和不切实际的安全关键案例。随着基础模型的出现,这些代表了新一代预训练的通用人工智能模型,开发者可以处理异构输入(例如自然语言、传感器数据、高清地图和控制动作),从而生成和解释复杂的驾驶场景。在本文中,我们对截至2025年5月基础模型在自动驾驶场景生成与分析中的应用进行了综述。综述中提出了一种统一的分类体系,包括大型语言模型、视觉语言模型、多模态大型语言模型、扩散模型和世界模型,用于生成和分析自动驾驶场景。此外,我们还回顾了方法论、开源数据集、仿真平台和基准挑战,并审查了针对场景生成与分析的评估指标。最后,综述总结了存在的开放挑战和研究问题,并概述了有前景的未来研究方向。所有审查的论文都列在一个持续维护的仓库中,该仓库包含补充材料,并可在https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis/获取。
Summary / 总结
This paper surveys the application of foundation models in autonomous driving, focusing on scenario generation and analysis. It highlights the limitations of traditional methods and introduces foundation models that can process diverse inputs to generate and interpret complex driving scenarios. Key findings include the use of large language models, vision-language models, and multimodal models for scenario synthesis, as well as the development of tailored evaluation metrics and open-source datasets to support research in this area.
本文对基础模型在自动驾驶中的应用进行了综述,特别是在场景生成和分析方面。它通过利用能够处理多种输入(如自然语言、传感器数据和高清地图)的基础模型来解决传统方法的局限性。关键发现包括使用大型语言模型、视觉-语言模型和扩散模型来生成和分析复杂的驾驶场景,并确定了该领域中的开放挑战和未来研究方向。综述还回顾了针对场景生成和分析的方法、数据集和评估指标。
Efficient Test-Time Scaling for Small Vision-Language Models
Authors: Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos
Venue: ICLR 2026
First: 2025-10-03T23:49:06+00:00 · Latest: 2026-02-16T15:56:06+00:00
Comments: Accepted at ICLR 2026. Project Page: https://monurcan.github.io/efficient_test_time_scaling
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
中文标题/摘要
标题:小视觉语言模型测试时高效缩放方法
小视觉语言模型(VLMs)提供了一种在计算上更高效的替代方案,但代价是泛化能力和下游任务性能较弱。这些不足可以通过测试时缩放技术来解决,但现有方法通常计算成本较高,这与小模型资源高效的设计目标相矛盾。为了解决这些限制,我们提出了两种新颖且高效的测试时缩放策略,这些策略利用模型内部特征而非外部监督:(i) 测试时增强(TTAug),它生成多个增强输入并在标记级别聚合输出而不更新参数,(ii) 测试时适应(TTAdapt),它在推理过程中通过TTAug提供的基于共识的伪标签来调整模型参数。通过在九个基准上的广泛实验,我们展示了在保持计算效率的同时的一致性能改进,这种计算效率适合资源受限的环境。我们的方法的通用性在不同规模的模型内部以及不同VLMs之间得到了验证,无需额外调整。
Summary / 总结
This paper addresses the limitations of small Vision-Language Models (VLMs) by proposing two efficient test-time scaling strategies, TTAug and TTAdapt, which enhance performance without increasing computational demands. TTAug generates multiple augmented inputs and aggregates outputs at the token level, while TTAdapt adapts model parameters during inference using consensus-based pseudolabels from TTAug. Experiments across nine benchmarks show consistent performance improvements while maintaining computational efficiency for resource-constrained environments.
本文提出两种高效的测试时缩放策略TTAug和TTAdapt,以增强小视觉-语言模型(VLM)的性能而不增加计算需求。TTAug生成多个增强输入并聚合输出,而TTAdapt在推理过程中使用TTAug生成的伪标签来调整模型参数。实验表明,在九个基准上的一致性能提升,同时保持资源受限环境下的计算效率。
Are foundation models for computer vision good conformal predictors?
Authors: Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz
First: 2024-12-08T22:05:38+00:00 · Latest: 2026-02-16T15:31:47+00:00
Abstract
Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has received little attention. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.
中文标题/摘要
标题:计算机视觉基础模型是良好的齐性预测器吗?
自我监督和对比学习的最新进展将基础模型在各种任务中的性能提升到了前所未有的水平。受此推动,这些模型已成为广泛现实视觉问题的主导方法,包括风险敏感和高风险应用。然而,在这些场景中确保安全部署需要对其不确定性建模能力有更全面的理解,这方面的关注较少。在本文中,我们探讨了视觉和视觉语言基础模型在齐性预测(CP)下的行为,这是一种提供边际覆盖理论保证的统计框架。通过广泛的实验,包括流行的视觉分类基准、知名的基础视觉模型和三种CP方法,我们的发现表明,基础模型特别适合齐性化程序,尤其是那些结合了视觉变换器的方法。我们还展示了校准这些模型的信心预测,一种提高其不确定性量化的方法,实际上导致了自适应CP方法中齐性集效率的下降。此外,视觉语言模型(VLMs)在下游任务中的少量适应,其流行度正在上升,与零样本预测相比,提高了齐性分数。最后,我们的实证研究揭示了在视觉基础模型的背景下,APS特别有前景,因为它在多个具有挑战性但现实的场景中不违反边际覆盖保证。
Summary / 总结
This study investigates the suitability of foundation models in computer vision for conformal prediction, a method that ensures theoretical coverage of the true class. Across various benchmarks and models, the research finds that foundation models, especially those using Vision Transformers, are well-suited for conformalization. Calibration of confidence predictions, while improving uncertainty quantification, reduces the efficiency of conformal sets in adaptive methods. Additionally, few-shot adaptation of Vision-Language Models enhances conformal scores compared to zero-shot predictions, and the Adaptive Probabilistic Scoring method is particularly effective without violating coverage guarantees in challenging scenarios.
研究探讨了基础模型在计算机视觉中的Conformal Prediction (CP)能力,CP方法能确保覆盖真实类别。研究发现,视觉和视觉语言基础模型,尤其是带有Vision Transformers的模型,适合进行规约化处理。虽然校准置信度预测能提高不确定性量化,但在适应性CP方法中会降低规约集的效率。此外,对下游任务的少量样本适应比零样本预测能提高规约分数,而自适应概率评分(APS)方法在多种挑战性但现实的场景中特别有效,且不会违反覆盖保证。
SAILS: Segment Anything with Incrementally Learned Semantics for Task-Invariant and Training-Free Continual Learning
Authors: Shishir Muralidhara, Didier Stricker, René Schuster
First: 2026-02-16T14:14:02+00:00 · Latest: 2026-02-16T14:14:02+00:00
Comments: Accepted at IEEE CAI 2026
Abstract
Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real-world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS -- Segment Anything with Incrementally Learned Semantics, a training-free framework for Class-Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero-shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra-class clustering, resulting in multiple prototypes per class to better model intra-class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training-based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task-invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.
中文标题/摘要
标题:SAILS:使用逐步学习语义进行任务不变和无需训练的持续学习
持续学习仍然受到重复重新训练、高计算成本以及持续的遗忘问题的限制。这些因素显著限制了持续学习在实际场景中的应用,因为迭代模型更新需要大量计算资源,并且会加剧遗忘。我们提出了SAILS——使用逐步学习语义进行任意类别增量语义分割(CISS)的无需训练框架,完全规避了这些挑战。SAILS 利用基础模型将 CISS 分解为两个阶段:使用 Segment Anything Model (SAM) 进行零样本区域提取,然后通过固定特征空间中的原型进行语义关联。SAILS 结合了选择性的类内聚类,每个类别包含多个原型以更好地建模类内变异性。我们的结果表明,尽管不需要增量训练,SAILS 通常在标准 CISS 数据集上的性能超过了现有的基于训练的方法,特别是在长期和具有挑战性的任务序列中,遗忘现象最为严重。通过避免参数更新,SAILS 完全消除了遗忘,并保持了持续一致的任务不变性能。此外,SAILS 还表现出积极的反向迁移,即引入新类别可以提升先前类别的性能。
Summary / 总结
SAILS is a training-free framework for Class-Incremental Semantic Segmentation that uses the Segment Anything Model for zero-shot region extraction and selective intra-class clustering to improve semantic association. It avoids retraining and parameter updates, thus preventing forgetting and maintaining consistent performance across tasks. SAILS outperforms existing training-based approaches, especially in long and challenging task sequences.
SAILS 是一个无需训练的类增量语义分割框架,使用 Segment Anything Model 进行零样本区域提取,并通过固定原型进行语义关联。它避免了重新训练,显著减少了遗忘现象,性能优于现有的基于训练的方法,尤其是在长期任务序列中表现更佳。SAILS 维持一致的性能,并展示了正向迁移,即新类别的引入可以提升先前类别的性能。
MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
Authors: Gabriel Roccabruna, Olha Khomyn, Giuseppe Riccardi
First: 2026-02-16T09:41:50+00:00 · Latest: 2026-02-16T09:41:50+00:00
Abstract
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.
中文标题/摘要
标题:MATEO:一种用于LVLMs的时间推理和规划多模态基准
AI代理需要制定计划以实现复杂的任务,这些任务涉及协调感知、子目标分解和执行。这些计划由按时间执行顺序(TEO,一个有向无环图,确保每个步骤仅在其先决条件满足后执行)结构化的有序步骤组成。现有研究中,对基础模型对时间执行的理解仅限于自动提取的注释、TEO的线性近似或仅文本输入。为解决这一问题,我们引入了MATEO(多模态时间执行顺序),一个旨在评估和提高大型视觉语言模型(LVLMs)时间推理能力的基准,这些模型对于实际规划是必要的。我们获得了一个高质量的专业多模态食谱语料库,通过标准化编辑流程分解指令为离散步骤,并为每个步骤配对相应的图像。我们通过设计并使用可扩展的众包流程收集TEO注释作为图形。使用MATEO,我们评估了六种最先进的LVLMs,涉及模型规模、语言上下文、多模态输入结构和微调策略的变化。
Summary / 总结
The research aims to enhance the temporal reasoning and planning capabilities of Large Vision Language Models (LVLMs) by introducing MATEO, a multimodal benchmark. MATEO evaluates LVLMs using a high-quality professional recipe corpus with detailed TEO annotations, ensuring each step is correctly ordered. Key findings show that state-of-the-art LVLMs struggle with complex temporal reasoning tasks, highlighting the need for improvement in this area.
研究旨在通过引入MATEO多模态基准来提升大型视觉语言模型(LVLM)的时序推理和规划能力。MATEO使用高质量的专业食谱数据集进行评估,该数据集包含详细的时序执行顺序(TEO)注释,确保每个步骤正确排序。主要发现表明,最先进的LVLM在复杂时序推理任务上表现不佳,突显了这一领域的改进需求。
Top-Down Semantic Refinement for Image Captioning
Authors: Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Chengpei Tang, Keze Wang
First: 2025-10-25T18:27:00+00:00 · Latest: 2026-02-16T09:20:59+00:00
Abstract
Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
中文标题/摘要
标题:自上而下的语义精炼在图像字幕中的应用
大型视觉-语言模型(VLMs)在图像字幕中面临一个固有的矛盾:它们强大的单步生成能力往往导致短视的决策过程,这使得在保持全局叙述连贯性的同时捕捉丰富细节变得困难,特别是在需要多步和复杂场景描述的任务中,这一限制尤为明显。为克服这一根本挑战,我们将图像字幕重新定义为具有目标导向的分层精炼规划问题,并进一步提出了一种名为自上而下的语义精炼(TDSR)的新框架,该框架将生成过程建模为马尔可夫决策过程(MDP)。然而,在VLM庞大的状态空间中进行规划会带来显著的计算障碍。因此,我们的核心贡献是为VLM设计了一种高效的蒙特卡洛树搜索(MCTS)算法。通过引入视觉引导的并行扩展和轻量级价值网络,我们的TDSR将对昂贵的VLM的调用频率降低了十倍,同时不牺牲规划质量。此外,自适应的早期停止机制动态匹配计算开销与图像的复杂性。在DetailCaps、COMPOSITIONCAP和POPE等多个基准上的广泛实验表明,作为即插即用模块,我们的TDSR可以显著增强现有VLM(例如,LLaVA-1.5、Qwen2.5-VL)的表现,实现细粒度描述、组合泛化和幻觉抑制的最新或高度竞争结果。
Summary / 总结
The paper addresses the challenge of maintaining global narrative coherence while capturing rich details in image captioning using large Vision-Language Models (VLMs). It proposes a Top-Down Semantic Refinement (TDSR) framework that models the captioning process as a Markov Decision Process (MDP) and uses a highly efficient Monte Carlo Tree Search (MCTS) algorithm to reduce the computational burden. Experiments show that TDSR enhances the performance of existing VLMs, achieving state-of-the-art or highly competitive results in various benchmarks.
论文旨在解决大型视觉-语言模型在图像描述中保持全局叙事连贯性的同时捕捉丰富细节的局限性。它提出了一种名为Top-Down Semantic Refinement (TDSR)的新框架,将生成过程建模为马尔可夫决策过程(MDP),并使用高效的蒙特卡洛树搜索(MCTS)算法来减轻计算负担。TDSR结合了视觉引导的并行扩展和轻量级价值网络,以增强现有视觉-语言模型的性能,在多个基准测试(如DetailCaps、COMPOSITIONCAP和POPE)上取得了最先进的或具有竞争力的结果。
OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance
Authors: Zhaotong Yang, Yong Du, Shengfeng He, Yuhui Li, Xinzhe Li, Yangyang Xu, Junyu Dong, Jian Yang
First: 2026-02-16T08:27:43+00:00 · Latest: 2026-02-16T08:27:43+00:00
Abstract
Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The source code will be released to the public.
中文标题/摘要
标题:OmniVTON++: 无需训练的通用虚拟试衣系统,基于主要姿态指导
基于图像的虚拟试衣(VTON)涉及在人体姿态和身体约束下通过服装重新渲染合成现实的人像。然而,现有方法通常针对特定的数据条件进行优化,使其部署依赖于重新训练,并限制了其作为统一解决方案的泛化能力。我们提出了OmniVTON++,这是一种无需训练的VTON框架,旨在实现通用适用性。它通过协调结构服装变形以实现基于对应关系的服装适应、主要姿态指导以在扩散采样过程中逐步结构调节,以及边界感知缝合以实现边界感知细化,来协调服装对齐、人体结构连贯性和边界连续性的复杂挑战,形成一个无需特定任务重新训练的统一管道。实验结果表明,OmniVTON++在多种泛化设置中实现了最先进的性能,包括跨数据集和跨服装类型的评估,同时在单一公式内跨不同场景和扩散基础模型可靠运行。除了单件服装、单个人物的情况外,该框架还支持多件服装、多人物以及动漫角色的虚拟试衣,扩展了虚拟试衣的应用范围。源代码将向公众发布。
Summary / 总结
OmniVTON++ is a training-free VTON framework designed for universal applicability in virtual try-on. It addresses garment alignment, human structural coherence, and boundary continuity through Structured Garment Morphing, Principal Pose Guidance, and Continuous Boundary Stitching. Experimental results show that OmniVTON++ outperforms existing methods in diverse settings, supporting various scenarios and diffusion backbones without retraining. The framework also extends to multi-garment, multi-human, and anime character virtual try-on applications.
OmniVTON++ 是一个无需训练的 VTON 框架,旨在解决现有方法的局限性,提供适用于多种泛化场景的统一解决方案。它通过结构化服装变形、主要姿态引导和连续边界缝合分别处理服装对齐、人体结构连贯性和边界连续性问题。实验结果表明,OmniVTON++ 在跨数据集和跨服装类型评估中优于现有方法,并支持包括多件服装和动漫角色在内的多种虚拟试穿场景。
Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model
Authors: Ari Vesalainen, Eetu Mäkelä, Laura Ruotsalainen, Mikko Tolonen
First: 2026-02-16T07:17:52+00:00 · Latest: 2026-02-16T07:17:52+00:00
Abstract
Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis.
While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.
中文标题/摘要
标题:历史OCR中的错误模式:TrOCR与视觉语言模型的比较分析
由于印刷质量退化、古体字符和非标准化拼写,十八世纪印刷文本的光学字符识别(OCR)仍然具有挑战性。尽管基于变换器的OCR系统和视觉语言模型(VLMs)在总体准确度上表现出色,但字符错误率(CER)和词错误率(WER)等指标对它们在学术用途中的可靠性提供了有限的洞察。我们使用长度加权准确度指标和假设驱动的错误分析,将一个专门的OCR变换器(TrOCR)和一个通用视觉语言模型(Qwen)应用于历史英语文本的行级识别。虽然Qwen的CER/WER较低且对退化输入具有更强的鲁棒性,但它表现出选择性的语言正则化和拼写规范化,可能会无声地改变具有历史意义的形式。TrOCR更一致地保持了拼写忠实度,但更容易发生级联错误传播。我们的研究发现,架构归纳偏见以系统的方式塑造了OCR错误结构。具有相似总体准确度的模型在错误局部性、可检测性和下游学术风险方面可能存在显著差异,强调了在历史数字化工作流程中进行架构感知评估的必要性。
Summary / 总结
The study aims to analyze error patterns in OCR of historical texts by comparing TrOCR and a VLM (Qwen). The research uses length-weighted accuracy metrics and error analysis to evaluate their performance. While Qwen shows lower CER/WER and better robustness, it also normalizes orthography, potentially altering historically meaningful forms. TrOCR, though more error-prone, preserves orthographic fidelity. The findings indicate that architectural biases significantly influence error structure, highlighting the importance of architecture-aware evaluation for historical digitization.
研究旨在通过比较TrOCR和VLM(Qwen)来分析历史文本OCR中的错误模式。研究使用长度加权准确度指标和错误分析来评估其性能。虽然Qwen的CER/WER较低且更具鲁棒性,但它也会规范化拼写,可能改变历史意义的形式。TrOCR虽然更易出错,但能更好地保持拼写一致性。研究发现,架构偏见显著影响错误结构,强调在历史数字化工作流程中进行架构感知评估的重要性。
Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization
Authors: Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou
First: 2024-10-08T17:59:30+00:00 · Latest: 2026-02-16T06:28:35+00:00
Comments: 31 pages, 33 figures, The project page and associated code can be accessed via https://jwmao1.github.io/storyiter/
Abstract
This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.
中文标题/摘要
标题:Story-Iter:一种无需训练的迭代范式以增强长故事生成
本文介绍了Story-Iter,这是一种新的无需训练的迭代范式,用于增强长故事生成。与现有方法依赖固定参考图像构建完整故事不同,我们的方法采用了一种新颖的外部迭代范式,超越了扩散模型内部的去噪迭代步骤,通过结合上一轮中所有参考图像来不断细化生成的每一幅图像。为此,我们提出了一种即插即用、无需训练的全局参考交叉注意力(GRCA)模块,使用全局嵌入表示所有参考帧,确保长序列中的语义一致性。通过逐步引入整体视觉上下文和文本约束,我们的迭代范式能够实现精确生成和精细交互,逐步优化故事可视化。在官方故事可视化数据集和我们的长故事基准中的广泛实验表明,Story-Iter在长故事可视化(多达100帧)方面的性能处于领先地位,在语义一致性和精细交互方面表现出色。
Summary / 总结
Story-Iter is a training-free iterative paradigm designed to improve long-story generation by continuously refining each image with all reference images from the previous round. It introduces a global reference cross-attention module to ensure semantic consistency across long sequences. Experiments show that Story-Iter outperforms existing methods in terms of semantic consistency and fine-grained interactions in long-story visualization up to 100 frames.
Story-Iter 是一种无需训练的迭代 paradigm,通过利用上一轮生成的所有参考图像不断细化每张生成图像,以提高长故事生成的质量。它引入了一个全局参考交叉注意力模块,确保长序列中的语义一致性。实验表明,Story-Iter 在长故事可视化(最多 100 帧)方面优于现有方法,在语义一致性和细粒度交互方面表现更佳。
CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer
Authors: Wenbo Nie, Zixiang Li, Renshuai Tao, Bin Wu, Yunchao Wei, Yao Zhao
First: 2026-02-16T04:52:29+00:00 · Latest: 2026-02-16T04:52:29+00:00
Abstract
Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
中文标题/摘要
标题:CoCoDiff:一致性对应的扩散模型用于细粒度风格迁移
在计算机视觉中,保留相似对象之间语义对应关系的同时在图像之间转移视觉风格仍然是一个核心挑战。尽管现有方法取得了巨大进展,但大多数方法在全局层面操作,而忽略了区域甚至像素级别的语义对应关系。为了解决这一问题,我们提出了一种名为CoCoDiff的新型无训练和低成本风格迁移框架,该框架利用预训练的潜在扩散模型实现细粒度、语义一致的风格化。我们发现生成扩散模型中的对应线索被严重忽视,且在语义匹配区域之间的内容一致性经常被忽略。CoCoDiff引入了一个像素级语义对应模块,通过挖掘中间扩散特征来构建内容和风格图像之间的密集对齐图。此外,一个循环一致性模块在迭代过程中强制执行结构和感知对齐,从而实现保留几何和细节的对象和区域级别的风格化。尽管不需要额外的训练或监督,CoCoDiff仍能提供最先进的视觉质量和强大的定量结果,超越依赖额外训练或注释的方法。
Summary / 总结
CoCoDiff is a novel training-free and low-cost framework for fine-grained style transfer that leverages pretrained latent diffusion models. It introduces a pixel-wise semantic correspondence module to mine intermediate diffusion features and construct a dense alignment map between content and style images. Additionally, a cycle-consistency module enforces structural and perceptual alignment, resulting in object and region-level stylization that preserves geometry and detail. Despite not requiring additional training or supervision, CoCoDiff achieves state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
CoCoDiff 是一种无需额外训练和低成本的细粒度风格迁移框架,利用预训练的潜在扩散模型来保持语义对应。它引入了一个像素级语义对应模块来构建密集对齐图,并通过循环一致性模块强制结构和感知对齐,从而实现高质量且语义一致的风格化。实验表明,CoCoDiff 在视觉质量和定量指标上均优于需要额外训练或注释的方法。
WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity
Authors: Lei Chen, Yuan Meng, Xiaoyu Zhan, Zhi Wang, Wenwu Zhu
First: 2026-02-16T04:18:36+00:00 · Latest: 2026-02-16T04:18:36+00:00
Abstract
Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.
中文标题/摘要
标题:WiSparse:基于权重感知的混合激活稀疏性提升大语言模型推理效率
大语言模型(LLMs)提供了强大的能力,但由于密集计算和内存访问导致了高昂的推理成本。无训练激活稀疏性是一种有前景的高效LLM推理方法,但现有方法往往仅依赖激活信息和均匀的稀疏性比例,忽视了权重和跨块敏感性变化的关键交互,导致性能不佳。我们识别了现代LLM中的两个关键现象:1)不那么重要的激活可能与高度重要的权重相关联,2)稀疏性敏感性在模型块之间是非单调变化的。我们提出了基于权重感知的混合粒度无训练激活稀疏性(WiSparse),该方法结合了激活和权重信息进行自适应稀疏性分配。具体来说,我们引入了一种基于权重感知的机制,将激活幅度与预计算的权重范数结合,以准确识别重要通道。这与混合粒度分配方案结合:通过进化搜索在块之间分配全局预算以保护敏感区域,然后在块内进行细化以最小化重构误差。我们改进了稀疏内核并在三个代表性模型上展示了其有效性。值得注意的是,在50%稀疏性下,WiSparse保持了97%的Llama3.1密集性能,超越最强基线2.23个百分点,同时实现了21.4%的端到端推理速度加速。我们的研究推进了无训练方法在高效LLM推理中的极限,推动了在不进行训练的情况下可实现的加速边界。
Summary / 总结
WiSparse aims to enhance the efficiency of Large Language Model (LLM) inference by addressing the limitations of existing training-free activation sparsity methods. It proposes a weight-aware mixed-granularity approach that considers both activation and weight information for adaptive sparsity allocation. Experiments on three representative models show that at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, outperforming the strongest baseline by 2.23 percentage points and achieving a 21.4% speedup in inference. This work advances training-free approaches for efficient LLM inference, demonstrating significant performance improvements without training.
WiSparse旨在通过解决现有训练-free 激活稀疏性方法的局限性,提升大型语言模型(LLM)推理的效率。它提出了一种兼顾权重和激活信息的混合粒度方法,以实现自适应的稀疏性分配。实验结果显示,在50%稀疏度下,WiSparse保持了97%的Llama3.1密集模型性能,超越了最佳基线2.23个百分点,并实现了21.4%的推理速度提升。这项工作推进了训练-free 方法在高效LLM推理中的应用边界。
S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations
Authors: Arnav Chavan, Nahush Lele, Udbhav Bamba, Sankalp Dayal, Aditi Raghunathan, Deepak Gupta
First: 2026-02-16T03:41:06+00:00 · Latest: 2026-02-16T03:41:06+00:00
Abstract
Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay ($S^2D$), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that $S^2D$ significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with $S^2D$ achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.
中文标题/摘要
标题:S2D:选择性光谱衰减用于神经激活的量化友好条件化
大型变压器模型中的激活异常值对模型量化构成了根本性挑战,导致量化过程中出现过大的范围,从而严重降低准确性。我们实证观察到,异常值的严重性随着预训练规模的增加而加剧(例如,从CLIP到更广泛训练的SigLIP和SigLIP2)。通过理论分析以及经验相关性研究,我们建立了这些激活异常值与权重主导奇异值之间的直接联系。基于这一洞察,我们提出了选择性光谱衰减($S^2D$),这是一种几何原理上的条件化方法,在微调过程中仅对与最大奇异值对应的权重分量进行手术式正则化。通过广泛的实验,我们证明$S^2D$显著减少了激活异常值,并生成了固有上量化友好的良好条件表示。使用$S^2D$训练的模型在ImageNet的W4A4量化下实现了高达7%的PTQ准确性提升,并且与QAT结合使用时可获得4%的增益。这些改进在下游任务和视觉语言模型中也具有普适性,使得可以扩展越来越大的、训练更为严格的模型,而不牺牲部署效率。
Summary / 总结
The paper addresses the challenge of activation outliers in large-scale transformer models, which affect quantization accuracy. It proposes Selective Spectral Decay ($S^2D$), a method that selectively regularizes the largest singular values during fine-tuning to reduce activation outliers. Extensive experiments show that $S^2D$ improves PTQ accuracy on ImageNet and enhances the efficiency of deploying large models.
论文针对大型变压器模型中的激活异常值问题,这些问题会影响模型的量化精度。提出了一种名为Selective Spectral Decay ($S^2D$)的方法,该方法在微调过程中仅对权重中对应最大奇异值的组件进行选择性正则化。实验表明,$S^2D$可以减少激活异常值,并在ImageNet上实现高达7%的后训练量化(PTQ)精度提升,特别是在结合量化感知训练(QAT)时。
Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
Authors: In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song
First: 2026-02-16T02:15:58+00:00 · Latest: 2026-02-16T02:15:58+00:00
Abstract
Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.
中文标题/摘要
标题:大型视觉语言模型的多轮自适应提示攻击
多轮脱缰攻击对仅包含文本的大规模语言模型(LLMs)有效,通过逐步引入恶意内容。当扩展到大规模视觉语言模型(LVLMs)时,我们发现简单地添加视觉输入会使现有的多轮脱缰攻击容易被防御。例如,过于恶意的视觉输入会轻易触发安全对齐的LVLMs的防御机制,使响应更加保守。为了解决这个问题,我们提出了MAPA:一种多轮自适应提示攻击,1)在每轮中交替使用文本-视觉攻击动作以获得最恶意的响应;2)在多轮中通过迭代来回优化调整攻击轨迹以逐步放大响应的恶意性。这种两层设计使MAPA能够在最近针对LLaVA-V1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini的基准测试中持续超越最先进的方法,将攻击成功率提高11-35%。
LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge
Authors: Xin Wang, Hualin Zhou, Sheng Guang Wang, Ting Dang, Yu Zhang, Hong Jia, Tao Gu
First: 2026-02-08T07:37:37+00:00 · Latest: 2026-02-16T02:01:53+00:00
Comments: 15 pages, 9 figures ,9 tables, preprint
Abstract
Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.
中文标题/摘要
标题:LQA:边缘设备上视觉-语言模型的轻量级量化自适应框架
在边缘设备上部署视觉-语言模型(VLMs)受到资源限制和分布转移导致性能下降的挑战。虽然测试时自适应(TTA)可以对抗这些转移,但现有方法在设备上部署时过于耗资源。为解决这一挑战,我们提出了一种结合模态感知量化策略和无梯度测试时自适应的轻量级量化自适应框架LQA。我们引入了选择性混合量化(SHQ)和量化无梯度自适应机制,以在资源受限的硬件上实现鲁棒且高效的VLM部署。在合成和真实世界分布转移的实验中,LQA 的整体自适应性能提高了4.5%,使用比全精度模型更少的内存,并且显著优于基于梯度的TTA方法,实现了七个多源数据集上最高达19.9倍的内存使用率降低。这些结果表明,LQA 提供了一条实用的路径,用于在边缘设备上实现鲁棒、隐私保护和高效的VLM部署。
Summary / 总结
LQA is a lightweight quantized-adaptive framework for Vision-Language Models (VLMs) on edge devices, combining a modality-aware quantization strategy with gradient-free test-time adaptation. It introduces Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enhance robust and efficient deployment. Experiments show LQA improves overall adaptation performance by 4.5%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9 times lower memory usage across seven open-source datasets.
LQA 是一种轻量级的量化自适应框架,用于边缘设备上的视觉-语言模型(VLMs),解决了资源限制和性能下降的问题。它结合了模态感知的量化策略和无梯度的测试时自适应机制。实验结果显示,LQA 的整体自适应性能提高了 4.5%,使用了比全精度模型更少的内存,并且在七个开源数据集上比基于梯度的 TTA 方法的内存使用降低了多达 19.9 倍。
LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
Authors: Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis
First: 2024-12-01T05:50:22+00:00 · Latest: 2026-02-16T00:52:57+00:00
Comments: 38 pages, 24 Figures, 19 Tables
Abstract
Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting -- a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.
中文标题/摘要
标题:LVLM-COUNT:增强大型视觉语言模型的计数能力
计数是各种现实视觉任务中的基本操作,需要物体识别和稳健的计数能力。尽管大型视觉语言模型(LVLMs)在视觉感知方面表现出色,但它们在计数任务上却存在困难。在本研究中,我们评估了几种LVLMs在多个计数和视觉数据集上的计数任务性能。我们观察到,虽然它们在小数量物体上的表现可能不太容易出错,但随着物体数量的增加,它们表现出明显的弱点。为了解决这一问题,我们提出了一种简单而有效的基线方法,通过分而治之的方法增强LVLMs在大量物体上的计数能力。该方法将计数问题分解为子任务,并且引入了一种机制,防止在划分过程中物体被分割,从而避免常见的重复计数问题。我们在各种数据集和基准测试中展示了该方法的有效性,将其确立为评估未来解决方案的重要参考。
Summary / 总结
This paper addresses the challenge of enhancing the counting ability of large vision-language models (LVLMs) for various real-world visual tasks. The authors evaluate several LVLMs on visual counting tasks and find that their performance degrades as the number of objects increases. To improve this, they propose LVLM-COUNT, a divide-and-conquer method that decomposes counting problems into smaller sub-tasks and prevents object splitting, thereby reducing repetitive counting. The method shows significant improvements across multiple datasets and benchmarks, making it a valuable reference for future work in this area.
该研究针对大型视觉-语言模型(LVLMs)在处理大量物体计数任务时的局限性进行了探讨。作者在多个数据集上评估了LVLMs的表现,并发现这些模型在处理少量物体时表现良好,但随着物体数量的增加,其性能显著下降。为了解决这一问题,研究引入了LVLM-COUNT方法,该方法采用分而治之的策略将计数问题分解为更小的子任务,并包含防止物体在分割过程中被重复计数的机制。实验结果表明,该方法在各种数据集和基准测试中提升了LVLMs的计数能力。
Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
Authors: A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar
First: 2026-02-15T19:00:02+00:00 · Latest: 2026-02-15T19:00:02+00:00
Comments: 28 pages, 15 figures
Abstract
Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.
中文标题/摘要
标题:超越稀疏接地:通过完整的屏幕解析监督
现代计算机使用代理(CUA)必须将屏幕视为一个结构化的状态,识别可见元素的位置和文本内容,才能可靠地进行指令定位和执行。然而,大多数可用的接地数据集提供的监督信息稀疏,标签不足且多样性低,只能标注每张屏幕中一小部分任务相关元素,这限制了覆盖范围和泛化能力;此外,实际部署需要高效性以实现低延迟的设备端使用。我们引入了ScreenParse,这是一个大规模的完整屏幕解析数据集,对771,000张网页截图(2100万元素)中的所有可见UI元素(框、55类类型和文本)进行了密集标注。ScreenParse通过Webshot生成,这是一个自动化的、可扩展的管道,可以渲染多样化的URL,提取标注并应用基于VLM的重新标注和质量过滤。使用ScreenParse,我们训练了ScreenVLM,这是一个紧凑的、参数量为3.16亿的视觉语言模型(VLM),使用结构感知损失解码紧凑的ScreenTag标记表示。ScreenVLM在密集解析(例如,ScreenParse上的PageIoU为0.592,而基础VLM为0.294)方面显著优于更大的基础VLM,并且在公共基准测试中表现出强大的迁移能力。此外,对基础VLM进行ScreenParse的微调始终提高了其接地性能,表明密集屏幕监督提供了可转移的结构先验知识,有助于UI理解。项目页面:https://saidgurbuz.github.io/screenparse/
Summary / 总结
The paper addresses the limitations of sparse supervision in grounding datasets by introducing ScreenParse, a large-scale dataset with dense annotations of all visible UI elements across 771K web screenshots. Using this dataset, the authors train ScreenVLM, a compact vision language model that outperforms larger foundation models on dense parsing tasks. The model's performance and the improvement in grounding performance when fine-tuned on ScreenParse suggest the value of dense screen supervision for UI understanding.
论文通过引入包含771K网页截图中UI元素密集标注的ScreenParse数据集,解决了现有数据集稀疏标注的限制。ScreenParse通过自动管道生成,渲染多样化的URL并进行质量过滤。作者使用该数据集训练了ScreenVLM,这是一种紧凑的视觉语言模型,其在密集解析任务上的表现优于更大的基础模型。此外,该模型在公共基准测试中也表现出强大的迁移能力,表明密集屏幕监督提供了可转移的结构先验知识,有助于UI理解。
Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models
Authors: Vishnu Sai, Dheeraj Sai, Srinath B, Girish Varma, Priyesh Shukla
First: 2026-02-15T17:06:02+00:00 · Latest: 2026-02-15T17:06:02+00:00
Abstract
Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.
中文标题/摘要
标题:长视频理解中视觉-语言模型的双信号自适应KV缓存优化
视觉-语言模型(VLMs)在处理长视频内容时面临严重的内存瓶颈,因为随着序列长度的增长,键值(KV)缓存呈线性增长。现有解决方案主要采用反应式淘汰策略,在计算完整的注意矩阵之前丢弃令牌,导致大量计算浪费。我们提出了一种新颖的先验优化框架Sali-Cache,通过主动内存管理实现双信号自适应缓存。通过结合基于光流分析的时间滤波器来检测帧间冗余,并结合利用显著性检测的空间滤波器来识别视觉显著区域,Sali-Cache在进入昂贵的注意操作之前智能地管理内存分配。在LLaVA 1.6架构上的实验评估表明,我们的方法在有效内存使用方面实现了2.20倍的压缩比,同时在BLEU、ROUGE-L和精确匹配指标上保持100%的准确性。此外,在相同的内存预算约束下,Sali-Cache能够长时间保留丰富的上下文特征而不降低模型性能,从而在消费级硬件上高效处理长视频内容。
Summary / 总结
The paper addresses the memory bottleneck in Vision-Language Models (VLMs) when processing long-form video content. It introduces Sali-Cache, a dual-signal adaptive caching framework that uses proactive memory management with temporal and spatial filters to reduce memory usage. Experimental results show that Sali-Cache achieves a 2.20x compression ratio in memory usage while maintaining model accuracy and performance on long-form video content.
论文提出了一种名为Sali-Cache的双重信号自适应缓存框架,以解决Vision-Language Models (VLMs)在处理长视频内容时的内存瓶颈问题。该框架通过基于光学流的时域滤波器和基于显著性检测的空间滤波器进行主动内存管理。实验结果显示,Sali-Cache在保持模型准确性和长时间保留上下文丰富特征的同时,实现了2.20倍的内存压缩比。
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
Authors: Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee
First: 2026-02-15T16:07:51+00:00 · Latest: 2026-02-15T16:07:51+00:00
Abstract
Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
中文标题/摘要
标题:MAGE:所有-[MASK]块已知如何在扩散大语言模型中查找
块扩散大语言模型正在成为语言生成下一个有前途的范式,但它们使用KV缓存使得在长上下文设置中内存访问成为主要瓶颈。尽管动态稀疏注意机制已被积极研究,但现有方法针对自回归大语言模型依赖于近似重要性估计,并且在适应块扩散时表现不佳。这项工作识别了一个块扩散独有的关键机会:在首次所有-[MASK]去噪步骤中的注意力可靠地预测了重要的KV条目和预算需求,使MAGE能够在每个块中执行一次精确的注意力传递,并在无需训练的情况下重用它进行稀疏去噪。在包括LongBench和Needle-in-a-Haystack的长上下文基准测试中,MAGE在KV预算的一小部分下实现了近乎无损的准确性,同时提供高达3-4倍的端到端加速,始终优于面向自回归稀疏注意力的基线。一种轻量级的微调策略进一步强化了-[MASK]引导的模式,仅需在单个NVIDIA H100 GPU上进行几小时的训练即可对1.5B和7B模型进行优化。
Summary / 总结
This work addresses the memory bottleneck in block diffusion language models by leveraging the first denoising step to predict important key-value entries, enabling efficient sparse denoising. MAGE performs a single exact attention pass per block and reuses it for training-free sparse denoising, achieving near-lossless accuracy with reduced KV budget and up to 3-4x speedup on long-context benchmarks. A lightweight fine-tuning strategy further enhances performance with minimal cost.
该研究通过利用首次去噪步骤来预测重要键值对,解决了块扩散语言模型中的内存瓶颈问题,使模型能够进行单次精确注意力传递并重复使用它来进行无训练的稀疏去噪。MAGE 在长上下文基准测试中实现了近乎无损的准确性,同时减少了键值预算并实现了高达 3-4 倍的端到端加速。一种轻量级的微调策略进一步增强了基于去噪的模式,仅需在单个 NVIDIA H100 GPU 上进行少量训练即可实现最佳效果。
LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer
Authors: Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, Anirudha Majumdar
First: 2026-02-11T06:09:11+00:00 · Latest: 2026-02-15T15:32:16+00:00
Comments: Project website: https://lap-vla.github.io
Abstract
A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.
中文标题/摘要
标题:LAP:语言-动作预训练使零样本跨体态迁移成为可能
机器人领域的一个长期目标是能够零样本部署在新机器人体态上的通用策略,无需针对每个体态进行适应。尽管进行了大规模的多体态预训练,现有的视觉-语言-动作模型(VLAs)仍然紧密耦合于其训练体态,并且通常需要昂贵的微调。我们引入了语言-动作预训练(LAP),这是一种简单的配方,直接将低级机器人动作表示为自然语言,使动作监督与预训练的视觉-语言模型的输入-输出分布相一致。LAP 不需要学习分词器,不需要昂贵的标注,也不需要针对特定体态的架构设计。基于 LAP,我们提出了 LAP-3B,据我们所知,这是第一个在无需任何体态特定微调的情况下实现显著零样本迁移至未见过的机器人体态的视觉-语言-动作模型。在多个新型机器人和操作任务上,LAP-3B 达到了超过 50% 的平均零样本成功率,相比之前最强的视觉-语言-动作模型,大约提高了 2 倍。我们还展示了 LAP 能够实现高效的适应和有利的扩展,并且在共享的语言-动作格式中统一了动作预测和视觉问答,通过协同训练进一步提高了性能。
Summary / 总结
The research aims to develop a generalist policy for robots that can be deployed without adaptation on new embodiments. The Language-Action Pre-training (LAP) method represents robot actions in natural language, aligning with the pre-trained vision-language model. LAP-3B, based on this method, achieves over 50% average zero-shot success across multiple novel robots and manipulation tasks, doubling the success rate of previous models without embodiment-specific fine-tuning. This method also supports efficient adaptation and unifies action prediction and VQA, leading to additional gains through co-training.
研究旨在开发一种通用的机器人策略,使其能够在无需适应的情况下部署到新的身体上。语言-动作预训练(LAP)方法将低级机器人动作表示为自然语言,促进零样本转移。基于LAP的LAP-3B在多个新型机器人和操作任务上实现了超过50%的平均零样本成功率,显著优于之前的模型。该方法无需学习分词器、昂贵的注释或特定于身体的架构设计,并展示了高效的适应性和有利的扩展性优势。
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Authors: Tao Xu
First: 2026-02-15T14:23:50+00:00 · Latest: 2026-02-15T14:23:50+00:00
Comments: 24 pages, 9 figures, 9 tables
Abstract
Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
中文标题/摘要
标题:简明视觉,深思逻辑:延迟视觉摄入在视觉密集型文档问答中的应用
现有的多模态文档问答方法普遍采用供给方摄入策略:在索引阶段对每页运行视觉语言模型(VLM)以生成全面描述,然后通过文本检索回答问题。然而,这种“预摄入”方法成本高昂(一个113页的工程图纸包大约需要80,000个VLM令牌),端到端可靠性差(VLM输出可能因检索基础设施中的格式不匹配而无法正确检索),并且一旦失败就不可恢复。本文提出了延迟视觉摄入(DVI)框架,采用需求方摄入策略:索引阶段仅进行轻量级元数据提取,将视觉理解推迟到用户提出具体问题时。DVI的核心原则是“索引用于定位,而非理解”——通过结构化元数据索引和BM25全文搜索实现页面定位,然后将原始图像和具体问题发送给VLM进行针对性分析。在两个实际工业工程图纸(113页+7页)上的实验表明,DVI在零摄入VLM成本(46.7% vs. 48.9%)的情况下实现了相当的整体准确性,对视觉必要的查询的有效率达到了50%(而预摄入为0%),并且实现了100%的页面定位(98%的搜索空间压缩)。DVI还支持交互式细化和渐进式缓存,将“问答准确性”问题转化为“页面定位”问题——一旦找到正确的图纸页面,获取答案就变成了交互轮次的问题。
Summary / 总结
This paper addresses the inefficiencies of existing multimodal document question answering methods by proposing the Deferred Visual Ingestion (DVI) framework. Instead of pre-ingesting every page with a Vision-Language Model (VLM), DVI extracts only lightweight metadata during indexing and defers visual understanding to when specific questions are asked. Experiments show DVI achieves comparable accuracy to pre-ingestion methods at zero VLM cost, 50% effectiveness on visually necessary queries, and 100% page localization. DVI also supports interactive refinement and progressive caching, transforming the QA accuracy problem into a page localization problem.
本文提出了一种延迟视觉摄入(DVI)框架,以解决现有跨模态文档问答方法的低效问题。DVI 在索引阶段仅提取轻量级元数据,而将视觉理解延迟到用户提出具体问题时。实验表明,DVI 在节省视觉语言模型(VLM)成本的同时,实现了与预摄入方法相似的准确率,有效定位页面并支持交互式细化和缓存。在实际的工业工程图纸上,DVI 在视觉必要查询上的有效性达到 50%,并且实现了 100% 的页面定位,搜索空间压缩率达到 98%。
Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Authors: Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek, Katarzyna Bogusz, Sebastian Cygert, Wojciech Kusa
First: 2026-02-15T09:54:40+00:00 · Latest: 2026-02-15T09:54:40+00:00
Abstract
Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
中文标题/摘要
标题:使用LLaVA框架对波兰语言的注释高效视觉-语言模型适应
大多数视觉-语言模型(VLMs)都是在以英语为中心的数据上训练的,这限制了它们在其他语言和文化背景下的性能。这限制了非英语使用者的使用,并阻碍了反映多元语言和文化现实的多模态系统的开发。在这项工作中,我们重现并适应了LLaVA-Next方法,以创建一组波兰VLMs。我们依赖于一个完全自动化的管道来翻译和过滤现有的多模态数据集,并通过合成的波兰数据补充OCR和文化特定任务。尽管几乎完全依赖自动翻译和少量的手动干预对训练数据进行干预,我们的方法仍取得了良好的结果:我们在波兰适应的MMBench上观察到对LLaVA-1.6-Vicuna-13B的性能提高了9.5%,并且在生成性评估中,由人类注释者衡量的字幕质量更高。这些发现表明,大规模的自动翻译结合轻量级的过滤可以有效地为低资源语言启动高质量的多模态模型。仍存在一些挑战,特别是在文化覆盖面和评估方面。为了促进进一步的研究,我们公开了我们的模型和评估数据集。
Summary / 总结
This work addresses the limitation of vision-language models (VLMs) trained primarily on English data by adapting the LLaVA-Next methodology to create Polish VLMs. The approach uses an automated pipeline for translating and filtering existing multimodal datasets, supplemented by synthetic Polish data. Despite minimal manual intervention, the models show a significant improvement of +9.5% on a Polish-adapted MMBench and higher-quality captions in human evaluations. This demonstrates that large-scale automated translation and lightweight filtering can effectively support the development of high-quality multimodal models for low-resource languages, though challenges in cultural coverage remain.
本研究针对主要以英语数据训练的视觉语言模型(VLMs)的局限性,通过将LLaVA-Next方法适应化来创建波兰语VLMs。该方法使用自动化的数据翻译和过滤管道,并补充了合成的波兰数据。尽管手动干预较少,但模型在波兰适应的MMBench上表现出显著的+9.5%改进,并且生成的字幕在人类评估者看来质量更高。这表明大规模自动翻译和轻量级过滤在开发低资源语言的高质量多模态模型方面是有效的,尽管文化覆盖面的挑战仍然存在。
Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation
Authors: Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So
First: 2026-02-15T07:14:47+00:00 · Latest: 2026-02-15T07:14:47+00:00
Comments: 19 pages, 15 figures
Abstract
Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}.
中文标题/摘要
标题:短训练,长推理:无需训练的超长时间自回归视频生成
自回归视频扩散模型已成为长视频生成的可扩展范式。然而,它们通常会遭受严重的外推失败,即快速的误差累积导致在超出训练范围时出现显著的时间降解。我们发现,这种失败主要源于3D位置嵌入的频谱偏差以及噪声采样中缺乏动态先验。为了解决这些问题,我们提出了FLEX(频率感知长度扩展),这是一种无需训练的推理时框架,可以弥合短期训练与长期推理之间的差距。FLEX引入了频率感知RoPE调制,以自适应地插值未充分训练的低频分量,同时外推高频分量,以保持多尺度时间可区分性。这与反相噪声采样(ANS)结合使用,以注入高频动态先验,并与推理专用注意力汇合,以锚定全局结构。在VBench上的广泛评估表明,FLEX在6倍外推(30秒时长)时显著优于最先进的模型,并且在12倍尺度(60秒时长)时与长视频微调基线的性能相当。作为即插即用的增强功能,FLEX无缝地集成到现有的推理管道中,以扩展时间范围。它有效地推动了如LongLive等模型的生成极限,支持在4分钟尺度上的一致和动态视频合成。项目页面可在https://ga-lee.github.io/FLEX_demo访问。
Summary / 总结
The paper addresses the issue of temporal degradation in autoregressive video diffusion models when extending beyond their training horizons. It proposes FLEX, a training-free framework that enhances long-term inference by introducing Frequency-aware RoPE Modulation, Antiphase Noise Sampling, and Inference-only Attention Sink. FLEX significantly improves performance at 6x extrapolation and matches long-video fine-tuned baselines at 12x scale, demonstrating its effectiveness in supporting consistent and dynamic video synthesis over extended durations.
论文解决了自回归视频扩散模型在超出训练范围时出现的时间退化问题。它提出了FLEX框架,该框架通过使用Frequency-aware RoPE Modulation和Antiphase Noise Sampling来改善长期推理。在VBench上的实验表明,FLEX在6x外推时优于最先进的模型,并且在12x尺度上与长视频微调基线相当。
Consistent text-to-image generation via scene de-contextualization
Authors: Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu
Venue: ICLR 2026
First: 2025-10-16T10:54:49+00:00 · Latest: 2026-02-15T06:23:00+00:00
Comments: This paper is accepted by ICLR 2026
Abstract
Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
中文标题/摘要
标题:通过场景去语境化实现一致的文本到图像生成
一致的文本到图像(T2I)生成旨在生成同一主题在不同场景下的身份保留图像,但由于身份(ID)偏移现象,往往难以实现。先前的方法已经解决了这一问题,但通常依赖于知道所有目标场景这一不现实的假设。本文揭示了身份偏移的一个主要来源是主题和场景语境之间的自然相关性,称为场景语境化,这是T2I模型拟合大量自然图像训练分布时自然产生的。我们正式证明了这种场景-ID相关性的普遍性,并推导出其强度的理论界。在此基础上,我们提出了一种新颖、高效、无需训练的提示嵌入编辑方法,称为场景去语境化(SDeC),它施加了T2I内置场景语境化的逆过程。具体而言,它通过量化SVD方向稳定性来识别并抑制ID提示嵌入中的潜在场景-ID相关性,从而自适应地重新加权相应的特征值。关键的是,SDeC 允许每场景使用(每个提示一个场景)而无需事先访问所有目标场景。这使其成为一种高度灵活且通用的解决方案,特别适合于在先验知识通常不可用或随时间变化的现实世界应用中。实验表明,SDeC 显著增强了身份保留能力,同时保持了场景多样性。
Summary / 总结
This paper addresses the issue of identity shift in text-to-image generation by proposing a novel method called Scene De-Contextualization (SDeC). The method identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding to enhance identity preservation. Experiments show that SDeC significantly improves identity preservation while maintaining scene diversity, making it a flexible solution for real-world applications where prior knowledge of all target scenes is unavailable or varies over time.
该论文通过提出一种名为Scene De-Contextualization (SDeC)的方法来解决文本到图像生成中的身份偏移问题,旨在跨不同场景保持主体的身份一致性。该方法识别并抑制ID提示嵌入中的潜在场景-ID关联,允许按场景使用而无需事先了解所有目标场景。实验表明,SDeC显著提高了身份一致性,同时保持了场景多样性。
LLM DNA: Tracing Model Evolution via Functional Representations
Authors: Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
Venue: ICLR 2026 Oral
First: 2025-09-29T09:09:57+00:00 · Latest: 2026-02-15T06:13:13+00:00
Comments: ICLR 2026 (Oral)
Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
中文标题/摘要
标题:LLM DNA:通过功能表示追踪模型进化
大型语言模型(LLMs)的爆炸性增长创造了一个庞大但不透明的景观:数百万个模型存在,但它们通过微调、蒸馏或适应的进化关系往往未被记录或不明确,使LLM管理变得复杂。现有方法受限于任务特定性、固定的模型集或对分词器或架构的严格假设。受生物DNA的启发,我们通过数学定义LLM DNA为功能行为的低维、双利普希茨表示来解决这些限制。我们证明LLM DNA满足继承性和遗传决定性属性,并证明了DNA的存在。在此理论基础上,我们推导出一个通用、可扩展、无需训练的DNA提取管道。在305个LLM的实验中,DNA与有限子集的先前研究结果一致,并在特定任务上实现了优于或竞争性的性能。超出这些任务,DNA比较揭示了LLM之间之前未记录的关系。我们进一步使用系统发生算法构建了LLM的进化树,该树与从编码器-解码器到仅解码器架构的转变、反映时间进程以及揭示不同LLM家族的进化速度相一致。
Summary / 总结
The paper addresses the challenge of understanding the evolutionary relationships among large language models (LLMs) by defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. The authors derive a general, scalable pipeline for extracting LLM DNA and demonstrate its effectiveness in aligning with previous studies and achieving competitive performance on specific tasks. Additionally, DNA comparisons reveal previously undocumented relationships among LLMs, and the evolutionary tree constructed using phylogenetic algorithms reflects temporal progression and distinct evolutionary speeds across LLM families.
论文旨在解决大规模语言模型(LLMs)通过微调、蒸馏或适应进化时缺乏透明度的问题。它引入了LLM DNA,这是一种功能行为的低维表示,被数学定义并证明满足继承和遗传决定性属性。方法包括一个可扩展的、无需训练的DNA提取管道,适用于305个LLM。结果表明,DNA与以前对有限子集的研究结果一致,并在特定任务上优于或匹配现有方法。此外,DNA比较揭示了LLMs之间以前未记录的关系,并使用系统发育算法构建的进化树反映了不同LLM家族的历史演变和不同的进化速度。
Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization
Authors: Guandong Li, Mengxia Ye
First: 2026-02-15T05:25:57+00:00 · Latest: 2026-02-15T05:25:57+00:00
Abstract
Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
中文标题/摘要
标题:在关键处注入:无需训练的空间自适应身份保护以实现文本到图像的个性化
个性化文本到图像生成旨在将特定身份融入任意场景中。然而,现有的无需调优方法通常采用空间均匀视觉注入,导致身份特征污染非面部区域(例如背景和照明),降低文本依附性。为解决这一问题而不需昂贵的微调,我们提出了一种无需训练的空间自适应身份调制框架——SpatialID。SpatialID 基本上将身份注入分解为与面部相关和无场景限制的区域,使用来自跨注意力响应的空间掩码提取器。此外,我们引入了一种时空调度策略,动态调整空间约束——从高斯先验过渡到基于注意力的掩码和自适应松弛,以适应扩散生成动力学。在 IBench 上的大量实验表明,SpatialID 在文本依附性(CLIP-T:0.281)、视觉一致性(CLIP-I:0.827)和图像质量(IQ:0.523)方面均达到最佳性能,显著减少了背景污染,同时保持了稳健的身份保护。
Summary / 总结
The paper addresses the issue of identity features contaminating non-facial regions in text-to-image generation, which degrades text adherence. It introduces SpatialID, a training-free framework that decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor. Additionally, it employs a Temporal-Spatial Scheduling strategy to dynamically adjust spatial constraints during diffusion generation. Experiments show that SpatialID outperforms existing methods in text adherence, visual consistency, and image quality, effectively eliminating background contamination while preserving identity features.
论文解决了身份特征污染非面部区域的问题,这会降低文本一致性。它提出了SpatialID,一种无需训练的框架,通过空间掩码提取器将身份注入分解为面部相关和非上下文区域。此外,它还采用了时空调度策略,在扩散生成过程中动态调整空间约束。实验表明,SpatialID 在文本一致性、视觉一致性和图像质量方面均优于现有方法,有效消除了背景污染并保持了身份特征的稳定性。
Elastic Diffusion Transformer
Authors: Jiangshan Wang, Zeqiang Lai, Jiarui Chen, Jiayi Guo, Hang Guo, Xiu Li, Xiangyu Yue, Chunchao Guo
First: 2026-02-15T05:19:17+00:00 · Latest: 2026-02-15T05:19:17+00:00
Abstract
Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at https://github.com/wangjiangshan0725/Elastic-DiT.
中文标题/摘要
标题:弹性扩散变换器
扩散变换器(DiT)展示了卓越的生成能力,但计算成本仍然很高。之前的加速方法,如剪枝和蒸馏,通常依赖固定的计算能力,导致加速不足且生成质量下降。为解决这一限制,我们提出了**弹性扩散变换器(E-DiT)**,这是一种适应性加速框架,能够在保持生成质量的同时有效提高效率。具体而言,我们观察到DiT的生成过程具有显著的稀疏性(即,一些计算可以被跳过而对质量影响很小),并且这种稀疏性在不同样本之间差异很大。受此观察的启发,E-DiT为每个DiT块配备了一个轻量级路由器,该路由器能够从输入的潜在变量中动态识别样本相关的稀疏性。每个路由器会自适应地确定相应的块是否可以被跳过。如果块不被跳过,路由器会预测块内MLP宽度减少的最佳比例。在推理过程中,我们进一步引入了一种块级特征缓存机制,利用路由器的预测来在无需训练的情况下消除冗余计算。在2D图像(Qwen-Image和FLUX)和3D资产(Hunyuan3D-3.0)上的广泛实验表明,E-DiT的有效性,实现了约2倍的加速,且生成质量几乎没有损失。代码将在https://github.com/wangjiangshan0725/Elastic-DiT/上提供。
Summary / 总结
Elastic Diffusion Transformer (E-DiT) is proposed to address the high computational cost of Diffusion Transformers (DiT) while maintaining generation quality. E-DiT identifies sample-dependent sparsity in DiT blocks using lightweight routers, allowing for adaptive skipping of computations. During inference, a block-level feature caching mechanism further reduces redundant computations. Experiments show E-DiT achieves up to a 2x speedup with negligible quality loss in 2D and 3D generation tasks.
Elastic Diffusion Transformer (E-DiT) 是一种针对扩散变换器(DiT)的自适应加速框架,能够在保持生成质量的同时提高效率。通过每个DiT块中的轻量级路由器识别样本相关的稀疏性,实现动态块跳过和最优MLP宽度缩减。此外,引入了块级特征缓存机制以消除冗余计算。实验表明,E-DiT 在2D和3D生成任务中可实现最高约2倍的加速,且质量损失可忽略不计。
MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
Authors: Shuoyuan Wang, Yiran Wang, Hongxin Wei
First: 2026-02-15T02:41:56+00:00 · Latest: 2026-02-15T02:41:56+00:00
Abstract
Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval
中文标题/摘要
标题:MarsRetrieval:评估行星规模地理空间检索的视觉-语言模型基准
基于数据的方法,如深度学习,正在迅速推动行星科学的发展,特别是在火星探索方面。尽管取得了进展,但现有的大多数基准仍然局限于封闭集监督视觉任务,不支持文本引导的地理空间发现检索。我们引入了MarsRetrieval,这是一个用于评估视觉-语言模型在火星地理空间发现中的检索基准。MarsRetrieval 包含三个任务:(1)配对图像-文本检索,(2)地貌检索,(3)全球地理定位,涵盖了多个空间尺度和多样的地貌起源。我们提出了一种统一的检索为中心的基准测试协议,用于评估多模态嵌入架构,包括对比度双塔编码器和生成型视觉-语言模型。我们的评估表明,MarsRetrieval 是具有挑战性的:即使强大的基础模型也往往无法捕捉到特定领域的地貌差异。我们进一步表明,特定领域的微调对于行星环境中的可泛化地理空间发现至关重要。我们的代码可在 https://github.com/ml-stat-Sustech/MarsRetrieval 获取
Summary / 总结
MarsRetrieval is a benchmark for evaluating vision-language models in planetary geospatial retrieval on Mars. It includes three tasks: paired image-text retrieval, landform retrieval, and global geo-localization. The benchmark uses a unified retrieval-centric protocol with contrastive dual-tower encoders and generative vision-language models. The evaluation shows that even strong foundation models struggle with domain-specific geomorphic distinctions, highlighting the importance of domain-specific fine-tuning for geospatial discovery in planetary settings.
MarsRetrieval 是一个用于评估行星地理空间检索的基准,专注于火星探索。它包括三项任务:配对图像-文本检索、地貌检索和全球地理定位。研究使用了一种统一的检索协议,包括对比双塔编码器和生成型视觉-语言模型。结果显示,即使是强大的基础模型也难以捕捉特定领域的地貌差异,强调了在行星环境中进行地理空间发现时进行领域特定微调的重要性。