Can vision language models learn intuitive physics from interaction?
Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00
Abstract
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
中文标题/摘要
标题:视觉语言模型能否通过交互学习直观的物理知识?
预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明,监督微调可以提高模型在简单物理任务上的表现。然而,微调后的模型似乎没有学会能够泛化的稳健物理规则。基于认知科学的研究,我们假设模型需要与环境进行交互才能正确学习其物理动力学。我们使用强化学习训练通过与环境交互来学习的模型。虽然通过交互学习可以让模型在任务内的表现得到提升,但无法产生具有泛化物理直觉的模型。我们发现,即使任务共享视觉统计和物理原理,针对一个任务训练的模型也不可靠地泛化到相关任务,无论模型是通过交互还是其他方式训练。
Summary / 总结
The study investigates whether vision language models can learn intuitive physics through interaction. Despite improvements in task performance with supervised fine-tuning, the models fail to develop robust, generalizable physical intuitions. Models trained through interaction show enhanced task performance but still struggle to generalize to related tasks, indicating that interaction alone is insufficient for learning transferable physical knowledge.
研究探讨了视觉语言模型是否可以通过互动来学习直观的物理知识。尽管监督微调可以提高模型的任务性能,但模型无法发展出稳健且可泛化的物理直觉。通过互动训练的模型虽然在任务性能上有所提升,但在处理相关任务时仍难以泛化,表明互动本身不足以学习可转移的物理知识。
GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00
Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
中文标题/摘要
标题:GenArena:我们如何实现视觉生成任务的人类对齐评估?
视觉生成模型的快速发展已经超越了传统的评估方法,迫切需要采用视觉语言模型作为替代的评判者。在本文中,我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明,这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制,我们引入了GenArena,这是一种统一的评估框架,利用成对比较范式确保稳定且人类对齐的评估。我们的实验揭示了一个变革性的发现,即简单采用这种成对协议可以使现成的开源模型超越顶级专有模型。值得注意的是,我们的方法将评估准确性提高了超过20%,并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性,远超点对点方法的0.36相关性。基于GenArena,我们对多种视觉生成模型进行了基准测试,为视觉生成提供了一个严格且自动化的评估标准。
Summary / 总结
This study addresses the limitations of traditional evaluation methods for visual generation models by introducing GenArena, a framework that uses a pairwise comparison paradigm. The research finds that adopting this method significantly improves evaluation accuracy, with off-the-shelf models outperforming proprietary models and achieving a 20% increase in accuracy. The method also correlates strongly with the authoritative LMArena leaderboard, demonstrating superior performance compared to pointwise scoring methods.
该研究针对传统点对点评分方法在评估视觉生成模型时的局限性,这些方法已因视觉生成模型的快速发展而变得不足。作者引入了GenArena,这是一种成对比较框架,以确保更稳定和符合人类感知的评估。实验表明,使用GenArena可以显著提高评估准确性,超过20%,并与权威的LMArena排行榜实现了0.86的Spearman相关性,远超点对点方法的0.36相关性。GenArena被用于跨多种视觉生成任务的基准测试,为社区提供了一个严格的自动化评估标准。
Pathwise Test-Time Correction for Autoregressive Long Video Generation
Authors: Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao Guo
First: 2026-02-05T16:50:39+00:00 · Latest: 2026-02-05T16:50:39+00:00
Abstract
Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
中文标题/摘要
标题:自回归长视频生成的路径测试时校正
提炼的自回归扩散模型有助于实时短视频合成,但在长序列生成过程中会遭受严重的错误累积。尽管现有的测试时优化(TTO)方法对图像或短片段证明有效,但我们发现它们无法缓解扩展序列中的漂移,因为奖励景观不稳定且提炼参数高度敏感。为了克服这些限制,我们引入了测试时校正(TTC),这是一种无需训练的替代方案。具体而言,TTC 利用初始帧作为稳定的参考锚点,校准采样轨迹中的中间随机状态。大量实验表明,我们的方法可以无缝集成到各种提炼模型中,延长生成长度,同时在30秒基准测试上与资源密集型训练方法保持相同的质量,几乎没有额外开销。
Summary / 总结
The research addresses the issue of error accumulation in long video generation using autoregressive diffusion models. It introduces Test-Time Correction (TTC), a training-free method that uses the initial frame as a reference to stabilize intermediate states during sampling. Experiments show that TTC can extend generation lengths without significant overhead and matches the quality of more resource-intensive methods on 30-second benchmarks.
本文解决了使用蒸馏自回归扩散模型进行长视频生成时出现的误差累积问题。它提出了一种无需训练的方法——Test-Time Correction (TTC),利用初始帧作为参考来校准中间状态,从而在长序列中减轻漂移现象。实验表明,TTC 可以在不增加显著开销的情况下延长生成长度,并在30秒的基准测试中与资源密集型的训练方法保持相同的质量。
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
Authors: Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan
Venue: ICLR 2026
First: 2025-06-09T20:11:21+00:00 · Latest: 2026-02-05T16:06:21+00:00
Comments: Accepted to ICLR 2026. Camera ready version
Abstract
Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.
中文标题/摘要
标题:GIQ:使用模拟和真实多面体基准测试视觉基础模型的三维几何推理能力
现代单目三维重建方法和视觉-语言模型(VLMs)在标准基准测试上表现出令人印象深刻的成果,但近期的研究对其真正理解几何属性表示怀疑。我们引入了GOQ,一个专门设计用于评估视觉和视觉-语言基础模型几何推理能力的综合基准。GIQ包含合成和真实世界的图像及其对应的三维网格,涵盖了从柏拉图、阿基米德、约翰逊和卡塔兰多面体到星形和复合形状的各种多面体,从简单到复杂和对称性不等。通过系统地进行单目三维重建、三维对称检测、心理旋转测试和零样本形状分类任务,我们揭示了当前模型的重大不足。最先进的三维重建算法即使在大量三维数据集上训练,也难以准确重建基本的几何柏拉图多面体。其次,尽管基础模型可能通过线性和非线性探针捕捉特定的三维对称元素,但在需要详细几何区分的任务,如心理旋转,它们的表现显著下降。此外,先进的视觉-语言助手,如ChatGPT、Gemini和Claud,在解释基本形状属性,如面几何、凸性和复杂多面体的复合结构方面表现出极低的准确性。GIQ在toomanymatts.github.io/giq-benchmark/公开,提供了一个结构化的平台,用于基准测试关键的几何智能差距,并促进未来在稳健、几何感知表示学习方面的进展。
Summary / 总结
GIQ is a benchmark designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. It includes synthetic and real-world images of diverse polyhedra, ranging from simple to complex shapes. The study reveals that state-of-the-art 3D reconstruction algorithms struggle with basic geometric shapes, while foundation models perform poorly in tasks requiring detailed geometric differentiation, such as mental rotation. Vision-language assistants like ChatGPT, Gemini, and Claud also show low accuracy in interpreting basic shape properties. The benchmark is publicly available at toomanymatts.github.io/giq-benchmark/.
论文介绍了GOQ基准,旨在评估视觉和视觉语言基础模型的几何推理能力。该基准包含多样化的多面体图像,涵盖不同复杂度。实验结果显示,最先进的3D重建算法在基本几何形状上表现不佳,而基础模型在需要详细几何区分的任务,如心理旋转,表现较差。此外,如ChatGPT、Gemini和Claud等高级视觉语言助手在解释基本形状属性方面表现出较低的准确性。该基准已公开,供进一步研究使用。
Bifrost: Steering Strategic Trajectories to Bridge Contextual Gaps for Self-Improving Agents
Authors: Quan M. Tran, Zhuo Huang, Wenbin Zhang, Bo Han, Koji Yatani, Masashi Sugiyama, Tongliang Liu
First: 2026-02-05T16:03:56+00:00 · Latest: 2026-02-05T16:03:56+00:00
Abstract
Autonomous agents excel in self-improvement through reflection and iterative refinement, which reuse successful task trajectories as in-context examples to assist subsequent reasoning. However, shifting across tasks often introduces a context mismatch. Hence, existing approaches either discard the trajectories or manipulate them using heuristics, leading to a non-negligible fine-tuning cost or unguaranteed performance. To bridge this gap, we reveal a context-trajectory correlation, where shifts of context are highly parallel with shifts of trajectory. Based on this finding, we propose BrIdge contextual gap FoR imprOvised trajectory STeering (Bifrost), a training-free method that leverages context differences to precisely guide the adaptation of previously solved trajectories towards the target task, mitigating the misalignment caused by context shifts. Our trajectory adaptation is conducted at the representation level using agent hidden states, ensuring trajectory transformation accurately aligns with the target context in a shared space. Across diverse benchmarks, Bifrost consistently outperforms existing trajectory reuse and finetuned self-improvement methods, demonstrating that agents can effectively leverage past experiences despite substantial context shifts.
中文标题/摘要
标题:Bifrost: 引导战略轨迹以弥补情境差距的自主改进代理
自主代理通过反思和迭代优化在自我改进方面表现出色,它们重用成功的任务轨迹作为上下文示例,以协助后续推理。然而,任务之间的转换往往会导致情境不匹配。因此,现有方法要么丢弃轨迹,要么使用启发式方法对其进行操作,这会导致显著的微调成本或无法保证的性能。为了解决这一问题,我们揭示了情境与轨迹之间的关联,即情境的变化与轨迹的变化高度平行。基于这一发现,我们提出了一种无需训练的方法——BrIdge情境差距以优化轨迹引导(Bifrost),该方法利用情境差异精确引导已解决轨迹向目标任务的适应,从而减轻由情境变化引起的错位。我们的轨迹适应在表示层进行,使用代理隐藏状态确保轨迹转换准确地与共享空间中的目标情境对齐。在各种基准测试中,Bifrost 一致地优于现有的轨迹重用和微调自我改进方法,证明了代理即使在大量情境变化的情况下也能有效利用过往经验。
Summary / 总结
The paper introduces Bifrost, a method that addresses the context mismatch issue in autonomous agents' self-improvement. By identifying the correlation between context and trajectory shifts, Bifrost precisely adapts previously solved trajectories to the target task, reducing the need for fine-tuning. Experiments show that Bifrost outperforms existing trajectory reuse and fine-tuning methods across various benchmarks, indicating effective use of past experiences even with significant context changes.
研究旨在通过解决上下文不匹配问题,提高自改进代理对新任务的适应能力。Bifrost 是一种无需训练的方法,它发现上下文和轨迹变化之间的关联,并利用这一点来精确引导之前成功的轨迹适应目标任务。实验表明,Bifrost 在各种基准测试中优于现有方法,显示出即使在显著上下文变化的情况下也能有效利用过往经验。
Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
Authors: Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
First: 2026-02-05T16:02:48+00:00 · Latest: 2026-02-05T16:02:48+00:00
Abstract
Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
中文标题/摘要
标题:聚焦-扫描-精炼:从人类视觉感知到高效的视觉标记剪枝
视觉语言模型(VLMs)通常生成大量的视觉标记,极大地增加了推理延迟和内存占用;而无需训练的标记剪枝提供了一种实用的解决方案,但现有方法在剧烈压缩下仍然难以平衡局部证据和全局上下文。我们提出了一种名为聚焦-扫描-精炼(FSR)的人类启发式、即插即用剪枝框架,该框架模仿了人类回答视觉问题的方式:首先聚焦于关键证据,然后在需要时进行全局扫描,最后通过聚合相关细节来精炼扫描的上下文。FSR 首先通过结合视觉重要性和指令相关性来聚焦关键证据,避免了对视觉上显著但查询无关的区域的偏见。然后,它根据聚焦的集合进行补充上下文扫描,选择与聚焦证据差异最大的标记。最后,FSR 通过基于相似性的分配和得分加权合并将附近的有用标记聚合到扫描锚点中,而不增加标记预算。在多个 VLM 基准模型和视觉语言基准上的广泛实验表明,FSR 一致地提高了与现有最先进的剪枝方法相比的准确性和效率权衡。源代码可以在 https://github.com/ILOT-code/FSR 获取
Summary / 总结
The paper addresses the issue of excessive visual tokens in vision-language models, which increases inference latency and memory usage. It introduces Focus-Scan-Refine (FSR), a human-inspired pruning framework that focuses on key evidence, scans for complementary context, and refines the scanned context. Experiments show that FSR outperforms existing methods in balancing accuracy and efficiency across various vision-language benchmarks.
研究旨在解决视觉语言模型中视觉令牌过多的问题,这会增加推理延迟和内存使用。提出的Focus-Scan-Refine (FSR)框架模仿人类视觉感知,通过聚焦关键证据、扫描补充上下文并精炼选定的令牌来工作。实验表明,FSR在各种VLM骨干网络和基准测试中优于现有方法,平衡了准确性和效率。
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
First: 2026-01-27T17:35:05+00:00 · Latest: 2026-02-05T15:59:52+00:00
Comments: 27 pages, 15 figures
Abstract
Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
中文标题/摘要
标题:当迭代RAG超越理想证据时:科学多跳问答中的诊断研究
检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但尚不清楚何时迭代检索-推理循环在意义上优于静态RAG,特别是在具有多跳推理、稀疏领域知识和异构证据的科学领域。我们提供了第一个受控的机制级诊断研究,探讨同步迭代检索和推理是否能超越理想化的静态上限(黄金上下文)RAG。我们以三个模式对十一个最先进的LLMs进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)黄金上下文,所有先验证据一次性提供;(iii)迭代RAG,一个无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用化学重点的ChemKGMultiHopQA数据集,我们隔离了需要真正检索的问题,并通过检索覆盖率差距、锚点携带丢失、查询质量、组合保真度和控制校准等诊断分析了行为。在所有模型中,迭代RAG始终优于黄金上下文,增幅高达25.6个百分点,尤其是对于非推理微调模型。分阶段检索减少了晚期跳失败,缓解了上下文过载,并使早期假设漂移的动态修正成为可能,但剩余的失败模式包括不完整的跳覆盖、干扰物锁定轨迹、早期停止校准错误以及即使在完美检索的情况下也存在高组合失败率。总体而言,分阶段检索往往比理想证据的存在本身更具影响力;我们提供了在专门的科学环境中部署和诊断RAG系统的实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。
Summary / 总结
This study investigates when iterative retrieval-reasoning loops in RAG outperform static RAG, especially in scientific domains. Using the ChemKGMultiHopQA dataset, it compares eleven state-of-the-art LLMs under three regimes: no context, gold context, and iterative RAG. Iterative RAG consistently outperforms the gold context, with gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures and context overload but faces challenges like incomplete hop coverage and early stopping miscalibration.
该研究探讨了在科学领域中,迭代检索-推理循环何时能超越静态RAG。使用ChemKGMultiHopQA数据集,它比较了十一个最先进的LLM在三种模式下的表现:无上下文、理想上下文和迭代RAG。迭代RAG在所有模型中都优于理想上下文,最高可提高25.6个百分点,尤其是对于非推理微调模型。分阶段的检索减少了晚期跳步失败和上下文过载,但仍面临如不完整的跳步覆盖和高组合失败率等挑战。
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Venue: ICLR 2026
First: 2025-10-20T06:17:57+00:00 · Latest: 2026-02-05T15:48:38+00:00
Comments: Accepted to ICLR 2026
Abstract
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
中文标题/摘要
标题:理解并改进层次稀疏注意力模型中的长度泛化
有效处理长上下文是语言模型面临的关键挑战。标准Transformer由于二次复杂度和较差的长度外推能力而受到限制,而滑动窗口注意力和状态空间模型则因固定大小的内存而牺牲了充分利用完整上下文的能力。基于块的稀疏注意力已成为一种有前景的极端长度泛化范式,但其成功的关键架构原则尚未完全理解。在本文中,我们系统地剖析了这些模型,以确定驱动其性能的核心组件。通过统一框架和全面的消融研究,我们证明了三种设计原则的结合是至关重要的:(1)具有专用CLS标记的表达性非线性块编码器,用于生成检索表示;(2)旁路残差路径,以稳定地整合检索到的全局信息,而不被局部残差流覆盖;(3)预训练期间强制选择稀疏性,以弥合训练-测试分布差距。我们为块内信息处理和地标生成提供了理论动机。通过结合这些原则,我们建立了训练无监督长度外推的新状态,成功地将训练在4K上下文上的模型泛化到RULER和BABILong上的3200万标记。我们的发现为开发未来高度具备长上下文能力的语言模型提供了一套清晰且经验支持的设计原则。
Summary / 总结
This work aims to improve the ability of language models to handle long contexts by analyzing and enhancing chunk-based sparse attention models. The study identifies three key design principles: an expressive Chunk Encoder with a CLS token, a Bypassing Residual Path, and enforced selection sparsity during pre-training. These principles enable better integration of global information and mitigate the distribution gap between training and testing. As a result, the models can generalize effectively from a 4K context to 32 million tokens on RULER and BABILong, setting a new state-of-the-art for training-free length extrapolation.
该研究通过剖析基于块的稀疏注意力模型,解决了语言模型处理长上下文的挑战。它确定了三个关键设计原则:具有CLS标记的表达性块编码器、旁路残差路径以及预训练期间的强制选择稀疏性。这些原则使模型能够从4K上下文有效推广到3200万标记,从而在RULER和BABILong数据集上达到了训练无损长度外推的新最佳水平。
Optimization and Generation in Aerodynamics Inverse Design
Authors: Huaguan Chen, Ning Lin, Luxi Chen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun
First: 2026-02-03T14:32:26+00:00 · Latest: 2026-02-05T15:47:18+00:00
Abstract
Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.
中文标题/摘要
标题:气动逆设计中的优化与生成
基于物理目标的逆设计具有挑战性,因为它将高维几何与昂贵的模拟耦合在一起,例如用于减阻的气动形状优化。我们通过两个典型的解决方案——最优设计点和最优设计分布——重新审视逆设计,并将它们与优化和引导生成联系起来。在此基础上,我们提出了一种新的成本预测器训练损失,并提出了一种密度梯度优化方法,该方法在提高目标的同时保留了合理的形状。我们进一步统一了现有的无训练引导生成方法。为了解决它们在高维空间中无法近似条件协方差的问题,我们开发了一种时间效率和内存效率高的近似协方差估计算法。在受控的2D研究和高保真的3D气动基准测试(汽车和飞机)上的实验,通过OpenFOAM模拟和3D打印原型的小型风洞测试,证明了在优化和引导生成方面的一致性改进。额外的离线强化学习结果进一步支持了我们方法的普适性。
Summary / 总结
The research aims to optimize and generate aerodynamic shapes for drag reduction by addressing the challenges of inverse design with physics-based objectives. The study proposes a new training loss for cost predictors and a density-gradient optimization method, which enhances objectives while maintaining plausible shapes. Experiments on 2D and 3D aerodynamic benchmarks, validated by simulations and wind-tunnel tests, show consistent improvements in both optimization and guided generation.
研究旨在通过解决逆向设计中的高维几何和昂贵模拟挑战,优化气动形状以减少阻力。作者提出了一种新的成本预测器训练损失和密度梯度优化方法,以增强目标同时保持合理的形状。通过2D和3D气动基准实验,结合模拟和风洞测试验证,展示了在优化和引导生成方面的持续改进。
Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
Authors: Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
First: 2026-02-05T15:45:39+00:00 · Latest: 2026-02-05T15:45:39+00:00
Abstract
With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
中文标题/摘要
标题:以中心视角感知器:通过框架实例化分离以中心视角推理与自中心视觉先验
随着对基于空间的任务(如视觉-语言导航/动作)需求的增加,视觉-语言模型(VLMs)的以中心视角感知能力正受到越来越多的关注。然而,VLMs 在需要显式视角转换的以中心视角空间查询上仍然脆弱,因为答案依赖于在目标为中心的框架中的推理,而不是观察到的摄像机视图。因此,我们引入了以中心视角感知器,这是一种无需训练的策略,可以从一个或多个图像中恢复出度量的3D状态,并通过现成的几何专家,然后根据指令的语义意图实例化一个查询条件下的以中心视角参考框架。通过确定性地将重建的几何形状转换为目标框架,并用结构化的、基于几何的表示提示主干VLM,以中心视角感知器将心理旋转从隐式推理卸载到显式计算。我们在多个主干家族上对以中心视角感知器进行了空间推理基准测试评估,观察到在以中心视角任务上一致且显著的提升(约10%),同时保持了强大的自中心性能,并超越了空间感知微调模型和最先进的开源和专有模型。
Summary / 总结
The paper introduces Allocentric Perceiver, a method that enhances Vision-Language Models' allocentric reasoning capabilities by transforming reconstructed geometry into a target-centric frame, thereby improving performance on spatial reasoning tasks. The method achieves consistent gains of about 10% on allocentric tasks while maintaining strong egocentric performance, surpassing both spatial-perception-finetuned models and state-of-the-art models.
研究旨在通过引入Allocentric Perceiver,一种无需训练的方法,利用几何专家从图像中恢复3D状态,并根据指令的意图对齐一个查询条件下的目标参考框架,以提高Vision-Language Models(VLMs)在处理定向空间查询的能力。该方法增强了定向推理能力,同时保持了强大的第一人称性能,实现了约10%的一致性增益,并超越了空间感知微调模型和最先进的开源及专有模型。
Ethology of Latent Spaces
Authors: Philippe Boisnard
First: 2026-02-05T14:37:31+00:00 · Latest: 2026-02-05T14:37:31+00:00
Comments: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( https://paragraphe.univ-paris8.fr/IMG/pdf/programme_colloque_his9_campuscondorcet_v3.pdf ) and accepted for publication in double-blind peer review in French in 2026-2027
Abstract
This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices.
Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points.
We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault's notion of the archive, Jameson's ideologeme, and Simondon's theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents.
中文标题/摘要
标题:潜空间的生态学
本研究从生态学的角度挑战了视觉语言模型(VLMs)中潜空间的中立性假设,揭示了潜空间表现出特定模型的算法敏感性,这些敏感性由训练数据和架构选择塑造了感知显著性的差异性制度。通过对三个模型(OpenAI CLIP、OpenCLIP LAION、SigLIP)应用于301件(15至20世纪)艺术作品的比较分析,我们揭示了政治和文化分类的显著差异。使用Mikolov等人(2013)提出的二元语义轴,我们发现SigLIP将59.4%的艺术作品分类为政治参与,而OpenCLIP仅为4%。非洲面具在SigLIP中获得最高的政治评分,而在OpenAI CLIP中则不具有政治性。在美学殖民轴上,不同模型之间的差异达到72.6个百分点。
我们引入了三个操作概念:计算潜空间政治化,描述在无故意编码的情况下政治类别的出现;新兴偏差,不可归因于统计或规范偏差,仅通过对比分析可检测;以及三种算法视域制度:熵制度(LAION)、机构制度(OpenAI)和符号制度(SigLIP),这些制度构建了不同的可见性模式。借鉴福柯的档案概念、詹姆斯·琼斯的意识形态概念以及西蒙松的个体化理论,我们认为训练数据集作为准档案,其话语形成在潜空间中结晶。这项工作为重新评估VLMs在数字艺术史中的应用条件做出了贡献,并呼吁将学习架构整合到任何文化解释的算法代理中。
Summary / 总结
This study examines the ethological behaviors of latent spaces in vision language models (VLMs) by analyzing three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to 301 artworks from the 15th to 20th centuries. Key findings include significant differences in the attribution of political and cultural categories, with SigLIP showing a higher percentage of politically engaged artworks compared to other models. The study introduces concepts like computational latent politicization and emergent bias, and identifies three algorithmic scopic regimes: entropic, institutional, and semiotic, which structure distinct modes of visibility in VLMs.
该研究通过分析三个模型(OpenAI CLIP、OpenCLIP LAION 和 SigLIP)应用于15至20世纪的301件艺术品,探讨了视觉语言模型(VLMs)的算法行为。主要发现包括在政治和文化分类上的显著差异,SigLIP 在政治参与度方面的作品比例远高于其他模型。研究引入了计算潜空间政治化和新兴偏差等概念,并识别了三种算法视域模式:熵、机构和符号,这些模式在潜空间中构建了不同的可见性模式。
ShapeUP: Scalable Image-Conditioned 3D Editing
Authors: Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
First: 2026-02-05T13:59:16+00:00 · Latest: 2026-02-05T13:59:16+00:00
Abstract
Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
中文标题/摘要
标题:ShapeUP:可扩展的基于图像的3D编辑
近期3D基础模型的发展使得高保真资产的生成成为可能,但精确的3D操作仍然是一个重大挑战。现有的3D编辑框架往往在视觉可控性、几何一致性与可扩展性之间面临难以调和的权衡。具体来说,基于优化的方法速度过慢,多视角2D传播技术存在视觉漂移问题,而无需训练的潜在操作方法则受限于固定的先验知识,无法直接从可扩展性中获益。在本文中,我们提出了ShapeUP,一种可扩展的、基于图像的3D编辑框架,将编辑形式化为在原生3D表示中的监督潜在域到潜在域的转换。这种形式化使得ShapeUP能够基于预训练的3D基础模型,利用其强大的生成先验,并通过监督训练进行适应。实际上,ShapeUP在包含源3D形状、编辑后的2D图像及其对应的编辑后3D形状的三元组上进行训练,并使用3D扩散变换器(DiT)学习直接映射。这种基于图像的提示方法使得对局部和全局编辑具有精细的视觉控制,并实现了隐式的、无掩码的定位,同时保持与原始资产的严格结构一致性。我们的广泛评估表明,ShapeUP在身份保留和编辑保真度方面始终优于当前的训练有素和无需训练的基线,为原生3D内容创作提供了一个稳健且可扩展的范式。
Summary / 总结
ShapeUP is a scalable 3D editing framework that uses a supervised latent-to-latent translation within a native 3D representation, leveraging a pretrained 3D foundation model. It is trained on triplets of source 3D shapes, edited 2D images, and corresponding edited 3D shapes, using a 3D Diffusion Transformer. ShapeUP provides fine-grained visual control and maintains structural consistency, outperforming current baselines in identity preservation and edit fidelity.
ShapeUP 是一种可扩展的 3D 编辑框架,将编辑视为在原生 3D 表示中的监督潜在到潜在的转换,利用预训练的 3D 基础模型。它使用包含原始 3D 形状、编辑后的 2D 图像及其相应编辑后的 3D 形状的三元组来训练 3D 扩散变换器 (DiT) 进行直接映射。ShapeUP 实现了精细的视觉控制、隐式定位和严格的结构一致性,其在身份保留和编辑保真度方面优于当前基线。
Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Venue: ICLR 2026
First: 2025-09-26T06:30:39+00:00 · Latest: 2026-02-05T13:38:54+00:00
Comments: Accepted by ICLR 2026
Abstract
Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
中文标题/摘要
标题:为MLLMs定制视觉情感评估:一种开放词汇、多维度和可扩展的方法
近年来,多模态大型语言模型(MLLMs)在多种任务中取得了卓越的性能,不断超越人们对它们能力的预期。然而,它们在从图像中感知情感的能力方面仍存在争议,研究结果在零样本场景中存在分歧。我们认为这种不一致部分源于现有评估方法的限制,包括忽视可能的响应、有限的情感分类、忽略背景因素以及劳动密集型注释。为了促进MLLMs的定制视觉情感评估,我们提出了一项情感陈述判断任务,以克服这些限制。作为这项任务的补充,我们设计了一种自动流水线,能够高效地构建以情感为中心的陈述,同时减少人力投入。通过系统地评估现有的MLLMs,我们的研究展示了它们在情感解释和基于上下文的情感判断方面的更强表现,同时也揭示了它们在理解感知主观性方面的相对局限性。与人类相比,即使是表现最佳的MLLMs如GPT4o也显示出显著的性能差距,突显了未来改进的关键领域。通过开发基本的评估框架并进行全面的MLLM评估,我们希望这项工作能够促进MLLMs的情感智能。
Summary / 总结
This study addresses the inconsistency in evaluating MLLMs' ability to perceive emotions from images by proposing an Emotion Statement Judgment task and an automated pipeline. The research demonstrates that MLLMs perform better in emotion interpretation and context-based emotion judgment but still have limitations in comprehending perception subjectivity, highlighting areas for future improvement. Even top-performing models like GPT4o show significant gaps compared to human performance.
该研究通过提出情感陈述判断任务和自动化管道来解决评估MLLMs从图像中感知情绪的一致性问题。研究发现,MLLMs在情绪解释和基于上下文的情绪判断方面表现更好,但在理解感知主观性方面存在困难,指出了改进的关键领域。即使像GPT4o这样的顶级MLLMs与人类相比也显示出显著的性能差距。
Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation
Authors: Joe-Mei Feng, Sheng-Wei Yu
First: 2026-02-05T12:12:00+00:00 · Latest: 2026-02-05T12:12:00+00:00
Abstract
We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3).
The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound.
The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
中文标题/摘要
标题:几何可观测性指标:SE(3) 姿态估计中单特征敏感性、弱可观测性和动态效应的算子理论框架
我们提出了一种统一的算子理论框架,用于分析相机姿态估计在李群SE(3)上的单特征敏感性。经典的敏感性工具——条件分析、欧几里得扰动论证和费雪信息界线——无法解释单个图像特征如何影响姿态估计,也无法解释动态或不一致的观测为何会不成比例地扭曲现代SLAM和结构光度法系统。为了解决这一差距,我们将影响函数理论扩展到矩阵李群,并推导出SE(3)上左平凡化M-估计量的内在扰动算子。
由此产生的几何可观测性指标(GOI)通过曲率算子和可观测子空间的李代数结构量化单个测量的贡献。GOI 沿可观测曲率的主要方向进行谱分解,揭示了弱可观测性和放大敏感性之间的直接对应关系。在总体情况下,GOI 与SE(3)上的费雪信息几何学一致,产生了一个单测量的克兰-罗不等式的类比。
该相同的谱机制解释了诸如纯旋转和消失的视差等经典退化现象,以及沿着弱曲率方向的动态特征放大。总体而言,GOI 提供了一种几何上一致的测量影响描述,统一了条件分析、费雪信息几何学、影响函数理论和动态场景可检测性,通过曲率算子的谱几何学。由于这些量直接出现在高斯-牛顿管道中,曲率谱和GOI 也提供了轻量级、无需训练的诊断信号,用于识别动态特征和检测弱可观测性配置,而无需修改现有的SLAM架构。
Summary / 总结
The paper introduces a unified operator-theoretic framework called Geometric Observability Index (GOI) to analyze the sensitivity of individual image features in camera pose estimation on the Lie group SE(3). GOI extends influence function theory to matrix Lie groups and provides a spectral decomposition that links weak observability to amplified sensitivity. Key findings include a direct correspondence between GOI and the Fisher information geometry on SE(3), and a geometrically consistent description of measurement influence that unifies various sensitivity analysis tools.
论文提出了几何可观性指数(GOI),用于分析单个图像特征对SE(3)中相机姿态估计的影响。GOI将影响函数理论扩展到矩阵李群,提供了一种谱分解,将弱可观性与放大灵敏度直接联系起来。关键发现包括GOI与费雪信息几何之间的直接对应关系,解释了经典退化现象和动态特征放大。GOI提供了一种统一的灵敏度分析、费雪信息几何和动态场景可检测性的框架,提供了识别动态特征和弱可观性配置的轻量级诊断信号。
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
Authors: Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
First: 2026-02-05T12:03:11+00:00 · Latest: 2026-02-05T12:03:11+00:00
Abstract
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
中文标题/摘要
标题:LoGoSeg:结合局部和全局特征的开放词汇语义分割
开放词汇语义分割(OVSS)扩展了传统的封闭集分割,通过任意文本描述对已见和未见类别进行像素级标注。现有方法利用如CLIP等视觉语言模型(VLMs),但其依赖于图像级预训练往往导致空间对齐不精确,导致在模糊或杂乱场景中出现不匹配的分割。然而,大多数现有方法缺乏强大的对象先验和区域级约束,这可能导致对象幻觉或漏检,进一步降低性能。为解决这些挑战,我们提出LoGoSeg,一种高效的单阶段框架,集成了三个关键创新:(i)对象存在先验,通过全局图像-文本相似性动态加权相关类别,有效减少幻觉;(ii)区域感知对齐模块,建立精确的区域级视觉-文本对应关系;(iii)双流融合机制,最优结合局部结构信息与全局语义上下文。与先前工作不同,LoGoSeg消除了对外部掩码提案、额外骨干网络或额外数据集的需求,确保高效性。在六个基准(A-847、PC-459、A-150、PC-59、PAS-20和PAS-20b)上的广泛实验表明,其在开放词汇设置中的性能和泛化能力具有竞争力。
Summary / 总结
LoGoSeg is designed to improve open-vocabulary semantic segmentation by integrating local and global features, addressing the limitations of existing methods that rely on image-level pretraining and lack strong object priors. It introduces an object existence prior, a region-aware alignment module, and a dual-stream fusion mechanism to enhance spatial alignment and reduce hallucinations. Experiments on six benchmarks show that LoGoSeg outperforms existing methods in open-vocabulary settings and demonstrates strong generalization capabilities.
LoGoSeg 是一种高效的单阶段框架,用于开放词汇语义分割,通过整合对象存在先验、区域感知对齐和双流融合来解决现有方法的局限性。它基于全局图像-文本相似性动态加权类别,建立精确的区域级对应关系,并结合局部和全局信息。在六个基准上的实验展示了其在开放词汇环境中的竞争力和强泛化能力。
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen
First: 2026-02-04T15:33:10+00:00 · Latest: 2026-02-05T12:00:10+00:00
Abstract
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
中文标题/摘要
标题:PIO-FVLM:从推理目标视角重新思考无训练视觉标记缩减以加速VLM
近年来,减少视觉语言模型(VLMs)中的冗余视觉标记以加速VLM推理已成为一个热点话题。然而,大多数现有方法依赖于基于视觉标记间相似性或跨模态视觉-文本相似性的启发式构造,这在压缩性能和实际部署方面存在一定的局限性。相比之下,我们从推理目标的角度提出了PIO-FVLM,将视觉标记压缩转化为保持输出结果不变性,并主要通过其对这一目标的重要性来选择标记。特别地,视觉标记在我们设计的层局部代理损失指导下重新排序,这是一种来自当前层到最终结果的粗略约束。然后,根据非极大值抑制(NMS)原则选择最有价值的视觉标记。提出的PIO-FVLM是无训练的,并且与FlashAttention兼容,易于实际应用和部署。它可以独立部署为一种无需编码器的方法,或者与VisionZip等编码器压缩方法结合使用,作为一种涉及编码器的方法。在LLaVA-Next-7B上,PIO-FVLM仅保留了11.1%的视觉标记,但保持了97.2%的原始性能,预填充速度提高了2.67倍,推理速度提高了2.11倍,FLOPs降低了6.22倍,KV缓存开销减少了6.05倍。我们的代码可在https://github.com/ocy1/PIO-FVLM获取。
Summary / 总结
The paper proposes PIO-FVLM, which aims to reduce redundant visual tokens in vision-language models for faster inference without training. It uses token-level gradient saliency to reorder and select the most important tokens based on their contribution to output invariance. On LLaVA-Next-7B, PIO-FVLM retains only 11.1% of visual tokens while maintaining 97.2% of the original performance, achieving significant speedups and reduced computational resources.
PIO-FVLM旨在通过减少视觉语言模型中的冗余视觉标记来加速推理,无需训练。它利用标记级别的梯度显著性重新排序并选择最重要的标记,以保持输出结果不变。在LLaVA-Next-7B上,它仅保留11.1%的视觉标记,同时保持97.2%的原始性能,实现了显著的加速和减少计算资源。该方法无需训练,兼容FlashAttention,可以作为无编码器或带编码器的方法进行部署。
TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
Authors: Yikun Zong, Cheston Tan
First: 2026-02-05T11:49:30+00:00 · Latest: 2026-02-05T11:49:30+00:00
Comments: 13 pages, 4 figures
Abstract
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.
中文标题/摘要
标题:TangramSR:视觉语言模型能否在连续几何空间中进行推理?
人类在完成如七巧板拼图等空间推理任务时,通过认知过程中的心理旋转、迭代细化和视觉反馈表现出色。受人类通过试错、观察和修正解决七巧板问题的启发,我们设计了一个框架来模拟这些人类的认知机制。然而,对五种代表性视觉语言模型(VLMs)进行全面实验后发现,它们在连续几何推理方面存在系统性失败:单块任务的平均IoU仅为0.41,两块组合任务降至0.23,远低于人类表现,儿童可以成功完成七巧板任务。本文探讨了自改进AI中的一个基本挑战:模型能否在测试时迭代细化预测而无需参数更新?我们引入了一个结合上下文学习(ICL)和奖励引导反馈循环的测试时自我细化框架。我们的无训练验证-细化代理应用递归细化循环,基于几何一致性反馈迭代自我细化预测,无需任何模型重训练即可将中三角形案例的IoU从0.63提高到0.932。这表明,通过ICL和奖励循环引入的人类启发式迭代细化机制可以显著增强VLMs的几何推理能力,将自改进AI从理论推向实践。我们的工作可在以下匿名链接获取:https://anonymous.4open.science/r/TangramVLM-F582/
Summary / 总结
This paper explores whether Vision-Language Models can perform continuous geometric reasoning, inspired by human cognitive processes in solving Tangram puzzles. Comprehensive experiments show that five VLMs struggle with geometric tasks, achieving low IoU scores. The authors propose a test-time self-refinement framework combining in-context learning and reward-guided feedback, which significantly improves IoU from 0.63 to 0.932 on medium-triangle cases without retraining. This demonstrates the potential of human-inspired iterative refinement mechanisms in enhancing geometric reasoning in VLMs.
该论文研究了视觉-语言模型(VLMs)是否能够进行连续几何推理,灵感来源于人类在解拼图时的认知过程。实验表明,VLMs在几何推理方面表现不佳,单块任务的IoU仅为0.41,两块组合任务仅为0.23。为解决这一问题,作者提出了一种结合上下文学习和奖励引导反馈循环的测试时自我精炼框架,该框架在不重新训练模型的情况下,将中三角形任务的IoU显著提高到0.932。这表明,通过上下文学习和奖励循环引入的人类启发式迭代精炼机制可以显著增强VLMs的几何推理能力。
Plug-and-play linear attention with provable guarantees for training-free image restoration
Authors: Srinivasan Kidambi, Karthik Palaniappan, Pravin Nair
First: 2025-06-10T07:37:41+00:00 · Latest: 2026-02-05T11:35:14+00:00
Abstract
Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$--$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$--$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.
中文标题/摘要
标题:插即用线性注意力:具有可证明保证的无训练图像恢复
多头自注意力(MHSA)是现代视觉Transformer的关键构建块,但其在标记数量上的二次复杂性仍然是实时和资源受限部署的主要瓶颈。我们提出了PnP-Nystra,这是一种基于Nyström的无训练线性注意力模块,旨在作为MHSA在预训练图像恢复Transformer中的即插即用替代品,具有可证明的核逼近误差保证。PnP-Nystra可以直接集成到基于窗口的架构中,如SwinIR、Uformer和Dehazeformer,无需微调即可实现高效的推理。在图像去噪、去模糊、去雾和超分辨率上,PnP-Nystra在NVIDIA RTX 4090 GPU上的速度提升为1.8-3.6倍,在CPU推理上的速度提升为1.8-7倍。与我们评估的最强无训练线性注意力基线相比,我们的方法产生的质量下降最小,并且最接近原始模型的输出。
VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator
Authors: Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
First: 2026-02-05T11:23:11+00:00 · Latest: 2026-02-05T11:23:11+00:00
Abstract
This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
中文标题/摘要
标题:VLN-Pilot: 大型视觉语言模型作为自主室内无人机操作员
本文介绍了VLN-Pilot,这是一种新颖的框架,在该框架中,大型多模态视觉和语言模型(VLLM)承担了室内无人机导航的人类飞行员角色。通过利用VLLM的多模态推理能力,VLN-Pilot 解释自然语言指令并将其与视觉观察相结合,以规划和执行无人机轨迹,适用于GPS受限的室内环境。与传统的基于规则或几何路径规划方法不同,我们的框架将语言驱动的语义理解与视觉感知相结合,使无人机能够实现上下文感知的高级飞行行为,而无需特定任务的工程。VLN-Pilot 通过推理空间关系、障碍物规避以及对未预见事件的动态反应,支持无人机的完全自主指令跟随。我们在一个自定义的逼真室内模拟基准上验证了我们的框架,并展示了由VLLM驱动的代理在复杂指令跟随任务中实现高成功率的能力,包括多目标的长期导航。实验结果突显了用语言引导的自主代理取代远程无人机飞行员的潜力,为室内无人机在检查、搜索与救援、设施监控等任务中的可扩展、用户友好的控制打开了途径。我们的结果表明,基于VLLM的飞行员可能大幅减少操作员的工作量,同时在受限的室内环境中提高安全性和任务灵活性。
Summary / 总结
VLN-Pilot is a framework where a large Vision-and-Language Model (VLLM) acts as an indoor drone pilot, interpreting natural language instructions and visual observations to navigate drones autonomously in GPS-denied environments. The model integrates semantic understanding and visual perception to perform context-aware flight behaviors. Experiments show high success rates in complex instruction-following tasks, indicating the potential for reducing operator workload and improving safety in indoor UAV operations.
VLN-Pilot 是一个框架,利用大型 Vision-and-Language 模型(VLLM)根据自然语言指令自主导航室内无人机。它解释自然语言并结合视觉观察来规划和执行轨迹。实验结果显示在复杂任务中,包括长时间导航多个目标,具有较高的成功率,表明有可能减少操作员的工作负担并提高室内环境中的安全性。
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
Authors: Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
First: 2026-02-05T10:52:36+00:00 · Latest: 2026-02-05T10:52:36+00:00
Abstract
Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
中文标题/摘要
标题:共享知识的反面效应:模型合并中的光谱过积累
模型合并通过将多个微调模型的权重更新相加,将它们合并为一个模型,提供了一种轻量级的替代重新训练的方法。现有方法主要针对解决任务更新之间的冲突,而忽略了过度计算共享知识的失败模式。我们展示了当任务共享对齐的光谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,膨胀奇异值并使合并模型偏向共享子空间。为了缓解这一问题,我们提出了奇异值校准(SVC),这是一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的光谱。在视觉和语言基准测试中,SVC 一致地改进了强大的合并基线并达到了最先进的性能。此外,通过仅修改奇异值,SVC 将任务算术的性能提高了13.0%。代码可在:https://github.com/lyymuwu/SVC。
Summary / 总结
The paper addresses the issue of spectral over-accumulation in model merging, where shared knowledge between tasks is over-counted, leading to biased merged models. The authors propose Singular Value Calibration (SVC), a training-free and data-free method that rescales inflated singular values to restore a balanced spectrum, thereby mitigating the problem. Experiments on vision and language benchmarks show that SVC improves strong merging baselines and achieves state-of-the-art performance, with a 13.0% improvement in Task Arithmetic performance.
论文解决了模型合并中由于任务间共享知识的过度累积导致模型偏差的问题。提出了一种无需训练和数据的Singular Value Calibration (SVC)方法,通过重新调整放大的奇异值来恢复平衡的谱。实验表明,SVC能够提升强合并基线,并在视觉和语言基准测试中达到最先进的性能,同时将Task Arithmetic的性能提升了13.0%。
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Authors: Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Venue: ICLR 2026
First: 2026-02-05T10:51:39+00:00 · Latest: 2026-02-05T10:51:39+00:00
Comments: Accepted to ICLR 2026. Code is available at https://github.com/HT86159/EUQ
Abstract
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
中文标题/摘要
标题:通过证据不确定性量化检测大型视觉-语言模型的不当行为
大型视觉-语言模型(LVLMs)在多模态理解和生成方面取得了显著进展。然而,当面对无能或对抗性输入时,它们经常生成不可靠甚至有害的内容,如事实幻觉或危险指令。这种与人类期望的不一致,被称为LVLMs的不当行为,对关键应用中的部署提出了严重关切。这些不当行为被发现源自于认识不确定性,具体来说是内部知识的冲突或缺乏支持信息。然而,现有的不确定性量化方法,通常只能捕捉整体认识不确定性,对于识别这些问题效果有限。为解决这一差距,我们提出了一种细粒度的方法——证据不确定性量化(EUQ),该方法能够同时捕捉信息冲突和无知,从而有效检测LVLM的不当行为。特别是,我们将模型输出头的特征解释为支持(正面)或反对(负面)的证据。利用证据理论,我们建模并聚合这些证据,在单次前向传播中量化内部冲突和知识空白。我们使用最先进的LVLMs在四个类别(幻觉、脱逃、对抗性漏洞和分布外失败)的不当行为上进行了广泛评估,发现EUQ始终优于强基线,表明幻觉对应于高内部冲突,而分布外失败对应于高无知。此外,逐层证据不确定性动态分析有助于从新视角解释内部表示的演变。源代码可在https://github.com/HT86159/EUQ获取。
Summary / 总结
This paper addresses the issue of misbehaviors in large vision-language models (LVLMs) by proposing Evidential Uncertainty Quantification (EUQ), which captures both information conflict and ignorance. The method interprets model output features as supporting or opposing evidence and uses Evidence Theory to quantify internal conflict and knowledge gaps. EUQ outperforms strong baselines in detecting various misbehaviors, including hallucinations and out-of-distribution failures, by consistently identifying high internal conflict and ignorance, respectively. Layer-wise analysis further helps interpret the evolution of internal representations.
本文提出了一种称为证据不确定性量化(EUQ)的方法,用于检测大型视觉-语言模型(LVLM)的不良行为,该方法通过将模型输出特征解释为支持或反对证据,并利用证据理论量化内部冲突和知识空白。该方法在四个类别的一系列不良行为上进行了评估,并优于强基线,表明幻觉与高内部冲突相关,而离分布失败与高无知相关。进一步的逐层分析有助于从新视角解释模型内部表示的演变。
SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Authors: Youngwoo Shin, Jiwan Hur, Junmo Kim
Venue: ICLR 2026
First: 2026-02-05T10:48:58+00:00 · Latest: 2026-02-05T10:48:58+00:00
Comments: Accepted to ICLR 2026
Abstract
Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
中文标题/摘要
标题:SSG:多尺度视觉自回归生成的缩放空间指导
视觉自回归(VAR)模型通过下一级预测生成图像,自然实现从粗到细、快速、高保真的合成,模拟人类感知。实践中,这种层次结构在推理时可能会漂移,由于容量有限和累积误差,模型会偏离其从粗到细的性质。我们从信息论的角度重新审视这一限制,并推断确保每个尺度贡献未被早期尺度解释的高频内容可以缓解训练与推理之间的差异。基于这一洞察,我们提出了缩放空间指导(SSG),一种无需训练的推理时指导,引导生成向预期的层次结构发展,同时保持全局一致性。SSG 强调目标高频信号,定义为语义残差,从较粗的先验中隔离出来。为了获得这一先验,我们利用了一个基于频率域的原理性程序,离散空间增强(DSE),旨在通过频率感知构建更好地突出和隔离语义残差。SSG 在利用离散视觉标记的 VAR 模型中广泛适用,无论标记设计或条件模态如何。实验表明,SSG 在保真度和多样性方面提供了持续的改进,同时保持低延迟,揭示了粗到细图像生成中的未开发效率。代码可在 https://github.com/Youngwoo-git/SSG 获取。
Summary / 总结
The paper addresses the issue of hierarchical drift in visual autoregressive models during inference, proposing Scaled Spatial Guidance (SSG) to ensure each scale contributes unique high-frequency content. SSG, a training-free method applied at inference time, guides generation to maintain the intended hierarchy and global coherence. Experiments show SSG improves image fidelity and diversity without increasing latency. Code is available on GitHub.
论文解决了视觉自回归模型在推理过程中层级漂移的问题,提出了Scaled Spatial Guidance (SSG)来缓解这一问题。SSG在推理时引导生成过程,保持预期的从粗到细的层级结构,同时确保全局一致性。它利用Discrete Spatial Enhancement (DSE)来分离语义残差,并强调高频信号,适用于各种VAR模型。实验表明,SSG在提高图像保真度和多样性的同时,不会增加延迟。
RefAM: Attention Magnets for Zero-Shot Referral Segmentation
Authors: Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
First: 2025-09-26T17:59:57+00:00 · Latest: 2026-02-05T10:20:31+00:00
Comments: Project Page: https://refam-diffusion.github.io/
Abstract
Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.
中文标题/摘要
标题:RefAM:用于零样本引用分割的注意力磁铁
现有的大多数引用分割方法仅通过微调或组合多个预训练模型才能实现良好的性能,这通常需要额外的训练和架构修改。同时,大规模生成扩散模型编码丰富的语义信息,使其成为通用特征提取器的有吸引力的选择。在本文中,我们介绍了一种新方法,该方法直接利用扩散变换器的特征和注意力分数,用于下游任务,无需架构修改和额外训练。为了系统地评估这些特征,我们扩展了基准测试,涵盖了从图像到视频的视觉-语言定位任务。我们的关键见解是停用词充当注意力磁铁:它们积累多余的注意力,并可以过滤以减少噪声。此外,我们识别出在深层中出现的全局注意力陷阱(GAS),并表明它们可以安全地被抑制或重新定向到辅助标记,从而生成更清晰和更准确的定位图。我们进一步提出了一种注意力重新分配策略,其中附加的停用词将背景激活划分为更小的簇,从而生成更清晰和更局部化的热图。基于这些发现,我们开发了RefAM,这是一种简单的无需训练的定位框架,结合了跨注意力图、GAS处理和重新分配。在零样本引用图像和视频分割基准测试中,我们的方法实现了强大的性能,并在大多数数据集上超过了先前的方法,无需微调、额外组件和复杂推理。
Summary / 总结
This work introduces RefAM, a training-free method for zero-shot referring segmentation that leverages attention scores from diffusion transformers. By filtering stop words and redirecting global attention sinks, RefAM produces sharper grounding maps. Experiments on various benchmarks show that RefAM outperforms previous methods without requiring fine-tuning or additional components, setting a new state-of-the-art.
该研究提出了一种名为RefAM的无训练框架,利用扩散变压器的特征和注意力分数进行零样本引用分割。通过过滤停用词并重定向全局注意力汇流点,RefAM 生成了更清晰的定位图。实验表明,RefAM 在各种基准测试中优于先前的方法,无需微调或额外组件,建立了新的性能标准。
SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Authors: Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong
First: 2026-02-05T10:02:00+00:00 · Latest: 2026-02-05T10:02:00+00:00
Abstract
Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
中文标题/摘要
标题:SDFP:基于FIT剪枝模型的推测性解码以实现无需训练和即插即用的LLM加速
大型语言模型(LLMs)支撑着诸如字幕生成、检索、推荐和创意内容生成等交互式多媒体应用,但其自回归解码会带来显著的延迟。推测性解码通过使用轻量级草稿模型来减少延迟,但其部署往往受限于获取、调整和维护有效草稿模型的成本和复杂性。最近的方法通常需要辅助训练或专门化,即使无需训练的方法也会产生昂贵的搜索或优化成本。我们提出了一种完全无需训练且即插即用的框架SDFP,该框架通过给定的LLM的Fisher信息迹(FIT)层剪枝来构建草稿模型。利用层灵敏度作为输出扰动的代理,SDFP移除低影响层以获得紧凑的草稿,同时保持与原始模型的兼容性,以进行标准的推测性验证。SDFP无需额外训练、超参数调整或单独维护的草稿,能够快速构建部署友好的草稿。在基准测试中,SDFP在不改变目标模型输出分布的情况下实现了1.32倍至1.5倍的解码加速,支持低延迟的多媒体应用。
Summary / 总结
SDFP is a training-free and plug-and-play framework that accelerates large language model (LLM) decoding by pruning layers from a given LLM using Fisher Information Trace (FIT). This method reduces latency without altering the output distribution, enabling faster speculative decoding in multimedia applications. SDFP constructs a lightweight draft model by removing low-impact layers, which is compatible with the original model for standard speculative verification, achieving 1.32x-1.5x speedup across benchmarks.
SDFP 是一个无需训练且即插即用的框架,通过剪枝给定的大语言模型(LLM)的层来减少解码延迟,从而加速 LLM。它使用 Fisher 信息迹(FIT)剪枝模型的层,无需额外训练或超参数调整。SDFP 可以实现 1.32 倍到 1.5 倍的解码加速,同时保持原始模型的输出分布,支持低延迟的多媒体应用。
Auto-Rubric: Learning From Implicit Weights to Explicit Rubrics for Reward Modeling
Authors: Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding
First: 2025-10-20T09:01:37+00:00 · Latest: 2026-02-05T09:36:28+00:00
Abstract
Conventional reward modeling relies on gradient descent over neural weights, creating opaque, data-hungry "black boxes." We propose a paradigm shift from implicit to explicit reward parameterization, recasting optimization from continuous weight spaces to the discrete space of natural language rubrics. We introduce a training-free framework based on iterative rubric learning: it locally induces discriminative criteria via verification-driven refinement, and globally compresses the candidate criteria pool into a compact core set by maximizing an information-theoretic coding rate objective. We organize the compressed core set into a hierarchical rubric structure -- high-level evaluation dimensions supported by concrete verification checks -- serving as an interpretable, portable reward function. Empirically, our approach challenges prevailing data scaling assumptions: using only 70 preference pairs, our rubric-guided judges outperform fully trained reward models on diverse benchmarks. For instance, Qwen3-8B equipped with our learned rubrics achieves 80.91% on RewardBench2, surpassing the specialized Skywork-Reward-V2-Qwen3-8B (78.20%). These results demonstrate that alignment signals are highly compressible and can be effectively captured through explicit symbolic search.
中文标题/摘要
标题:自动评分标准:从隐式权重到显式评分标准的奖励建模学习
传统的奖励建模依赖于神经权重的梯度下降,创建出不透明、数据饥渴的“黑箱”。我们提出了一种从隐式到显式的奖励参数化范式转变,将优化从连续权重空间重新定义为自然语言评分标准的离散空间。我们引入了一种无需训练的框架,基于迭代评分标准学习:通过验证驱动的细化局部诱导判别标准,并通过最大化信息论编码率目标全局压缩候选标准池为紧凑的核心集。我们将压缩的核心集组织成层次化的评分标准结构——由具体的验证检查支持的高层次评估维度,作为可解释的、可移植的奖励函数。实证上,我们的方法挑战了现有的数据规模假设:仅使用70对偏好对,我们的评分标准引导的评判者在多种基准上优于完全训练的奖励模型。例如,配备我们学习到的评分标准的Qwen3-8B在RewardBench2上达到了80.91%,超过了专门的Skywork-Reward-V2-Qwen3-8B(78.20%)。这些结果表明,对齐信号是高度可压缩的,并可以通过显式的符号搜索有效地捕捉。
Summary / 总结
The paper proposes Auto-Rubric, a method that shifts reward modeling from implicit neural weights to explicit natural language rubrics. It uses a training-free framework with iterative rubric learning to induce discriminative criteria and compress them into a compact core set. This core set is organized into a hierarchical structure, serving as an interpretable reward function. Experiments show that using only 70 preference pairs, the rubric-guided model outperforms fully trained models on diverse benchmarks, such as achieving 80.91% on RewardBench2 compared to 78.20% for a specialized model. This indicates that alignment signals can be effectively captured through explicit symbolic search and are highly compressible.
该研究提出了Auto-Rubric方法,将奖励建模从隐式的神经权重转向显式的自然语言评分标准。它使用一个无训练的迭代过程来细化判别标准并压缩成一个紧凑的层次结构。实验表明,使用仅70对偏好对,评分标准引导的模型在多种基准上优于完全训练的模型,例如在RewardBench2上达到80.91%,而专门模型仅为78.20%。这表明对齐信号可以通过显式的符号搜索有效捕获,并且高度可压缩。
RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation
Authors: Ming-Ming Yu, Yi Chen, Börje F. Karlsson, Wenjun Wu
Venue: ICRA 2026
First: 2025-12-30T13:25:22+00:00 · Latest: 2026-02-05T09:33:50+00:00
Comments: Accepted at ICRA 2026
Abstract
Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
中文标题/摘要
标题:RANGER:通过上下文适应的单目零样本语义导航框架
在复杂环境中高效地找到目标是现实世界体态应用的基础。尽管最近多模态基础模型的进步使得零样本物体目标导航成为可能,允许机器人搜索任意物体而无需微调,但现有方法面临两个关键限制:(1)对模拟器提供的精确深度和姿态信息的高度依赖,这限制了其在现实世界场景中的应用;(2)缺乏上下文学习(ICL)能力,使得难以快速适应新环境,如利用短视频。为了解决这些挑战,我们提出了一种名为RANGER的新型零样本、开放式词汇语义导航框架,仅使用单目相机运行。利用强大的3D基础模型,RANGER消除了对深度和姿态的依赖,同时展示了强大的ICL能力。通过简单观察新环境的短视频,系统也可以显著提高任务效率,无需进行架构修改或微调。该框架整合了几个关键组件:基于关键帧的3D重建、语义点云生成、基于视觉-语言模型(VLM)的探索价值估计、高层自适应航点选择和低层动作执行。在HM3D基准测试和真实世界环境中进行的实验表明,RANGER在导航成功率和探索效率方面表现出竞争力,同时展示了优越的ICL适应性,无需事先对环境进行3D建图。
Summary / 总结
RANGER is a zero-shot semantic navigation framework that uses only a monocular camera to navigate complex environments efficiently. It addresses the limitations of existing methods by eliminating the need for precise depth and pose information and incorporating in-context learning capability. RANGER integrates key components such as 3D reconstruction, semantic point cloud generation, and exploration value estimation, enabling it to adapt quickly to new environments without architectural changes. Experiments show RANGER performs competitively in navigation success rate and exploration efficiency, and demonstrates superior in-context learning adaptability.
RANGER 是一个仅使用单目相机的零样本语义导航框架,无需依赖精确的深度和姿态信息。通过利用 3D 基础模型和上下文学习能力,RANGER 可以通过短视频观察快速适应新环境,提高任务效率。实验表明,RANGER 在导航成功率和探索效率方面表现出色,并且在不需要对环境进行先前 3D 映射的情况下展示了更优的上下文学习适应性。
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
Authors: Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu
First: 2026-02-05T09:15:34+00:00 · Latest: 2026-02-05T09:15:34+00:00
Comments: 9 pages, 2 figures, 5 tables, conference
Abstract
Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
中文标题/摘要
标题:MerNav:一种高度通用的记忆-执行-审查框架用于零样本物体目标导航
视觉语言导航(VLN)是体现智能的基本能力之一,也是亟待解决的关键挑战。然而,现有方法在成功率(SR)和泛化能力方面仍然不尽如人意:监督微调(SFT)方法通常能获得更高的SR,而无需训练(TF)方法往往能更好地泛化,但两者难以同时兼得。为此,我们提出了一种记忆-执行-审查框架。该框架由三部分组成:层次化记忆模块提供信息支持,执行模块进行常规决策和操作,审查模块处理异常情况并纠正行为。我们在物体目标导航任务上验证了该框架的有效性。在4个数据集上,与所有基线方法相比,我们的平均SR分别在无需训练(TF)和零样本(ZS)设置下提高了7%和5%。在最常用的HM3D_v0.1和更具挑战性的开放词汇数据集HM3D_OVON上,SR在零样本设置下分别提高了8%和6%。此外,在MP3D和HM3D_OVON数据集上,我们的方法不仅优于所有无需训练方法,还超越了所有监督微调方法,在成功率(SR)和泛化能力方面均取得了全面领先(5%和2%)。
Summary / 总结
The paper introduces MerNav, a Memory-Execute-Review framework designed to improve zero-shot object goal navigation in VLN tasks. It consists of a memory module for information support, an execute module for routine decision-making, and a review module for handling abnormalities. The framework achieved significant improvements in success rate, with absolute gains of 7% and 5% over baseline methods in training-free and zero-shot settings, respectively. Notably, it outperformed both training-free and supervised fine-tuning methods on challenging datasets, demonstrating comprehensive leadership in both success rate and generalization.
该论文提出了MerNav框架,旨在提高VLN任务中的零样本物体目标导航性能。该框架包括用于信息支持的记忆模块、用于常规决策的执行模块以及用于处理异常情况的审查模块。实验结果显示,该框架在训练自由和零样本设置下分别实现了7%和5%的成功率绝对提升,显著优于基线方法。特别是在具有挑战性的数据集中,该方法不仅超越了所有训练自由方法,还超过了所有监督微调方法,展示了在成功率和泛化能力上的全面领先。
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
First: 2025-12-26T18:59:47+00:00 · Latest: 2026-02-05T08:49:21+00:00
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
中文标题/摘要
标题:见少而明:双向感知塑造用于多模态推理
大型视觉-语言模型(VLMs)通常从中间视觉提示中受益,这些提示要么通过外部工具注入,要么在推理过程中作为潜在视觉标记生成,但这些机制仍然忽略了细微的视觉证据(例如图表中的多段线),在不同领域泛化能力差,并且在推理时间成本高。在本文中,我们提出了双向感知塑造(BiPS),它将问题条件下的掩码视图转换为双向的看哪里信号,在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间应用KL一致性约束,鼓励粗略但完整的支持像素覆盖。然后在原始图像和关键像素被遮蔽的证据消除视图之间应用KL分离约束,该视图不再支持原始答案,从而避免仅从文本回答(即,仅从文本回答)并强制执行细微的视觉依赖。在八个基准测试中,BiPS 将 Qwen2.5-VL-7B 的性能平均提高了8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
First: 2026-02-05T08:45:08+00:00 · Latest: 2026-02-05T08:45:08+00:00
Comments: 17 pages, 7 figures; cvpr2026 submission
Abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
中文标题/摘要
标题:DisCa: 加速视频扩散变换器的蒸馏兼容可学习特征缓存
尽管扩散模型在视频生成领域取得了巨大成功,但这一进展伴随着计算负担的急剧增加。现有的加速方法中,特征缓存因其无需训练的特性及显著的加速性能而广受欢迎,但进一步压缩时不可避免地会面临语义和细节的丢失。另一种广泛应用的方法,训练感知的步骤蒸馏,在图像生成中取得了成功,但在视频生成中却面临严重的性能下降,且仅应用无需训练的特征缓存到步骤蒸馏模型时,质量损失更为严重,因为采样步骤更为稀疏。本文首次引入了蒸馏兼容的可学习特征缓存机制。我们采用轻量级的可学习神经预测器代替传统的无需训练的启发式方法,能够更准确地捕捉高维特征演化过程。此外,我们探讨了高度压缩蒸馏在大规模视频模型中的挑战,并提出了一种保守的受限均值流方法,以实现更稳定和无损的蒸馏。通过这些努力,我们在保持生成质量的同时将加速边界进一步推至$11.8\times$。大量实验表明了我们方法的有效性。代码附在补充材料中,并将公开。
Summary / 总结
This paper addresses the computational challenges in video generation using diffusion models by introducing a novel distillation-compatible learnable feature caching mechanism. The method uses a lightweight learnable neural predictor to capture feature evolution accurately, and proposes a conservative Restricted MeanFlow approach for stable distillation. The results show that the proposed method can accelerate video generation by 11.8 times while maintaining quality. Extensive experiments validate the effectiveness of the approach.
本文通过引入一种新型的可蒸馏学习可调特征缓存机制,解决了使用扩散模型进行视频生成时的计算挑战。该方法采用轻量级的学习神经预测器来准确捕捉高维特征演化过程,克服了传统特征缓存中的语义和细节丢失问题。提出的受限均值流方法确保了更稳定的无损蒸馏,实现了$11.8\times$的加速,同时不牺牲生成质量。广泛的实验验证了该方法的有效性。
LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation
Authors: Hengyu Shi, Junhao Su, Tianyang Han, Junfeng Luo, Jialin Gao
First: 2025-04-15T03:12:01+00:00 · Latest: 2026-02-05T07:47:50+00:00
Abstract
Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
中文标题/摘要
标题:LayoutCoT:利用大型语言模型的深度推理潜力进行布局生成
条件布局生成旨在从用户定义的约束条件自动生成视觉上吸引人且语义上连贯的布局。虽然基于生成模型的近期方法取得了令人鼓舞的结果,但它们通常需要大量的训练数据或广泛的微调,限制了它们的灵活性和实际应用性。相反,一些无需训练的方法利用大型语言模型(LLMs)的上下文学习也出现了,但它们往往推理能力有限,排名机制过于简单,限制了它们生成高质量布局的能力。为此,我们提出了一种名为LayoutCoT的新方法,该方法通过检索增强生成(RAG)和链式思考(CoT)技术的结合,利用LLMs的推理能力。具体而言,LayoutCoT将布局表示转换为适合LLMs处理的标准序列化格式。使用布局感知的RAG来促进有效的检索并生成粗略布局。然后,将初步布局与选定的示例一起输入特别设计的CoT推理模块进行迭代细化,显著提高了语义连贯性和视觉质量。我们在五个公共数据集上进行了广泛的实验,涵盖了三种条件布局生成任务。实验结果表明,LayoutCoT在无需训练或微调的情况下达到了最先进的性能。值得注意的是,我们的CoT推理模块使标准LLMs,即使它们没有明确的深度推理能力,也能超越专门的深度推理模型(如deepseek-R1),突显了我们方法在利用LLMs的深度推理能力进行布局生成任务方面的潜力。
Summary / 总结
LayoutCoT is a novel approach for conditional layout generation that leverages the reasoning capabilities of Large Language Models (LLMs) through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. It transforms layout representations into a standardized format and uses a Layout-aware RAG to generate a coarse layout, which is then refined by a CoT reasoning module. Extensive experiments on five public datasets show that LayoutCoT achieves state-of-the-art performance without the need for training or fine-tuning, outperforming specialized deep-reasoning models.
LayoutCoT 通过结合检索增强生成(RAG)和链式思考(CoT)技术,利用大型语言模型(LLMs)的推理能力,从用户定义的约束中生成视觉上吸引人且语义上连贯的布局。它将布局表示转换为标准化格式,使用布局感知的 RAG 生成粗略布局,并通过 CoT 推理模块进行迭代细化。在五个公开数据集上的实验表明,LayoutCoT 在不需要训练或微调的情况下优于现有方法,展示了 LLMs 在布局生成任务中进行深度推理的潜力。