Can vision language models learn intuitive physics from interaction?
Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00
Abstract
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
中文标题/摘要
标题:视觉语言模型能否通过交互学习直观的物理知识?
预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明,监督微调可以提高模型在简单物理任务上的表现。然而,微调后的模型似乎没有学会能够泛化的稳健物理规则。基于认知科学的研究,我们假设模型需要与环境互动才能正确学习其物理动力学。我们使用强化学习训练通过与环境互动来学习的模型。虽然通过互动学习可以让模型提高其任务内的表现,但无法产生具有泛化物理直觉的模型。我们发现,即使任务共享视觉统计和物理原理,针对一个任务训练的模型也不可靠地泛化到相关任务,无论模型是通过互动还是其他方式训练。
GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00
Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
中文标题/摘要
标题:GenArena:我们如何实现视觉生成任务的人类对齐评估?
视觉生成模型的快速发展已经超越了传统的评估方法,迫切需要采用视觉语言模型作为替代的评判者。在本文中,我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明,这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制,我们引入了GenArena,这是一种统一的评估框架,利用成对比较范式确保稳定且人类对齐的评估。关键的是,我们的实验揭示了一个变革性的发现,即简单采用这种成对协议可以使现成的开源模型超越顶级专有模型。值得注意的是,我们的方法将评估准确性提高了超过20%,并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性,远远超过了0.36的点对点方法相关性。基于GenArena,我们对多种视觉生成模型进行了基准测试,为视觉生成领域提供了严格的自动化评估标准。
Summary / 总结
This study addresses the limitations of traditional absolute pointwise scoring in evaluating visual generation models, which have advanced rapidly. The authors introduce GenArena, a pairwise comparison framework, to ensure stable and human-aligned evaluations. Experiments show that using GenArena, open-source models outperform proprietary models, with a 20% increase in evaluation accuracy and a Spearman correlation of 0.86 compared to the LMArena leaderboard, significantly better than the 0.36 correlation of pointwise methods.
该研究针对传统绝对点评分在评估视觉生成模型时的局限性,提出了GenArena,一种成对比较框架,以提高评估的可靠性和与人类感知的对齐。实验表明,GenArena显著提高了评估准确性,超过20%,并与权威的LMArena排行榜实现了0.86的Spearman相关性,远超点评分方法的0.36相关性。
Pathwise Test-Time Correction for Autoregressive Long Video Generation
Authors: Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao Guo
First: 2026-02-05T16:50:39+00:00 · Latest: 2026-02-05T16:50:39+00:00
Abstract
Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
中文标题/摘要
标题:路径依赖的测试时校正以实现自回归长视频生成
提炼的自回归扩散模型有助于实时短视频合成,但在长序列生成过程中会遭受严重的错误累积。尽管现有的测试时优化(TTO)方法对图像或短片段有效,但我们发现它们无法缓解扩展序列中的漂移,因为奖励景观不稳定且提炼参数高度敏感。为克服这些限制,我们引入了测试时校正(TTC),这是一种无需训练的替代方案。具体而言,TTC 利用初始帧作为稳定的参考锚点,校准采样轨迹中的中间随机状态。大量实验表明,我们的方法可以无缝集成到各种提炼模型中,延长生成长度,同时在30秒基准测试上与资源密集型训练方法保持相同的质量,几乎没有额外开销。
Summary / 总结
The research addresses the issue of error accumulation in long video generation using autoregressive diffusion models. It introduces Test-Time Correction (TTC), a training-free method that uses the initial frame as a reference to stabilize intermediate states during sampling. Experiments show that TTC can extend generation lengths without significant overhead and maintains video quality comparable to resource-intensive training-based methods on 30-second benchmarks.
研究解决了使用蒸馏自回归扩散模型进行长视频生成时出现的错误累积问题。提出了测试时校正(TTC)方法,该方法利用初始帧作为参考来稳定采样过程中的中间状态。实验表明,TTC可以在不显著增加开销的情况下延长生成长度,并在30秒视频基准测试中与更耗资源的训练方法保持相同的质量。
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
Authors: Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan
Venue: ICLR 2026
First: 2025-06-09T20:11:21+00:00 · Latest: 2026-02-05T16:06:21+00:00
Comments: Accepted to ICLR 2026. Camera ready version
Abstract
Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.
中文标题/摘要
标题:GIQ:使用模拟和真实多面体基准测试视觉基础模型的3D几何推理能力
现代单目3D重建方法和视觉-语言模型(VLMs)在标准基准测试上表现出令人印象深刻的成果,但近期的研究对其真正理解几何属性表示怀疑。我们引入了GOQ,一个专门设计用于评估视觉和视觉-语言基础模型几何推理能力的综合基准。GIQ包含合成和真实世界的图像及其对应的复杂度和对称性各异的多种多面体的3D网格,从柏拉图、阿基米德、约翰逊和卡塔兰多面体到星形和复合形状。通过涉及单目3D重建、3D对称性检测、心理旋转测试和零样本形状分类任务的系统实验,我们揭示了当前模型的重大不足。最先进的3D重建算法即使在大量3D数据集上训练,也难以准确重建基本的几何柏拉图多面体。其次,尽管基础模型可能通过线性和非线性探针捕捉特定的3D对称元素,但在需要详细几何差异的任务,如心理旋转方面,它们的表现显著下降。此外,先进的视觉-语言助手,如ChatGPT、Gemini和Claud在解释基本形状属性,如面几何、凸性和复杂多面体的复合结构方面表现出极低的准确性。GIQ在toomanymatts.github.io/giq-benchmark/公开,提供了一个结构化的平台,用于基准测试关键的几何智能差距,并促进未来在稳健、几何感知表示学习方面的进展。
Summary / 总结
The paper introduces GOQ, a benchmark for evaluating the geometric reasoning capabilities of vision and vision-language foundation models using both synthetic and real-world images of polyhedra. Through monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification, the study reveals that current models struggle with basic geometric properties and detailed geometric differentiation, indicating significant gaps in their geometric intelligence. State-of-the-art reconstruction algorithms and advanced vision-language assistants perform poorly in tasks requiring detailed geometric understanding. The benchmark is publicly available at toomanymatts.github.io/giq-benchmark/.
研究旨在通过GIQ基准测试视觉和视觉-语言基础模型的几何推理能力,该基准包括合成和真实世界的多面体图像。研究发现,最先进的3D重建算法在基本几何形状上表现不佳,而基础模型在需要详细几何区分的任务,如心理旋转,上表现较差。此外,像ChatGPT、Gemini和Claud这样的高级视觉-语言助手在解释基本形状属性,如面几何、凸性和复杂多面体的复合结构时表现出较低的准确性。GIQ基准已公开发布,以促进进一步研究和增强几何智能。
Bifrost: Steering Strategic Trajectories to Bridge Contextual Gaps for Self-Improving Agents
Authors: Quan M. Tran, Zhuo Huang, Wenbin Zhang, Bo Han, Koji Yatani, Masashi Sugiyama, Tongliang Liu
First: 2026-02-05T16:03:56+00:00 · Latest: 2026-02-05T16:03:56+00:00
Abstract
Autonomous agents excel in self-improvement through reflection and iterative refinement, which reuse successful task trajectories as in-context examples to assist subsequent reasoning. However, shifting across tasks often introduces a context mismatch. Hence, existing approaches either discard the trajectories or manipulate them using heuristics, leading to a non-negligible fine-tuning cost or unguaranteed performance. To bridge this gap, we reveal a context-trajectory correlation, where shifts of context are highly parallel with shifts of trajectory. Based on this finding, we propose BrIdge contextual gap FoR imprOvised trajectory STeering (Bifrost), a training-free method that leverages context differences to precisely guide the adaptation of previously solved trajectories towards the target task, mitigating the misalignment caused by context shifts. Our trajectory adaptation is conducted at the representation level using agent hidden states, ensuring trajectory transformation accurately aligns with the target context in a shared space. Across diverse benchmarks, Bifrost consistently outperforms existing trajectory reuse and finetuned self-improvement methods, demonstrating that agents can effectively leverage past experiences despite substantial context shifts.
中文标题/摘要
标题:Bifrost:引导战略轨迹以弥补情境差距,促进自我提升代理
自主代理通过反思和迭代改进,在重用成功任务轨迹作为上下文示例以辅助后续推理方面表现出色。然而,任务切换往往会导致情境不匹配。因此,现有方法要么丢弃轨迹,要么使用启发式方法对其进行操作,导致显著的微调成本或无法保证性能。为弥补这一差距,我们揭示了情境与轨迹之间的关联,其中情境变化与轨迹变化高度平行。基于这一发现,我们提出了一种无需训练的方法——BrIdge 情境差距以引导即兴轨迹调整(Bifrost),该方法利用情境差异精确引导先前解决的轨迹向目标任务适应,减轻由情境变化引起的错位。我们的轨迹调整在表示层进行,使用代理隐藏状态确保轨迹转换准确地与共享空间中的目标情境对齐。在多种基准测试中,Bifrost 一致地优于现有的轨迹重用和微调自我改进方法,证明了代理即使在大量情境变化的情况下也能有效利用过往经验。
Summary / 总结
The research aims to address the challenge of context mismatch when autonomous agents shift between tasks, which can lead to poor performance or high fine-tuning costs. The proposed method, Bifrost, identifies a correlation between context and trajectory shifts and uses this to adapt previously successful trajectories to the target task without retraining. Experiments show that Bifrost outperforms existing methods in various benchmarks, indicating that agents can effectively utilize past experiences even with significant context changes.
论文针对自主代理在任务切换时面临的上下文不匹配问题,提出了一种无需训练的方法Bifrost,通过利用上下文差异精确调整隐藏状态,使过去的任务轨迹能够适应新的上下文。实验结果显示,Bifrost 在多种基准测试中优于现有方法,表明代理即使在面临显著上下文变化时也能有效利用过往经验。
Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
Authors: Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
First: 2026-02-05T16:02:48+00:00 · Latest: 2026-02-05T16:02:48+00:00
Abstract
Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
中文标题/摘要
标题:聚焦-扫描-精炼:从人类视觉感知到高效的视觉标记剪枝
视觉语言模型(VLMs)通常生成大量的视觉标记,极大地增加了推理延迟和内存占用;而无需训练的标记剪枝提供了一种实用的解决方案,但现有方法在剧烈压缩下仍然难以平衡局部证据和全局上下文。我们提出了一种名为聚焦-扫描-精炼(FSR)的人类启发式、即插即用剪枝框架,该框架模仿了人类回答视觉问题的方式:首先聚焦于关键证据,然后在需要时进行全局扫描,最后通过聚合相关细节来精炼扫描的上下文。FSR 首先通过结合视觉重要性和指令相关性来聚焦关键证据,避免了对视觉上显著但与查询无关的区域的偏见。然后,它根据聚焦的集合进行补充上下文扫描,选择与聚焦证据差异最大的标记。最后,FSR 通过基于相似性的分配和加权合并来精炼扫描的上下文,而不增加标记预算。在多个 VLM 后端和视觉语言基准上的广泛实验表明,FSR 一致地提高了与现有最先进的剪枝方法相比的准确性和效率权衡。源代码可以在 https://github.com/ILOT-code/FSR 获取
Summary / 总结
The research aims to address the issue of excessive visual tokens in vision-language models, which increases inference latency and memory usage. The proposed Focus-Scan-Refine (FSR) framework mimics human visual perception by focusing on key evidence, scanning for complementary context, and refining the context through similarity-based aggregation. Experiments across various VLM backbones and benchmarks demonstrate that FSR outperforms existing methods in balancing accuracy and efficiency.
研究旨在解决视觉语言模型中视觉令牌过多的问题,这会增加推理延迟和内存使用。提出的Focus-Scan-Refine (FSR)框架模仿人类视觉感知,通过聚焦关键证据、扫描补充上下文并基于相似性聚合进行细化。实验表明,FSR在各种VLM骨干网络和基准测试中优于现有方法,能够在准确性和效率之间取得更好的平衡。
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
First: 2026-01-27T17:35:05+00:00 · Latest: 2026-02-05T15:59:52+00:00
Comments: 27 pages, 15 figures
Abstract
Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
中文标题/摘要
标题:当迭代RAG超越理想证据时:科学多跳问答中的诊断研究
检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但尚不清楚何时迭代检索-推理循环在意义上优于静态RAG,特别是在具有多跳推理、稀疏领域知识和异构证据的科学领域。我们提供了第一个受控的机制级诊断研究,探讨同步迭代检索和推理是否能超越理想化的静态上限(黄金上下文)RAG。我们以三个模式基准了十一个最先进的LLM:(i)无上下文,衡量对参数化记忆的依赖;(ii)黄金上下文,所有先验证据一次提供;(iii)迭代RAG,一个无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用化学重点的ChemKGMultiHopQA数据集,我们隔离需要真正检索的问题,并通过检索覆盖率差距、锚点携带丢失、查询质量、组合保真度和控制校准等诊断分析行为。在所有模型中,迭代RAG始终优于黄金上下文,增幅高达25.6个百分点,尤其是对于非推理微调模型。分阶段检索减少了晚期跳步失败,缓解了上下文过载,并允许动态纠正早期假设漂移,但剩余的失败模式包括不完整的跳步覆盖、干扰物锁定轨迹、早期停止校准错误以及即使在完美检索的情况下也存在高组合失败率。总体而言,分阶段检索往往比理想证据的存在本身更具影响力;我们提供了在专门的科学环境中部署和诊断RAG系统的实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。
Summary / 总结
This study investigates when iterative retrieval-reasoning in RAG outperforms static RAG, especially in scientific domains. Using the ChemKGMultiHopQA dataset, eleven state-of-the-art LLMs were benchmarked under three regimes: no context, gold context, and iterative RAG. Iterative RAG consistently outperformed the gold context, with gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Staged retrieval reduced late-hop failures and context overload but faced challenges like incomplete hop coverage and early stopping miscalibration.
研究探讨了在科学领域中,迭代检索-推理循环何时能超越静态RAG。使用ChemKGMultiHopQA数据集,对十一款最先进的LLM在三种模式下进行了基准测试:无上下文、黄金上下文和迭代RAG。迭代RAG在所有模型中都优于黄金上下文,增幅最高可达25.6个百分点,尤其对于非推理微调模型。分阶段检索减少了晚期跳步失败和上下文过载,但仍面临如不完整跳步覆盖和高组合失败率等挑战。
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Venue: ICLR 2026
First: 2025-10-20T06:17:57+00:00 · Latest: 2026-02-05T15:48:38+00:00
Comments: Accepted to ICLR 2026
Abstract
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
中文标题/摘要
标题:理解并改进层次稀疏注意模型中的长度泛化
有效地处理长上下文是语言模型面临的关键挑战。虽然标准Transformer受到二次复杂性和长度外推能力差的限制,但滑动窗口注意和状态空间模型等替代架构由于其固定大小的内存而牺牲了充分利用完整上下文的能力。基于块的稀疏注意已成为一种有前景的极端长度泛化范式,但其成功背后的关键架构原则尚未完全理解。在本文中,我们系统地剖析了这些模型,以确定驱动其性能的核心组件。通过统一框架和全面的消融研究,我们证明了三种设计原则的结合是至关重要的:(1)具有专用CLS标记的表达性非线性块编码器,用于检索表示;(2)旁路残差路径,以稳定地整合检索到的全局信息,而不被局部残差流覆盖;(3)预训练期间强制选择稀疏性,以弥合训练-测试分布差距。我们为块内信息处理和地标生成提供了理论动机。通过结合这些原则,我们建立了训练无监督长度外推的新状态,成功地将训练在4K上下文上的模型推广到RULER和BABILong上的3200万标记。我们的发现为开发未来高度能力的长上下文语言模型提供了一套清晰且经验丰富的设计原则。
Summary / 总结
This work addresses the challenge of processing long contexts in language models by dissecting chunk-based sparse attention models. It identifies three key design principles: an expressive Chunk Encoder with a CLS token, a Bypassing Residual Path, and enforced selection sparsity during pre-training. These principles improve length generalization, enabling models trained on 4K contexts to generalize to 32 million tokens on RULER and BABILong, setting a new state-of-the-art for training-free length extrapolation.
该研究分析了基于块的稀疏注意力模型,以解决语言模型处理长上下文的挑战。它确定了三个关键设计原则:具有CLS标记的表达性块编码器、旁路残差路径以及预训练期间强制选择稀疏性。这些原则提高了长度泛化能力,使在4K上下文中训练的模型能够泛化到RULER和BABILong上的3200万令牌,从而在无训练的长度外推方面达到了新的最佳水平。
Optimization and Generation in Aerodynamics Inverse Design
Authors: Huaguan Chen, Ning Lin, Luxi Chen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun
First: 2026-02-03T14:32:26+00:00 · Latest: 2026-02-05T15:47:18+00:00
Abstract
Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.
中文标题/摘要
标题:气动逆向设计中的优化与生成
基于物理目标的逆向设计具有挑战性,因为它将高维几何与昂贵的模拟耦合在一起,例如气动形状优化以减少阻力。我们通过两个典型的解决方案——最优设计点和最优设计分布——重新审视逆向设计,并将它们与优化和引导生成联系起来。在此基础上,我们提出了一种新的成本预测器训练损失,并提出了一种密度梯度优化方法,该方法在提高目标的同时保留了合理的形状。我们进一步统一了现有的无训练引导生成方法。为了解决它们在高维空间中无法近似条件协方差的问题,我们开发了一种时间效率和内存效率高的近似协方差估计算法。在受控的2D研究和高保真的3D气动基准测试(汽车和飞机)上的实验,通过OpenFOAM模拟和3D打印原型的小型风洞测试,证明了在优化和引导生成方面的一致性改进。额外的离线强化学习结果进一步支持了我们方法的通用性。
Summary / 总结
The research aims to optimize and generate aerodynamic shapes for drag reduction by addressing the challenges of inverse design with physics-based objectives. The authors propose a new training loss for cost predictors and a density-gradient optimization method, which improves objectives while maintaining plausible shapes. They also develop an efficient algorithm for approximate covariance estimation to handle high-dimensional conditional covariance. Experiments on 2D and 3D aerodynamic benchmarks, validated by simulations and wind-tunnel tests, show consistent improvements in both optimization and guided generation.
研究旨在通过解决基于物理目标的逆向设计挑战,优化和生成减少阻力的气动形状。作者提出了一种新的训练损失函数和密度梯度优化方法,该方法在保持合理形状的同时提高了目标性能。通过2D和3D气动基准测试的实验,并通过模拟和风洞测试验证,展示了在优化和引导生成方面的持续改进。
Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
Authors: Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
First: 2026-02-05T15:45:39+00:00 · Latest: 2026-02-05T15:45:39+00:00
Abstract
With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
中文标题/摘要
标题:以中心视角感知器:通过框架实例化分离以中心视角推理与自视角视觉先验
随着对空间定位任务如视觉语言导航/动作的需求日益增长,视觉语言模型(VLMs)的以中心视角感知能力正受到越来越多的关注。然而,VLMs 在需要显式视角转换的以中心视角空间查询上仍然脆弱,答案依赖于在目标为中心的框架中进行推理,而不是观察到的摄像机视图。因此,我们引入了以中心视角感知器,这是一种无需训练的策略,可以从一张或多张图像中恢复出度量化的3D状态,并使用现成的几何专家,然后根据指令的语义意图实例化一个查询条件下的以中心视角参考框架。通过确定性地将重建的几何形状转换为目标框架,并用结构化的、几何定位的表示提示主干VLM,以中心视角感知器将心理旋转从隐式推理卸载到显式计算。我们在多个主干家族上对以中心视角感知器进行了空间推理基准测试评估,观察到在以中心视角任务上的一致且显著的提升(约10%),同时保持了强大的自视角性能,并超越了空间感知微调模型和最先进的开源和专有模型。
Ethology of Latent Spaces
Authors: Philippe Boisnard
First: 2026-02-05T14:37:31+00:00 · Latest: 2026-02-05T14:37:31+00:00
Comments: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( https://paragraphe.univ-paris8.fr/IMG/pdf/programme_colloque_his9_campuscondorcet_v3.pdf ) and accepted for publication in double-blind peer review in French in 2026-2027
Abstract
This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices.
Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points.
We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault's notion of the archive, Jameson's ideologeme, and Simondon's theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents.
中文标题/摘要
标题:潜空间的动物学
本研究从动物学的角度挑战了视觉语言模型(VLMs)中潜空间被认为的中立性。潜空间并非同质的不确定空间,而是表现出由训练数据和架构选择塑造的特定模型感知显著性的差异性制度。通过对三个模型(OpenAI CLIP、OpenCLIP LAION、SigLIP)应用于301件(15至20世纪)艺术作品的比较分析,我们揭示了政治和文化分类的显著差异。使用Mikolov等人(2013)提出的二元语义轴,我们发现SigLIP将59.4%的艺术作品分类为政治参与,而OpenCLIP仅为4%。非洲面具在SigLIP中获得最高的政治评分,而在OpenAI CLIP中则完全无政治色彩。在美学殖民轴上,不同模型之间的差异达到72.6个百分点。
我们引入了三个操作概念:计算潜空间政治化,描述在无故意编码的情况下政治类别如何出现;新兴偏差,不可归因于统计或规范偏差,仅通过对比分析可被检测到;以及三种算法视域制度:熵制度(LAION)、制度化(OpenAI)和符号学(SigLIP),它们构建了不同的可见性模式。借鉴福柯的档案概念、詹姆斯·琼斯的意识形态模因以及西蒙松的个体化理论,我们主张训练数据集作为准档案,其话语形成在潜空间中固化。本研究为视觉语言模型在数字艺术史中的应用条件提供了批判性重新评估,并呼吁将学习架构整合到任何文化解释的算法代理中。
Summary / 总结
This study examines the ethological behaviors of latent spaces in vision language models (VLMs) by analyzing three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to 301 artworks from the 15th to 20th centuries. Key findings include substantial divergences in the attribution of political and cultural categories, with SigLIP classifying 59.4% of artworks as politically engaged, compared to only 4% for OpenCLIP. The study introduces concepts like computational latent politicization and emergent bias, and identifies three algorithmic scopic regimes: entropic, institutional, and semiotic, which structure distinct modes of visibility in latent spaces.
该研究通过分析三个模型(OpenAI CLIP、OpenCLIP LAION、SigLIP)应用于15至20世纪的301件艺术品,探讨了视觉语言模型(VLMs)的生态学特征。研究揭示了政治和文化分类的显著差异,SigLIP在政治参与的艺术品分类中占比更高。研究引入了计算潜空间政治化和新兴偏差的概念,并识别了三种算法视域制度:熵、机构和符号,这些制度在潜空间中构建了不同的可见性模式。
ShapeUP: Scalable Image-Conditioned 3D Editing
Authors: Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
First: 2026-02-05T13:59:16+00:00 · Latest: 2026-02-05T13:59:16+00:00
Abstract
Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
中文标题/摘要
标题:ShapeUP:可扩展的基于图像的3D编辑
近期3D基础模型的进展使得高保真资产的生成成为可能,但精确的3D操作仍然是一个重大挑战。现有的3D编辑框架往往在视觉可控性、几何一致性与可扩展性之间面临难以调和的权衡。具体来说,基于优化的方法速度过慢,多视角2D传播技术存在视觉漂移问题,而无需训练的潜在操作方法则受限于固定的先验知识,无法直接从可扩展性中获益。在本文中,我们提出了ShapeUP,一种可扩展的基于图像的3D编辑框架,将编辑形式化为在原生3D表示中的监督潜在域到潜在域的转换。这种形式化使得ShapeUP能够基于预训练的3D基础模型,利用其强大的生成先验,并通过监督训练进行适应。实际上,ShapeUP在包含源3D形状、编辑后的2D图像及其对应的编辑后3D形状的三元组上进行训练,并使用3D扩散变换器(DiT)学习直接映射。这种基于图像的提示方法使得对局部和全局编辑具有精细的视觉控制,并实现了隐式的、无掩码的定位,同时保持与原始资产的严格结构一致性。我们的广泛评估表明,ShapeUP在身份保留和编辑保真度方面始终优于当前的训练有素和无需训练的基线,提供了一种稳健且可扩展的原生3D内容创建范式。
Summary / 总结
ShapeUP is a scalable 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation, leveraging a pretrained 3D foundation model. It uses triplets of source 3D shapes, edited 2D images, and corresponding edited 3D shapes to train a 3D Diffusion Transformer (DiT) for direct mapping. ShapeUP achieves fine-grained visual control, implicit localization, and strict structural consistency, outperforming current baselines in identity preservation and edit fidelity.
ShapeUP 是一种可扩展的 3D 编辑框架,通过 3D 表示中的监督潜在到潜在的转换,利用预训练的 3D 基础模型。它使用 3D 扩散变换器 (DiT) 从包含源 3D 形状、编辑后的 2D 图像及其相应编辑后的 3D 形状的三元组中学习直接映射。ShapeUP 提供了精细的视觉控制并保持结构一致性,其在身份保留和编辑保真度方面优于当前基线。
Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Venue: ICLR 2026
First: 2025-09-26T06:30:39+00:00 · Latest: 2026-02-05T13:38:54+00:00
Comments: Accepted by ICLR 2026
Abstract
Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
中文标题/摘要
标题:为MLLMs定制视觉情感评估:一种开放词汇、多维度和可扩展的方法
近年来,多模态大型语言模型(MLLMs)在多种任务中取得了卓越的性能,不断超越人们对它们能力的预期。然而,它们在从图像中感知情感的能力方面仍存在争议,研究结果在零样本场景中存在分歧。我们认为这种不一致部分源于现有评估方法的限制,包括忽视可能的响应、有限的情感分类、忽略背景因素以及劳动密集型注释。为了促进MLLMs的定制视觉情感评估,我们提出了一项情感陈述判断任务,以克服这些限制。配合这一任务,我们设计了一种自动化流水线,能够高效地构建以情感为中心的陈述,同时减少人力投入。通过系统地评估现有的MLLMs,我们的研究展示了它们在情感解释和基于上下文的情感判断方面的更强表现,同时也揭示了它们在理解感知主观性方面的相对局限性。与人类相比,即使是表现最佳的MLLMs如GPT4o也显示出显著的性能差距,突显了未来改进的关键领域。通过开发基本的评估框架并进行全面的MLLM评估,我们希望这项工作能够促进MLLMs的情感智能。
Summary / 总结
This study addresses the inconsistency in the performance of Multimodal Large Language Models (MLLMs) in perceiving emotions from images by proposing an Emotion Statement Judgment task. The task and an automated pipeline are designed to overcome limitations in existing evaluation methods. Key findings show that MLLMs perform better in emotion interpretation and context-based judgment but still have limitations in understanding perception subjectivity, with even top-performing models like GPT4o showing significant gaps compared to human performance.
该研究通过提出情感陈述判断任务和自动化管道来解决评估MLLMs从图像中感知情绪的一致性问题。研究发现,MLLMs在情绪解释和基于上下文的情绪判断方面表现更好,但在理解感知主观性方面存在困难,即使是如GPT4o这样的先进模型与人类相比也存在显著差距。
Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation
Authors: Joe-Mei Feng, Sheng-Wei Yu
First: 2026-02-05T12:12:00+00:00 · Latest: 2026-02-05T12:12:00+00:00
Abstract
We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3).
The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound.
The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
中文标题/摘要
标题:几何可观测性指标:SE(3) 姿态估计中单特征灵敏度、弱可观测性和动态效应的算子理论框架
我们提出了一种统一的算子理论框架,用于分析相机姿态估计在李群SE(3)上的单特征灵敏度。经典的灵敏度工具——条件分析、欧几里得扰动论证和Fisher信息界——无法解释单个图像特征如何影响姿态估计,也无法解释动态或不一致的观测为何会不成比例地扭曲现代SLAM和结构光度法系统。为了解决这一差距,我们将影响函数理论扩展到矩阵李群,并推导出SE(3)上左平凡化M-估计量的内在扰动算子。
由此产生的几何可观测性指标(GOI)通过曲率算子和可观测子空间的李代数结构量化单个测量的贡献。GOI沿可观测曲率的主要方向进行谱分解,揭示了弱可观测性和放大灵敏度之间的直接对应关系。在总体情况下,GOI与SE(3)上的Fisher信息几何学一致,产生了一个单测量的Cramer-Rao界类比。
该相同的谱机制解释了纯旋转和消失视差等经典退化现象,以及弱曲率方向上的动态特征放大。总体而言,GOI提供了一种几何上一致的测量影响描述,统一了条件分析、Fisher信息几何学、影响函数理论和动态场景可检测性,通过曲率算子的谱几何学。由于这些量直接出现在高斯-牛顿管道中,曲率谱和GOI也提供了轻量级、无需训练的诊断信号,用于识别动态特征和检测弱可观测性配置,而不修改现有的SLAM架构。
Summary / 总结
The paper introduces the Geometric Observability Index (GOI) to analyze the sensitivity of individual image features in camera pose estimation on SE(3). GOI extends influence function theory to matrix Lie groups, providing a unified framework that connects weak observability with amplified sensitivity. Key findings include a spectral decomposition of GOI revealing the influence of measurement curvature and a direct correspondence to Fisher information geometry, offering a lightweight diagnostic tool for identifying dynamic features and weak observability configurations.
论文提出了几何可观性指数(GOI),用于分析SE(3)上单个图像特征在相机姿态估计中的敏感性。GOI将影响函数理论扩展到矩阵李群,提供了一个统一的框架,将弱可观性与放大敏感性直接联系起来。关键发现包括GOI沿主方向的谱分解,揭示了弱可观性与敏感性之间的直接联系,并在总体情况下给出了Cramer-Rao界的一个单测量类比。
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
Authors: Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
First: 2026-02-05T12:03:11+00:00 · Latest: 2026-02-05T12:03:11+00:00
Abstract
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
中文标题/摘要
标题:LoGoSeg:结合局部和全局特征的开放词汇语义分割
开放词汇语义分割(OVSS)扩展了传统的封闭集分割,通过使用任意文本描述对已见和未见类别进行像素级标注。现有方法利用视觉语言模型(VLMs)如CLIP,但其依赖于图像级预训练往往导致空间对齐不精确,导致在模糊或杂乱场景中出现分割不匹配。然而,大多数现有方法缺乏强大的对象先验和区域级约束,这可能导致对象幻觉或漏检,进一步降低性能。为解决这些挑战,我们提出LoGoSeg,一种高效的单阶段框架,集成了三个关键创新:(i)一种对象存在先验,通过全局图像-文本相似性动态加权相关类别,有效减少幻觉;(ii)一种区域感知对齐模块,建立精确的区域级视觉-文本对应关系;(iii)一种双流融合机制,最优地结合局部结构信息与全局语义上下文。与先前工作不同,LoGoSeg 消除了对外部掩码提案、额外骨干网络或额外数据集的需要,确保高效性。在六个基准(A-847、PC-459、A-150、PC-59、PAS-20 和 PAS-20b)上的广泛实验表明,其在开放词汇设置中的性能和泛化能力具有竞争力。
Summary / 总结
LoGoSeg is designed to improve open-vocabulary semantic segmentation by integrating local and global features. It introduces an object existence prior, a region-aware alignment module, and a dual-stream fusion mechanism to address the issues of imprecise spatial alignment and object hallucination. Experiments on six benchmarks show that LoGoSeg outperforms existing methods and demonstrates strong generalization in open-vocabulary settings.
LoGoSeg 是一种通过整合局部和全局特征来提升开放词汇语义分割的方法。它引入了对象存在先验、区域感知对齐模块和双流融合机制,以解决空间对齐不精确和对象幻觉的问题。在六个基准上的实验表明,LoGoSeg 在开放词汇设置中表现出色且具有较强的泛化能力,无需额外的外部数据集或骨干网络。
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen
First: 2026-02-04T15:33:10+00:00 · Latest: 2026-02-05T12:00:10+00:00
Abstract
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
中文标题/摘要
标题:PIO-FVLM:从推理目标视角重新思考无训练视觉标记缩减以加速VLM
近年来,减少视觉语言模型(VLMs)中的冗余视觉标记以加速VLM推理已成为一个热点话题。然而,大多数现有方法依赖于基于视觉标记间相似性或跨模态视觉-文本相似性的启发式构造,这在压缩性能和实际部署方面存在一定的局限性。相比之下,我们从推理目标的角度提出了PIO-FVLM,将视觉标记压缩转化为保持输出结果不变性,并主要通过其对该目标的重要性来选择标记。特别地,视觉标记在我们设计的层局部代理损失生成的标记级梯度显著性指导下重新排序,这是一种来自当前层到最终结果的粗略约束。然后,根据非极大值抑制(NMS)原则选择最有价值的视觉标记。提出的PIO-FVLM是无训练的,并且与FlashAttention兼容,对实际应用和部署友好。它可以独立部署作为无编码器方法,或者与VisionZip等编码器压缩方法结合使用作为包含编码器的方法。在LLaVA-Next-7B上,PIO-FVLM仅保留了11.1%的视觉标记,但保持了97.2%的原始性能,预填充速度提高了2.67倍,推理速度提高了2.11倍,FLOPs降低了6.22倍,KV缓存开销减少了6.05倍。我们的代码可在https://github.com/ocy1/PIO-FVLM获取。
Summary / 总结
The paper proposes PIO-FVLM, a training-free method for visual token reduction in vision-language models (VLMs) that focuses on preserving output result invariance. It uses token-level gradient saliency guided by a layer-local proxy loss to reorder and select the most valuable tokens, achieving significant performance and efficiency gains. On LLaVA-Next-7B, it retains only 11.1% of visual tokens while maintaining 97.2% of the original performance, with substantial speedups and reduced computational resources.
论文提出了PIO-FVLM,这是一种无需训练的视觉标记减少方法,用于加速视觉语言模型的推理。该方法基于保留输出不变性的目标重新排序和选择标记,使用由层局部代理损失生成的标记级别梯度显著性。在LLaVA-Next-7B上,PIO-FVLM仅保留了11.1%的视觉标记,同时保持了97.2%的原始性能,实现了显著的加速和资源减少。
TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
Authors: Yikun Zong, Cheston Tan
First: 2026-02-05T11:49:30+00:00 · Latest: 2026-02-05T11:49:30+00:00
Comments: 13 pages, 4 figures
Abstract
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.
中文标题/摘要
标题:TangramSR:视觉语言模型能否在连续几何空间中进行推理?
人类在完成如七巧板拼图等空间推理任务时,通过认知过程中的心理旋转、迭代细化和视觉反馈表现出色。受人类通过试错、观察和修正解决七巧板拼图的启发,我们设计了一个框架来模拟这些人类的认知机制。然而,对五种代表性视觉语言模型(VLMs)进行全面实验后发现,它们在连续几何推理方面存在系统性失败:单块任务的平均IoU仅为0.41,两块组合时降至0.23,远低于人类表现,儿童可以成功完成七巧板任务。本文探讨了自改进AI中的一个基本挑战:模型能否在测试时迭代细化预测而无需参数更新?我们引入了一个结合上下文学习(ICL)和奖励引导反馈循环的测试时自我细化框架。我们的无训练验证-细化代理应用递归细化循环,基于几何一致性反馈迭代自我细化预测,无需任何模型重训练即可将中三角形案例的IoU从0.63提高到0.932。这表明,通过ICL和奖励循环引入的人类启发式迭代细化机制可以显著增强VLMs的几何推理能力,将自改进AI从理论推向实践。连续空间领域。我们的工作可在以下匿名链接获取:https://anonymous.4open.science/r/TangramVLM-F582/
Summary / 总结
This paper explores the capability of Vision-Language Models (VLMs) in continuous geometric reasoning through the Tangram puzzle task. Despite human proficiency in solving Tangram puzzles, VLMs exhibit significant limitations, achieving only 0.41 average IoU for single-piece tasks and 0.23 for two-piece compositions. To address this, the authors propose a test-time self-refinement framework combining in-context learning and reward-guided feedback loops, which improves IoU from 0.63 to 0.932 on medium-triangle cases without retraining. This demonstrates the potential of human-inspired iterative refinement mechanisms to enhance geometric reasoning in VLMs.
该论文探讨了视觉-语言模型(VLMs)是否能够进行连续几何推理,灵感来源于人类在解拼图时的认知过程。实验结果显示,VLMs 在几何推理方面表现不佳,单块任务的 IoU 仅为 0.41,两块组合任务仅为 0.23。为了解决这一问题,作者提出了一种结合上下文学习和奖励引导反馈循环的测试时自我精炼框架,该框架在无需重新训练的情况下将中三角形任务的 IoU 提高到 0.932。这表明,通过上下文学习和奖励循环引入的人类启发式迭代精炼机制能够显著增强 VLMs 的几何推理能力。
Plug-and-play linear attention with provable guarantees for training-free image restoration
Authors: Srinivasan Kidambi, Karthik Palaniappan, Pravin Nair
First: 2025-06-10T07:37:41+00:00 · Latest: 2026-02-05T11:35:14+00:00
Abstract
Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$--$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$--$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.
中文标题/摘要
标题:插即用线性注意力:具有可证明保证的无训练图像恢复
多头自注意力(MHSA)是现代视觉Transformer的关键构建块,但其在标记数量上的二次复杂性仍然是实时和资源受限部署的主要瓶颈。我们提出了PnP-Nystra,这是一种基于Nyström的无训练线性注意力模块,旨在作为MHSA在预训练图像恢复Transformer中的即插即用替代品,具有可证明的核逼近误差保证。PnP-Nystra可以直接集成到基于窗口的架构中,如SwinIR、Uformer和Dehazeformer,无需微调即可实现高效的推理。在去噪、去模糊、去雾和超分辨率等图像任务上,PnP-Nystra在NVIDIA RTX 4090 GPU上的速度提升为1.8-3.6倍,在CPU推理上的速度提升为1.8-7倍。与我们评估的最强无训练线性注意力基线相比,我们的方法产生的质量下降最小,并且最接近原始模型的输出。
Summary / 总结
The research aims to address the computational bottleneck of multi-head self-attention (MHSA) in real-time and resource-constrained settings by proposing PnP-Nystra, a training-free linear attention module based on the Nyström method. This module is designed as a plug-and-play replacement for MHSA in image restoration Transformers, providing provable kernel approximation error guarantees. Experimental results show that PnP-Nystra achieves significant speedups on both GPU and CPU, ranging from 1.8 to 7 times faster, while maintaining high quality and staying closest to the original model's outputs across various image restoration tasks such as denoising, deblurring, dehazing, and super-resolution.
研究旨在通过提出基于Nyström方法的训练-free线性注意力模块PnP-Nystra,解决实时和资源受限环境中多头自注意力(MHSA)的计算瓶颈。该模块作为图像恢复Transformer中的插件式替代品,提供可证明的核近似误差保证。实验结果显示,PnP-Nystra在GPU和CPU上的速度分别提高了1.8到7倍,同时保持了高质量,并且最接近原始模型在各种图像恢复任务(如去噪、去模糊、去雾和超分辨率)中的输出。
VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator
Authors: Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
First: 2026-02-05T11:23:11+00:00 · Latest: 2026-02-05T11:23:11+00:00
Abstract
This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
中文标题/摘要
标题:VLN-Pilot: 大型视觉语言模型作为自主室内无人机操作员
本文介绍了VLN-Pilot,这是一种新颖的框架,在该框架中,大型多模态视觉语言模型(VLLM)承担了室内无人机导航的人类飞行员角色。通过利用VLLM的多模态推理能力,VLN-Pilot 解释自由形式的自然语言指令,并将其与视觉观察相结合,以规划和执行无人机轨迹,适用于GPS受限的室内环境。与传统的基于规则或几何路径规划方法不同,我们的框架将语言驱动的语义理解与视觉感知相结合,使无人机能够实现上下文感知的高级飞行行为,而无需特定任务的工程。VLN-Pilot 通过推理空间关系、障碍物规避以及对未预见事件的动态反应,支持无人机的完全自主指令跟随。我们在一个自定义的逼真室内模拟基准上验证了我们的框架,并展示了由VLLM驱动的代理在复杂指令跟随任务中实现高成功率的能力,包括多目标的长期导航。实验结果突显了用语言引导的自主代理取代远程无人机飞行员的潜力,为室内无人机在检查、搜索与救援、设施监控等任务中的可扩展、用户友好的控制打开了途径。我们的结果表明,基于VLLM的飞行员可能大幅减少操作员的工作量,同时在受限的室内环境中提高安全性和任务灵活性。
Summary / 总结
VLN-Pilot is a framework where a large Vision-and-Language Model (VLLM) acts as an autonomous indoor drone pilot. It interprets natural language instructions and uses visual observations to plan and execute drone trajectories in GPS-denied environments. The model integrates semantic understanding with visual perception, enabling context-aware flight behaviors. Experiments show that VLN-Pilot can successfully follow complex instructions, navigate long distances, and avoid obstacles, demonstrating the potential for language-guided autonomous control in indoor UAV tasks such as inspection and search-and-rescue.
VLN-Pilot 是一个框架,其中大型 Vision-and-Language 模型 (VLLM) 作为自主室内无人机飞行员,通过解释自然语言指令和视觉观察来在 GPS 遮挡的环境中导航无人机。该模型结合了语义理解和视觉感知,以实现上下文相关的高级飞行行为。实验表明,VLN-Pilot 可以成功遵循复杂指令、执行长时导航,并对未预见的事件进行动态反应,其在自定义的室内模拟基准测试中取得了高成功率。
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
Authors: Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
First: 2026-02-05T10:52:36+00:00 · Latest: 2026-02-05T10:52:36+00:00
Abstract
Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
中文标题/摘要
标题:共享知识的反面效应:模型合并中的光谱过积累
模型合并通过将多个微调模型的权重更新相加,将它们合并为一个模型,提供了一种轻量级的替代重新训练的方法。现有方法主要针对解决任务更新之间的冲突,而忽略了共享知识过量累积的失败模式。我们展示了当任务共享对齐的光谱方向(即重叠的奇异向量)时,简单的线性组合会反复累积这些方向,膨胀奇异值并使合并模型偏向共享子空间。为缓解这一问题,我们提出了奇异值校准(SVC),这是一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的光谱。在视觉和语言基准测试中,SVC 一致地改进了强大的合并基线并达到了最先进的性能。此外,仅通过修改奇异值,SVC 将任务算术的性能提高了13.0%。代码可在:https://github.com/lyymuwu/SVC 获取。
Summary / 总结
The paper addresses the issue of spectral over-accumulation in model merging, where shared knowledge between tasks is over-counted, leading to biased models. It introduces Singular Value Calibration (SVC), a training-free and data-free method that rescales inflated singular values to restore a balanced spectrum, improving model performance across vision and language benchmarks. SVC also enhances Task Arithmetic by 13.0%.
论文解决了模型合并中由于任务间的共享知识被过度累加而导致模型偏差的问题。提出了Singular Value Calibration (SVC)方法,这是一种无需训练和数据的方法,通过重新调整放大的奇异值来恢复平衡的谱,从而在视觉和语言基准测试中提升模型性能。SVC还提高了任务算术的性能,提升了13.0%。
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Authors: Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Venue: ICLR 2026
First: 2026-02-05T10:51:39+00:00 · Latest: 2026-02-05T10:51:39+00:00
Comments: Accepted to ICLR 2026. Code is available at https://github.com/HT86159/EUQ
Abstract
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
中文标题/摘要
标题:通过证据不确定性量化检测大型视觉-语言模型的不当行为
大型视觉-语言模型(LVLMs)在多模态理解和生成方面取得了显著进展。然而,当面对无能或对抗性输入时,它们经常生成不可靠甚至有害的内容,如事实幻觉或危险指令。这种与人类期望不符的现象,被称为LVLMs的不当行为,对关键应用中的部署提出了严重关切。这些不当行为被发现源自于认识不确定性,具体来说是内部知识冲突或缺乏支持信息。然而,现有的不确定性量化方法通常只能捕捉整体认识不确定性,对于识别此类问题效果有限。为解决这一差距,我们提出了一种细粒度的方法——证据不确定性量化(EUQ),该方法能够同时捕捉信息冲突和无知,从而有效检测LVLM的不当行为。特别是,我们将模型输出头的特征解释为支持(正面)或反对(负面)证据。利用证据理论,我们建模并聚合这些证据,在单次前向传播中量化内部冲突和知识空白。我们使用最先进的LVLMs在四个类别(幻觉、脱逃、对抗性漏洞和分布外失败)的不当行为上进行了广泛评估,发现EUQ始终优于强基线,表明幻觉对应于高内部冲突,而分布外失败对应于高无知。此外,逐层证据不确定性动态分析有助于从新视角解释内部表示的演变。源代码可在https://github.com/HT86159/EUQ获取。
Summary / 总结
The paper addresses the issue of misbehaviors in large vision-language models (LVLMs) by proposing Evidential Uncertainty Quantification (EUQ), a method that captures both information conflict and ignorance. EUQ evaluates the model's output by interpreting features as either supporting or opposing evidence and uses Evidence Theory to quantify these aspects. The method is evaluated across four categories of misbehavior and outperforms strong baselines, indicating that hallucinations are associated with high internal conflict and out-of-distribution failures with high ignorance. Layer-wise analysis further helps interpret the model's internal dynamics.
本文提出了一种证据不确定性量化(EUQ)方法,用于检测大型视觉-语言模型(LVLM)的不当行为,该方法捕捉信息冲突和无知。该方法将模型输出特征解释为支持或反对的证据,并使用证据理论来量化内部冲突和知识空白。EUQ在检测幻觉、脱逃、对抗性漏洞和分布外失败等行为方面优于强基线,表明幻觉与高内部冲突相关,而分布外失败与高无知相关。逐层的证据不确定性动态分析有助于从新视角解释内部表示的演变。
SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Authors: Youngwoo Shin, Jiwan Hur, Junmo Kim
Venue: ICLR 2026
First: 2026-02-05T10:48:58+00:00 · Latest: 2026-02-05T10:48:58+00:00
Comments: Accepted to ICLR 2026
Abstract
Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
中文标题/摘要
标题:SSG:多尺度视觉自回归生成的缩放空间指导
视觉自回归(VAR)模型通过下一级预测生成图像,自然实现从粗到细、快速、高保真的合成,模拟人类感知。实践中,这种层次结构在推理时可能会漂移,由于容量有限和累积误差,模型会偏离从粗到细的性质。我们从信息论的角度重新审视这一限制,并推断确保每个尺度贡献未被早期尺度解释的高频内容可以缓解训练与推理之间的差异。基于这一洞察,我们提出了缩放空间指导(SSG),一种无需训练的推理时指导,引导生成向预期的层次结构发展,同时保持全局一致性。SSG 强调目标高频信号,定义为语义残差,从较粗的先验中隔离出来。为了获得这一先验,我们利用了一个基于频率域的原理性过程,离散空间增强(DSE),旨在通过频率感知构建更好地突出和隔离语义残差。SSG 在利用离散视觉标记的 VAR 模型中广泛适用,无论标记设计或条件模态如何。实验表明,SSG 在保真度和多样性方面提供了持续的改进,同时保持低延迟,揭示了粗到细图像生成中的未开发效率。代码可在 https://github.com/Youngwoo-git/SSG/ 获取。
Summary / 总结
The research aims to address the drift issue in visual autoregressive (VAR) models during inference, which can lead to a mismatch between training and inference. Scaled Spatial Guidance (SSG) is proposed as a training-free method that guides the generation process to maintain the intended coarse-to-fine hierarchy, ensuring high-frequency content is accurately captured at each scale. Experiments show that SSG improves image fidelity and diversity without increasing latency, highlighting its efficiency in coarse-to-fine image generation.
研究解决了视觉自回归模型在推理过程中层级漂移的问题,提出了一种训练免费的方法——缩放空间引导(SSG),确保每一尺度贡献独特的高频内容。SSG 在推理时应用,引导生成保持预期的层级结构,同时保持全局一致性。实验表明,SSG 在提高图像保真度和多样性的同时,不会增加延迟,突显了其在自上而下图像生成中的高效性。
RefAM: Attention Magnets for Zero-Shot Referral Segmentation
Authors: Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
First: 2025-09-26T17:59:57+00:00 · Latest: 2026-02-05T10:20:31+00:00
Comments: Project Page: https://refam-diffusion.github.io/
Abstract
Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.
中文标题/摘要
标题:RefAM:用于零样本引用分割的注意力磁铁
现有的大多数引用分割方法仅通过微调或组合多个预训练模型才能实现较强的性能,通常需要额外的训练和架构修改。同时,大规模生成扩散模型编码丰富的语义信息,使其成为通用特征提取器的有吸引力的选择。在本文中,我们介绍了一种新方法,该方法直接利用扩散变换器的特征和注意力分数,用于下游任务,无需架构修改和额外训练。为了系统地评估这些特征,我们扩展了基准测试,涵盖了从图像到视频的视觉-语言定位任务。我们的关键见解是停用词充当注意力磁铁:它们积累多余的注意力,并可以过滤以减少噪声。此外,我们识别出在深层中出现的全局注意力陷阱(GAS),并表明它们可以安全地被抑制或重新定向到辅助标记,从而产生更清晰和更准确的定位图。我们还提出了一种注意力重新分配策略,其中附加的停用词将背景激活划分为更小的簇,产生更清晰和更局部化的热图。基于这些发现,我们开发了RefAM,这是一种简单的无需训练的定位框架,结合了交叉注意力图、GAS处理和重新分配。在零样本引用图像和视频分割基准测试中,我们的方法实现了强大的性能,并在大多数数据集上超过了先前的方法,无需微调、附加组件和复杂推理。
Summary / 总结
This work introduces RefAM, a training-free method for zero-shot referring segmentation that leverages attention scores from diffusion transformers. It uses stop words as attention magnets to filter noise and identifies global attention sinks in deeper layers, which can be redirected to auxiliary tokens. The approach also employs an attention redistribution strategy to produce sharper heatmaps. Experiments on zero-shot referring image and video segmentation benchmarks show that RefAM outperforms previous methods without requiring fine-tuning or additional components.
该研究提出了一种名为RefAM的无训练零样本引用分割方法,利用扩散变换器的注意力分数。它利用停用词作为注意力磁铁来过滤噪声,并识别深层中的全局注意力汇流点,可以将其重定向到辅助标记。该方法还采用了一种注意力重新分配策略,以生成更清晰的热图。实验表明,RefAM在零样本引用图像和视频分割基准测试中优于先前方法,无需进行微调或添加额外组件。
SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Authors: Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong
First: 2026-02-05T10:02:00+00:00 · Latest: 2026-02-05T10:02:00+00:00
Abstract
Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
中文标题/摘要
标题:SDFP:基于FIT剪枝的推测性解码以实现无需训练和即插即用的LLM加速
大型语言模型(LLMs)支撑着诸如字幕生成、检索、推荐和创意内容生成等交互式多媒体应用,但其自回归解码会带来显著的延迟。推测性解码通过使用轻量级草稿模型来减少延迟,但其部署往往受限于获取、调整和维护有效草稿模型的成本和复杂性。最近的方法通常需要辅助训练或专门化,即使无需训练的方法也会产生昂贵的搜索或优化成本。我们提出了一种完全无需训练且即插即用的框架SDFP,该框架通过给定的LLM的Fisher信息追踪(FIT)层剪枝来构建草稿模型。利用层灵敏度作为输出扰动的代理,SDFP移除低影响层以获得紧凑的草稿,同时保持与原始模型的兼容性,以进行标准的推测性验证。SDFP无需额外训练、超参数调整或单独维护的草稿,能够快速构建部署友好的草稿。在基准测试中,SDFP在不改变目标模型输出分布的情况下实现了1.32倍至1.5倍的解码加速,支持低延迟的多媒体应用。
Summary / 总结
SDFP is a training-free and plug-and-play framework that accelerates large language model (LLM) decoding by using Fisher Information Trace (FIT)-based layer pruning. It constructs a lightweight draft model without additional training or hyperparameter tuning, enabling rapid deployment. Experiments show that SDFP provides a 1.32x to 1.5x decoding speedup without changing the output distribution of the original model, supporting low-latency multimedia applications.
SDFP 是一个无需训练且即插即用的框架,通过使用 Fisher 信息迹 (FIT) 基础的层剪枝来创建一个轻量级的草稿模型,以加速大型语言模型 (LLMs)。该方法不需要额外的训练或超参数调整,可以快速且易于部署地构建草稿模型。SDFP 在不改变原始模型输出分布的情况下,提供了 1.32 到 1.5 倍的解码加速,支持低延迟的多媒体应用。
Auto-Rubric: Learning From Implicit Weights to Explicit Rubrics for Reward Modeling
Authors: Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding
First: 2025-10-20T09:01:37+00:00 · Latest: 2026-02-05T09:36:28+00:00
Abstract
Conventional reward modeling relies on gradient descent over neural weights, creating opaque, data-hungry "black boxes." We propose a paradigm shift from implicit to explicit reward parameterization, recasting optimization from continuous weight spaces to the discrete space of natural language rubrics. We introduce a training-free framework based on iterative rubric learning: it locally induces discriminative criteria via verification-driven refinement, and globally compresses the candidate criteria pool into a compact core set by maximizing an information-theoretic coding rate objective. We organize the compressed core set into a hierarchical rubric structure -- high-level evaluation dimensions supported by concrete verification checks -- serving as an interpretable, portable reward function. Empirically, our approach challenges prevailing data scaling assumptions: using only 70 preference pairs, our rubric-guided judges outperform fully trained reward models on diverse benchmarks. For instance, Qwen3-8B equipped with our learned rubrics achieves 80.91% on RewardBench2, surpassing the specialized Skywork-Reward-V2-Qwen3-8B (78.20%). These results demonstrate that alignment signals are highly compressible and can be effectively captured through explicit symbolic search.
中文标题/摘要
标题:自动评分标准:从隐式权重到显式评分标准的奖励建模学习
传统的奖励建模依赖于神经权重的梯度下降,创建出不透明的、数据饥渴的“黑箱”。我们提出了一种从隐式到显式的奖励参数化范式转变,将优化从连续的权重空间重新定义为自然语言评分标准的离散空间。我们引入了一种基于迭代评分标准学习的无训练框架:通过验证驱动的细化局部诱导判别标准,并通过最大化信息论编码率目标全局压缩候选标准池,形成一个紧凑的核心集。我们将压缩的核心集组织成一个分层的评分标准结构——高层次的评估维度由具体的验证检查支持,作为可解释的、可移植的奖励函数。实证上,我们的方法挑战了现有的数据规模假设:仅使用70对偏好对,我们的评分标准引导的评判者在多种基准上优于完全训练的奖励模型。例如,配备我们学习到的评分标准的Qwen3-8B在RewardBench2上达到了80.91%,超过了专门的Skywork-Reward-V2-Qwen3-8B(78.20%)。这些结果表明,对齐信号是高度可压缩的,并且可以通过显式的符号搜索有效地捕捉。
Summary / 总结
The paper proposes Auto-Rubric, a method that shifts reward modeling from implicit neural weights to explicit natural language rubrics. It uses a training-free iterative process to refine discriminative criteria and compress them into a compact core set, which is organized into a hierarchical structure. Experiments show that using only 70 preference pairs, the rubric-guided approach outperforms fully trained reward models on various benchmarks, such as achieving 80.91% on RewardBench2 compared to 78.20% for a specialized model. This indicates that alignment signals can be effectively captured through explicit symbolic search and are highly compressible.
论文提出了Auto-Rubric方法,将奖励建模从隐式的神经权重转向显式的自然语言评分标准。它使用无训练的迭代过程来细化判别标准并压缩成一个紧凑的核心集,然后组织成层次结构。实验表明,使用仅70个偏好对,Auto-Rubric在多种基准测试上优于完全训练的奖励模型,例如在RewardBench2上达到80.91%,而专门的模型仅为78.20%。这表明对齐信号可以通过显式的符号搜索有效捕获,并且高度可压缩。
RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation
Authors: Ming-Ming Yu, Yi Chen, Börje F. Karlsson, Wenjun Wu
Venue: ICRA 2026
First: 2025-12-30T13:25:22+00:00 · Latest: 2026-02-05T09:33:50+00:00
Comments: Accepted at ICRA 2026
Abstract
Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
中文标题/摘要
标题:RANGER:通过上下文适应的单目零样本语义导航框架
在复杂环境中高效地找到目标是现实世界体态应用的基础。虽然近期多模态基础模型的进步使得零样本物体目标导航成为可能,允许机器人搜索任意物体而无需微调,但现有方法面临两个关键限制:(1)对模拟器提供的精确深度和姿态信息的高度依赖,这限制了其在现实世界场景中的应用;(2)缺乏上下文学习(ICL)能力,使得难以快速适应新环境,如利用短视频。为解决这些挑战,我们提出RANGER,一种新颖的零样本、开放式词汇语义导航框架,仅使用单目相机运行。利用强大的3D基础模型,RANGER消除了对深度和姿态的依赖,同时展示了强大的ICL能力。通过简单观察新环境的短视频,系统也可以显著提高任务效率,无需进行架构修改或微调。该框架整合了几个关键组件:基于关键帧的3D重建、语义点云生成、基于视觉-语言模型(VLM)的探索价值估计、高层自适应航点选择和低层动作执行。在HM3D基准测试和真实世界环境中进行的实验表明,RANGER在导航成功率和探索效率方面表现出竞争力,同时展示了优越的ICL适应性,无需事先对环境进行3D建模。
Summary / 总结
RANGER is a zero-shot semantic navigation framework that uses only a monocular camera to navigate complex environments without the need for precise depth and pose information. It leverages 3D foundation models and in-context learning to adapt quickly to new environments. Experiments show that RANGER performs competitively in navigation success rate and exploration efficiency, and demonstrates superior adaptability through short video observations without requiring previous 3D mapping.
RANGER 是一种仅使用单目相机的零样本语义导航框架,旨在高效地在复杂环境中找到目标。它通过消除对精确深度和姿态信息的需求,并结合上下文学习能力来解决现有方法的限制。实验表明,RANGER 在导航成功率和探索效率方面表现出色,并且无需进行架构修改或微调即可快速适应新环境。
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
Authors: Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu
First: 2026-02-05T09:15:34+00:00 · Latest: 2026-02-05T09:15:34+00:00
Comments: 9 pages, 2 figures, 5 tables, conference
Abstract
Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
中文标题/摘要
标题:MerNav:一种高度通用的记忆-执行-回顾框架用于零样本物体目标导航
视觉语言导航(VLN)是体现智能的基本能力之一,也是亟待解决的关键挑战。然而,现有方法在成功率(SR)和泛化能力方面仍然不尽如人意:监督微调(SFT)方法通常能获得更高的SR,而无需训练(TF)方法往往能更好地泛化,但两者难以同时兼得。为此,我们提出了一种记忆-执行-回顾框架。该框架由三个部分组成:层次化记忆模块提供信息支持,执行模块进行常规决策和操作,回顾模块处理异常情况并纠正行为。我们在物体目标导航任务上验证了该框架的有效性。在4个数据集上,我们的平均SR在无需训练(TF)和零样本(ZS)设置下分别比所有基线方法提高了7%和5%。在最常用的HM3D_v0.1和更具挑战性的开放词汇数据集HM3D_OVON上,ZS设置下的SR分别提高了8%和6%。此外,在MP3D和HM3D_OVON数据集上,我们的方法不仅优于所有无需训练方法,还超越了所有监督微调方法,在SR(5%和2%)和泛化能力上实现了全面领先。
Summary / 总结
The paper introduces MerNav, a Memory-Execute-Review framework designed to improve zero-shot object goal navigation in VLN tasks. It combines a hierarchical memory module for information support, an execute module for routine decision-making, and a review module for handling abnormal situations. Experiments across four datasets show that MerNav achieves a 7% and 5% improvement in success rate over all baseline methods in training-free and zero-shot settings, respectively. On HM3D_OVON, MerNav outperforms both training-free and supervised fine-tuning methods, demonstrating superior performance and generalization.
该研究提出了MerNav框架,旨在提高VLN任务中的零样本物体目标导航性能。该框架结合了层次记忆模块、执行模块和审查模块,以提升成功率和泛化能力。在四个数据集上的实验表明,MerNav在训练免费和零样本设置下的成功率分别提高了7%和5%。特别是在HM3D_OVON数据集上,MerNav不仅超越了所有训练免费方法,还超过了所有监督微调方法,展示了卓越的性能和泛化能力。
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
First: 2025-12-26T18:59:47+00:00 · Latest: 2026-02-05T08:49:21+00:00
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
中文标题/摘要
标题:少看,精准看:双向感知塑造多模态推理
大型视觉-语言模型(VLMs)通常从中间视觉提示中受益,这些提示要么通过外部工具注入,要么在推理过程中作为潜在视觉标记生成,但这些机制仍然忽略了细微的视觉证据(例如图表中的多段线),在不同领域泛化能力差,并且在推理时成本高。在本文中,我们提出了双向感知塑造(BiPS),它将问题条件下的遮蔽视图转换为双向的看哪里信号,在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间施加KL一致性约束,鼓励粗略但完整的支持像素覆盖。然后在原始图像和关键像素被遮蔽的证据消除视图之间施加KL分离约束,使得图像不再支持原始答案,从而避免仅从文本回答(即,仅从文本回答)并强制执行细微的视觉依赖。在八个基准测试中,BiPS 将 Qwen2.5-VL-7B 的性能平均提升 8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
Summary / 总结
This paper addresses the limitations of existing vision-language models in utilizing fine-grained visual evidence and their poor generalization across domains. It introduces Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals during training. BiPS enhances the model's reliance on fine-grained visual evidence and discourages text-only shortcuts. Across eight benchmarks, BiPS improves Qwen2.5-VL-7B by 8.2% on average and demonstrates strong out-of-domain generalization.
本文针对现有视觉-语言模型在利用细粒度视觉证据方面的局限性和跨域泛化能力差的问题,提出了双向感知塑造(BiPS)方法。BiPS在训练过程中将问题条件下的遮罩视图转换为双向的注视信号,增强模型对细粒度视觉证据的依赖,并避免纯文本捷径。在八个基准测试中,BiPS将Qwen2.5-VL-7B的性能平均提升8.2%,并展示了强大的跨域泛化能力。
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
First: 2026-02-05T08:45:08+00:00 · Latest: 2026-02-05T08:45:08+00:00
Comments: 17 pages, 7 figures; cvpr2026 submission
Abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
中文标题/摘要
标题:DisCa: 加速视频扩散变换器的蒸馏兼容可学习特征缓存
尽管扩散模型在视频生成领域取得了巨大成功,但这一进展伴随着计算负担的急剧增加。现有的加速方法中,特征缓存因其无需训练的特性及显著的加速性能而广受欢迎,但进一步压缩时不可避免地会面临语义和细节的丢失。另一种广泛应用的方法,训练感知的步骤蒸馏,在图像生成中取得了成功,但在视频生成中却面临严重的性能下降,且仅应用无需训练的特征缓存到步骤蒸馏模型时,质量损失更为严重,因为采样步骤更为稀疏。本文首次引入了蒸馏兼容的可学习特征缓存机制。我们采用轻量级的可学习神经预测器代替传统的无需训练的启发式方法,能够更准确地捕捉高维特征演化过程。此外,我们探讨了高度压缩蒸馏在大规模视频模型中的挑战,并提出了一种保守的受限均值流方法,以实现更稳定和无损的蒸馏。通过这些努力,我们在保持生成质量的同时将加速边界进一步推至$11.8\times$。大量实验表明了我们方法的有效性。代码附在补充材料中,并将公开。
Summary / 总结
This paper addresses the computational challenges in video generation using diffusion models by introducing a novel distillation-compatible learnable feature caching mechanism. The method uses a lightweight learnable predictor to capture the high-dimensional feature evolution process more accurately, and proposes a conservative Restricted MeanFlow approach to achieve stable and lossless distillation. The approach accelerates the models by a factor of 11.8 times without compromising generation quality, as demonstrated by extensive experiments.
该论文通过引入一种新型的可蒸馏学习特征缓存机制,解决了使用扩散模型进行视频生成时的计算挑战。该方法使用轻量级的学习神经预测器来准确捕捉特征演变过程,并提出了一种保守的限制均值流方法以实现更稳定的蒸馏。实验表明,所提出的方法可以在保持质量的同时将视频生成加速11.8倍。
LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation
Authors: Hengyu Shi, Junhao Su, Tianyang Han, Junfeng Luo, Jialin Gao
First: 2025-04-15T03:12:01+00:00 · Latest: 2026-02-05T07:47:50+00:00
Abstract
Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
中文标题/摘要
标题:LayoutCoT:利用大型语言模型的深度推理潜力进行布局生成
条件布局生成旨在从用户定义的约束条件自动生成视觉上吸引人且语义上连贯的布局。虽然基于生成模型的近期方法显示出有希望的结果,但它们通常需要大量的训练数据或广泛的微调,限制了它们的灵活性和实际应用性。相反,一些无需训练的方法利用大型语言模型(LLMs)的上下文学习也出现了,但它们往往推理能力有限,排名机制过于简单,限制了它们生成高质量布局的能力。为此,我们提出了一种名为LayoutCoT的新方法,该方法通过检索增强生成(RAG)和链式思考(CoT)技术结合利用LLMs的推理能力。具体而言,LayoutCoT将布局表示转换为适合LLMs处理的标准序列化格式。使用布局感知RAG来促进有效的检索并生成粗略布局。然后,将初步布局与选定的示例一起输入特别设计的CoT推理模块进行迭代细化,显著提高了语义连贯性和视觉质量。我们在五个公共数据集上进行了广泛的实验,涵盖了三种条件布局生成任务。实验结果表明,LayoutCoT在无需训练或微调的情况下达到了最先进的性能。值得注意的是,我们的CoT推理模块使标准LLMs,即使它们没有明确的深度推理能力,也能超越专门的深度推理模型(如deepseek-R1),突显了我们方法在利用LLMs的深度推理能力进行布局生成任务方面的潜力。
Summary / 总结
LayoutCoT is a method that enhances the reasoning capabilities of Large Language Models (LLMs) for layout generation by combining Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. It transforms layout representations into a standardized format and uses a Layout-aware RAG to generate a coarse layout, which is then refined through a CoT reasoning module. Experiments on five public datasets show that LayoutCoT outperforms existing methods without needing training data, demonstrating the potential of LLMs to perform complex reasoning tasks in layout generation.
LayoutCoT 是一种通过结合检索增强生成(RAG)和链式思考(CoT)技术来增强大型语言模型(LLMs)在布局生成中的推理能力的方法。它将布局表示转换为标准化格式,并使用布局感知的 RAG 生成粗略布局,然后通过 CoT 推理模块进行迭代细化。在五个公共数据集上的广泛实验表明,LayoutCoT 在无需训练或微调的情况下实现了最先进的性能,超越了专门的深度推理模型。