Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization
Authors: Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou
First: 2024-10-08T17:59:30+00:00 · Latest: 2026-02-10T18:53:34+00:00
Comments: 31 pages, 33 figures, The project page and associated code can be accessed via https://jwmao1.github.io/storyiter/
Abstract
This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.
中文标题/摘要
标题:Story-Iter:一种无需训练的迭代范式以增强长故事生成
本文介绍了Story-Iter,一种新的无需训练的迭代范式,用于增强长故事生成。与现有方法依赖固定参考图像构建完整故事不同,我们的方法采用了一种新颖的外部迭代范式,超越了扩散模型内部的去噪迭代步骤,通过结合上一轮所有参考图像来不断细化生成的每一幅图像。为此,我们提出了一种即插即用、无需训练的全局参考交叉注意力(GRCA)模块,使用全局嵌入表示所有参考帧,确保长序列中的语义一致性。通过逐步引入整体视觉上下文和文本约束,我们的迭代范式能够实现精确生成和精细交互,逐步优化故事可视化。在官方故事可视化数据集和我们的长故事基准中的广泛实验表明,Story-Iter在长故事可视化(多达100帧)方面的性能在语义一致性和精细交互方面均表现出色。
Summary / 总结
Story-Iter is a training-free iterative paradigm designed to improve long-story generation. Unlike previous methods that use fixed reference images, Story-Iter incorporates all reference images from previous rounds to refine each generated image, ensuring semantic consistency. It introduces a global reference cross-attention module to model all reference frames, enhancing fine-grained interactions and precise generation. Experiments show that Story-Iter outperforms existing methods in terms of semantic consistency and fine-grained interactions for long-story visualization up to 100 frames.
Story-Iter 是一种无需训练的迭代框架,旨在提升长故事生成的效果。不同于以往依赖固定参考图像的方法,Story-Iter 在每次生成图像时都会结合前一轮的所有参考图像,确保语义一致性。它引入了一个全局参考交叉注意力模块来建模所有参考帧,增强细粒度交互和精确生成。实验表明,Story-Iter 在长故事可视化(最多100帧)方面优于现有方法,在语义一致性和细粒度交互方面表现出色。
Anagent For Enhancing Scientific Table & Figure Analysis
Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
First: 2026-02-10T18:46:28+00:00 · Latest: 2026-02-10T18:46:28+00:00
Abstract
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
中文标题/摘要
标题:增强科学表格与图表分析的代理工具
在科学研究中,分析需要准确解读复杂的多模态知识,整合来自不同来源的证据,并基于领域特定知识得出推断。然而,当前的人工智能(AI)系统在持续展示这些能力方面存在困难。科学表格与图表的复杂性和变异性,以及异构结构和长上下文需求,构成了科学表格与图表分析的基本障碍。为了量化这些挑战,我们引入了AnaBench,这是一个包含来自九个科学领域的63,178个实例的大规模基准,系统地按七个复杂维度进行分类。为了应对这些挑战,我们提出了Anagent,这是一种多代理框架,通过四个专门的代理来增强科学表格与图表分析:规划者将任务分解为可执行的子任务,专家通过有针对性的工具执行检索特定任务信息,解决者综合信息生成连贯的分析,评论家通过五维质量评估进行迭代优化。我们进一步开发了模块化的训练策略,利用监督微调和专门的强化学习来优化个体能力,同时保持有效的协作。在170个子领域进行全面评估表明,Anagent实现了显著的改进,在无训练设置中最高可达13.43%的提升,在微调设置中可达42.12%,揭示了任务导向的推理和上下文感知的问题解决对于高质量的科学表格与图表分析至关重要。我们的项目页面:https://xhguo7.github.io/Anagent/.
Summary / 总结
The paper addresses the challenge of accurately analyzing complex scientific tables and figures, which current AI systems often fail to handle consistently. To overcome this, the authors introduce AnaBench, a benchmark with 63,178 instances, and propose Anagent, a multi-agent framework with four specialized agents: Planner, Expert, Solver, and Critic. Anagent shows significant improvements in analysis quality, up to 13.43% in training-free settings and 42.12% with finetuning, highlighting the importance of task-oriented reasoning and context-aware problem-solving in scientific analysis.
论文旨在解决当前AI系统在准确分析复杂科学表格和图表时经常无法一致处理的问题。为了量化这些挑战,作者引入了包含63,178个实例的AnaBench基准,这些实例来自九个科学领域。他们提出了一个由分解任务的Planner、检索信息的Expert、综合分析的Solver和迭代改进的Critic组成的多智能体框架。Anagent在无训练和有训练的设置下分别实现了高达13.43%和42.12%的分析质量提升,强调了任务导向的推理和上下文感知的问题解决对于高质量科学分析的重要性。
Chain of Mindset: Reasoning with Adaptive Cognitive Modes
Authors: Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Yuhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen
First: 2026-02-10T18:31:47+00:00 · Latest: 2026-02-10T18:31:47+00:00
Abstract
Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96\% and 4.72\% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.
中文标题/摘要
标题:思维链:适应性认知模式推理
人类解决问题绝非单一思维模式的重复,我们面对特定任务时,并非依赖单一思维模式,而是将多种思维模式整合到单一的解决方案过程中。然而,现有的大语言模型推理方法往往陷入一个常见陷阱:它们在所有步骤中都采用相同的固定思维模式,忽视了解决同一问题的不同阶段需要根本不同的思维模式。这种单一思维模式的假设阻碍了模型达到更高层次的智能。为解决这一局限,我们提出了一种无需训练的代理框架——思维链(CoM),该框架能够实现步骤级别的适应性思维模式编排。CoM 将推理分解为四个功能异质的思维模式:空间思维、收敛思维、发散思维和算法思维。一个元代理根据推理状态的演变动态选择最优思维模式,而双向上下文门控则过滤模块间的信息流,以保持有效性和效率。跨六个涵盖数学、代码生成、科学问答和空间推理的挑战性基准实验表明,CoM 达到了最先进的性能,在 Qwen3-VL-32B-Instruct 和 Gemini-2.0-Flash 上的整体准确率分别比最强基线高出 4.96% 和 4.72%,同时平衡了推理效率。我们的代码已公开发布于 https://github.com/QuantaAlpha/chain-of-mindset。
Summary / 总结
The research aims to improve the adaptability of large language models (LLMs) in problem-solving by addressing their tendency to use a single mindset throughout the reasoning process. The proposed Chain of Mindset (CoM) framework introduces a dynamic approach where a Meta-Agent selects from four distinct mindsets—Spatial, Convergent, Divergent, and Algorithmic—based on the evolving reasoning state. This method significantly outperforms existing baselines by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, respectively, while maintaining efficiency across various benchmarks including mathematics, code generation, scientific QA, and spatial reasoning.
研究旨在通过解决大型语言模型(LLMs)在解决问题时使用单一思维模式的问题,提高其适应性。提出的Chain of Mindset(CoM)框架引入了一种动态方法,根据推理状态的变化,Meta-Agent 从四种不同的思维模式——空间、收敛、发散和算法中选择最优的一种。该方法在Qwen3-VL-32B-Instruct和Gemini-2.0-Flash上分别显著优于现有基线4.96%和4.72%,同时在数学、代码生成、科学问答和空间推理等多个基准测试中保持高效。
Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection
Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Venue: ICASSP 2026
First: 2026-02-10T18:10:08+00:00 · Latest: 2026-02-10T18:10:08+00:00
Comments: Accepted by ICASSP 2026
Abstract
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
中文标题/摘要
标题:Fake-HR1:重新思考视觉语言模型在合成图像检测中的推理方式
近期研究表明,在检测过程中引入链式思考(CoT)推理可以增强模型检测合成图像的能力。然而,过长的推理过程会带来显著的资源开销,包括令牌消耗和延迟,特别是在处理明显伪造的图像时尤为冗余。为了解决这一问题,我们提出了一种名为Fake-HR1的大规模混合推理模型,据我们所知,这是首个能够根据生成检测任务的特性自适应地决定是否需要推理的模型。为了实现这一点,我们设计了一个两阶段训练框架:首先进行混合微调(HFT)以进行冷启动初始化,然后通过混合推理组策略优化(HGRPO)进行在线强化学习,以隐式学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够在不同类型的查询中自适应地进行推理,不仅在推理能力和生成检测性能上超越现有语言模型,还显著提高了响应效率。
Summary / 总结
The research aims to improve the ability of vision-language models to detect synthetic images by incorporating adaptive reasoning. The method involves a two-stage training framework: Hybrid Fine-Tuning for initialization and Hybrid-Reasoning Grouped Policy Optimization for learning when reasoning is needed. The key finding is that Fake-HR1 outperforms existing models in both reasoning ability and generative detection performance while enhancing response efficiency.
研究旨在通过引入适应性推理来提高视觉-语言模型检测合成图像的能力。方法包括两阶段训练框架:Hybrid Fine-Tuning进行初始设置,Hybrid-Reasoning Grouped Policy Optimization学习何时需要推理。关键发现是Fake-HR1在推理能力和生成检测性能上均优于现有模型,同时提高了响应效率。
Learning to Detect Baked Goods with Limited Supervision
Authors: Thomas H. Schmitt, Maximilian Bundscherer, Tobias Bocklet
First: 2026-02-10T17:06:36+00:00 · Latest: 2026-02-10T17:06:36+00:00
Abstract
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
中文标题/摘要
标题:在有限监督下学习检测烘焙食品
监控剩余产品提供了宝贵的见解,可用于优化未来生产。这对德国烘焙店尤为重要,因为新鲜烘焙食品的保质期非常短。自动化此过程可以降低劳动力成本、提高准确性并简化操作。我们提议使用对象检测模型来从图像中识别烘焙食品以实现自动化。然而,德国烘焙食品的多样性很大,完全监督的训练成本高昂且限制了可扩展性。尽管开放词汇检测器(例如OWLv2、Grounding DINO)具有灵活性,但我们证明它们不足以完成我们的任务。虽然受烘焙店的启发,我们的工作解决了计算机视觉在工业中部署的更广泛挑战,其中任务专业化且标注数据集稀缺。我们编制了不同监督水平的数据集分割,涵盖了19类烘焙食品。我们提出了两种训练工作流来在有限监督下训练对象检测模型。首先,我们将OWLv2和Grounding DINO定位与图像级监督结合,以弱监督方式训练模型。其次,我们通过使用Segment Anything 2进行视频帧注释并作为伪标签传播模型来提高视角鲁棒性。使用这些工作流,我们由于YOLOv11具有有利的速度与准确性的权衡,因此使用它进行检测任务的训练。仅依赖图像级监督,模型的平均精度(mAP)达到0.91。使用伪标签进行微调,在非理想部署条件下提高了模型性能19.3%。结合这些工作流,在非理想部署条件下训练的模型超越了我们的完全监督基线模型,尽管仅依赖图像级监督。
Summary / 总结
The paper aims to automate the detection of baked goods in German bakeries to optimize production and reduce labor costs. It proposes two training workflows for an object detection model using limited supervision due to the diversity of baked goods. The first workflow combines OWLv2 and Grounding DINO localization with image-level supervision, achieving a mean Average Precision (mAP) of 0.91. The second workflow fine-tunes the model on video frames using pseudo-labels, improving performance by 19.3%. Combining both workflows trains a model that outperforms the fully-supervised baseline under non-ideal deployment conditions.
论文旨在通过图像自动检测烘焙产品来优化德国烘焙业的生产,考虑到劳动力成本和准确性的重要性。为应对由于烘焙产品的多样性导致的监督不足问题,作者提出了两种训练工作流。第一个工作流结合使用OWLv2和Grounding DINO与图像级标签进行弱监督,实现了平均精度(mAP)0.91。第二个工作流通过伪标签传播模型进行视角鲁棒性增强的微调,非理想条件下性能提高了19.3%。结合这两种工作流训练的模型在非理想条件下超过了完全监督的基线模型。
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Authors: Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong
First: 2026-02-01T06:12:05+00:00 · Latest: 2026-02-10T16:46:48+00:00
Abstract
Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.
中文标题/摘要
标题:残差解码:通过历史感知残差指导减轻大型视觉-语言模型中的幻觉
大型视觉-语言模型(LVLMs)能够有效从图像-文本输入中进行推理并在多种多模态任务中表现出色。尽管取得了这些成功,它们仍然受到语言先验的影响,经常产生幻觉。幻觉指的是语法和句法上连贯但与实际视觉输入毫无关联或直接相关性的生成内容。为了解决这一问题,我们提出了残差解码(ResDec)。这是一种无需训练的新方法,利用历史信息来辅助解码。该方法依赖于LVLMs内部的隐式推理机制和token概率演化机制来纠正偏差。大量实验表明,ResDec有效地抑制了由语言先验引起的幻觉,显著提高了视觉定位,并减少了物体幻觉。除了减轻幻觉外,ResDec在全面的LVLM基准测试中表现也非常出色,突显了其广泛的适用性
Summary / 总结
The paper proposes Residual Decoding (ResDec), a training-free method that uses historical information to correct biases and mitigate hallucinations in large vision-language models. Experiments show that ResDec effectively suppresses language-prior-induced hallucinations, improves visual grounding, and reduces object hallucinations, while also performing well on comprehensive LVLM benchmarks.
论文针对大型视觉-语言模型(LVLM)生成与视觉输入无关的幻觉问题,提出了一种名为残差解码(ResDec)的无训练方法,利用历史信息纠正偏差。实验表明,ResDec能有效减少幻觉,提高视觉定位能力,并在基准测试中表现出色。
SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
Authors: Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti
First: 2026-02-06T10:05:25+00:00 · Latest: 2026-02-10T15:30:46+00:00
Abstract
Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
中文标题/摘要
标题:SPARC:分离感知和推理电路以实现VLMs测试时的扩展
尽管取得了近期的成功,测试时扩展——即在推理过程中根据需要动态扩展令牌预算——对于视觉语言模型(VLMs)仍然脆弱:关于图像的无序思维链将感知与推理纠缠在一起,导致长而杂乱的上下文,其中微小的感知错误可能会导致完全错误的答案。此外,为了获得良好的性能,需要昂贵的手工设计奖励的强化学习。在这里,我们提出了SPARC(分离感知和推理电路),这是一种模块化框架,明确地将视觉感知与推理分离。受大脑顺序感觉至认知处理的启发,SPARC 实现了一个两阶段流水线,其中模型首先进行显式的视觉搜索以定位与问题相关区域,然后基于这些区域进行推理以生成最终答案。这种分离使得可以独立地进行测试时扩展,并且可以不对称地分配计算资源(例如,在分布转移时优先考虑感知处理),支持选择性优化(例如,仅改进感知阶段以解决端到端性能瓶颈),并且可以通过在较低图像分辨率下运行全局搜索并在选定区域分配高分辨率处理来适应压缩的上下文,从而减少总的视觉令牌计数和计算量。在具有挑战性的视觉推理基准测试中,SPARC 超过了单一模块基线和强大的视觉定位方法。例如,SPARC 在 $V^*$ VQA 基准测试中将 Qwen3VL-4B 的准确性提高了 6.7 个百分点,并且在一项具有挑战性的 OOD 任务中,尽管需要 200 倍更低的令牌预算,但其性能超过了“图像中的思考”。
Summary / 总结
SPARC is a modular framework that decouples visual perception from reasoning in vision-language models to improve test-time scaling. It uses a two-stage pipeline for explicit visual search and reasoning conditioning, enabling independent scaling and selective optimization. SPARC outperforms monolithic baselines and visual-grounding approaches, achieving significant accuracy improvements on benchmarks like VQA and outperforming 'thinking with images' with a much lower token budget.
SPARC 是一个模块化框架,将视觉感知与推理分离,以提高视觉语言模型的测试时缩放。它采用两阶段管道:视觉搜索以识别相关区域,然后基于这些区域进行推理。SPARC 在多个视觉推理基准测试中优于单一模型基线和强大的视觉接地方法,实现了显著的准确性提升,并且使用了更低的令牌预算。
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
Authors: Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao
First: 2025-10-01T17:58:05+00:00 · Latest: 2026-02-10T15:16:36+00:00
Abstract
Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .
中文标题/摘要
标题:代理拼图互动学习以增强视觉感知与推理能力——视觉语言模型
尽管当前大型视觉语言模型(VLMs)在多模态理解和推理方面取得了进展,但其基本的感知和推理能力仍然有限。具体而言,即使在简单的拼图任务上,现有的VLMs的表现也接近随机,揭示了核心感知和推理能力的不足。虽然高质量的视觉语言数据可以增强这些能力,但其稀缺性和有限的可扩展性对其构成了重大限制。为了解决这一问题,我们提出了AGILE,一种代理拼图互动学习方法,以增强视觉语言模型的视觉感知和推理能力。AGILE将拼图解决过程公式化为一个互动过程,使模型能够逐步与环境互动。在每一步中,模型根据当前状态生成可执行代码以执行动作,而环境则提供精细的视觉反馈以指导任务完成。通过这种观察和互动的迭代循环,模型通过探索和反馈逐步提高其感知和推理能力。实验结果表明,AGILE不仅在不同复杂度的拼图任务(例如,在2×2设置下将准确率从9.5%提高到82.8%)上显著提升了性能,还在9个通用视觉任务上展示了强大的泛化能力,平均提高了3.1%。这些结果表明,感知和推理能力都有了显著提升。这项工作为推进多模态模型的推理和泛化开辟了一条新途径,并提供了一种高效、可扩展的解决多模态强化学习数据稀缺问题的方案。代码和数据集可在https://github.com/yuzeng0-0/AGILE 获取。
Summary / 总结
The paper proposes AGILE, a method to enhance the perceptual and reasoning abilities of Vision-Language Models (VLMs) by formulating jigsaw solving as an interactive process. The model generates executable code to perform actions based on visual feedback, improving its capabilities through iterative cycles. AGILE significantly boosts performance on jigsaw tasks and demonstrates strong generalization across various vision tasks, achieving an average improvement of 3.1%. This work addresses the limitations of current VLMs in core perception and reasoning abilities and provides a scalable solution for multimodal model development.
AGILE 通过将拼图解决过程视为互动过程来增强视觉语言模型(VLM)的感知和推理能力。模型根据视觉反馈生成可执行代码以执行动作,从而实现逐步改进。AGILE 显著提升了拼图任务的表现,并在其他视觉任务中表现出良好的泛化能力,展示了感知和推理能力的提升。
Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
Authors: Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, Wenjun Zhang
First: 2026-02-10T15:12:51+00:00 · Latest: 2026-02-10T15:12:51+00:00
Abstract
Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
中文标题/摘要
标题:Free-GVC:一种基于时间连贯性的无训练极端生成视频压缩方法
基于近期视频生成技术的进展,生成式视频压缩已成为实现视觉上令人愉悦重构的新范式。然而,现有方法在利用时间相关性方面存在局限性,导致在超低比特率下出现明显的闪烁和时间连贯性下降。在本文中,我们提出了一种名为Free-GVC的无训练生成式视频压缩框架,将视频编码重新定义为由视频扩散先验引导的潜在轨迹压缩。该方法在组内图像(GOP)级别操作,将视频段编码到紧凑的潜在空间,并沿扩散轨迹逐步压缩。为了确保GOP间感知一致的重构,我们引入了一个自适应质量控制模块,动态构建在线率-感知代理模型以预测每个GOP的最佳扩散步长。此外,一个跨GOP对齐模块建立了帧重叠并进行潜在融合,从而减轻闪烁并增强时间连贯性。实验表明,Free-GVC在DISTS上的平均BD-Rate降低率为93.29%,并且用户研究进一步证实了其在超低比特率下的优越感知质量和时间连贯性。
Summary / 总结
Free-GVC is a training-free generative video compression framework that leverages latent trajectory compression guided by a video diffusion prior to improve temporal coherence. It encodes video segments into a compact latent space and compresses them progressively. The Adaptive Quality Control module ensures perceptual consistency across GOPs, while the Inter-GOP Alignment module mitigates flicker and enhances temporal coherence. Experiments show that Free-GVC reduces BD-Rate by 93.29% in DISTS compared to DCVC-RT and outperforms in perceptual quality and temporal coherence at ultra-low bitrates.
Free-GVC 是一种无需训练的生成视频压缩框架,利用视频扩散先验将视频片段压缩到紧凑的潜在空间。它包含一个自适应质量控制模块以预测最优扩散步长,并包含一个跨组帧对齐模块以实现帧重叠和潜在融合,从而增强时间连贯性。实验表明,Free-GVC 在 DISTS 中将 BD-Rate 减少 93.29%,并在超低比特率下表现出更优的感知质量和时间连贯性。
SNAP: Towards Segmenting Anything in Any Point Cloud
Authors: Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang
First: 2025-10-13T16:07:00+00:00 · Latest: 2026-02-10T15:12:42+00:00
Comments: Project Page, https://neu-vi.github.io/SNAP/
Abstract
Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present SNAP (Segment aNything in Any Point cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/
中文标题/摘要
标题:SNAP:在任意点云中分割一切
交互式3D点云分割可以通过用户引导的提示高效标注复杂的3D场景。然而,当前的方法通常局限于单一领域(室内或室外),并且仅支持单一形式的用户交互(空间点击或文本提示)。此外,多数据集训练往往导致负迁移,导致领域特定的工具缺乏泛化能力。为了解决这些限制,我们提出了SNAP(在任意点云中分割一切),这是一种统一的交互式3D分割模型,支持跨领域使用点基和文本基提示。通过在涵盖室内、室外和航空环境的7个数据集上进行训练,并采用领域自适应归一化来防止负迁移,我们的方法实现了跨领域的泛化能力。对于基于文本提示的分割,我们自动生成掩码提案,无需人工干预,并将其与CLIP嵌入的文本查询匹配,从而实现全景和开放词汇分割。大量实验表明,SNAP始终能够提供高质量的分割结果。我们在8个零样本基准测试中达到最先进的性能,在5个文本提示基准测试中表现出竞争力的结果。这些结果表明,统一模型可以匹配或超越专门的领域特定方法,提供一种实用的工具,用于大规模3D标注。项目页面为:https://neu-vi.github.io/SNAP/
Summary / 总结
SNAP is designed to address the limitations of current 3D point cloud segmentation methods by providing a unified model that supports both point-based and text-based prompts across various domains. It achieves cross-domain generalizability through training on diverse datasets and using domain-adaptive normalization. SNAP demonstrates high-quality segmentation results, achieving state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and competitive results on all 5 text-prompted benchmarks.
SNAP旨在通过提供一个支持点基和文本基提示的统一模型来解决当前3D点云分割方法的局限性,该模型适用于各种领域。它通过在多种数据集上进行训练并使用领域自适应归一化来实现跨域泛化能力。SNAP展示了高质量的分割结果,在大多数基准测试中达到最先进的性能,并在文本提示分割中表现出竞争力,从而提供了一个实用的工具,用于大规模的3D注释。
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
Authors: Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu
First: 2026-01-09T13:26:38+00:00 · Latest: 2026-02-10T15:12:17+00:00
Comments: Work In Progress
Abstract
Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git
中文标题/摘要
标题:从离策学到就策学:通过双层专家到策略同化提升GUI代理
视觉语言模型越来越多地被部署为计算机使用代理(CUAs),用于操作桌面和浏览器。表现最佳的CUAs是基于框架的系统,将规划和执行分解开来,而端到端的截图到动作策略更容易部署,但在OSWorld-Verified等基准测试中却落后。GUI数据集如OSWorld存在两个瓶颈:它们仅暴露了几百个可交互的、可验证的任务和环境,并且专家轨迹必须通过与这些环境交互来收集,使得此类数据难以扩展。因此,我们探讨了如何利用可验证奖励的强化学习(RLVR)最好地利用少量现有的专家轨迹来训练端到端策略。简单地将这些离策学轨迹混合到就策学RLVR中是脆弱的:即使经过格式转换,专家轨迹也表现出结构不匹配和分布偏移。我们提出了BEPA(双层专家到策略同化),通过在基础策略(LEVEL-1)下生成策略对齐的指导性自卷积可达轨迹和RLVR中使用的每任务动态更新缓存(LEVEL-2)将静态专家轨迹转化为策略对齐的指导。在OSWorld-Verified上,BEPA将UITARS1.5-7B的成功率从22.87%提高到32.13%,并将保留分割从5.74%提高到10.30%,在MMBench-GUI和Online-Mind2Web上也取得了持续的改进。我们的代码和数据可在:https://github.com/LEON-gittech/Verl_GUI.git
Summary / 总结
The research aims to enhance computer-use agents (CUAs) that operate desktops and browsers by improving end-to-end screenshot-to-action policies using a novel method called BEPA (Bi-Level Expert-to-Policy Assimilation). BEPA converts static expert trajectories into policy-aligned guidance through self-rolled reachable trajectories and a dynamically updated cache used in reinforcement learning from verifiable rewards. The method significantly improves the success rate of UITARS1.5-7B on OSWorld-Verified from 22.87% to 32.13% and raises the held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web.
研究旨在通过解决离策训练和端到端截图到动作策略的局限性,提升操作桌面和浏览器的计算机使用代理(CUAs)。研究提出了BEPA(Bi-Level Expert-to-Policy Assimilation),通过在基策略下生成自卷达可达轨迹和每任务动态更新的缓存,将静态专家轨迹转化为策略对齐的指导。在OSWorld-Verified上,BEPA显著提高了UITARS1.5-7B的成功率,从22.87%提升到32.13%,同时将保留分割从5.74%提升到10.30%,并在MMBench-GUI和Online-Mind2Web上保持一致的改进。
Kelix Technique Report
Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang
First: 2026-02-10T14:48:26+00:00 · Latest: 2026-02-10T14:48:26+00:00
Comments: Work in progress
Abstract
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
中文标题/摘要
标题:Kelix 技术报告
自回归大型语言模型(LLMs)通过将多样化的任务表示为离散自然语言标记序列,并通过下一个标记预测进行训练,从而能够很好地扩展,这将理解与生成统一在自我监督之下。将这一范式扩展到多模态数据需要在不同模态之间共享离散表示。然而,大多数视觉语言模型(VLMs)仍然依赖于混合界面:离散文本标记配以连续的视觉变换器(ViT)特征。由于监督主要来自文本,这些模型往往偏向于理解,而不能充分利用大规模的非文本数据的自我监督学习。最近的工作探索了离散视觉标记化,以实现完全自回归的多模态建模,显示出统一理解和生成的有希望的进展。然而,现有的离散视觉标记经常由于编码容量有限而丢失信息,导致理解能力明显弱于连续特征的VLMs。我们提出了Kelix,一种完全离散的自回归统一模型,以缩小离散和连续视觉表示之间的理解差距。
Summary / 总结
The research aims to improve multimodal data processing by developing a fully discrete autoregressive unified model, Kelix, to bridge the understanding gap between discrete and continuous visual representations. The method involves using discrete visual tokens to enable unified understanding and generation, similar to how autoregressive large language models handle text. Key findings show that Kelix closes the understanding gap, demonstrating promising progress toward unified multimodal modeling.
研究旨在通过开发一个完全离散的自回归统一模型Kelix,来弥合离散和连续视觉表示之间的理解差距。方法是使用离散视觉标记来实现统一的理解和生成,类似于自回归大型语言模型处理文本的方式。主要发现表明,Kelix成功地弥合了理解差距,展示了统一多模态建模的有希望的进展。
SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
Authors: Zhaoxu Li, Chenqi Kong, Peijun Bao, Song Xia, Yi Tu, Yi Yu, Xinghao Jiang, Xudong Jiang
First: 2026-02-10T14:33:24+00:00 · Latest: 2026-02-10T14:33:24+00:00
Abstract
Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
中文标题/摘要
标题:SAKED:通过稳定性意识知识增强解码减轻大型视觉-语言模型的幻觉
大型视觉-语言模型(LVLMs)中的幻觉在实际应用中带来了重大的安全性和可靠性风险。受人类在不确定或犹豫时更容易出错的观察启发,我们研究了模型内部知识的不稳定性如何导致LVLM幻觉。我们从注意力头、模型层和解码标记三个角度进行了广泛的实证分析,并识别出三种关键的幻觉模式:(i)注意力头之间的视觉激活漂移,(ii)各层之间显著的知识波动,(iii)相邻输出标记之间的视觉焦点分散。基于这些发现,我们提出了稳定性意识知识增强解码(SAKED),它引入了一种逐层的知识稳定性评分(KSS)来量化模型中的知识稳定性。通过对比最稳定和最不稳定的层,SAKED 抑制了解码噪声,并动态利用最可靠的内部知识生成忠实的标记。此外,SAKED 是无需训练的,并且可以无缝集成到不同的架构中。广泛的实验表明,SAKED 在各种模型、任务和基准上的幻觉缓解性能达到了最先进的水平。
Summary / 总结
The research aims to address the issue of hallucinations in Large Vision-Language Models (LVLMs) by analyzing the instability in the model's internal knowledge. The study identifies three hallucination patterns and proposes Stability-Aware Knowledge-Enhanced Decoding (SAKED), which uses a layer-wise Knowledge Stability Score to suppress decoding noise and dynamically utilize the most reliable knowledge. Experiments show that SAKED effectively mitigates hallucinations across various models and benchmarks.
研究通过分析大型视觉-语言模型(LVLM)内部知识的不稳定性,识别了三种幻觉模式:视觉激活漂移、层间知识波动以及输出令牌焦点分散。为解决这些问题,研究提出了SAKED,它使用层间知识稳定性评分来抑制解码噪声,并动态利用最可靠的内部知识。实验表明,SAKED在各种模型、任务和基准测试中优于现有方法,在幻觉抑制方面表现出色。
OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
Authors: Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park
First: 2025-12-09T14:10:23+00:00 · Latest: 2026-02-10T14:16:20+00:00
Comments: Work in progress. Project page: https://jisang1528.github.io/OpenMonoGS-SLAM/
Abstract
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
中文标题/摘要
标题:OpenMonoGS-SLAM:具有开放集语义的单目高斯点云SLAM
同时定位与建图(SLAM)是机器人学、AR/VR和自主系统中的基础组件。近年来,随着对空间AI的关注增加,将SLAM与语义理解相结合变得越来越重要,以实现智能感知和交互。最近的研究已经探索了这种集成,但它们通常依赖于深度传感器或封闭集语义模型,限制了其在开放环境中的可扩展性和适应性。在本文中,我们提出了OpenMonoGS-SLAM,这是第一个将3D高斯点云(3DGS)与开放集语义理解统一的单目SLAM框架。为了实现这一目标,我们利用了视觉基础模型(VFMs)的最新进展,包括MASt3R用于视觉几何和SAM及CLIP用于开放词汇语义。这些模型在多种任务上提供了稳健的泛化能力,使我们能够实现准确的单目相机跟踪和建图,以及对开放环境中的语义有丰富的理解。我们的方法不依赖任何深度输入或3D语义真值,仅依赖于自我监督学习目标。此外,我们提出了一种专门设计的记忆机制,用于管理高维语义特征,有效地构建了高斯语义特征图,从而实现了强大的整体性能。实验结果表明,我们的方法在封闭集和开放集分割任务中均能达到或超过现有基线的性能,且不依赖于诸如深度图或语义注释等辅助传感器。
Summary / 总结
OpenMonoGS-SLAM is a monocular SLAM framework that integrates 3D Gaussian Splatting with open-set semantic understanding, leveraging Visual Foundation Models for robust generalization. It achieves accurate monocular camera tracking and mapping without depth input or 3D semantic ground truth, and introduces a memory mechanism to manage high-dimensional semantic features, leading to strong performance in both closed-set and open-set segmentation tasks.
OpenMonoGS-SLAM 是一种结合了 3D 高斯点云和开放集语义理解的单目 SLAM 框架,使用了 MASt3R、SAM 和 CLIP 等视觉基础模型。该方法在没有深度输入或 3D 语义标注的情况下实现了准确的单目相机跟踪和建图,并在封闭集和开放集分割任务中达到了与现有基线相当或更优的性能。
Learning Tractable Distributions Of Language Model Continuations
Authors: Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Yuchen Cui, Guy Van den Broeck, Benjie Wang
First: 2025-11-20T05:17:19+00:00 · Latest: 2026-02-10T13:57:46+00:00
Abstract
Controlled generation imposes sequence-level constraints (syntax, style, safety) that depend on future tokens, making exact conditioning of an autoregressive LM intractable. Tractable surrogates such as HMMs can approximate continuation distributions and steer decoding, but standard surrogates are often weakly context-aware. We propose Learning to Look Ahead (LTLA), a hybrid method that uses base-LM embeddings to condition a globally learned tractable surrogate: a neural head predicts only a prefix-dependent latent prior, while a shared HMM answers continuation queries exactly. LTLA is designed to avoid two common efficiency traps when adding neural context. First, it avoids vocabulary-sized prefix rescoring (V extra LM evaluations) by scoring all next-token candidates via a single batched HMM forward update. Second, it avoids predicting a new HMM per prefix by learning one shared HMM and conditioning only the latent prior, which enables reuse of cached future-likelihood (backward) messages across decoding steps. Empirically, LTLA improves continuation likelihood over standard HMM surrogates, enables lookahead control for vision--language models by incorporating continuous context, achieves 100% syntactic constraint satisfaction, and improves detoxification while adding only a 14% decoding-time overhead.
中文标题/摘要
标题:学习可计算的语言模型续写分布
可控生成施加了序列级约束(语法、风格、安全性),这些约束依赖于未来的标记,使得自回归语言模型的确切条件化不可计算。可计算的替代方案如隐马尔可夫模型(HMM)可以近似续写分布并引导解码,但标准替代方案通常缺乏上下文意识。我们提出了一种名为前瞻学习(LTLA)的混合方法,该方法使用基础语言模型嵌入来条件化一个全局学习的可计算替代方案:一个神经头仅预测前缀依赖的潜在先验,而一个共享的HMM可以精确回答续写查询。LTLA旨在避免在添加神经上下文时常见的两种效率陷阱。首先,它通过单次批处理HMM前向更新来评分所有下一个标记候选,从而避免了词汇表大小的前缀重新评分(V次额外的LM评估)。其次,它通过学习一个共享的HMM并仅条件化潜在先验来避免为每个前缀预测一个新的HMM,这使得未来可能性(后向)消息可以在解码步骤之间重用。实验表明,LTLA在续写似然性上优于标准HMM替代方案,通过结合连续上下文使视觉-语言模型能够进行前瞻控制,实现了100%的语法约束满足,并在增加仅14%的解码时间开销的情况下提高了去毒效果。
Summary / 总结
The paper addresses the challenge of applying sequence-level constraints in controlled generation by proposing Learning to Look Ahead (LTLA), a hybrid method that uses a neural head to predict a prefix-dependent latent prior while a shared HMM answers continuation queries exactly. This approach avoids the inefficiencies of vocabulary-sized prefix rescoring and the need to predict a new HMM per prefix, leading to improved continuation likelihood, 100% syntactic constraint satisfaction, and enhanced detoxification with only a 14% decoding-time overhead.
研究提出了一种名为“学习向前看”(LTLA)的混合方法,该方法使用神经头预测前缀相关的潜在先验,并使用共享的HMM精确回答延续查询。该方法避免了词汇量大小的前缀重新评分和每次前缀都需要预测一个新的HMM的效率问题,从而提高了延续似然性、更好地满足了句法约束,并增强了去毒功能,同时仅增加了14%的解码时间开销。
Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
Authors: Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer
Venue: ICLR 2026
First: 2025-12-09T11:05:08+00:00 · Latest: 2026-02-10T12:37:43+00:00
Comments: Accepted at the International Conference on Learning Representations 2026 (ICLR 2026). Code is available at: https://adagorgun.github.io/PCI-Project/
Abstract
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://adagorgun.github.io/PCI-Project/
中文标题/摘要
标题:通过提示条件干预研究扩散模型中的时间概念动态
扩散模型通常通过其最终输出进行评估,逐步去除随机噪声以生成有意义的图像。然而,生成过程沿着一条轨迹展开,分析这一动态过程对于理解这些模型在成功/失败模式方面的可控性、可靠性和可预测性至关重要。在本研究中,我们提出的问题是:何时噪声会转变为特定概念(例如年龄)并锁定去噪轨迹?我们提出了基于提示条件干预(PCI)的方法来研究这一问题。PCI 是一种无需训练且模型无关的框架,用于通过扩散时间分析概念动态。核心思想是分析概念插入成功率(CIS),定义为在给定时间步插入的概念在最终图像中被保留和反映的概率,从而表征概念形成的时间动态。将 PCI 应用于多个最先进的文本到图像扩散模型以及广泛的概念分类,揭示了扩散模型中概念的多样时间行为,在同一概念类型内,某些轨迹阶段对特定概念更为有利。这些发现还为文本驱动的图像编辑提供了可操作的见解,无需访问模型内部或进行训练,从而实现比强基线更强的编辑,这些编辑在语义准确性和内容保留之间达到了平衡。代码可在 https://adagorgun.github.io/PCI-Project/ 获取。
Summary / 总结
This work investigates the temporal dynamics of concept formation in diffusion models by proposing PCI (Prompt-Conditioned Intervention), a training-free and model-agnostic framework. PCI analyzes the probability of a concept being preserved and reflected in the final image at different timesteps, revealing diverse temporal behaviors across different diffusion models. These findings provide actionable insights for text-driven image editing, showing when interventions are most effective without requiring access to model internals or training, and yielding stronger edits than strong baselines.
该研究通过提出训练免费且模型无关的PCI(Prompt-Conditioned Intervention)框架,探讨了扩散模型中概念形成的时序动态。PCI通过分析不同时间步长的概念插入成功率(CIS),来理解概念如何在最终图像中被保留和反映。研究揭示了不同扩散模型中多样化的时序行为,表明特定阶段的轨迹对特定概念更为有利。这些发现为文本驱动的图像编辑提供了实用的指导,展示了在无需访问模型内部结构或训练的情况下,何时进行干预最为有效,并且在语义准确性和内容保真度之间取得了优于强基线的平衡。
GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
Authors: Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh
First: 2026-02-10T11:59:14+00:00 · Latest: 2026-02-10T11:59:14+00:00
Abstract
We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks.
Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation.
We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
中文标题/摘要
标题:GenSeg-R1:基于RL的细粒度视觉-语言定位以实现细粒度引用分割
我们通过一个解耦的推理-分割流水线研究细粒度的引用图像分割。视觉-语言模型(VLM)接收图像和自然语言查询,推理场景并发出结构化的空间提示:每个引用实例的边界框加上两个内部关键点。一个冻结的可提示分割器(SAM 2)将这些提示转换为高质量的掩码。
在我们的GenSeg-R1框架中,我们使用组相对策略优化(GRPO)微调Qwen3-VL模型(4B和8B参数),无需监督推理链注释。在RefCOCOg验证集上,我们的最佳模型(GenSeg-R1-8B)达到0.7127 cIoU和0.7382 mIoU,显著优于相应的Qwen3-VL Instruct基线模型(分别提高15.3和21.9分)并在相同的评估条件下超越Seg-Zero-7B [3] 3.3 cIoU。
我们进一步引入了GenSeg-R1-G变体,该变体在GRefCOCO [9]上进行训练,并带有SAM 2在环中的奖励,直接优化掩码质量。在GRefCOCO验证集上,GenSeg-R1-G达到76.69%的目标mIoU,负(无目标)提示的准确率为82.40%,显著优于Seg-R1-7B和Seg-Zero-7B,后者缺乏无目标检测能力。在ReasonSeg测试集上,GenSeg-R1-4B达到68.40%的mIoU,分别超越Seg-Zero-7B和Seg-R1-7B 7.0和10.7分。
Summary / 总结
The research aims to improve fine-grained referring image segmentation by developing a decoupled reason-then-segment pipeline. The method involves using a vision-language model (VLM) to reason about the scene and generate structured spatial prompts, which are then used by a segmenter to produce masks. The GenSeg-R1 framework, which fine-tunes Qwen3-VL models using Group Relative Policy Optimization, achieves significant improvements in cIoU and mIoU on RefCOCOg validation compared to baselines. Additionally, GenSeg-R1-G, trained on GRefCOCO, shows superior performance in target and no-target detection on the ReasonSeg test.
该研究提出了GenSeg-R1,一种使用分解为先推理后分割管道的细粒度引用图像分割方法。视觉语言模型生成结构化的空间提示,然后由分割器生成掩码。该模型通过组相对策略优化进行微调,在RefCOCOg验证集上显著优于基线模型,获得更高的cIoU和mIoU分数。此外,GenSeg-R1-G在GRefCOCO和ReasonSeg测试上的掩码质量和无目标检测能力上也优于其他模型。
Towards Training-free Multimodal Hate Localisation with Large Language Models
Authors: Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu
First: 2026-02-10T10:32:46+00:00 · Latest: 2026-02-10T10:32:46+00:00
Abstract
The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
中文标题/摘要
标题:朝无需训练的多模态仇恨定位迈进——基于大型语言模型的方法
在线视频中仇恨内容的泛滥对个人福祉和社会和谐构成了严重威胁。然而,现有的视频仇恨检测解决方案要么依赖大规模的人工注释,要么缺乏精细的时间精度。在本文中,我们提出了LELA,这是首个无需训练的基于大型语言模型(LLM)的仇恨视频定位框架。与依赖监督管道的先进模型不同,LELA 利用 LLM 和模态特定的字幕检测并以无需训练的方式检测和时间定位仇恨内容。我们的方法将视频分解为五种模态,包括图像、语音、OCR、音乐和视频上下文,并使用多阶段提示方案为每一帧计算细粒度的仇恨得分。我们还引入了一种组成匹配机制以增强跨模态推理。在两个具有挑战性的基准测试HateMM和MultiHateClip上的实验表明,LELA 在无需训练的基线中表现出显著的优势。我们还提供了详尽的消融分析和定性可视化,确立了LELA 作为可扩展和可解释的仇恨视频定位的强大基础。
AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
Authors: Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang
First: 2026-02-10T10:02:29+00:00 · Latest: 2026-02-10T10:02:29+00:00
Comments: preprint
Abstract
Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
中文标题/摘要
标题:AGMark:注意力引导的动态水印技术用于大型视觉语言模型
水印技术已成为大型视觉语言模型(LVLMs)中内容可追溯性和知识产权保护的关键解决方案。然而,视觉无关的水印可能会引入视觉上无关的标记并破坏视觉定位,通过施加不分青红皂白的伪随机偏见。此外,当前的视觉特定水印依赖于视觉关键权重的一次性静态估计,并在确定保护标记的比例时忽略了权重分布密度,这种设计未能考虑到生成过程中视觉依赖性的动态变化,可能会在生成的长尾部分引入低质量的标记。为了解决这些挑战,我们提出了注意力引导的动态水印(AGMark)这一新颖框架,该框架在严格保持视觉保真度的同时嵌入可检测的信号。在每次解码步骤中,AGMark 首先根据注意力权重动态识别基于视觉相关性的语义关键证据,结合上下文感知的一致性线索,从而获得更适应且校准更好的证据权重分布。然后,它通过同时考虑不确定性意识(标记熵)和证据校准(权重密度)来确定语义关键标记的比例,从而实现自适应词汇分区以避免无关标记。实验证明,AGMark 在生成质量和视觉语义保真度方面优于传统方法,特别是在生成后期阶段表现出显著的改进。该框架保持了高度竞争力的检测准确性(至少99.36% AUC)和强大的攻击鲁棒性(至少88.61% AUC),而不牺牲推理效率,从而有效地确立了可靠保真的多模态水印的新标准。
Summary / 总结
AGMark is a novel framework for watermarking LVLMs that dynamically identifies and protects semantic-critical tokens based on attention weights and context coherence, ensuring visual fidelity. It outperforms conventional methods by improving generation quality and visual semantic fidelity, especially in later stages, while maintaining high detection accuracy and robustness against attacks without compromising inference efficiency.
AGMark 是一种新型的大型视觉语言模型水印框架,它基于注意力权重和上下文一致性动态识别并保护语义关键的标记。该方法在后期生成阶段特别提高了生成质量和视觉语义保真度,同时保持了高检测准确率和对攻击的鲁棒性,而不牺牲推理效率。
Delving into Spectral Clustering with Vision-Language Representations
Authors: Bo Peng, Yuanwei Hu, Bo Liu, Ling Chen, Jie Lu, Zhen Fang
First: 2026-02-10T09:36:24+00:00 · Latest: 2026-02-10T09:36:24+00:00
Comments: ICLR26
Abstract
Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
中文标题/摘要
标题:探索基于视觉语言表示的谱聚类
谱聚类是一种强大的无监督数据分析技术。大多数谱聚类方法仅依赖单一模态,未能充分利用多模态表示中的丰富信息。受近期视觉语言预训练成功的启发,本文将谱聚类从单一模态扩展到多模态领域。特别地,我们提出了基于神经切线核的谱聚类方法,利用预训练视觉语言模型中的跨模态对齐。通过将神经切线核锚定在与目标图像语义接近的阳性名词上,我们将图像之间的亲和力形式化为视觉接近度和语义重叠的耦合。我们证明这种形式化可以放大簇内的连接,抑制跨簇的虚假连接,从而促进块对角结构。此外,我们提出了一种正则化亲和力扩散机制,该机制能够自适应地融合由不同提示诱导的亲和力矩阵。在包括经典、大规模、细粒度和领域转移数据集在内的16个基准测试中,我们的方法在所有测试集上均显著优于现有最佳方法。
Summary / 总结
This paper explores spectral clustering with vision-language representations, motivated by the underutilization of rich multi-modal information in traditional single-modal approaches. The authors propose Neural Tangent Kernel Spectral Clustering, which uses cross-modal alignment from pre-trained vision-language models. By focusing on positive nouns semantically related to images, they enhance the affinity between images based on both visual and semantic similarities. The method shows superior performance across 16 benchmarks, significantly outperforming existing state-of-the-art methods.
该论文探讨了使用视觉-语言表示的谱聚类方法,动机在于传统谱聚类方法中单模态信息的不足。提出了神经 tangent 核谱聚类方法,利用预训练视觉-语言模型的跨模态对齐。该方法基于视觉相似性和语义重叠来计算图像之间的亲和力,从而增强了内部簇的连接并抑制了跨簇的虚假连接。在16个基准数据集上的实验表明,该方法显著优于现有方法。
Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
Authors: Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata
Venue: WACV 2026
First: 2026-02-10T08:53:43+00:00 · Latest: 2026-02-10T08:53:43+00:00
Comments: WACV 2026 (It was accepted in the first round, with an acceptance rate of 6%.)
Abstract
Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
中文标题/摘要
标题:Scalpel:通过混合高斯桥梁精细调整注意力激活流形以减轻多模态幻觉
大型视觉-语言模型(LVLM)在视觉-语言任务中取得了前所未有的性能。然而,由于大型语言模型(LLM)的强先验和跨模态注意力的不一致,LVLM经常生成与视觉内容不符的输出,称为幻觉。为了解决这一问题,我们提出了一种名为Scalpel的方法,通过将注意力激活分布细化到更可信的区域来减少幻觉。Scalpel在推理过程中为Transformer层中的每个头预测可信的注意力方向,并相应地调整激活。它使用高斯混合模型来捕捉信任和幻觉流形中多峰分布的注意力,并使用熵最优传输(等价于薛定谔桥问题)精确映射高斯成分。在减轻幻觉时,Scalpel根据成分成员关系和幻觉与信任激活之间的映射关系动态调整干预强度和方向。在多个数据集和基准上的广泛实验表明,Scalpel有效地减轻了幻觉,优于先前的方法,并实现了最先进的性能。此外,Scalpel对模型和数据具有普适性,不需要额外的计算,只需一个解码步骤。
Summary / 总结
Scalpel is a method designed to reduce hallucinations in vision-language models by refining attention activation distributions. It predicts trusted attention directions and adjusts activations using a Gaussian mixture model and entropic optimal transport. Scalpel dynamically adjusts intervention strength and direction, leading to effective hallucination mitigation across various datasets and benchmarks, outperforming previous methods and achieving state-of-the-art performance.
Scalpel 是一种通过细化注意力激活分布来减少视觉语言模型中幻觉的方法。它在推理过程中为每个 Transformer 层的头预测可信的注意力方向,并使用高斯混合模型和熵最优传输来调整激活。Scalpel 根据幻觉和信任激活之间的成分成员关系和映射关系动态调整干预强度和方向。实验表明,Scalpel 有效地减少了幻觉,并在多个数据集和基准测试中达到了最先进的性能。
DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
Authors: Bohan Fu, Guanyi Qin, Fazhan Zhang, Zihao Huang, Mingxuan Li, Runze Hu
Venue: AAAI 2026
First: 2026-02-10T08:41:40+00:00 · Latest: 2026-02-10T08:41:40+00:00
Comments: Accepted by AAAI 2026
Abstract
Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce DR.Experts, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. DR.Experts begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of DR.Experts over current methods and showcase its excellence in terms of generalization and data efficiency.
中文标题/摘要
标题:DR.Experts: 基于失真先验的失真感知专家细化盲图像质量评估框架
盲图像质量评估旨在无需参考图像的情况下模拟人类对视觉质量的感知,在视觉任务中发挥关键作用,但现有模型往往无法有效捕捉细微的失真线索,导致与人类主观判断不一致。我们发现这一局限性的根源在于缺乏可靠的失真先验,因为方法通常学习统一图像特征与质量评分之间的浅层关系,导致它们对失真不敏感,从而限制了性能。为解决这一问题,我们引入了DR.Experts,这是一种新颖的先验驱动的BIQA框架,旨在明确地结合失真先验,实现可靠的图像质量评估。DR.Experts首先利用失真感知的视觉-语言模型获取失真特定的先验,然后通过提出的失真显著性差异模块进一步细化和增强这些先验,通过区分它们与语义注意力,确保失真的真实表示。细化后的先验与语义和桥梁表示结合,通过提出的混合专家风格模块动态失真加权模块进行融合。该机制根据感知影响加权每个失真特定特征,确保最终的质量预测与人类感知一致。在五个具有挑战性的BIQA基准上的广泛实验表明,DR.Experts在性能上优于现有方法,并在泛化能力和数据效率方面表现出色。
Summary / 总结
The research aims to improve blind image quality assessment by addressing the limitation of existing models in capturing subtle distortion cues. DR.Experts, a novel framework, introduces a degradation-aware vision-language model to obtain distortion-specific priors, which are refined by the Distortion-Saliency Differential Module. These refined priors are then fused with semantic and bridging representations using the Dynamic Distortion Weighting Module to ensure accurate quality prediction. Experiments on five BIQA benchmarks show that DR.Experts outperforms current methods in terms of generalization and data efficiency.
论文针对现有盲图像质量评估(BIQA)模型在捕捉细微失真线索方面的不足,这往往导致与人类判断的不一致。为此,提出了DR.Experts框架,该框架通过降解感知的视觉-语言模型和失真相关性差异模块来引入失真先验。经过细化的先验与语义和桥梁表示相结合,使用动态失真加权模块确保最终的质量预测与人类感知一致。在五个BIQA基准上的实验显示,DR.Experts在泛化能力和数据效率方面优于现有方法。
Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
Authors: Jingyi Wang, Fei Li, Rujie Liu
First: 2026-02-10T08:26:50+00:00 · Latest: 2026-02-10T08:26:50+00:00
Abstract
Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
中文标题/摘要
标题:细节注意,从logits到真相:基于视觉感知的注意力增强以减轻LVLM中的幻觉
现有的大型视觉-语言模型(LVLMs)在视觉注意力方面存在不足,导致产生幻觉。为了解决这一问题,一些先前的研究调整并放大了视觉注意力。这些方法的一个局限性在于,增强所有视觉标记的注意力不可避免地增加了对任务无关标记的注意力。为应对这一挑战,我们提出了一种无需训练的注意力干预算法,基于任务相关标记通常表现出高视觉-文本相似性的论点,增强任务相关标记的注意力。具体而言,提取视觉-文本交叉注意力子矩阵,表示视觉-文本相关性,构建重新加权矩阵以重新分配注意力。此外,为了增强视觉标记的贡献,我们将视觉注意力值注入到束搜索解码中,以识别具有更高视觉注意力的解决方案。广泛的实验表明,该方法显著减少了主流LVLM中的幻觉,同时保持了生成内容的准确性和连贯性。
Summary / 总结
The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) due to insufficient visual attention. It proposes an attentional intervention algorithm that enhances the attention of task-relevant tokens without boosting attention for irrelevant tokens. By extracting vision-text cross-attention submatrices and using them to construct reweighting matrices, the method reallocates attention. Additionally, visual attention values are injected into the beam search decoding to improve the selection of solutions. Experiments show that this approach reduces hallucinations across various LVLMs while maintaining accuracy and coherence.
论文针对大型视觉-语言模型(LVLM)由于视觉注意力不足导致的幻觉问题,提出了一种无需训练的方法,通过提取视觉-文本交叉注意力子矩阵并重新分配注意力来增强任务相关token的注意力。此外,通过将视觉注意力值注入到束搜索解码中,以提高生成内容的相关性。实验表明,该方法有效减少了幻觉现象,同时保持了生成内容的准确性和连贯性。
AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Authors: Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li
First: 2026-02-02T19:00:27+00:00 · Latest: 2026-02-10T07:57:54+00:00
Abstract
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
中文标题/摘要
标题:AdaptMMBench:评估自适应多模态推理以进行模式选择和推理过程
自适应多模态推理已成为视觉语言模型(VLMs)的一个有前景的研究前沿,旨在动态调节工具增强的视觉推理和文本推理之间的平衡,以提高效果和效率。然而,现有的评估依赖于静态难度标签和简单的度量标准,无法捕捉难度相对于不同模型能力的动态性质。因此,它们模糊了自适应模式选择与总体性能之间的区别,同时忽略了精细的过程分析。在本文中,我们提出了AdaptMMBench,这是一个跨五个领域(现实世界、OCR、GUI、知识和数学)的全面基准,涵盖了直接感知和复杂推理任务。AdaptMMBench 使用马修斯相关系数(MCC)度量来评估不同推理模式的选择合理性,通过动态识别任务难度来隔离这种元认知能力。此外,AdaptMMBench 促进了关键步骤覆盖率、工具有效性以及计算效率的多维度过程评估。我们的评估表明,虽然自适应模式选择随着模型能力的增加而扩展,但它明显与最终准确性脱钩。相反,关键步骤覆盖率与性能相关,尽管工具有效性在不同模型架构之间仍然高度不一致。
Summary / 总结
The paper introduces AdaptMMBench, a benchmark for evaluating adaptive multimodal reasoning in Vision-Language Models across five domains. It uses the Matthews Correlation Coefficient to assess the rationality of mode selection and includes metrics for process evaluation such as key step coverage and tool effectiveness. The study finds that adaptive mode selection improves with model capacity but is not directly correlated with final accuracy, while key step coverage is more aligned with performance, though tool effectiveness varies widely among different model architectures.
本文提出了AdaptMMBench,这是一个针对Vision-Language模型在五个领域中的适应性多模态推理的基准。该基准使用Matthews相关系数评估模式选择的合理性,并提供对推理过程的多维度分析。研究发现,适应性模式选择随着模型容量的增加而改善,但与最终准确性无直接关联,而关键步骤的覆盖率则与性能更一致,尽管不同模型架构的工具有效性差异很大。
OSI: One-step Inversion Excels in Extracting Diffusion Watermarks
Authors: Yuwei Chen, Zhenliang He, Jia Tang, Meina Kan, Shiguang Shan
First: 2026-02-10T07:43:16+00:00 · Latest: 2026-02-10T07:43:16+00:00
Abstract
Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.
中文标题/摘要
标题:OSI:一步反演在提取扩散水印方面表现出色
水印技术是保护由扩散生成图像的来源和版权的重要机制。无训练方法,如高斯阴影,能够在不影响生成图像质量的情况下将水印嵌入到扩散模型的初始噪声中。然而,提取这种类型的水印通常需要多步扩散反演以获得精确的初始噪声,这在计算上非常昂贵且耗时。为了解决这个问题,我们提出了一步反演(OSI),这是一种显著更快且更准确的提取高斯阴影风格水印的方法。OSI 将水印提取重新表述为可学习的符号分类问题,从而消除了对初始噪声精确回归的需要。然后,我们从扩散主干初始化 OSI 模型,并在合成噪声-图像对上使用符号分类目标进行微调。通过这种方式,OSI 模型能够在一步中高效地完成水印提取。我们的 OSI 显著优于多步扩散反演方法:它快 20 倍,提取准确度更高,并将水印承载能力翻倍。广泛的实验表明,无论是在不同的调度器、扩散主干还是加密方案下,我们的 OSI 框架都表现出改进,证明了其通用性。
Summary / 总结
The paper addresses the computational inefficiency of multi-step diffusion inversion for extracting watermarks from diffusion-generated images. It introduces One-step Inversion (OSI), which reformulates watermark extraction as a sign classification problem, enabling faster and more accurate watermark extraction. OSI achieves 20 times faster extraction, higher accuracy, and doubles the watermark payload capacity compared to multi-step methods.
研究旨在提高从扩散生成图像中提取水印的效率和准确性。提出了一步反演(OSI)方法,将水印提取重新定义为可学习的符号分类问题,从而省去了多步扩散反演的需要。OSI从扩散主干初始化并针对合成的噪声-图像对进行微调,允许高效的一步提取。实验结果表明,OSI比多步扩散反演方法快20倍,更准确,并且可以将水印承载容量翻倍。
Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement
Authors: Koduvayur Subbalakshmi, Sabbir Hossain Ujjal, Venkata Krishna Teja Mangichetty, Nastaran Jamalipour Soofi
First: 2026-02-10T07:32:37+00:00 · Latest: 2026-02-10T07:32:37+00:00
Comments: Preprint, 23 pages, 13 tables, 12 figures
Abstract
Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
中文标题/摘要
标题:倾听层次:通过层间分歧减轻幻觉
预训练大型语言模型(LLMs)容易生成流畅但事实错误的文本——这一现象被称为幻觉,这削弱了它们在下游任务中的可靠性和实用性。我们假设生成文本片段的事实性与其在模型内部各层中的表示不稳定性相关。基于此,我们提出了一种名为CoCoA(混淆和一致性感知)的解码器,这是一种新型的无需训练的解码算法,在推理时通过倾听中间层的这些信号来减轻幻觉。我们提出了两种度量中间层不稳定性的方法,并使用它们来惩罚表现出高内部混淆的输出,从而引导模型生成更一致且事实依据更充分的输出。我们还提出了一种自信息门控变体CoCoA-SIG,它可以动态调节这种惩罚,以选择性地针对高惊喜、不稳定的生成。在包括问答、总结和代码生成在内的多种任务上进行的广泛实验表明,CoCoA 显著提高了多个模型家族(例如Llama-3、Qwen-2.5、Mistral)的事实正确性。通过利用模型固有的信号,CoCoA 提供了一种有效且广泛适用的方法,在无需任何模型重训练的情况下增强LLMs在推理时的可信度。
Summary / 总结
The paper addresses the issue of hallucinations in large language models (LLMs) by proposing CoCoA (Confusion and Consistency Aware) decoder, which mitigates hallucinations at inference time using internal layer signals. CoCoA listens to the representational instability across layers to penalize outputs with high internal confusion, promoting factually grounded and consistent outputs. Experiments on various tasks show that CoCoA improves factual correctness across different model families.
论文通过提出CoCoA(混淆和一致性感知)解码器来解决大型语言模型(LLMs)中的幻觉问题,该解码器通过监听模型内部层的表征不稳定性来减轻幻觉。引入了两个指标来量化这种不稳定性,并对高内部混淆的输出施加惩罚。还提出了一个自信息门控变体CoCoA-SIG,以动态调整惩罚。实验表明,CoCoA在不同模型家族上提高了事实正确性,且无需对模型进行重新训练。
ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking
Authors: Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao
First: 2025-05-26T17:55:19+00:00 · Latest: 2026-02-10T07:20:03+00:00
Comments: https://github.com/chen-si-jia/ReaMOT
Abstract
Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms are largely designed for explicit instructions and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that requires models to identify and track targets that satisfy implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark comprising: (1) a large-scale dataset with 1,156 instructions categorized into High-Level Reasoning and Low-Level Perception, covering 423,359 image-language pairs across 869 diverse scenes; and (2) a tailored metric suite designed to jointly evaluate reasoning accuracy and tracking robustness. Furthermore, we propose ReaTrack, a training-free framework that synergizes the reasoning capabilities of Thinking-variant Large Vision-Language Model (LVLM) with the precise temporal modeling of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrates the effectiveness of our ReaTrack framework.
中文标题/摘要
标题:ReaMOT:基于推理的多目标跟踪基准和框架
引用多目标跟踪(RMOT)旨在跟踪由语言指令指定的目标。然而,现有的RMOT范式主要针对明确的指令设计,因此无法很好地泛化到需要逻辑推理的复杂指令。为了解决这一问题,我们提出了基于推理的多目标跟踪(ReaMOT),这是一个新的任务,要求模型通过逻辑推理来识别和跟踪满足隐式约束的目标。为了推进这一领域,我们构建了ReaMOT挑战,这是一个全面的基准,包括:(1)一个大规模数据集,包含1,156条指令,分为高级推理和低级感知,覆盖423,359张图像-语言对,分布在869个不同的场景中;(2)一个定制的度量套件,旨在同时评估推理准确性和跟踪鲁棒性。此外,我们提出了ReaTrack,这是一个无需训练的框架,将Thinking变体大型视觉-语言模型(LVLM)的推理能力与SAM2的精确时间建模相结合。在ReaMOT挑战基准上的广泛实验表明了我们ReaTrack框架的有效性。
Summary / 总结
ReaMOT aims to address the limitations of existing Referring Multi-Object Tracking (RMOT) methods by introducing a new task that requires logical reasoning for implicit constraints. The proposed ReaTrack framework combines the reasoning capabilities of a Thinking-variant Large Vision-Language Model with the temporal precision of SAM2. The ReaMOT Challenge benchmark includes a large dataset with diverse scenes and a metric suite to evaluate reasoning and tracking performance, showing the effectiveness of ReaTrack.
ReaMOT 是一个新的基准和框架,旨在解决现有方法在处理需要逻辑推理的复杂指令时的局限性。它包含一个大规模的数据集,有1,156条指令和423,359个图像-语言对,并且有一套评价推理和跟踪效果的指标。ReaTrack 框架结合了 Thinking-variant 大型视觉-语言模型的推理能力和 SAM2 的时间建模能力,在 ReaMOT 挑战基准上显示出有效性。
SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning
Authors: Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu
First: 2026-02-10T06:57:12+00:00 · Latest: 2026-02-10T06:57:12+00:00
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
中文标题/摘要
标题:SpotAgent:通过代理推理在大型视觉语言模型中实现视觉地理定位
大型视觉语言模型(LVLMs)在地理定位方面展示了强大的推理能力,但在现实世界场景中,由于视觉线索稀疏、长尾且高度模糊,它们往往难以应对。以往的方法受限于内部知识,往往无法提供可验证的结果,在面对混淆证据时只能给出自信但缺乏依据的预测。为了解决这些挑战,我们提出SpotAgent框架,该框架将地理定位形式化为一个代理推理过程,利用专家级推理来协同视觉解释与工具辅助验证。SpotAgent通过ReAct图利用外部工具(例如网络搜索、地图)积极探索和验证视觉线索。我们引入了一个三阶段后训练管道,首先是一个监督微调(SFT)阶段以实现基本对齐,接着是一个代理冷启动阶段,利用多代理框架合成的高质量轨迹来培养工具调用技巧。随后,模型的推理能力通过强化学习进行精炼。我们提出了一种基于空间难度的动态过滤策略,以增强强化学习阶段的效率,优先学习具有挑战性的样本。在标准基准上的广泛实验表明,SpotAgent实现了最先进的性能,有效减少了幻觉,同时提供了精确且可验证的地理定位。
Summary / 总结
The research aims to improve the geo-localization capabilities of large vision-language models in real-world scenarios where visual cues are sparse and ambiguous. SpotAgent is proposed as a framework that formalizes geo-localization into an agentic reasoning process, leveraging external tools for verification. The method includes a 3-stage post-training pipeline: Supervised Fine-Tuning for basic alignment, Agentic Cold Start using high-quality trajectories, and Reinforcement Learning to refine reasoning. The model's performance is enhanced by a Spatially-Aware Dynamic Filtering strategy. Experiments show that SpotAgent outperforms existing methods, providing precise and verifiable geo-localization results.
SpotAgent通过将地理定位过程形式化为代理推理来解决大型视觉语言模型在现实世界中的局限性。它采用三阶段管道,包括监督微调、多代理框架下的冷启动以及强化学习。通过空间感知动态过滤策略增强模型的推理能力。实验结果表明,SpotAgent在现有方法中表现出色,提供精确且可验证的地理定位。
Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
Authors: Yan Luo, Henry Huang, Todd Y. Zhou, Mengyu Wang
First: 2026-02-10T06:34:47+00:00 · Latest: 2026-02-10T06:34:47+00:00
Abstract
Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.
中文标题/摘要
标题:前瞻与回溯流程:无需训练的图像生成与轨迹平滑
近期进展通过流匹配框架将扩散模型重新表述为确定性的常微分方程(ODEs),提供了一种统一的噪声到数据生成过程的表述。开发了多种无需训练的流匹配方法,通过调整流速度场来改进图像生成,从而消除昂贵的重新训练需求。然而,修改速度场 $v$ 引入的误差会通过整个生成路径传播,而通过预训练的速度网络调整潜在轨迹 $z$ 则自然地减少了误差累积。在本文中,我们提出了两种基于未来和过去速度 $v$ 以及潜在轨迹 $z$ 信息的互补的无需训练的潜在轨迹调整方法,直接在潜在空间中细化生成路径。我们提出了两种无需训练的轨迹平滑方案:\emph{前瞻},使用曲率门控权重平均当前和下一步的潜在变量;\emph{回溯},使用衰减的指数移动平均平滑潜在变量。通过广泛的实验和全面的评估指标,我们证明了所提出的无需训练的轨迹平滑模型在包括COCO17、CUB-200和Flickr30K等多个数据集上显著优于各种最先进的模型。
Summary / 总结
This paper addresses the challenge of training-free image generation by proposing two methods, Look-Ahead and Look-Back, which refine the generative path directly in latent space using future and past velocity and latent trajectory information. The Look-Ahead method averages the current and next-step latents with a curvature-gated weight, while the Look-Back method smooths latents using an exponential moving average with decay. Experiments show that these methods significantly outperform state-of-the-art models on COCO17, CUB-200, and Flickr30K datasets.
本文提出两种互补的方法,Look-Ahead和Look-Back,通过使用未来和过去的速度及潜在轨迹信息直接在潜在空间中细化生成路径来解决训练-free 图像生成中的误差累积问题。Look-Ahead 方法使用曲率门控权重平均当前和下一步的潜在变量,而Look-Back 方法使用衰减的指数移动平均平滑潜在变量。实验表明,这些方法在COCO17、CUB-200和Flickr30K等数据集上在图像质量和生成准确性方面显著优于最先进的模型。
P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
Authors: Yun Luo, Futing Wang, Qianjia Cheng, Fangchen Yu, Haodi Lei, Jianhao Yan, Chenxi Li, Jiacheng Chen, Yufeng Zhao, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Wenxuan Zeng, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui
First: 2026-02-10T06:28:08+00:00 · Latest: 2026-02-10T06:28:08+00:00
Abstract
The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
中文标题/摘要
标题:P1-VL:在物理奥林匹克竞赛中连接视觉感知与科学推理
从符号操作过渡到科学级别的推理是大型语言模型(LLMs)面临的关键前沿,而物理学则是将抽象逻辑与物理现实结合的关键测试锚点。物理学要求模型保持与支配宇宙的物理定律的一致性,这一任务本质上需要多模态感知来将抽象逻辑与现实联系起来。在奥林匹克竞赛级别,图表往往是构成性的而非说明性的,包含边界条件和空间对称性等关键约束,这些在文本中是不存在的。为了弥合这一视觉-逻辑差距,我们引入了P1-VL,这是一种专为高级科学推理设计的开源视觉-语言模型。我们的方法结合了课程强化学习,通过逐步增加难度来稳定训练后的表现,以及代理增强,使推理时能够进行迭代自我验证。在HiPhO这一严格的2024-2025年13场考试基准测试中,我们的旗舰模型P1-VL-235B-A22B成为首个开源视觉-语言模型,获得12枚金牌,并在开源模型中达到最先进的性能。我们的代理增强系统在全球排名中位列第二,仅次于Gemini-3-Pro。除了物理学,P1-VL展示了卓越的科学推理能力和泛化能力,在STEM基准测试中显著优于基础模型。通过开源P1-VL,我们为实现通用物理智能奠定了基础,以更好地将视觉感知与抽象物理定律对齐,促进机器科学发现。
Summary / 总结
The paper introduces P1-VL, a vision-language model designed for advanced scientific reasoning, particularly in physics. It combines Curriculum Reinforcement Learning and Agentic Augmentation to enhance model performance. Evaluated on HiPhO, P1-VL-235B-A22B secured 12 gold medals and achieved state-of-the-art performance in open-source models, ranking second globally behind Gemini-3-Pro. The model demonstrates strong generalizability in STEM benchmarks.
论文介绍了P1-VL,这是一种专门用于物理高级科学推理的视觉语言模型,特别是在奥林匹克级别。该模型结合了 Curriculum Reinforcement Learning 和 Agentic Augmentation,以增强模型的稳定性和自我验证能力。P1-VL-235B-A22B 在 HiPhO 基准测试中获得了 12 枚金牌,并在开源模型中达到了新的最先进水平。它在全球排名中位列第二,仅次于 Gemini-3-Pro,并在 STEM 基准测试中表现出强大的泛化能力。