自然语言处理学术速递[9.4]

点击 阅读原文 访问 arxivdaily.com ，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计35篇

大模型相关(12篇)

【1】Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

标题：知识的诅咒：当复杂的评估环境受益时，LLM法官却产生偏见

链接：
https://arxiv.org/abs/2509.03419

作者：i, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang

备注：8 pages, 4 figures, conference

摘要：随着大型语言模型（LLM）变得越来越强劲，它们面临着越来越多样化和复杂的任务，这使得可靠的评估变得越来越具有挑战性。法学硕士作为法官的范例已经成为一个可扩展的解决方案，但以前的工作主要聚焦在简单的设置。他们在复杂任务中的可靠性-多方面的规则，非结构化的参考答案和细致入微的标准是至关重大的-依旧研究不足。在本文中，我们构建了ComplexEval，一个挑战性的基准，旨在系统地暴露和量化辅助信息诱导的偏见。我们系统地研究和验证了12个基本和3个高级场景中的6个先前未探索的偏差。主要调查结果显示：（1）所有被评估的模型都表现出对这些偏差的显著敏感性，偏差大小随任务复杂性而变化;（2）值得注意的是，大型推理模型（LRM）表现出矛盾的脆弱性。我们的深入分析为提高评估信号的准确性和可验证性提供了重大见解，为更通用和更强劲的评估模型铺平了道路。

摘要：As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks–where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical–remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

【2】LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

标题：LMEn：一个用于分析语言模型中知识（从预训练数据到表明）的套件

链接：
https://arxiv.org/abs/2509.03405

作者：ottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva

备注：Submitted to TACL, August 2025

摘要：语言模型（LM）越来越多地驱动需要世界知识的现实世界应用程序。不过，模型将数据转化为有关世界的知识和信念的表明的内部过程却知之甚少。对这些过程的深入了解可以为开发具有更一致，更强劲和更完整的知识表明的LM铺平道路。为了便于研究这些问题，我们提出了LMEnt，一个套件，用于分析知识获取LM在预训练。LMEnt介绍：（1）基于维基百科的知识丰富的预训练语料库，完全注释了实体提及，（2）基于预训练数据的基于实体的检索方法，比以前的方法性能高出80.4%，以及（3）12个预训练模型，具有多达1B个参数和4K个中间检查点，与流行的开源模型在知识基准上的性能相当。总之，这些资源提供了一个受控的环境，用于分析预训练中的实体提及与下游表现之间的联系，以及预训练数据中因果干预的影响。我们通过研究跨检查点的知识获取来展示LMEnt的实用性，发现实际频率是关键，但不能完全解释学习趋势。我们发布LMEnt来支持LM中的知识研究，包括知识表明，可塑性，编辑，归因和学习动态。

摘要：Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.

【3】Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

标题：语言模型不遵循奥卡姆剃刀：归纳和归纳推理的基准

链接：
https://arxiv.org/abs/2509.03345

作者：n, Abulhair Saparov

摘要：推理是人工智能系统的核心能力，大型语言模型（LLM）最近取得了显着进展。不过，大多数工作只聚焦在演绎推理上，这是有问题的，由于其他类型的推理在解决现实世界的问题时也是必不可少的，而且它们很少被探索。这项工作的重点是评估LLM的归纳和溯因推理能力。我们介绍了一个可编程的合成数据集，InAbHyD（发音为in-a-bid），其中每个推理示例由一个不完整的世界模型和一组观察结果组成。智能主体的任务是在不完全世界模型下产生解释观察的假设，以解决每个推理示例。基于奥卡姆剃刀理论，提出了一种新的评价假设质量的度量方法。我们评估和分析一些国家的最先进的LLM。我们的分析表明，LLM可以在简单的场景中进行归纳和溯因推理，但在复杂的世界模型和产生高质量的假设时会遇到困难，即使使用流行的推理增强技术，如上下文学习和RLVR。

摘要：Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs' inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam's Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.

【4】AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

标题：SEARCH Tracer：谁在LLM统计系统中导致了失败？

链接：
https://arxiv.org/abs/2509.03312

作者：ang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan

摘要：基于大型语言模型（LLM）的代理系统，一般包括多个模型，复杂的工具调用和编排协议，大大优于单体代理。不过，正是这种复杂性放大了它们的脆弱性，使它们更容易出现系统故障。在长时间的执行跟踪中，准确定位导致错误的特定代理或步骤定义了代理系统故障归因的任务。不过，目前最先进的推理LLM依旧明显不足以应对这一挑战，准确率一般低于10%。为了解决这一差距，我们提出了一个自动化的框架，通过反实际重放和编程故障注入来注释失败的多智能体轨迹，从而产生策划数据集TracerTraj。利用这一资源，我们开发了一个轻量级的故障跟踪器-Tracer-8B，它是一个经过多粒度强化学习训练的故障跟踪器，能够有效地诊断冗长的多智能体交互中的错误。在Who&When基准测试中，LLM Tracer-8B的表现比Gemini-2.5-Pro和Claude-4-Sonnet等大型专有LLM高出18.18%，为LLM代理失败归因设定了新标准。更重大的是，XNUMX Tracer-8B为MetaGPT和MaAS等现成的多智能体系统提供了可操作的反馈，性能提升了4.8-14.2%，使其能够自我纠正和自我进化。

摘要：Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.

【5】Domain Adaptation of LLMs for Process Data

标题：过程数据的LLM域适应

链接：
https://arxiv.org/abs/2509.03161

作者：idi Oyamada, Jari Peeperkorn, Jochen De Weerdt, Johannes De Smedt

摘要：近年来，大型语言模型（LLM）已经成为包括过程挖掘（PM）在内的各个研究领域的一个重大领域。PM中的当前应用主要聚焦在提示工程策略或将事件日志转换为叙述式数据集，从而利用LLM的语义功能来解决不同的任务。相比之下，本研究调查了预训练的LLM在无需自然语言重新表述的情况下直接适应处理数据，其动机是这些模型在生成令牌序列方面表现出色，类似于PM中的目标。更具体地说，我们专注于参数有效的微调技术，以减轻一般与这些模型相关的计算开销。我们的实验设置侧重于预测过程监控（PPM），并思考单任务和多任务预测。结果表明，与最先进的递归神经网络（RNN）方法和最近的基于叙事风格的解决方案相比，预测性能有潜在的改善，特别是在多任务环境中。此外，我们的微调模型表现出更快的收敛速度，并且需要显着减少超参数优化。

摘要：In recent years, Large Language Models (LLMs) have emerged as a prominent area of interest across various research domains, including Process Mining (PM). Current applications in PM have predominantly centered on prompt engineering strategies or the transformation of event logs into narrative-style datasets, thereby exploiting the semantic capabilities of LLMs to address diverse tasks. In contrast, this study investigates the direct adaptation of pretrained LLMs to process data without natural language reformulation, motivated by the fact that these models excel in generating sequences of tokens, similar to the objective in PM. More specifically, we focus on parameter-efficient fine-tuning techniques to mitigate the computational overhead typically associated with such models. Our experimental setup focuses on Predictive Process Monitoring (PPM), and considers both single- and multi-task predictions. The results demonstrate a potential improvement in predictive performance over state-of-the-art recurrent neural network (RNN) approaches and recent narrative-style-based solutions, particularly in the multi-task setting. Additionally, our fine-tuned models exhibit faster convergence and require significantly less hyperparameter optimization.

【6】From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

标题：从评估到防御：为大型语言模型构建持久的基于编辑的指纹

链接：
https://arxiv.org/abs/2509.03122

作者：in Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Xiaoling Wang, Linlin Wang

备注：preprint

摘要：大型语言模型（LLM）的知识产权（IP）保护越来越重大。通过指令调优将专用指纹注入LLM是一种常见的IP保护技术。不过，这可能会显着降低模型的性能，需要大量的计算资源，并表现出模型修改下的持久性差。我们认为，知识编辑提供了一个轻量级的替代方案，更适合指纹注入。因此，我们首次将知识编辑应用于指纹注入，并展示了其强劲的能力。尽管使用了杂乱的文本作为指纹，以防止它们在微调期间被覆盖，但在大规模微调下依旧会发生降级。为了解决这个问题，我们提出了指纹子空间感知微调（FSFT），它通过限制指纹子空间的更新来减少指纹退化。即使在最坏的情况下，FSFT的性能也超过微调10%。此外，我们观察到，指纹注入模型很难区分指纹和类似的文本，由于它们的特征高度类似。这一发现强调了迫切需要更强劲和细粒度的指纹注射方法的LLM。

摘要：The intellectual property (IP) protection of Large Language Models (LLMs) is increasingly critical. Injecting specialized fingerprints into LLMs through instruction tuning is a common IP protection technique. However, this may significantly degrade model performance, requires substantial computational resources, and exhibits poor persistence under model modifications. We argue that knowledge editing offers a lightweight alternative that is more suitable for fingerprint injection. Accordingly, we apply knowledge editing to fingerprint injection for the first time and demonstrate its strong capability. Despite using scrambled text as fingerprints to prevent them from being overwritten during fine-tuning, degradation still occurs under large-scale fine-tuning. To address this, we propose Fingerprint Subspace-aware Fine-Tuning (FSFT), which reduces fingerprint degradation by constraining the update of the fingerprint subspace. The performance of FSFT exceeds fine-tuning by 10% even in the worst-case scenario. Additionally, we observe that the fingerprint-injected models struggle to distinguish between fingerprints and similar texts due to the high similarity of their features. This finding underscores the urgent need for more robust and fine-grained fingerprinting injection methods for LLMs.

【7】Measuring Scalar Constructs in Social Science with LLMs

标题：利用法学硕士测量社会科学中的纯量结构

链接：
https://arxiv.org/abs/2509.03116

作者：ht, Rupak Sarkar, Patrick Y. Wu, Pranav Goel, Niklas Stoehr, Elliott Ash, Alexander Miserlis Hoyle

备注：Accepted to EMNLP 2025 (Main)

摘要：许多表征语言的结构，如复杂性或情感性，都有一个自然连续的语义结构;公开演讲不仅仅是“简单”或“复杂”，而是存在于两个极端之间的连续体上。虽然大型语言模型（LLM）是一个有吸引力的工具，用于测量标量结构，其特殊的处理数值输出提出了如何最好地应用它们的问题。我们解决这些问题的综合评价法学硕士为基础的方法，以标量结构测量社会科学。使用来自政治学文献的多个数据集，我们评估了四种方法：未加权直接逐点评分，成对比较的聚合，令牌概率加权逐点评分和微调。我们的研究为应用研究人员提供了可操作的发现。第一，LLM提示直接从文本生成逐点分数，产生不连续的分布，在任意数字处聚束。通过LLM进行的成对比较，测量的质量得到了提高，但是通过逐点得分并通过令牌概率对其进行加权，测量的质量得到了更大的提高。最后，用少至1,000个训练对来微调较小的模型可以匹配或超过提示的LLM的性能。

摘要：Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex,” but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons,
token-probability-weighted pointwise scoring, and finetuning. Our study yields actionable findings for applied researchers. First, LLMs prompted to generate pointwise scores directly from texts produce discontinuous distributions with bunching at arbitrary numbers. The quality of the measurements improves with pairwise comparisons made by LLMs, but it improves even more by taking pointwise scores and weighting them by token probability. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

【8】Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

标题：参数高效的大型语言模型的结构可学习适配器微调

链接：
https://arxiv.org/abs/2509.03057

作者：, Yingnan Deng, Nia Qi, Yujun Zou, Zhihao Xue, Yun Zi

摘要：本文讨论了大型语言模型微调中的参数冗余、刚性结构和有限的任务适应性等问题。它提出了一种基于适配器的微调方法建立在结构学习机制。通过引入可微门函数和结构稀疏控制变量，该方法能够自动优化适配器插入点，激活路径和模块组合。这使得模型能够在多任务设置中灵活地调整其结构，以匹配不同的任务特征。在保持主干参数冻结的情况下，该方法使用结构搜索机制来指导训练过程中特定于任务的有效子结构的动态构建。这大大提高了参数利用率和代表能力。此外，本文还设计了一组灵敏度分析实验，系统地评估了稀疏权重、噪声注入率和数据扰动对模型性能的影响。这些实验验证了所提出的方法在各种多任务自然语言理解任务中的稳定性和鲁棒性。实验结果表明，该方法优于主流的参数有效的调整技术在多个任务。它在精度、压缩率和对噪声和扰动的鲁棒性之间实现了更好的平衡。

摘要：This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.

【9】Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

标题：通过双向重建训练LLC成为更好的文本嵌入者

链接：
https://arxiv.org/abs/2509.03020

作者： Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin

备注：accepted by EMNLP 2025 Main Conference

摘要：大型语言模型（LLM）作为强劲的文本嵌入器越来越多地被探索。现有的基于LLM的文本嵌入方法一般利用最终令牌的嵌入，一般是保留的特殊令牌，如[EOS]。不过，这些标记并没有被有意地训练来捕获整个上下文的语义，限制了它们作为文本嵌入的能力，特别是对于检索和重新排序任务。我们提议在对比学习之前添加一个新的训练阶段，以丰富最终标记嵌入的语义。该阶段采用双向生成重建任务，即EBQ 2D（基于嵌入的查询到文档）和EBD 2 Q（基于嵌入的文档到查询），它们交织以锚定[EOS]嵌入并重建查询-文档对的任何一侧。实验结果表明，我们的额外训练阶段显着提高了LLM在海量文本嵌入基准（MTEB）上的性能，在不同的LLM基础模型和尺度上实现了新的最先进的结果。

摘要：Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

【10】English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

标题：无复杂联合训练的英语语音评估：LoRA微调语音多模态LLM

链接：
https://arxiv.org/abs/2509.02915

作者：Ahn, Hosung Nam

摘要：该研究表明，通过低秩自适应（LoRA）适配的多模态大型语言模型（MLLM）可以同时执行自动发音评估（APA）和发音错误检测和诊断（MDD）。利用Microsoft的
Phi-4-multimodal-instruction，我们的微调方法消除了对复杂架构更改或这些不同任务所需的单独培训程序的需要。在Speechocean 762数据集上进行微调后，该模型预测的发音评估分数与人工分配的分数表现出较强的Pearson相关系数（PCC > 0.7），同时实现了较低的单词错误率（WER）和音素错误率（PER）（均

摘要：This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both

【11】IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

标题：IDEAlign：将大型语言模型与开放式解释注释中的人类专家进行比较

链接：
https://arxiv.org/abs/2509.02855

作者：m, Lucia Langlois, James Malamut, Mei Tan, Dorottya Demszky

备注：10 pages, 9 pages for appendix

摘要：大型语言模型（LLM）越来越多地应用于开放式的解释性注释任务，例如研究人员的主题分析或教师对学生作业的反馈。这些任务涉及自由文本注释，需要基于特定目标的专家级判断（例如，研究问题或教学目标）。评估LLM生成的注释是否与专家人类生成的注释相一致是一项具有挑战性的工作，目前还没有经过验证的可扩展的思想类似性度量。在本文中，我们（i）介绍了可扩展的评价解释性注释的LLM作为一个关键的和未充分研究的任务，（ii）提出IDEAlgin，一个直观的基准范式捕获专家的类似性评级通过“挑奇一出”三重判断任务，和（iii）评估各种类似性度量，包括基于向量的（主题模型、嵌入）和通过IDEAlgin进行的法学硕士评审，对照这些人类基准。将这种方法应用于两个真实世界的教育数据集（解释性分析和反馈生成），我们发现基于向量的指标在很大程度上无法捕捉到对专家有意义的细微差别的类似性维度。与传统的基于词汇和向量的指标相比，通过IDEAlgin识别LLM显著提高了与专家判断的一致性（提高了9-30%）。这些结果确立了IDEAlgin作为一个有前途的范例，用于评估LLM对开放式专家注释的规模，通知LLM在教育和其他领域的负责任部署。

摘要：Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a “pick-the-odd-one-out” triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.

【12】Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

标题：集群话语：大型语言模型生成的女性短篇小说中的种族偏见

链接：
https://arxiv.org/abs/2509.02834

作者：onil, João Gondim, Marina dos Santos, Simone Hashiguti, Helena Maia, Nadia Silva, Helio Pedrini, Sandra Avila

备注：12 pages, 3 figures. Accepted at STIL @ BRACIS 2025

摘要：本研究探讨如何大的语言模型，特别是LLaMA 3.2-3B，构建在葡萄牙语产生的短篇小说中的黑人和白人妇女的叙事。从2100个文本中，我们应用计算方法对语义类似的故事进行分组，以便进行定性分析。三个主要的话语表征出现：社会克服，祖先神话和主观自我实现。分析揭示了语法连贯，看似中性的文本如何具体化一个结晶，殖民地结构框架的女性身体，加强历史的不平等。该研究提出了一种综合方法，将机器学习技术与定性的手动话语分析相结合。

摘要：This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colonially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.

Transformer(2篇)

【1】Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

标题：连续沙特手语识别：视觉Transformer方法

链接：
https://arxiv.org/abs/2509.03467

作者：Elhassen, Lama Al Khuzayem, Areej Alhothali, Ohoud Alzamzami, Nahed Alowaidi

备注：23 pages, 13 figures, 5 tables

摘要：手语（SL）是听障和聋人的一种重大沟通形式，使他们能够参与更广泛的社会。尽管手语具有重大意义，但公众对手语的认识有限，往往导致教育和职业机会的不平等，从而加剧了社会排斥，特别是在沙特阿拉伯，超过84，000人依赖沙特手语作为其主要交流形式。虽然某些技术方法有助于改善听力障碍者的沟通，但依旧迫切需要更准确和可靠的翻译技术，特别是对于SSL等阿拉伯手语变体。大多数最先进的解决方案主要侧重于非阿拉伯语手语，导致专门用于阿拉伯语手语，特别是SSL的资源严重缺乏。阿拉伯语的复杂性和孤立的手语数据集的普遍性，聚焦在单个单词，而不是连续的语音促成了这个问题。为了解决这一差距，我们的研究代表了开发SSL资源的重大一步。为了解决这个问题，我们引入了第一个连续的沙特手语数据集KAU-CSSL，专注于完整的句子，以促进进一步的研究，并为SSL识别和翻译提供复杂的识别系统。此外，我们提出了一个基于transformer的模型，利用预训练的ResNet-18进行空间特征提取，并使用具有双向LSTM的Transformer编码器进行时间依赖性，在签名者相关模式下达到99.02%的准确率，在签名者独立模式下达到77.71%的准确率。这一发展不仅为SSL社区改善了通信工具，而且为更广泛的手语领域做出了重大贡献。

摘要：Sign language (SL) is an essential communication form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of communication. Although certain technological approaches have helped to improve communication for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a transformer-based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02% accuracy at signer dependent mode and 77.71% accuracy at signer independent mode. This development leads the way to not only improving communication tools for the SSL community but also making a substantial contribution to the wider field of sign language.

【2】Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets

标题：利用Transformer推进少数族裔压力检测：来自社交媒体数据集的见解

链接：
https://arxiv.org/abs/2509.02908

作者：hapagain, Cory J Cascalheira, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi, Jillian R. Scheer

备注：Accepted in Social Network Analysis and Mining Journal (SNAM)

摘要：性和性别少数群体的个人经历不成比例的高比例的健康状况不佳的结果和精神障碍相比，他们的异性恋和顺性别的同行，主要是由于少数压力所描述的迈耶（2003年）模型。这项研究提出了第一个全面的评估变压器为基础的架构，以检测少数民族的压力在网上的话语。我们对多个Transformer模型进行了基准测试，包括ELECTRA、BERT、RoBERTa和BART，并与传统的机器学习基线和图形增强变体进行了对比。我们进一步评估了zero-shot和Few-Shot学习范式，以评估它们在代表性不足的数据集上的适用性。实验在两个最大的公开可用的Reddit语料库上进行，用于少数压力检测，包括12，645和5，789个帖子，并在五个随机种子上重复，以确保鲁棒性。我们的研究结果表明，集成图结构一致地提高了跨transformer-only模型的检测性能，并且具有关系上下文的监督微调优于零和Few-Shot方法。理论分析表明，通过图形增强来建模社会连接和会话上下文，可以提高模型识别关键语言标记的能力，例如身份隐藏，内化的耻辱和支持，这表明图形增强的Transformers为数字健康干预和公共卫生政策提供了最可靠的基础。

摘要：Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer's (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models' ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.

GAN|生成相关(1篇)

【1】A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

标题：A-SEA 3L-QA：用于阿拉伯语长上下文问答生成的全自动自进化、对抗性工作流程

链接：
https://arxiv.org/abs/2509.02864

作者：g, Daulet Toibazar, Pedro J. Moreno

摘要：我们提出了一个端到端的，自我发展的对抗性工作流程，用于阿拉伯语的长上下文问答（QA）生成。通过编排多个专门的LVLM：一个问题生成器，一个评估器和一群答案生成器，我们的系统迭代地改善自己的性能，而无需任何人为干预。从不同领域的原始多页阿拉伯文文档开始，问题生成器生成细粒度的上下文感知查询，由答案生成器群处理，评估器评估并反馈质量指标。这种闭环循环实现了持续学习：低置信度输出触发自动重新生成和模型更新，逐步提高问题的难度和相关性。此外，我们将质量指标设置为可调的超参数，使问题生成在可控和可定制的难度水平。我们发布了AraLongBench，这是一个跨越数百页的单页和多页挑战的大规模阿拉伯语基准测试，并证明我们的自我进化工作流程大大优于静态管道，显着提高了领先的阿拉伯语大视觉语言模型（LVLM）的长期上下文理解能力。最后，我们还精心设计了一个完全自动化的代理工作流程，用于长上下文阿拉伯文文档收集。

摘要：We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

QA|VQA|问答|对话(1篇)

【1】ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

标题：ProMQA-Assembly：关于装配的多模式程序QA数据集

链接：
https://arxiv.org/abs/2509.02949

作者：Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura

备注：29 pages. Code and data: this https URL

摘要：装配任务助理具有很大的潜力，可以使人类从日常任务到工业环境中受益。不过，没有测试平台支持面向应用的系统评估在一个实际的设置，特别是在组装。为了促进发展，我们提出了一个新的组装活动的多模态QA数据集。我们的数据集ProMQA-Assembly由391个QA对组成，这些QA对需要以在线方式对人类活动记录及其说明书进行多模态理解。在开发过程中，我们采用了半自动QA注释方法，其中LLM生成候选人并进行人工验证，作为一种具有成本效益的方法，并通过集成细粒度的动作标签来进一步改善它以使问题类型多样化。此外，我们创建指令任务图的目标任务组装玩具车。这些新创建的任务图用于我们的基准测试实验，以及促进QA注释中的人工验证过程。利用我们的数据集，我们对模型进行基准测试，包括具有竞争力的专有多模态模型。我们的研究结果表明，目前的模型有很大的改善空间。我们信任我们的新评估数据集可以为程序活动助理的进一步发展做出贡献。

摘要：Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.

推理|分析|理解|解释(2篇)

【1】SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

标题：僧加罗语MMLU：评估僧加罗语多任务语言理解的综合基准

链接：
https://arxiv.org/abs/2509.03162

作者：ramodya, Nirasha Nelki, Heshan Shalinda, Chamila Liyanage, Yusuke Sakai, Randil Pushpananda, Ruvan Weerasinghe, Hidetaka Kamigaito, Taro Watanabe

备注：19 pages, 11 figures

摘要：大型语言模型（LLM）展示了令人印象深刻的一般知识和推理能力，但它们的评估主要聚焦在全球或英语中心的主题上，往往忽视了低资源语言和文化特定的内容。虽然最近的多语言基准尝试弥合这一差距，但许多依赖于自动翻译，这可能会引入错误并歪曲原始文化背景。为了解决这个问题，我们引入了僧伽罗MMLU，这是第一个专门为僧伽罗语设计的多项选择题回答基准，僧伽罗语是一种低资源语言。该数据集包括7 000多个问题，涵盖中学到大学教育水平，与斯里兰卡国家课程相一致，涵盖6个领域和30个科目，包括一般学术主题和文化基础知识。我们在僧伽罗MMLU上评估了26个LLM，并观察到，虽然Claude 3.5十四行诗和GPT-4 o分别达到了67%和62%的最高平均准确度，但整体模型性能依旧有限。特别是，模型在文化丰富的领域，如人文学科的斗争，揭示了大量的改善空间，使LLM适应低资源和特定文化背景。

摘要：Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

【2】A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network

标题：基于回归神经网络的商业情绪分析长短期记忆（LSTM）模型

链接：
https://arxiv.org/abs/2509.03060

作者：ul Islam Razin, Md. Abdul Karim, M. F. Mridha, S M Rafiuddin, Tahira Alam

备注：11 pages, 9 figures, 3 tables, published in Sustainable Communication Networks and Application: Proceedings of ICSCN 2020 (2021). Paper presents an LSTM-based business sentiment analysis model with 91.33% accuracy, compares against KNN, SVM, and Naive Bayes, and discusses methodology, dataset, training/testing, results, and implementation tools

摘要：商业情感分析是自然语言处理领域的一个重大研究方向。它是一种用于商业目的的情感分析技术。不同类别的情感分析技术，如基于词典的技术和不同类型的机器学习算法，适用于不同语言（如英语，印地语，西班牙语等）的情感分析。在本文中，长短期记忆（LSTM）应用于商业情感分析，其中使用递归神经网络。在修改后的方法中使用LSTM模型来防止梯度消失问题，而不是应用传统的递归神经网络（RNN）。为了应用修改后的RNN模型，使用产品评论数据集。在这个实验中，70%的数据用于LSTM训练，其余30%的数据用于测试。将改善后的RNN模型的结果与其他传统的RNN模型进行了比较，并对结果进行了比较。值得注意的是，所提出的模型比其他传统的RNN模型性能更好。在这里，所提出的模型，即，改善后的RNN模型的识别准确率达到了91.33%.通过应用这个模型，任何商业公司或电子商务网站都可以识别客户对客户喜爱或不喜爱的不同类型产品的反馈。根据客户评论，商业公司或电子商务平台可以评估其营销策略。

摘要：Business sentiment analysis (BSA) is one of the significant and popular topics of natural language processing. It is one kind of sentiment analysis techniques for business purposes. Different categories of sentiment analysis techniques like lexicon-based techniques and different types of machine learning algorithms are applied for sentiment analysis on different languages like English, Hindi, Spanish, etc. In this paper, long short-term memory (LSTM) is applied for business sentiment analysis, where a recurrent neural network is used. An LSTM model is used in a modified approach to prevent the vanishing gradient problem rather than applying the conventional recurrent neural network (RNN). To apply the modified RNN model, product review dataset is used. In this experiment, 70% of the data is trained for the LSTM and the rest 30% of the data is used for testing. The result of this modified RNN model is compared with other conventional RNN models, and a comparison is made among the results. It is noted that the proposed model performs better than the other conventional RNN models. Here, the proposed model, i.e., the modified RNN model approach has achieved around 91.33% of accuracy. By applying this model, any business company or e-commerce business site can identify the feedback from their customers about different types of products that customers like or dislike. Based on the customer reviews, a business company or e-commerce platform can evaluate its marketing strategy.

检测相关(1篇)

【1】Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

标题：Speech DF Arena：Speech DeepFake检测模型的排行榜

链接：
https://arxiv.org/abs/2509.02859

作者： Dowerah, Atharva Kulkarni, Ajinkya Kulkarni, Hoan My Tran, Joonas Kalda, Artem Fedorchenko, Benoit Fauve, Damien Lolive, Tanel Alumäe, Matthew Magimai Doss

摘要：与高级deepfake音频生成的发展并行，音频deepfake检测也取得了重大进展。不过，依旧缺少一个标准化和全面的基准。为了解决这个问题，我们引入了Speech DeepFake（DF）Arena，这是音频deepfake检测的第一个综合基准。Speech DF Arena提供了一个工具包来统一评估检测系统，目前涵盖14个不同的数据集和攻击场景，标准化的评估指标和协议，以实现可重复性和透明度。它还包括一个排行榜来比较和排名系统，以协助研究人员和开发人员提高其可靠性和鲁棒性。我们包括14个评估集，12个最先进的开源和3个专有检测系统。我们的研究提出了许多系统表现出高EER域外的情况下，强调需要广泛的跨域评估。排行榜托管在Huggingface1上，GitHub上提供了一个工具包，用于在所列数据集上复制结果。

摘要：Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

语料库(1篇)

【1】DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

标题：DiaCBT：以认知概念化图为指导的基于CBT的心理咨询的长周期对话库

链接：
https://arxiv.org/abs/2509.02999

作者：ou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, Liang He

摘要：由于社会耻辱和治疗师有限，心理治疗只惠及一小部分精神障碍患者。大型语言模型（LLM），当配备专业的心理治疗技能，提供了一个有前途的解决方案，以扩大获得心理健康服务。不过，缺乏心理对话数据集提出了重大挑战，在开发有效的心理治疗指导的对话代理。本文构建了一个基于认知行为疗法（CBT）的心理咨询长周期对话语料库。我们策划的数据集包括每次咨询的多个会话，并结合认知概念化图（CCD）来指导客户在不同场景中的模拟。为了评估我们的数据集的效用，我们训练了一个深入的咨询模型，并提出了一个全面的评估框架，以基准对建立基于CBT的咨询心理标准。结果表明，DiaCBT有效地提高了LLM模仿具有CBT专业知识的心理学家的能力，强调了其培训更多专业咨询代理人的潜力。

摘要：Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs' ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.

Word2Vec|文本|单词(2篇)

【1】Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

标题：文本游戏中基于强化学习的代理的设计与优化

链接：
https://arxiv.org/abs/2509.03479

作者：ng, Mingjia Zhao, Junfeng Sun, Wei Liu

备注：6 papges

摘要：随着人工智能技术的进步，与智能体一起玩基于文本的游戏的研究越来越受欢迎。本文提出了一种基于强化学习的智能体设计和智能体学习的新方法。第一应用深度学习模型处理游戏文本并构建世界模型。接下来，通过基于策略梯度的深度强化学习方法对智能体进行学习，以促进从状态值到最优策略的转换，增强后的智能体在多个基于文本的游戏实验中表现更好，在游戏完成率和胜率上明显优于之前的智能体.我们的研究介绍了新的理解和经验基础，使用强化学习的文本游戏，并设置了阶段，为更一般的领域和问题的开发和优化强化学习代理。

摘要：As AI technology advances, research in playing text-based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient-based deep reinforcement learning method to facilitate conversion from state value to optimal policy.The enhanced agent works better in several text-based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.

【2】An experimental and computational study of an Estonian single-person word naming

标题：爱沙尼亚单人词命名的实验和计算研究

链接：
https://arxiv.org/abs/2509.03143

作者：, Arvi Tavast, Maria Heitmeier, Harald Baayen

摘要：本研究探讨爱沙尼亚语的词汇加工。本文报道了一个大规模的单被试实验，该实验将单词命名任务与眼动追踪相结合。五个反应变量（首次注视持续时间，总注视持续时间，注视次数，单词命名潜伏期和口语单词持续时间）进行了分析与广义加法模型。核心利益的问题是，是否测量产生的心理词汇的计算模型（判别词汇模型，DLM）的词汇处理预测这些反应变量，以及它们如何比较经典的预测因子，如词频，邻域大小，和屈折范式的大小。计算模型采用线性映射和深度映射实现。主要发现是，第一，基于DLM的测量是词汇处理的强劲预测器，其次，使用深度学习的DLM测量不必定比使用线性映射的DLM测量更准确地预测词汇处理，第三，与基于DLM的预测器相比，经典预测器往往提供更准确的拟合（除了总的固定时间，其中两个提供一样的拟合优度），第四，在命名任务中的词汇变量是不预测的第一个固定时间和固定的总数。由于DLM与从形式到意义的映射一起工作，基于DLM的总注视持续时间、命名延迟和口语单词持续时间的测量的预测性表明，意义在很大程度上参与了本单词命名任务。

摘要：This study investigates lexical processing in Estonian. A large-scale single-subject experiment is reported that combines the word naming task with eye-tracking. Five response variables (first fixation duration, total fixation duration, number of fixations, word naming latency, and spoken word duration) are analyzed with the generalized additive model. Of central interest is the question of whether measures for lexical processing generated by a computational model of the mental lexicon (the Discriminative Lexicon Model, DLM) are predictive for these response variables, and how they compare to classical predictors such as word frequency, neighborhood size, and inflectional paradigm size. Computational models were implemented both with linear and deep mappings. Central findings are, first, that DLM-based measures are powerful predictors for lexical processing, second, that DLM-measures using deep learning are not necessarily more precise predictors of lexical processing than DLM-measures using linear mappings, third, that classical predictors tend to provide somewhat more precise fits compared to DLM-based predictors (except for total fixation duration, where the two provide equivalent goodness of fit), and fourth, that in the naming task lexical variables are not predictive for first fixation duration and the total number of fixations. As the DLM works with mappings from form to meaning, the predictivity of DLM-based measures for total fixation duration, naming latencies, and spoken word duration indicates that meaning is heavily involved in the present word naming task.

其他神经网络|深度学习|模型|建模(3篇)

【1】Learning Mechanism Underlying NLP Pre-Training and Fine-Tuning

标题：NLP预训练和微调背后的学习机制

链接：
https://arxiv.org/abs/2509.03407

作者：ach, Ronit D. Gross, Ella Koresh, Shalom Rosner, Or Shpringer, Tal Halevi, Ido Kanter

备注：46 pages, 18 figures, 10 tables

摘要：自然语言处理（NLP）能够理解和生成有意义的人类语言，一般使用大型数据集上的预训练复杂架构来学习语言，然后微调其权重以实现特定任务。双重目标的审查;了解成功的预训练的机制，并确定预训练准确性和分类任务的微调之间的相互作用。获得了以下主要结果;每个令牌的准确度（APT）随着其在数据聚焦的出现频率而增加，并且其在所有令牌上的平均值用作量化预训练成功的顺序参数，其沿着Transformer块增加。预训练打破了令牌之间的对称性，并将它们分组为有限的，小的，强匹配的令牌集群，从所呈现的令牌混淆矩阵推断。此功能沿着Transformer块向输出层锐化，与嵌入层相比，大大增强了其性能。因此，高阶语言结构是通过预训练生成的，即使学习成本函数只针对识别单个标记。这些预训练结果反映在沿Transformer块的改善的微调精度上。此外，发现输出标签预测置信度与平均输入APT无关，由于输入含义被保留，由于标记主要由强匹配标记替换。最后，虽然预训练在图像分类任务中一般不存在，但其基本机制与微调NLP分类任务中使用的机制类似，暗示了其普遍性。结果基于在维基百科数据集上预训练的BERT-6架构，并在FewRel和DBpedia分类任务上进行了微调。

摘要：Natural language processing (NLP) enables the understanding and generation of meaningful human language, typically using a pre-trained complex architecture on a large dataset to learn the language and next fine-tune its weights to implement a specific task. Twofold goals are examined; to understand the mechanism underlying successful pre-training and to determine the interplay between the pre-training accuracy and the fine-tuning of classification tasks. The following main results were obtained; the accuracy per token (APT) increased with its appearance frequency in the dataset, and its average over all tokens served as an order parameter to quantify pre-training success, which increased along the transformer blocks. Pre-training broke the symmetry among tokens and grouped them into finite, small, strong match token clusters, as inferred from the presented token confusion matrix. This feature was sharpened along the transformer blocks toward the output layer, enhancing its performance considerably compared with that of the embedding layer. Consequently, higher-order language structures were generated by pre-training, even though the learning cost function was directed solely at identifying a single token. These pre-training findings were reflected by the improved fine-tuning accuracy along the transformer blocks. Additionally, the output label prediction confidence was found to be independent of the average input APT, as the input meaning was preserved since the tokens are replaced primarily by strong match tokens. Finally, although pre-training is commonly absent in image classification tasks, its underlying mechanism is similar to that used in fine-tuning NLP classification tasks, hinting at its universality. The results were based on the BERT-6 architecture pre-trained on the Wikipedia dataset and fine-tuned on the FewRel and DBpedia classification tasks.

【2】Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

标题：NOCASA 2025挑战赛的端到端语音评估模型比较

链接：
https://arxiv.org/abs/2509.03256

作者：avoronkov, Tanel Alumäe

备注：Published at IEEE MLSP 2025

摘要：本文分析了为NOCASA 2025挑战赛开发的三个端到端模型，旨在为学习挪威语作为第二语言的儿童进行单词级发音自动评估。我们的模型包括一个编码器-解码器连体架构（E2 E-R），一个前缀调整的直接分类模型，利用预训练的wav2vec2.0表明，以及一个新的模型，集成了通过CTC计算的无干扰的良好发音（GOP）功能。我们引入了一个加权序数交叉熵损失量身定制的优化指标，如未加权平均召回率和平均绝对误差。在探索的方法中，我们基于GOP-CTC的模型实现了最高的性能，大大超过了挑战基线，并获得了最高的排行榜分数。

摘要：This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

【3】Identifiability and minimality bounds of quantum and post-quantum models of classical stochastic processes

标题：经典随机过程量子和后量子模型的可识别性和极小性界

链接：
https://arxiv.org/abs/2509.03004

作者：iechers, Thomas J. Elliott

备注：11 pages, 4 figures

摘要：为了理解我们周围的世界，我们开发模型，使我们能够复制，描述和解释我们所看到的行为。聚焦在相关随机变量序列的广泛情况下，即，经典的随机过程，我们解决的问题，确定是否两个不同的模型产生一样的可观察的行为。这就是可识别性的问题。奇怪的是，模型的物理学不需要与观测的物理学相对应;最近的工作表明，在记忆和热效率方面，采用量子模型来产生经典随机过程甚至是有利的。我们解决了可识别性问题，在这个政权，提供了一种手段来比较任何两个模型的经典过程，是经典的，量子的，或“后量子”的模型，通过将它们映射到一个典型的“广义”隐马尔可夫模型。此外，这使我们能够在量子模型所需的最小维度上设置（有时是严格的）界限，以生成给定的经典随机过程。

摘要：To make sense of the world around us, we develop models, constructed to enable us to replicate, describe, and explain the behaviours we see. Focusing on the broad case of sequences of correlated random variables, i.e., classical stochastic processes, we tackle the question of determining whether or not two different models produce the same observable behavior. This is the problem of identifiability. Curiously, the physics of the model need not correspond to the physics of the observations; recent work has shown that it is even advantageous — in terms of memory and thermal efficiency — to employ quantum models to generate classical stochastic processes. We resolve the identifiability problem in this regime, providing a means to compare any two models of a classical process, be the models classical, quantum, or `post-quantum', by mapping them to a canonical `generalized' hidden Markov model. Further, this enables us to place (sometimes tight) bounds on the minimal dimension required of a quantum model to generate a given classical stochastic process.

其他(10篇)

【1】LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

标题：LimiX：释放通才智能的结构数据建模能力

链接：
https://arxiv.org/abs/2509.03505

作者：Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

备注：56 pages

摘要：我们认为，一般智能的进展需要基于语言，物理世界和结构化数据的互补基础模型。本报告介绍了LimiX，这是我们的大型结构化数据模型（LDM）的第一部分。LimiX将结构化数据视为变量和缺失的联合分布，因此能够通过单一模型通过基于查询的条件预测来解决各种表格任务。LimiX使用具有情景、上下文条件目标的掩蔽联合分布建模进行预训练，其中模型预测以特定于特定我们评估LimiX在10个大型结构化数据基准与广泛的制度的样本量，特征维度，类数，分类数字特征比，缺失，和样本特征比。凭借单一模型和统一界面，LimiX始终超越强劲的基线，包括梯度提升树，深度表格网络，最新的表格基础模型和自动化集成，如图1和图2所示。这种优势适用于各种任务，例如分类、回归、缺失值插补和数据生成，一般具有很大的优势，同时避免了特定于任务的架构或针对每个任务的定制训练。所有LimiX模型都可以在Apache 2.0下公开访问。

摘要：We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.

【2】Situating AI Agents in their World: Aspective Agentic AI for Dynamic Partially Observable Information Systems

标题：将人工智能代理置于他们的世界中：动态部分可观察信息系统的前瞻性人工智能

链接：
https://arxiv.org/abs/2509.03380

作者：Bentley, Soo Ling Lim, Fuyuki Ishikawa

备注：9 pages

摘要：LLM AI代理一般只不过是自主的聊天机器人：演员遵循脚本，一般由不可靠的导演控制。这项工作引入了一个自下而上的框架，将AI代理置于其环境中，所有行为都由其环境中的变化触发。它引入了方面的概念，类似于umwelt的想法，其中代理的集合彼此不同地感知他们的环境，从而能够更清晰地控制信息。我们提供了一个说明性的实现，并表明与泄漏高达83%的时间的典型架构相比，aspective agentic AI实现了零信息泄漏。我们预计，这种概念的专业代理人在自己的信息壁龛有效地工作，可以提供改善的安全性和效率。

摘要：Agentic LLM AI agents are often little more than autonomous chatbots: actors following scripts, often controlled by an unreliable director. This work introduces a bottom-up framework that situates AI agents in their environment, with all behaviors triggered by changes in their environments. It introduces the notion of aspects, similar to the idea of umwelt, where sets of agents perceive their environment differently to each other, enabling clearer control of information. We provide an illustrative implementation and show that compared to a typical architecture, which leaks up to 83% of the time, aspective agentic AI enables zero information leakage. We anticipate that this concept of specialist agents working efficiently in their own information niches can provide improvements to both security and efficiency.

【3】SESGO: Spanish Evaluation of Stereotypical Generative Outputs

标题：SESGO：西班牙对刻板印象生成产出的评估

链接：
https://arxiv.org/abs/2509.03329

作者：obles, Catalina Bernal, Denniss Raigoso, Mateo Dulce Rubio

摘要：本文讨论了在多语言大语言模型（LLM）中评估偏见的关键差距，特别关注在文化意识的拉丁美洲背景下的西班牙语。尽管在全球范围内广泛部署，但目前的评估依旧主要以美国英语为中心，在其他语言和文化背景下的潜在危害在很大程度上未得到充分研究。我们介绍了一种新颖的，基于文化的框架，用于检测预防调整的LLM中的社会偏见。我们的方法采用了BBQ数据集的未指定问题方法，通过将特定文化的表达和谚语编码为四个社会类别的区域刻板印象：性别，种族，社会经济阶层和民族血统。使用超过4，000个提示，我们提出了一个新的度量标准，将准确性与错误方向相结合，以有效地平衡模糊和消除歧义的上下文中的模型性能和偏差对齐。据我们所知，我们的工作提出了第一个系统的评估，研究领先的商业LLM如何应对西班牙语中的文化特定偏见，揭示了不同模式的偏见表目前国家的最先进的模型。我们还提供证据表明，偏见缓解技术优化英语不有效地转移到西班牙语的任务，偏见模式在不同的采样温度下保持基本一致。我们的模块化框架为新的刻板印象、偏见类别或语言和文化背景提供了自然的延伸，代表了在人工智能系统运行的不同语言环境中对人工智能系统进行更公平和更具有文化意识的评估的重大一步。

摘要：This paper addresses the critical gap in evaluating bias in multilingual Large Language Models (LLMs), with a specific focus on Spanish language within culturally-aware Latin American contexts. Despite widespread global deployment, current evaluations remain predominantly US-English-centric, leaving potential harms in other linguistic and cultural contexts largely underexamined. We introduce a novel, culturally-grounded framework for detecting social biases in instruction-tuned LLMs. Our approach adapts the underspecified question methodology from the BBQ dataset by incorporating culturally-specific expressions and sayings that encode regional stereotypes across four social categories: gender, race, socioeconomic class, and national origin. Using more than 4,000 prompts, we propose a new metric that combines accuracy with the direction of error to effectively balance model performance and bias alignment in both ambiguous and disambiguated contexts. To our knowledge, our work presents the first systematic evaluation examining how leading commercial LLMs respond to culturally specific bias in the Spanish language, revealing varying patterns of bias manifestation across state-of-the-art models. We also contribute evidence that bias mitigation techniques optimized for English do not effectively transfer to Spanish tasks, and that bias patterns remain largely consistent across different sampling temperatures. Our modular framework offers a natural extension to new stereotypes, bias categories, or languages and cultural contexts, representing a significant step toward more equitable and culturally-aware evaluation of AI systems in the diverse linguistic environments where they operate.

【4】LatPhon: Lightweight Multilingual G2P for Romance Languages and English

标题：LatPhon：针对浪漫语言和英语的轻量级多语言G2P

链接：
https://arxiv.org/abs/2509.03300

作者：pe Chary, Miguel Arjona Ramirez

摘要：字形到音素（G2 P）转换是文本到语音（TTS），自动语音识别（ASR），语音到语音翻译（S2 ST）和对齐系统的关键前端，特别是在多个拉丁脚本languages.We介绍LatPhon，一个7.5 M参数的Transformer联合训练六种语言-英语，西班牙语，法语，意大利语，葡萄牙语和罗马尼亚语。在公共ipa-dict语料库上，它的平均音素错误率（PER）为3.5%，优于字节级ByT 5基线（5.4%），接近特定语言的WFST（3.2%），同时占用30 MB内存，这使得在需要时可以在设备上部署。这些结果表明，紧凑的多语言G2 P可以作为拉丁语语音管道的通用前端。

摘要：Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script languages.We present LatPhon, a 7.5 M – parameter Transformer jointly trained on six such languages–English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.

【5】Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

标题：与Rumantsch Grischun、Sursilvan、Sutsilvan、Surmiran、Puter和Vallader一起扩展WMT 24 ++基准

链接：
https://arxiv.org/abs/2509.03148

作者：mvas, Ignacio Pérez Prat, Not Battesta Soliva, Sandra Baltermia-Guetg, Andrina Beeli, Simona Beeli, Madlaina Capeder, Laura Decurtins, Gian Peder Gregori, Flavia Hobi, Gabriela Holderegger, Arina Lazzarini, Viviana Lazzarini, Walter Rosselli, Bettina Vital, Anna Rutkiewicz, Rico Sennrich

备注：Submitted to WMT25 (Open Language Data Initiative Shared Task)

摘要：罗曼什语在瑞士使用，用于机器翻译评估的资源有限。在本文中，我们提出了一个基准的六个品种的罗曼什：Rumantsch Grischun，一个超区域的品种，和五个区域品种：Sursilvan，Sutsilvan，Surmiran，普特，和Vallader。我们的参考翻译是由人工翻译人员基于WMT24++基准创建的，该基准可确保与超过55种其他语言并行。对现有机器翻译系统和LLM的自动评估表明，从罗曼什语到德语的翻译对于所有变体都处理得相对较好，但翻译成罗曼什语依旧具有挑战性。

摘要：The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.

【6】Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

标题：通过基于学生的自我反思缓解多模式幻觉

链接：
https://arxiv.org/abs/2509.03113

作者：, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez

摘要：多模态大语言模型中的幻觉是由文本视觉偏差和共现偏差引起的。前者反映了在决策过程中对文本信息的过度依赖，而后者则源于从训练数据中提取的统计对象配对模式。现有的缓解方法在不理解实例之间波动的偏差水平的情况下，逐层地解决这些偏差。我们第一提出使用基于梯度的自反射方法估计各个令牌类型（视觉，提示和以前的输出）的影响。所估计的标记影响还使得能够检测对象相关的视觉标记并将其集成到影响感知对比解码框架中，以同时减轻两种类型的偏差。我们的方法不需要额外的资源，例如昂贵的微调，额外的模型或数据统计。大量的实验表明，它有效地减少了幻觉，在LLaVA-QA 90上实现了高达92%的准确性提高。

摘要：Hallucinations in multimodal large language model are caused by the text-visual bias and the co-occurrence bias. The former reflects an over-reliance on text information in the decision-making process, while the latter arises from the statistical object-pairing patterns abstracted from the training data. Existing mitigation methods heuristically address these biases without understanding the fluctuating bias level across the instances. We first propose estimating the influence of respective token types (visual, prompt, and previous outputs) using a gradient-based self-reflection method. The estimated token influence further enables the detection of object-related visual tokens and their integration into an influence-aware contrastive decoding framework to mitigate both types of biases simultaneously. Our method operates without the need for additional resources, such as costly fine-tuning, extra models, or data statistics. Extensive experiments show it effectively reduces hallucinations, achieving up to a 92% accuracy increase on LLaVA-QA90.

【7】Mitigating Data Imbalance in Automated Speaking Assessment

标题：缓解自动演讲评估中的数据失衡

链接：
https://arxiv.org/abs/2509.03010

作者： Tsai, Kuan-Tang Huang, Bi-Cheng Yan, Tien-Hong Lo, Berlin Chen

备注：Submitted to APSIPA 2025

摘要：自动口语评估在评估二语学习者的语言水平方面起着至关重大的作用。不过，ASA模型常常遭受类不平衡，导致有偏见的预测。为了解决这个问题，我们引入了一个新的目标来训练ASA模型，称为平衡Logit变异（BLV）损失，它会干扰模型预测，以改善少数类的特征表明，而无需修改数据集。对ICNALE基准数据集的评估表明，将BLV损失集成到著名的基于文本的（BERT）模型中显着提高了分类准确性和公平性，使自动语音评估对不同的学习者更加强劲。

摘要：Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minority classes without modifying the dataset. Evaluations on the ICNALE benchmark dataset show that integrating the BLV loss into a celebrated text-based (BERT) model significantly enhances classification accuracy and fairness, making automated speech evaluation more robust for diverse learners.

【8】Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

标题：解码规则书：从Reddit社区中提取隐藏的审核标准

链接：
https://arxiv.org/abs/2509.02926

作者：Kim, Himanshu Beniwal, Steven L. Johnson, Thomas Hartvigsen

备注：Accepted to EMNLP 2025 Main

摘要：有效的内容审核系统需要明确的分类标准，但像subreddits这样的在线社区一般采用不同的、隐含的标准来运营。这项工作介绍了一种新的方法来识别和提取这些隐含的标准，从历史的缓和数据使用一个可解释的架构。我们将审核标准表明为与内容删除相关的词汇表达的评分表，从而能够在不同社区之间进行系统比较。我们的实验表明，这些提取的词汇模式有效地复制了神经调节模型的性能，同时为决策过程提供了透明的见解。由此产生的标准矩阵揭示了看似共享的规范实际上是如何执行的显着变化，揭示了以前没有记录的节制模式，包括社区特定的语言容忍度，专题限制的功能，以及潜在的有毒言论分类的子类别。

摘要：Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.

【9】SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR

标题：SSVD：结构化MVD，用于ASB中域转移下的参数高效微调和基准测试

链接：
https://arxiv.org/abs/2509.02830

作者：Shinji Watanabe, Hugo Van hamme

备注：Accepted by IEEE ASRU 2025

摘要：参数高效微调（PEFT）已成为适应大型基础模型的可扩展解决方案。虽然低秩自适应（LoRA）广泛用于语音应用中，但其最先进的变体，例如，VeRA、DoRA、PiSSA和SVFT主要是为语言和视觉任务开发的，在语音方面的验证有限。这项工作提出了第一个全面的集成和ESPnet内的这些PEFT方法的基准。我们进一步引入结构化SVD引导（SSVD）微调，它选择性地旋转输入相关的右奇异向量，同时保持输出相关向量固定，以保持语义映射。这种设计能够以最小的可训练参数和提高的效率实现鲁棒的域自适应。我们评估了域转移语音识别任务的所有方法，包括儿童语音和方言变化，模型规模从0.1B到2B。所有实现都在ESPnet中发布，以支持可重复性和未来的工作。

摘要：Parameter-efficient fine-tuning (PEFT) has emerged as a scalable solution for adapting large foundation models. While low-rank adaptation (LoRA) is widely used in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA, PiSSA, and SVFT, are developed mainly for language and vision tasks, with limited validation in speech. This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. We evaluate all methods on domain-shifted speech recognition tasks, including child speech and dialectal variation, across model scales from 0.1B to 2B. All implementations are released in ESPnet to support reproducibility and future work.

【10】DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

标题：DrDiff：具有分层关注的动态路由扩散，打破效率与质量的权衡

链接：
https://arxiv.org/abs/2509.02785

作者：hang, Yijia Fan, Kaitong Cai, Zimeng Huang, Xiaofei Sun, Jian Wang, Chengpei Tang, Keze Wang

备注：Accepted 2025 EMNLP (MainConference)

摘要：本文介绍了DrDiff，一个新的框架，长文本生成，克服了效率和质量的权衡，通过三个核心技术。第一，我们设计了一个动态的专家调度机制，智能分配计算资源的扩散过程中的文本复杂性的基础上，使不同难度的文本生成任务的更有效的处理。其次，我们引入了层次稀疏注意（HSA）机制，根据各种输入长度自适应地调整注意模式，将计算复杂度从O（$n^2$）降低到O（$n$），同时保持模型性能。最后，我们提出了一种软吸收制导优化策略，结合DPM-solver++减少扩散步骤，显着提高生成速度。各种长文本生成基准的综合实验表明，我们的DrDiff优于现有的SOTA方法。

摘要：This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.

机器翻译由腾讯交互翻译提供，仅供参考

文章版权归作者所有，未经允许请勿转载。如内容涉嫌侵权，请在本页底部进入<联系我们>进行举报投诉!

THE END