重访棘手难题:语言模型语义推理基准测试
阅读原文· arxiv.org研究团队发布 SemanticQA 评测套件,用于评估语言模型处理语义短语的能力。该基准整合现有多词表达资源,构建统一测试平台,涵盖词汇搭配、习语表达、名词复合词及动词结构四大类别。通过对不同架构和规模模型的测试发现,各模型在提取、分类、解释及序列组合任务中表现差异显著,尤其在需要深层语义推理的任务上差距明显,暴露出复杂语义短语理解的能力瓶颈。评测数据与工具已开源。
We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.