拉丁语至奥克语的语法性别流变探析
阅读原文· arxiv.org该研究引入一个可解释的深度学习框架,以探究拉丁语演变为奥克语(一种罗曼语)过程中语法性别体系从阳性、阴性、中性三分向二分转变的现象。研究发现,针对这种低资源历史语料,传统分词策略不够稳健,所提出的改进分词器提升了模型性能。在词汇层面评估了词形特征对性别预测的贡献,在上下文层面量化了不同词性类别对语法性别预测的影响,揭示了性别信息在词元及其句子上下文间的分布特征。
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at https://github.com/ahan-2000/Lost-in-Translation-{https://github.com/ahan-2000/Lost-in-Translation-}.