# 语言模型评估中的多项选择归一化

- 来源：EleutherAI：Blog
- 发布时间：2021-10-11 23:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjhcw004lsln0oo0rips2
- 原文链接：https://blog.eleuther.ai/multiple-choice-normalization

## AI 摘要

自回归语言模型（GPT-3、GPT-Neo、GPT-J 等）的多项选择任务评估存在多种实现路径。文章系统梳理了当前主流的归一化（Normalization）方法，针对模型在不同选项上的概率计算方式、长度偏差修正及分数标准化技术进行详细阐述，为统一语言模型评测标准提供方法论参考。

## 正文

Let $x_{0:m}$ be the prompt, and $x_{m:n_i}$ be the $i$th possible continuation with a token length of $n_i - m$. There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e $\log \mathbb P(x_i|x_{0:i})$), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:

Unnormalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j})$. Intuitively, this is the probability of a generation sampled from the prompt containing the continuation in question. While this is the simplest method, problems arise when there are significant differences in length between different continuations, as longer continuations tend to have lower log probabilities, thus biasing the language model towards picking shorter continuations. This approach is used by eval harness in all multiple choice tasks and presented as acc.

acc

Token-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / (n_i - m)$. This approach attempts to normalize for length by computing average log probability per token; however, this approach is not tokenization agnostic, and as such two models with different tokenization that assign the same log likelihood to every single input string will have different token-length normalized scores. This approach is used by GPT-3 in most tasks. Eval harness does not report this score because it violates the design principle that all tasks should be tokenization independent.

Byte-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / \sum_{j=m}^{n_i - 1} L_{x_j}$, where $L_{x_j}$ is the number of bytes represented by the token $x_j$. This approach attempts to normalize for length by computing average log probability per character, which ensures that it is tokenization agnostic. This approach is also used by eval harness in all multiple choice tasks and presented as acc_norm.

acc_norm

Unconditional likelihood normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) - \log \mathbb P(x_j)$. Intuitively, this approach measures the amount that the prompt increases the model's probability of outputting each continuation from the probability of the model unconditionally producing that continuation. This approach is used by GPT-3 in select tasks (ARC, OpenBookQA, and RACE), though no justification for why only these tasks in particular use this method is provided other than that this improves performance.

The unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional LM calls. The unconditional likelihood normalized metric requires an additional LM call to obtain the unconditional likelihood.