# 在下游任务上微调模型

- 来源：EleutherAI：Blog
- 发布时间：2021-05-25 04:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjhcw004rsln0x37nvpcu
- 原文链接：https://blog.eleuther.ai/tuning-on-eval-harness

## AI 摘要

研究团队对GPT-Neo模型开展下游任务微调实验，利用eval harness评测体系进行针对性训练，系统观测微调过程对其性能表现产生的具体影响。实验通过调整模型参数适配特定任务，评估预训练模型在下游场景中的能力变化与适应性表现，为理解微调对模型性能的优化效果提供实证数据支撑。

## 正文

The GPT-3 paper didn't explore fine tuning on downstream tasks, so I decided to tune Neo 2.7B for 1.1k iters on all the tasks in eval harness that have a train set (all at once, because tuning one model per task would have taken ages). I was quite surprised that the tuned model didn't destroy untuned 2.7B completely on all tasks, but rather from eyeballing it seems like a tossup. Interestingly, tuned seems to defeat 2.7B by quite a lot on anli, which is especially notable given that this is one task the models in the GPT-3 paper struggled on. Also, lambada and pubmedqa are included in these tables, even though it doesn't have a training set (at least for the implementation in eval harness, using the OA version of lambada), because I wanted to look at effects on sets not in the tuning, to potentially observe some catastrophic forgetting or something. Sure enough, lambada and pubmedqa scores are significantly worse on the tuned model.

Zero shot#

Task Metric 2.7B Tuned anli_r1 acc 0.332 ± 0.015 0.418 ± 0.015 anli_r2 acc 0.342 ± 0.015 0.375 ± 0.015 anli_r3 acc 0.352 ± 0.014 0.392 ± 0.014 arc_challenge acc 0.275 ± 0.013 0.286 ± 0.013 acc_norm 0.301 ± 0.013 0.312 ± 0.013 arc_easy acc 0.611 ± 0.010 0.560 ± 0.010 acc_norm 0.539 ± 0.010 0.558 ± 0.010 boolq acc 0.630 ± 0.008 0.605 ± 0.008 cb acc 0.304 ± 0.062 0.411 ± 0.062 copa acc 0.800 ± 0.040 0.730 ± 0.040 ethics_cm acc 0.510 ± 0.008 0.561 ± 0.008 ethics_deontology acc 0.497 ± 0.008 0.658 ± 0.008 ethics_justice acc 0.501 ± 0.010 0.589 ± 0.010 ethics_utilitarianism acc 0.497 ± 0.007 0.498 ± 0.007 ethics_virtue acc 0.251 ± 0.006 0.800 ± 0.006 headqa acc 0.235 ± 0.008 0.233 ± 0.008 acc_norm 0.272 ± 0.008 0.265 ± 0.008 hellaswag acc 0.427 ± 0.005 0.400 ± 0.005 acc_norm 0.558 ± 0.005 0.517 ± 0.005 hendrycksTest-abstract_algebra acc 0.230 ± 0.042 0.340 ± 0.042 acc_norm 0.200 ± 0.040 0.350 ± 0.040 hendrycksTest-anatomy acc 0.252 ± 0.037 0.267 ± 0.037 acc_norm 0.222 ± 0.036 0.252 ± 0.036 hendrycksTest-astronomy acc 0.250 ± 0.035 0.309 ± 0.035 acc_norm 0.362 ± 0.039 0.309 ± 0.039 hendrycksTest-business_ethics acc 0.360 ± 0.048 0.340 ± 0.048 acc_norm 0.280 ± 0.045 0.310 ± 0.045 hendrycksTest-clinical_knowledge acc 0.291 ± 0.028 0.370 ± 0.028 acc_norm 0.287 ± 0.028 0.374 ± 0.028 hendrycksTest-college_biology acc 0.250 ± 0.036 0.250 ± 0.036 acc_norm 0.222 ± 0.035 0.271 ± 0.035 hendrycksTest-college_chemistry acc 0.230 ± 0.042 0.350 ± 0.042 acc_norm 0.250 ± 0.044 0.350 ± 0.044 hendrycksTest-college_computer_science acc 0.280 ± 0.045 0.430 ± 0.045 acc_norm 0.270 ± 0.045 0.390 ± 0.045 hendrycksTest-college_mathematics acc 0.200 ± 0.040 0.370 ± 0.040 acc_norm 0.300 ± 0.046 0.350 ± 0.046 hendrycksTest-college_medicine acc 0.254 ± 0.033 0.312 ± 0.033 acc_norm 0.260 ± 0.033 0.306 ± 0.033 hendrycksTest-college_physics acc 0.225 ± 0.042 0.275 ± 0.042 acc_norm 0.245 ± 0.043 0.284 ± 0.043 hendrycksTest-computer_security acc 0.270 ± 0.045 0.290 ± 0.045 acc_norm 0.330 ± 0.047 0.290 ± 0.047 hendrycksTest-conceptual_physics acc 0.247 ± 0.028 0.315 ± 0.028 acc_norm 0.187 ± 0.026 0.319 ± 0.026 hendrycksTest-econometrics acc 0.193 ± 0.037 0.272 ± 0.037 acc_norm 0.228 ± 0.039 0.281 ± 0.039 hendrycksTest-electrical_engineering acc 0.331 ± 0.039 0.386 ± 0.039 acc_norm 0.338 ± 0.039 0.386 ± 0.039 hendrycksTest-elementary_mathematics acc 0.230 ± 0.022 0.280 ± 0.022 acc_norm 0.270 ± 0.023 0.278 ± 0.023 hendrycksTest-formal_logic acc 0.333 ± 0.042 0.310 ± 0.042 acc_norm 0.302 ± 0.041 0.278 ± 0.041 hendrycksTest-global_facts acc 0.240 ± 0.043 0.250 ± 0.043 acc_norm 0.240 ± 0.043 0.260 ± 0.043 hendrycksTest-high_school_biology acc 0.219 ± 0.024 0.335 ± 0.024 acc_norm 0.284 ± 0.026 0.329 ± 0.026 hendrycksTest-high_school_chemistry acc 0.167 ± 0.026 0.207 ± 0.026 acc_norm 0.256 ± 0.031 0.212 ± 0.031 hendrycksTest-high_school_computer_science acc 0.220 ± 0.042 0.290 ± 0.042 acc_norm 0.280 ± 0.045 0.280 ± 0.045 hendrycksTest-high_school_european_history acc 0.267 ± 0.035 0.358 ± 0.035 acc_norm 0.285 ± 0.035 0.358 ± 0.035 hendrycksTest-high_school_geography acc 0.227 ± 0.030 0.359 ± 0.030 acc_norm 0.298 ± 0.033 0.333 ± 0.033 hendrycksTest-high_school_government_and_politics acc 0.207 ± 0.029 0.301 ± 0.029 acc_norm 0.259 ± 0.032 0.311 ± 0.032 hendrycksTest-high_school_macroeconomics acc 0.262 ± 0.022 0.267 ± 0.022 acc_norm 0.267 ± 0.022 0.262 ± 0.022 hendrycksTest-high_school_mathematics acc 0.174 ± 0.023 0.248 ± 0.023 acc_norm 0.244 ± 0.026 0.270 ± 0.026 hendrycksTest-high_school_microeconomics acc 0.256 ± 0.028 0.265 ± 0.028 acc_norm 0.328 ± 0.030 0.277 ± 0.030 hendrycksTest-high_school_physics acc 0.225 ± 0.034 0.212 ± 0.034 acc_norm 0.219 ± 0.034 0.225 ± 0.034 hendrycksTest-high_school_psychology acc 0.253 ± 0.019 0.338 ± 0.019 acc_norm 0.261 ± 0.019 0.330 ± 0.019 hendrycksTest-high_school_statistics acc 0.264 ± 0.030 0.278 ± 0.030 acc_norm 0.338 ± 0.032 0.273 ± 0.032 hendrycksTest-high_school_us_history acc 0.235 ± 0.030 0.230 ± 0.030 acc_norm 0.270 ± 0.031 0.235 ± 0.031 hendrycksTest-high_school_world_history acc 0.270 ± 0.029 0.388 ± 0.029 acc_norm 0.300 ± 0.030 0.392 ± 0.030 hendrycksTest-human_aging acc 0.296 ± 0.031 0.318 ± 0.031 acc_norm 0.238 ± 0.029 0.314 ± 0.029 hendrycksTest-human_sexuality acc 0.336 ± 0.041 0.290 ± 0.041 acc_norm 0.290 ± 0.040 0.290 ± 0.040 hendrycksTest-international_law acc 0.248 ± 0.039 0.322 ± 0.039 acc_norm 0.496 ± 0.046 0.347 ± 0.046 hendrycksTest-jurisprudence acc 0.250 ± 0.042 0.269 ± 0.042 acc_norm 0.426 ± 0.048 0.296 ± 0.048 hendrycksTest-logical_fallacies acc 0.209 ± 0.032 0.258 ± 0.032 acc_norm 0.288 ± 0.036 0.264 ± 0.036 hendrycksTest-machine_learning acc 0.295 ± 0.043 0.250 ± 0.043 acc_norm 0.259 ± 0.042 0.259 ± 0.042 hendrycksTest-management acc 0.184 ± 0.038 0.311 ± 0.038 acc_norm 0.282 ± 0.045 0.330 ± 0.045 hendrycksTest-marketing acc 0.316 ± 0.030 0.432 ± 0.030 acc_norm 0.338 ± 0.031 0.440 ± 0.031 hendrycksTest-medical_genetics acc 0.300 ± 0.046 0.240 ± 0.046 acc_norm 0.370 ± 0.049 0.270 ± 0.049 hendrycksTest-miscellaneous acc 0.281 ± 0.016 0.323 ± 0.016 acc_norm 0.271 ± 0.016 0.328 ± 0.016 hendrycksTest-moral_disputes acc 0.286 ± 0.024 0.350 ± 0.024 acc_norm 0.355 ± 0.026 0.364 ± 0.026 hendrycksTest-moral_scenarios acc 0.234 ± 0.014 0.264 ± 0.014 acc_norm 0.273 ± 0.015 0.269 ± 0.015 hendrycksTest-nutrition acc 0.275 ± 0.026 0.307 ± 0.026 acc_norm 0.359 ± 0.027 0.333 ± 0.027 hendrycksTest-philosophy acc 0.270 ± 0.025 0.305 ± 0.025 acc_norm 0.315 ± 0.026 0.322 ± 0.026 hendrycksTest-prehistory acc 0.256 ± 0.024 0.361 ± 0.024 acc_norm 0.216 ± 0.023 0.364 ± 0.023 hendrycksTest-professional_accounting acc 0.248 ± 0.026 0.230 ± 0.026 acc_norm 0.259 ± 0.026 0.220 ± 0.026 hendrycksTest-professional_law acc 0.267 ± 0.011 0.275 ± 0.011 acc_norm 0.300 ± 0.012 0.284 ± 0.012 hendrycksTest-professional_medicine acc 0.246 ± 0.026 0.290 ± 0.026 acc_norm 0.232 ± 0.026 0.298 ± 0.026 hendrycksTest-professional_psychology acc 0.258 ± 0.018 0.299 ± 0.018 acc_norm 0.253 ± 0.018 0.315 ± 0.018 hendrycksTest-public_relations acc 0.300 ± 0.044 0.364 ± 0.044 acc_norm 0.164 ± 0.035 0.373 ± 0.035 hendrycksTest-security_studies acc 0.339 ± 0.030 0.343 ± 0.030 acc_norm 0.286 ± 0.029 0.286 ± 0.029 hendrycksTest-sociology acc 0.269 ± 0.031 0.403 ± 0.031 acc_norm 0.264 ± 0.031 0.423 ± 0.031 hendrycksTest-us_foreign_policy acc 0.330 ± 0.047 0.390 ± 0.047 acc_norm 0.350 ± 0.048 0.390 ± 0.048 hendrycksTest-virology acc 0.313 ± 0.036 0.325 ± 0.036 acc_norm 0.331 ± 0.037 0.343 ± 0.037 hendrycksTest-world_religions acc 0.304 ± 0.035 0.316 ± 0.035 acc_norm 0.386 ± 0.037 0.339 ± 0.037 logiqa acc 0.201 ± 0.016 0.280 ± 0.016 acc_norm 0.281 ± 0.018 0.283 ± 0.018 mathqa acc 0.247 ± 0.008 0.248 ± 0.008 acc_norm 0.246 ± 0.008 0.239 ± 0.008 mnli acc 0.339 ± 0.005 0.729 ± 0.005 mnli_mismatched acc 0.338 ± 0.005 0.742 ± 0.005 mrpc acc 0.684 ± 0.023 0.701 ± 0.023 f1 0.812 ± 0.016 0.820 ± 0.016 multirc acc 0.016 ± 0.004 0.004 ± 0.004 openbookqa acc 0.234 ± 0.019 0.248 ± 0.019 acc_norm 0.332 ± 0.021 0.318 ± 0.021 piqa acc 0.721 ± 0.010 0.713 ± 0.010 acc_norm 0.729 ± 0.010 0.708 ± 0.010 qnli acc 0.509 ± 0.007 0.761 ± 0.007 qqp acc 0.368 ± 0.002 0.843 ± 0.002 f1 0.538 ± 0.003 0.789 ± 0.003 race acc 0.353 ± 0.015 0.362 ± 0.015 record f1 0.845 ± 0.004 0.779 ± 0.004 em 0.838 ± 0.004 0.770 ± 0.004 rte acc 0.520 ± 0.030 0.729 ± 0.030 sciq acc 0.893 ± 0.010 0.919 ± 0.010 acc_norm 0.828 ± 0.012 0.913 ± 0.012 sst acc 0.789 ± 0.014 0.862 ± 0.014 webqs acc 0.016 ± 0.003 0.071 ± 0.003 wic acc 0.500 ± 0.020 0.517 ± 0.020 winogrande acc 0.575 ± 0.014 0.570 ± 0.014 wnli acc 0.310 ± 0.055 0.563 ± 0.055 wsc acc 0.365 ± 0.047 0.365 ± 0.047 lambada ppl 5.626 ± 0.139 27.796 ± 0.139 acc 0.622 ± 0.007 0.387 ± 0.007 pubmedqa acc 0.565 ± 0.016 0.496 ± 0.016 coqa f1 0.604 ± 0.018 0.598 ± 0.018 em 0.479 ± 0.020 0.480 ± 0.020 drop em 0.026 ± 0.002 0.001 ± 0.002 f1 0.083 ± 0.002 0.033 ± 0.002 math_algebra acc 0.008 ± 0.003 0.025 ± 0.003 math_geometry acc 0.002 ± 0.002 0.021 ± 0.002 math_intermediate_algebra acc 0.004 ± 0.002 0.025 ± 0.002 math_num_theory acc 0.019 ± 0.006 0.046 ± 0.006 math_prealgebra acc 0.001 ± 0.001 0.039 ± 0.001 math_precalc acc 0.005 ± 0.003 0.016 ± 0.003

One shot#

Task Metric 2.7B Tuned anli_r1 acc 0.331 ± 0.015 0.443 ± 0.015 anli_r2 acc 0.307 ± 0.015 0.373 ± 0.015 anli_r3 acc 0.343 ± 0.014 0.423 ± 0.014 arc_challenge acc 0.302 ± 0.013 0.292 ± 0.013 acc_norm 0.323 ± 0.014 0.323 ± 0.014 arc_easy acc 0.634 ± 0.010 0.567 ± 0.010 acc_norm 0.622 ± 0.010 0.562 ± 0.010 boolq acc 0.536 ± 0.009 0.620 ± 0.009 cb acc 0.429 ± 0.067 0.411 ± 0.067 cola mcc 0.001 ± 0.031 0.022 ± 0.031 copa acc 0.770 ± 0.042 0.780 ± 0.042 ethics_cm acc 0.508 ± 0.008 0.625 ± 0.008 ethics_deontology acc 0.511 ± 0.008 0.683 ± 0.008 ethics_justice acc 0.515 ± 0.010 0.604 ± 0.010 ethics_utilitarianism acc 0.490 ± 0.007 0.536 ± 0.007 ethics_virtue acc 0.726 ± 0.006 0.805 ± 0.006 headqa acc 0.230 ± 0.008 0.228 ± 0.008 acc_norm 0.270 ± 0.008 0.275 ± 0.008 hellaswag acc 0.428 ± 0.005 0.386 ± 0.005 acc_norm 0.557 ± 0.005 0.494 ± 0.005 hendrycksTest-abstract_algebra acc 0.220 ± 0.042 0.270 ± 0.042 acc_norm 0.290 ± 0.046 0.260 ± 0.046 hendrycksTest-anatomy acc 0.289 ± 0.039 0.304 ± 0.039 acc_norm 0.230 ± 0.036 0.289 ± 0.036 hendrycksTest-astronomy acc 0.204 ± 0.033 0.322 ± 0.033 acc_norm 0.303 ± 0.037 0.322 ± 0.037 hendrycksTest-business_ethics acc 0.290 ± 0.046 0.320 ± 0.046 acc_norm 0.280 ± 0.045 0.280 ± 0.045 hendrycksTest-clinical_knowledge acc 0.287 ± 0.028 0.351 ± 0.028 acc_norm 0.328 ± 0.029 0.358 ± 0.029 hendrycksTest-college_biology acc 0.215 ± 0.034 0.271 ± 0.034 acc_norm 0.194 ± 0.033 0.271 ± 0.033 hendrycksTest-college_chemistry acc 0.300 ± 0.046 0.330 ± 0.046 acc_norm 0.340 ± 0.048 0.320 ± 0.048 hendrycksTest-college_computer_science acc 0.330 ± 0.047 0.390 ± 0.047 acc_norm 0.310 ± 0.046 0.360 ± 0.046 hendrycksTest-college_mathematics acc 0.200 ± 0.040 0.280 ± 0.040 acc_norm 0.220 ± 0.042 0.270 ± 0.042 hendrycksTest-college_medicine acc 0.254 ± 0.033 0.295 ± 0.033 acc_norm 0.260 ± 0.033 0.283 ± 0.033 hendrycksTest-college_physics acc 0.304 ± 0.046 0.284 ± 0.046 acc_norm 0.333 ± 0.047 0.304 ± 0.047 hendrycksTest-computer_security acc 0.320 ± 0.047 0.270 ± 0.047 acc_norm 0.320 ± 0.047 0.290 ± 0.047 hendrycksTest-conceptual_physics acc 0.268 ± 0.029 0.349 ± 0.029 acc_norm 0.255 ± 0.029 0.345 ± 0.029 hendrycksTest-econometrics acc 0.298 ± 0.043 0.272 ± 0.043 acc_norm 0.298 ± 0.043 0.263 ± 0.043 hendrycksTest-electrical_engineering acc 0.338 ± 0.039 0.324 ± 0.039 acc_norm 0.290 ± 0.038 0.303 ± 0.038 hendrycksTest-elementary_mathematics acc 0.262 ± 0.023 0.275 ± 0.023 acc_norm 0.294 ± 0.023 0.275 ± 0.023 hendrycksTest-formal_logic acc 0.310 ± 0.041 0.310 ± 0.041 acc_norm 0.294 ± 0.041 0.270 ± 0.041 hendrycksTest-global_facts acc 0.200 ± 0.040 0.290 ± 0.040 acc_norm 0.210 ± 0.041 0.290 ± 0.041 hendrycksTest-high_school_biology acc 0.265 ± 0.025 0.342 ± 0.025 acc_norm 0.287 ± 0.026 0.342 ± 0.026 hendrycksTest-high_school_chemistry acc 0.251 ± 0.031 0.232 ± 0.031 acc_norm 0.291 ± 0.032 0.227 ± 0.032 hendrycksTest-high_school_computer_science acc 0.260 ± 0.044 0.280 ± 0.044 acc_norm 0.300 ± 0.046 0.260 ± 0.046 hendrycksTest-high_school_european_history acc 0.267 ± 0.035 0.309 ± 0.035 acc_norm 0.315 ± 0.036 0.321 ± 0.036 hendrycksTest-high_school_geography acc 0.227 ± 0.030 0.348 ± 0.030 acc_norm 0.278 ± 0.032 0.354 ± 0.032 hendrycksTest-high_school_government_and_politics acc 0.290 ± 0.033 0.332 ± 0.033 acc_norm 0.290 ± 0.033 0.321 ± 0.033 hendrycksTest-high_school_macroeconomics acc 0.279 ± 0.023 0.305 ± 0.023 acc_norm 0.267 ± 0.022 0.285 ± 0.022 hendrycksTest-high_school_mathematics acc 0.252 ± 0.026 0.278 ± 0.026 acc_norm 0.296 ± 0.028 0.304 ± 0.028 hendrycksTest-high_school_microeconomics acc 0.265 ± 0.029 0.256 ± 0.029 acc_norm 0.324 ± 0.030 0.273 ± 0.030 hendrycksTest-high_school_physics acc 0.205 ± 0.033 0.205 ± 0.033 acc_norm 0.232 ± 0.034 0.212 ± 0.034 hendrycksTest-high_school_psychology acc 0.251 ± 0.019 0.328 ± 0.019 acc_norm 0.270 ± 0.019 0.325 ± 0.019 hendrycksTest-high_school_statistics acc 0.319 ± 0.032 0.241 ± 0.032 acc_norm 0.319 ± 0.032 0.245 ± 0.032 hendrycksTest-high_school_us_history acc 0.265 ± 0.031 0.221 ± 0.031 acc_norm 0.260 ± 0.031 0.230 ± 0.031 hendrycksTest-high_school_world_history acc 0.283 ± 0.029 0.371 ± 0.029 acc_norm 0.266 ± 0.029 0.380 ± 0.029 hendrycksTest-human_aging acc 0.296 ± 0.031 0.296 ± 0.031 acc_norm 0.274 ± 0.030 0.291 ± 0.030 hendrycksTest-human_sexuality acc 0.351 ± 0.042 0.290 ± 0.042 acc_norm 0.282 ± 0.039 0.290 ± 0.039 hendrycksTest-international_law acc 0.248 ± 0.039 0.322 ± 0.039 acc_norm 0.347 ± 0.043 0.331 ± 0.043 hendrycksTest-jurisprudence acc 0.269 ± 0.043 0.296 ± 0.043 acc_norm 0.370 ± 0.047 0.296 ± 0.047 hendrycksTest-logical_fallacies acc 0.202 ± 0.032 0.276 ± 0.032 acc_norm 0.270 ± 0.035 0.258 ± 0.035 hendrycksTest-machine_learning acc 0.295 ± 0.043 0.250 ± 0.043 acc_norm 0.330 ± 0.045 0.223 ± 0.045 hendrycksTest-management acc 0.282 ± 0.045 0.320 ± 0.045 acc_norm 0.272 ± 0.044 0.350 ± 0.044 hendrycksTest-marketing acc 0.303 ± 0.030 0.415 ± 0.030 acc_norm 0.329 ± 0.031 0.423 ± 0.031 hendrycksTest-medical_genetics acc 0.330 ± 0.047 0.300 ± 0.047 acc_norm 0.420 ± 0.050 0.300 ± 0.050 hendrycksTest-miscellaneous acc 0.319 ± 0.017 0.318 ± 0.017 acc_norm 0.319 ± 0.017 0.313 ± 0.017 hendrycksTest-moral_disputes acc 0.298 ± 0.025 0.341 ± 0.025 acc_norm 0.318 ± 0.025 0.344 ± 0.025 hendrycksTest-moral_scenarios acc 0.267 ± 0.015 0.240 ± 0.015 acc_norm 0.265 ± 0.015 0.238 ± 0.015 hendrycksTest-nutrition acc 0.278 ± 0.026 0.330 ± 0.026 acc_norm 0.337 ± 0.027 0.350 ± 0.027 hendrycksTest-philosophy acc 0.251 ± 0.025 0.315 ± 0.025 acc_norm 0.293 ± 0.026 0.325 ± 0.026 hendrycksTest-prehistory acc 0.244 ± 0.024 0.352 ± 0.024 acc_norm 0.250 ± 0.024 0.361 ± 0.024 hendrycksTest-professional_accounting acc 0.287 ± 0.027 0.213 ± 0.027 acc_norm 0.248 ± 0.026 0.216 ± 0.026 hendrycksTest-professional_law acc 0.273 ± 0.011 0.267 ± 0.011 acc_norm 0.269 ± 0.011 0.269 ± 0.011 hendrycksTest-professional_medicine acc 0.301 ± 0.028 0.301 ± 0.028 acc_norm 0.268 ± 0.027 0.327 ± 0.027 hendrycksTest-professional_psychology acc 0.279 ± 0.018 0.304 ± 0.018 acc_norm 0.284 ± 0.018 0.310 ± 0.018 hendrycksTest-public_relations acc 0.327 ± 0.045 0.345 ± 0.045 acc_norm 0.309 ± 0.044 0.336 ± 0.044 hendrycksTest-security_studies acc 0.265 ± 0.028 0.331 ± 0.028 acc_norm 0.208 ± 0.026 0.290 ± 0.026 hendrycksTest-sociology acc 0.269 ± 0.031 0.393 ± 0.031 acc_norm 0.249 ± 0.031 0.383 ± 0.031 hendrycksTest-us_foreign_policy acc 0.290 ± 0.046 0.320 ± 0.046 acc_norm 0.320 ± 0.047 0.320 ± 0.047 hendrycksTest-virology acc 0.289 ± 0.035 0.349 ± 0.035 acc_norm 0.265 ± 0.034 0.355 ± 0.034 hendrycksTest-world_religions acc 0.374 ± 0.037 0.345 ± 0.037 acc_norm 0.409 ± 0.038 0.351 ± 0.038 logiqa acc 0.255 ± 0.017 0.273 ± 0.017 acc_norm 0.272 ± 0.017 0.280 ± 0.017 mathqa acc 0.256 ± 0.008 0.253 ± 0.008 acc_norm 0.258 ± 0.008 0.240 ± 0.008 mnli acc 0.338 ± 0.005 0.801 ± 0.005 mnli_mismatched acc 0.362 ± 0.005 0.811 ± 0.005 mrpc acc 0.571 ± 0.025 0.750 ± 0.025 f1 0.689 ± 0.022 0.841 ± 0.022 multirc acc 0.047 ± 0.007 0.012 ± 0.007 openbookqa acc 0.222 ± 0.019 0.268 ± 0.019 acc_norm 0.346 ± 0.021 0.344 ± 0.021 piqa acc 0.726 ± 0.010 0.714 ± 0.010 acc_norm 0.736 ± 0.010 0.718 ± 0.010 qnli acc 0.504 ± 0.007 0.788 ± 0.007 qqp acc 0.534 ± 0.002 0.847 ± 0.002 f1 0.372 ± 0.004 0.793 ± 0.004 race acc 0.352 ± 0.015 0.355 ± 0.015 record f1 0.843 ± 0.004 0.778 ± 0.004 em 0.835 ± 0.004 0.771 ± 0.004 rte acc 0.491 ± 0.030 0.747 ± 0.030 sciq acc 0.930 ± 0.008 0.939 ± 0.008 acc_norm 0.938 ± 0.008 0.935 ± 0.008 sst acc 0.492 ± 0.017 0.916 ± 0.017 webqs acc 0.054 ± 0.005 0.095 ± 0.005 wic acc 0.472 ± 0.020 0.539 ± 0.020 winogrande acc 0.582 ± 0.014 0.571 ± 0.014 wnli acc 0.380 ± 0.058 0.549 ± 0.058 wsc acc 0.365 ± 0.047 0.365 ± 0.047 lambada ppl 6.423 ± 0.162 20.150 ± 0.162 acc 0.576 ± 0.007 0.394 ± 0.007 pubmedqa acc 0.529 ± 0.016 0.479 ± 0.016 coqa f1 0.606 ± 0.018 0.581 ± 0.018 em 0.484 ± 0.020 0.472 ± 0.020 drop em 0.001 ± 0.000 0.001 ± 0.000 f1 0.039 ± 0.001 0.031 ± 0.001 math_algebra acc 0.016 ± 0.004 0.024 ± 0.004 math_counting_and_prob acc 0.023 ± 0.007 0.030 ± 0.007 math_geometry acc 0.006 ± 0.004 0.021 ± 0.004 math_intermediate_algebra acc 0.020 ± 0.005 0.029 ± 0.005 math_num_theory acc 0.037 ± 0.008 0.039 ± 0.008 math_prealgebra acc 0.023 ± 0.005 0.041 ± 0.005 math_precalc acc 0.015 ± 0.005 0.022 ± 0.005

The model can be downloaded here, though I don't recommend using it for anything.
