The recipe for “classic” reasoning benchmarks is simple: tex · AI HOT