All benchmarks are flawed, but GPQA has been fairly consiste · AI HOT