特斯拉AI高级工程师(从事自动驾驶与机器人ML)揭露ML项目真实时间分配:50%评估、40%数据清洗、8%集成、2%训练。前两项共同设定学习的噪声底限,模型无法降低——这是数据的香农最优界。他每天思考本体论(ontology),旧标签必须持续审查,因为生产系统中分布漂移与边缘用例不断暴露标签缺陷。核心结论:训练不是瓶颈,清理现实数据的能力才是关键。
A Sr. Staff Engineer at Tesla AI dropped an ML time breakdown that nobody dares to face: 50% evaluation, 40% data cleaning, 8% integration, 2% training. The post got shared over 2,000 times, and everyone was shocked that training only takes 2%. But the truly terrifying part is what he said next: The first two - evaluation and data cleaning - directly determine the noise floor of learning. No matter how powerful the model is, it cannot lower this floor, because it's already the Shannon optimal bound of your data itself. In plain English: the quality of the data you feed the model has already welded the ceiling shut. No matter how strong a model you switch to, that ceiling won't budge an inch. Then he dropped an even harder punch. He said he thinks about ontology every single day - old labels must be continuously reviewed. Not a one-and-done labeling exercise. In production systems, distribution drift and edge cases constantly expose the flaws in old labels. This guy works on Tesla AI's self-driving and robotics ML - he's not a researcher running benchmarks in a lab. Every day he steps on landmines in real-world deployment, and what he distilled is just four numbers and an information theory statement. One line from his replies nails it: "Even a genius needs a good textbook." Give an IQ 180 genius a textbook where the table of contents is all scrambled, and they won't learn linear algebra. Not because they're not smart enough - but because what you're teaching doesn't even define what linear algebra is. Ontology is the textbook's table of contents for the model. If the table of contents is messy, the model can only memorize noise. If it's clear, the model knows which direction to reason toward. Training is not the bottleneck. Our ability to clean reality is. While you're out chasing the latest strongest model every day, the real pros are reviewing old labels and ontology.