# 星系分词器指南：科学基础模型基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：40
- AIHOT 链接：https://aihot.virxact.com/items/cmqzjr3n5007dslqv49smcy83
- 原文链接：https://arxiv.org/abs/2606.25610

## AI 摘要

在统一Transformer框架下，比较Affine、AIM、JetFormer和VQ-VAE四种tokenization策略对天文成像的影响。使用DESI Legacy Survey中640,000张星系图像和共享AstroPT骨干，评估重建保真度与物理属性预测。结果显示：基于流的JetFormer重建质量更高，VQ-VAE对星系物理属性的探针性能更强，Affine与AIM更好保留局部形态信息。重建质量与表示质量相互解耦，没有单一方法在所有任务上一致最优。研究以独立测量的物理量为基准，凸显科学数据构建可解释基础模型基准的潜力。

## 正文

Tokenization is central to adapting scientific data for transformer-based foundation models, yet its impact on learned representations remains poorly understood. We compare four tokenization strategies, Affine, AIM, JetFormer, and VQ-VAE, within a unified transformer framework for astronomical imaging. Using 640,000 galaxy images from the DESI Legacy Survey and a shared AstroPT backbone, we evaluate each method on reconstruction fidelity and prediction of physical properties. Our results reveal trade-offs across approaches. The flow-based JetFormer achieves higher reconstruction quality, while VQ-VAE yields strong probe performance for galaxy physical properties. Affine and AIM better preserve localized morphological information. We find that reconstruction and representation quality are decoupled, and no single method consistently performs best across the tasks considered here. By grounding our evaluation in independently measured physical quantities, we hope this study serves to highlight the potential of scientific data as a basis for constructing interpretable benchmarks for foundation models.
