使用Geneformer发现候选治疗靶点
Geneformer 本身是 2023 年的老工作,这篇是 Nature Protocols 的操作手册,价值在于把零样本推理、微调、虚拟扰动整套流程标准化了,做计算生物或药物靶点发现的团队可以直接照着跑,但对 AI 主流圈来说是脚注级更新。
本文介绍了利用Geneformer这一预训练基础AI模型,在数据有限条件下发现疾病治疗靶点的流程。该模型基于大规模单细胞转录组数据训练,通过零样本推理、微调及计算机模拟扰动分析,将基因表达数据转化为排序编码,并评估表型可分性。模拟基因扰动后,通过量化细胞状态嵌入变化来优先筛选候选靶点。整个分析流程可在标准GPU工作站上2天内完成,仅需中等Python编程经验。该协议为微调与模拟扰动提供了一个通用框架,是scGPT、scFoundation等模型的替代方案。
Subjects
Abstract
Mapping the connections between genes enables the identification of networks disrupted in disease. The approach, however, requires large amounts of data, making the discovery of therapeutic targets difficult in settings with limited data. We recently developed a foundational artificial intelligence model, Geneformer, pretrained on a large-scale corpus of single-cell transcriptomes (initially ~30 million, now >100 million) to enable context-aware predictions in network biology with limited data. Here, we cover the methodology for using Geneformer through a combination of zero-shot inference, fine-tuning and in silico perturbation. The procedure includes the tokenization of raw gene expression counts into rank value encodings aligned with the model’s pretrained vocabulary. Separability of relevant phenotypes in the pretrained embedding space is first assessed with zero-shot embeddings. Fine-tuning is then performed either with a single task, for example, disease prediction within a specific cell type, or with multiple tasks to jointly learn cross-informative features, such as cell types and disease states. Performance is evaluated with confusion matrices, macro F1 scores and embedding analysis. Subsequently, in silico perturbation simulates gene repression or activation and quantifies the shift in cell state embeddings, prioritizing candidate targets by statistical and biological metrics. The approach also supports perturbation using a quantized model to enhance computational efficiency. Outputs include predictive models fine-tuned for context-specific cell state representations and rank-ordered predictions of perturbations to induce each target state. The full pipeline typically completes in under 2 days on a standard GPU workstation and requires only moderate Python experience.
Key points
-
Geneformer uses a bidirectional transformer model to learn patterns of covariation between genes. The protocol includes single-task and multi-task learning, with options for improved scalability, and can be optimized depending on the research question and dataset size.
-
The protocol provides a framework for fine-tuning and in silico perturbation analysis, serving as an alternative to pretrained foundation models including scGen, scGPT, scFoundation, GeneCompass, UCE, Nicheformer, scSimilarity and TranscriptFormer.
This is a preview of subscription content, access via your institution
Access options
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.Data availability
An example dataset18 for use with this protocol is provided with the tutorial on Google Colab at https://tinyurl.com/geneformertutorial. The dataset can also be directly downloaded from Google Drive as indicated in the provided tutorial (https://drive.google.com/uc?id=1VeMkFrUy43xEJZzYaFw0t7aVZRpXF-Yt).
Code availability
All relevant code is available on the Hugging Face Model Hub at https://huggingface.co/ctheodoris/Geneformer. Detailed documentation is available at https://geneformer.readthedocs.io/en/latest/. In addition, a tutorial accompanying this protocol is provided on Google Colab at https://tinyurl.com/geneformertutorial.
References
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Vaswani, A. et al. Attention is all you need. Preprint at arXiv https://doi.org/10.48550/arXiv.1706.03762 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Chen, H. et al. Scaling and quantization of large-scale foundation model enables resource-efficient predictions in network biology. Nat. Comput. Sci. https://doi.org/10.1038/s43588-026-00972-4 (2026).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods https://doi.org/10.1038/s41592-024-02305-7 (2024).
Yang, X. et al. GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Cell Res. 34, 830–845 (2024).
Rosen, Y. et al. Universal cell embeddings: a foundation model for cell biology. Preprint at bioRxiv https://doi.org/10.1101/2023.11.28.568918 (2023).
Schaar, A. C. et al. Nicheformer: a foundation model for single-cell and spatial omics. Bioinformatics 22, 2525–2538 (2024).
Pearce, J. D. et al. A cross-species generative cell atlas across 1.5 billion years of evolution: the TranscriptFormer single-cell model. Preprint at bioRxiv https://doi.org/10.1101/2025.04.25.650731 (2025).
Chen, H. et al. Quantized multi-task learning for context-specific representations of gene network dynamics. Preprint at bioRxiv https://doi.org/10.1101/2024.08.16.608180 (2024).
Quality control in single-cell RNA-seq data. Kaggle https://kaggle.com/code/hrishikeshp/2-quality-control-in-single-cell-rna-seq-data (2021).
Quality control—single-cell best practices. Single-cell Best Practices https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html#filtering-low-quality-cells (2023).
Heil, B. How to convert gene ID formats in Python. Autobencoder https://autobencoder.com/2021-10-03-gene-conversion/ (2021).
Notebook on nbviewer. https://nbviewer.org/gist/newgene/6771106.
Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).
Acknowledgements
We thank the Theodoris Lab members and collaborators for helpful scientific discussions. Y.Z. was supported by the Hillblom/BARI Graduate Student Fellowship.
Author information
Authors and Affiliations
Gladstone Institute of Cardiovascular Disease, San Francisco, CA, USA
Yujie Zhang (张毓杰), Madhavan S. Venkatesh & Christina V. Theodoris
Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
Yujie Zhang (张毓杰), Madhavan S. Venkatesh & Christina V. Theodoris
Biological and Medical Informatics Graduate Program, University of California, San Francisco, San Francisco, CA, USA
Yujie Zhang (张毓杰) & Christina V. Theodoris
Department of Computational and Systems Biology, University of California, Los Angeles, Los Angeles, CA, USA
Madhavan S. Venkatesh
Department of Pediatrics, Institute for Human Genetics, Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
Christina V. Theodoris
- Yujie Zhang (张毓杰)
Search author on:PubMedGoogle Scholar
- Madhavan S. Venkatesh
Search author on:PubMedGoogle Scholar
- Christina V. Theodoris
Search author on:PubMedGoogle Scholar
Contributions
Y.Z., M.S.V. and C.V.T. developed the protocol. Y.Z. designed/performed the analyses. C.V.T. designed the analyses and supervised the work. Y.Z. and C.V.T. wrote the manuscript. All authors edited the manuscript.
Corresponding author
Correspondence to Christina V. Theodoris.
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Protocols thanks Nathan Palpant and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key references
Theodoris, C. V. et al. Nature 618, 616–624 (2023): https://doi.org/10.1038/s41586-023-06139-9
Chen, H. et al. Nat. Comput. Sci. https://doi.org/10.1038/s43588-026-00972-4 (2026)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Venkatesh, M.S. & Theodoris, C.V. Discovery of candidate therapeutic targets with Geneformer. Nat Protoc (2026). https://doi.org/10.1038/s41596-026-01364-8
Received: 27 June 2025
Accepted: 04 March 2026
Published: 23 April 2026
Version of record: 23 April 2026
DOI: https://doi.org/10.1038/s41596-026-01364-8