通过深度学习驱动的IRES发现与从头生成实现可编程RNA翻译
Nature Machine Intelligence 上的这篇用语言模型+扩散模型端到端设计 RNA 翻译元件,实验验证率高达 99%,是 AI for Biology 从预测走向生成设计的又一个硬证据,做合成生物学和 RNA 疗法的人值得细读。
研究人员开发了一个端到端的人工智能框架,用于精确控制蛋白质表达。该框架包含三个核心模型:IRES-LM语言模型在46,774条序列上训练,预测线性mRNA内部核糖体进入位点的曲线下面积和F1分数比现有方法提升15%,并能准确识别全部21个已验证的环状RNA IRES。IRES-EA进化算法对37,293个非IRES序列的计算评估显示60%被预测获得功能,并通过12,000条突变序列的大规模平行报告实验验证,其中98.4%确实获得了IRES功能。IRES-DM扩散模型从头生成的新IRES序列性能超越现有最佳方法,另一组12,000条生成序列的实验验证显示99.3%具有可检测的IRES活性。该框架能生成从类天然到结构保守但序列各异的设计,为合成生物学和RNA疗法提供了可编程翻译的强大工具。
Abstract
The precise control of protein expression is a major bottleneck in the development of RNA therapeutics. Internal ribosome entry sites (IRES) overcome traditional limitations by enabling cap-independent translation initiation, making them highly desirable tools for synthetic biology and therapeutic payload expression. However, the complex structure-function relationship of IRES elements has historically hindered their rational design. Here we show that a comprehensive, end-to-end artificial intelligence framework unifies IRES identification, evolutionary optimization and de novo generation. First, IRES-LM, an ensemble of two language models trained on 46,774 sequences, predicts linear mRNA IRESs with a 15% improvement in area under the curve and F1 score over existing methods. In addition, IRES-LM demonstrates remarkable cross-applicability to circular RNA IRESs, correctly identifying all 21 experimentally validated circular RNA IRESs and outperforming benchmark methods. Next, IRES-EA integrates an evolutionary algorithm with IRES-LM to induce IRES functionality through targeted mutations. Computational evaluation of 37,293 non-IRES sequences showed 60% predicted functional conversion, with large-scale massively parallel reporter assay validation of 12,000 mutated sequences demonstrating 98.4% acquired IRES functionality, confirming both computational predictions and experimental functionality. Further, IRES-DM employs a diffusion model to de novo generate novel IRES sequences that outperform the state-of-the-art method. Massively parallel reporter assay validation using another set of 12,000 IRES-DM-generated sequences revealed 99.3% detectable IRES functionality. Notably, IRES-DM shows diverse generation capacity, producing sequences ranging from natural-like candidates to structurally conserved yet sequence-divergent designs. Motif analysis revealed both natural-prevalent and design-enriched high-activity motifs. Together, this framework establishes a robust approach for programmable RNA translation, expanding the molecular toolkit for scaling up next-generation biomedical discovery and RNA-based therapeutics.
This is a preview of subscription content, access via your institution
Access options
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.
The alternative text for this image may have been generated using AI.Data availability
The datasets used in this study, including training data and evaluation data for IRES-LM, IRES-EA and IRES-DM models, along with all experimental results, are available via GitHub at https://github.com/a96123155/IRES_Prediction_Design. The archived version of the repository is available via Zenodo at https://doi.org/10.5281/zenodo.15081323 (ref. 34). Source data are provided with this paper.
Code availability
The code is available via GitHub at https://github.com/a96123155/IRES_Prediction_Design with GNU General Public License Version 3. This repository contains the code dependencies, operating environment, instructions and source code for the proposed IRES-LM, IRES-EA and IRES-DM models. The DOI is available via Zenodo at https://doi.org/10.5281/zenodo.15081323 (ref. 34).
References
Komar, A. A. & Hatzoglou, M. Exploring internal ribosome entry sites as therapeutic targets. Front. Oncol. 5, 233 (2015).
Martinand-Mari, C., Lebleu, B. & Robbins, I. Oligonucleotide-based strategies to inhibit human hepatitis C virus. Oligonucleotides 13, 539–548 (2003).
Nulf, C. J. & Corey, D. Intracellular inhibition of hepatitis C virus (HCV) internal ribosomal entry site (IRES)-dependent translation by peptide nucleic acids (PNAs) and locked nucleic acids (LNAs). Nucleic Acids Res. 32, 3792–3798 (2004).
Filbin, M. E. & Kieft, J. S. Toward a structural understanding of IRES RNA function. Curr. Opin. Struct. Biol. 19, 267–276 (2009).
Lozano, G., Fernandez, N. & Martinez-Salas, E. Modeling three-dimensional structural motifs of viral IRES. J. Mol. Biol. 428, 767–776 (2016).
Plank, T.-D. M. & Kieft, J. S. The structures of nonprotein-coding RNAs that drive internal ribosome entry site function. Wiley Interdiscip. Rev. RNA 3, 195–212 (2012).
Mailliot, J. & Martin, F. Viral internal ribosomal entry sites: four classes for one goal. Wiley Interdiscip. Rev. RNA 9, e1458 (2018).
Chen, R. et al. Engineering circular RNA for enhanced protein production. Nat. Biotechnol. 41, 262–272 (2023).
Choi, S.-W. & Nam, J.-W. Optimal design of synthetic circular RNAs. Exp. Mol. Med. 56, 1281–1292 (2024).
Kolekar, P., Pataskar, A., Kulkarni-Kale, U., Pal, J. & Kulkarni, A. Irespred: web server for prediction of cellular and viral internal ribosome entry site (IRES). Sci. Rep. 6, 27436 (2016).
Zhao, J. et al. IRESfinder: identifying RNA internal ribosome entry site in eukaryotic cell using framed k-mer features. J. Genet. Genomics 45, 403–406 (2018).
Wang, J. & Gribskov, M. IRESpy: an XGBoost model for prediction of internal ribosome entry sites. BMC Bioinf. 20, 409 (2019).
Zhou, Y. et al. DeepCIP: a multimodal deep learning method for the prediction of internal ribosome entry sites of circRNAs. Comput. Biol. Med. 164, 107288 (2023).
Chu, Y. et al. A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions. Zenodo https://doi.org/10.5281/zenodo.10621605 (2024).
Shen, T. et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat. Methods 21, 2287–2298 (2024).
Weingarten-Gabbay, S. et al. Comparative genetics. Systematic discovery of cap-independent translation sequences in human and viral genomes. Science 351, aad4939 (2016).
Zhao, J. et al. Iresbase: a comprehensive database of experimentally validated internal ribosome entry sites. Genom. Proteom. Bioinform. 18, 129–139 (2020).
Mokrejs, M. et al. IRESite–a tool for the examination of viral and cellular internal ribosome entry sites. Nucleic Acids Res. 38, D131–D136 (2010).
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200 (2021).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Preprint at https://doi.org/10.48550/arxiv.2006.11239 (2020).
Zhao, Y., Oono, K., Takizawa, H. & Kotera, M. GenerRNA: a generative pre-trained language model for de novo RNA design. PLoS ONE 19, e0310814 (2024).
Thoma, C., Bergamini, G., Galy, B., Hundsdoerfer, P. & Hentze, M. W. Enhancement of IRES-mediated translation of the c-myc and BiP mRNAs by the poly(A) tail is independent of intact eIF4G and PABP. Mol. Cell 15, 925–935 (2004).
Li, H. et al. riboCIRC: a comprehensive database of translatable circRNAs. Genome Biol. 22, 79 (2021).
Gritsenko, A. A. et al. Sequence features of viral AND human internal ribosome entry sites predictive of their activity. PLoS Comput. Biol. 13, e1005734 (2017).
Dvir, S. et al. Deciphering the rules by which 5′-UTR sequences affect protein expression in yeast. Proc. Natl Acad. Sci. USA 110, E2792–E2801 (2013).
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2024).
Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 234–241 (Springer, 2015).
He, K. et al. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing System 6000–6010 (NIPS, 2017).
Shen, Z., Zhang, M., Zhao, H., Yi, S. & Li, H. Efficient attention: attention with linear complexities. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 3531–3539 (IEEE, 2021).
Hughes, N. W. et al. Machine-learning-optimized Cas12a barcoding enables the recovery of single-cell lineages and transcriptional profiles. Mol. Cell 82, 3103–3118 (2022).
Yin, D. et al. Targeting herpes simplex virus with CRISPR–Cas9 cures herpetic stromal keratitis in mice. Nat. Biotechnol. 39, 567–577 (2021).
Yin, D. et al. Dendritic-cell-targeting virus-like particles as potent mRNA vaccine carriers. Nat. Biomed. Eng. 9, 185–200 (2025).
Chu, Y. a96123155/IRES_Prediction_Design: IRES-AI. Zenodo https://doi.org/10.5281/zenodo.15081323 (2026).
Acknowledgements
This work was supported by the Donald and Delia Baxter Foundation (L.C.), the Weintz Family Foundation (L.C.), the Princeton AI Lab (M.W.), the Google Faculty Award (M.W.), the Office of Naval Research (ONR) under MURI grant no. N00014-24-1-2687 (M.W.) and Zipcode Bio. We thank J. Sullivan and the Arc Institute for their computational support.
Author information
Yanyi Chu
Present address: Key Laboratory of RNA Innovation, Science and Engineering, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Biochemistry and Cell Biology, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
These authors contributed equally: Yanyi Chu, Di Yin, Dan Yu, Guangxue Xu.
Authors and Affiliations
Arc Institute, Palo Alto, CA, USA
Yanyi Chu & Hani Goodarzi
Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
Di Yin, Guangxue Xu, Junze Zhang, Xiaotong Wang, Ning Zhao, Yi Zhu & Le Cong
Zipcode Bio, Weston, MA, USA
Dan Yu, Yupeng Li & Jason Zhang
Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, TN, USA
Yue Shen
Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
Hani Goodarzi
Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
Hani Goodarzi
Center for Statistics and Machine Learning and Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ, USA
Mengdi Wang
- Yanyi Chu
Search author on:PubMedGoogle Scholar
- Di Yin
Search author on:PubMedGoogle Scholar
- Dan Yu
Search author on:PubMedGoogle Scholar
- Guangxue Xu
Search author on:PubMedGoogle Scholar
- Junze Zhang
Search author on:PubMedGoogle Scholar
- Xiaotong Wang
Search author on:PubMedGoogle Scholar
- Yue Shen
Search author on:PubMedGoogle Scholar
- Yupeng Li
Search author on:PubMedGoogle Scholar
- Ning Zhao
Search author on:PubMedGoogle Scholar
- Yi Zhu
Search author on:PubMedGoogle Scholar
- Jason Zhang
Search author on:PubMedGoogle Scholar
- Hani Goodarzi
Search author on:PubMedGoogle Scholar
- Mengdi Wang
Search author on:PubMedGoogle Scholar
- Le Cong
Search author on:PubMedGoogle Scholar
Contributions
Y.C. developed the model. D. Yin, G.X., Junze Zhang, N.Z. and Y.Z. performed experimental validation. X.W. performed MPRA analysis. Y.S. and Y.L. processed data and implemented benchmark methods. D. Yu, D. Yin and G.X. initiated the experimental part of the project. H.G. and Jason Zhang reviewed the paper. M.W. and L.C. led the entire project. All authors contributed to paper preparation.
Corresponding authors
Correspondence to Mengdi Wang or Le Cong.
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Zheng Xia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance comparison of four pre-trained models across different fine-tuning strategies for IRES identification.
Performance evaluation of (a) UTR-LM, (b) RNA-FM, (c) RNA-BERT, and (d) ERNIE-RNA using four different training strategies: random initialization with all parameters fine-tuned, pretrained model frozen with only predictor fine-tuned, last layer of pretrained model and predictor fine-tuned, and pretrained model with all parameters fine-tuned. (e) Performance of frozen Evo-2 with downstream head trained on embeddings extracted from different layer blocks. (f) Performance of Evo-2 with last N blocks and downstream head fine-tuned together.
Extended Data Fig. 2 IRES-LM representations capture biologically relevant features without explicit supervision.
Analysis based on 55,000 sequences annotated with MFE values and 5,388 sequences annotated with splicing scores, with these biological properties never used during model training. a,b, IRES-UTRLM analysis with minimum free energy (MFE). a, Principal component analysis (PCA) showing correlation between first principal component and MFE values. The red dashed line indicates linear regression fit. b, Uniform manifold approximation and projection (UMAP) visualization of complete model representations colored by MFE. c,d, IRES-RNAFM analysis with (c) PCA and (d) UMAP for MFE. e,f, IRES-UTRLM analysis with (e) PCA and (f) UMAP for splicing scores. g,h, IRES-RNAFM analysis with (g) PCA and (h) UMAP for splicing scores.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1 and 2 and Sections A–E.
Source data
Source Data Fig. 3 (download XLSX )
Performance metrics used in Fig. 3a–c and predicted IRES probabilities for experimentally validated circular IRES sequences across different methods shown in Fig. 3d.
Source Data Fig. 4 (download XLSX )
Raw FLuc and NLuc luminescence values from two independent biological replicates for EMCV IRES variants (Fig. 4c) and CVB3 IRES variants (Fig. 4d), including calculated NLuc/FLuc ratios and normalized activity values; summary of IRES-EA library sequence composition and corresponding MPRA activity profiling, including mutation distribution, bin-based sorting results, sequencing coverage and functional activity statistics.
Source Data Fig. 5 (download XLSX )
Raw FLuc and NLuc luminescence values from two independent biological replicates for VCIP IRES variants (Fig. 5c), including calculated NLuc/FLuc ratios and normalized activity values, and summary of IRES-DM library sequence composition and corresponding MPRA activity profiling, including mutation distribution, bin-based sorting results, sequencing coverage and functional activity statistics.
Source Data Fig. 6 (download XLSX )
Source data for Fig. 6a–c, including motif enrichment analysis (6-mer and 7-mer), frequency distributions across activity groups, motif annotations and cumulative activity statistics stratified by motif count.
Source Data Extended Data Fig./Table 1 (download XLSX )
Performance metrics used in Extended Data Fig. 1a–f.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chu, Y., Yin, D., Yu, D. et al. Programmable RNA translation through deep learning-driven IRES discovery and de novo generation. Nat Mach Intell 8, 559–574 (2026). https://doi.org/10.1038/s42256-026-01213-z
Received: 31 March 2025
Accepted: 06 March 2026
Published: 24 April 2026
Version of record: 24 April 2026
Issue date: April 2026
DOI: https://doi.org/10.1038/s42256-026-01213-z