Keywords: Artificial intelligence, drug discovery, machine learning, target identification, de novo molecule design, reinforcement learning, AlphaFold, drug-target interaction, generative models, multi-target drug design
Drug discovery is a comprehensive process aimed at identifying active compounds that can alter disease progression and address unmet therapeutic needs through the development of new molecular entities[1]. These agents may be derived from various sources, including synthetic chemicals, natural products, biologicals, or repurposed existing drugs. However, challenges such as unknown disease mechanisms particularly in nervous system disorders make target identification difficult [2]. The process begins with identifying a biological target, such as a protein or receptor involved in disease progression. Despite advances that have improved speed, precision, and cost-efficiency, modern drug discovery still faces significant hurdles. Failures due to poor efficacy, adverse effects, or commercial limitations remain common. In recent years, artificial intelligence (AI), particularly machine learning (ML), has emerged as a transformative force in drug discovery. These technologies can uncover drug-disease relationships that might otherwise go unnoticed. AI excels at recognizing complex patterns within large datasets, helping researchers identify potential drug targets and design novel therapeutic molecules with greater accuracy and speed. However, AI-based approaches also pose challenges, particularly regarding potential biases and fairness in predictions [3]. Approaches like data augmentation and explainable AI can help improve model reliability while addressing concerns about bias and transparency.
The Drug Discovery Pipeline
Identifying Drug Targets and Generating Novel Molecules
The first step in drug discovery involves identifying biological targets, such as proteins, enzymes, or receptors, that play a role in disease progression [4]. This is often hindered by the complexity of biological systems and the vast amount of data involved. AI and ML models can mine genomic, proteomic, and disease-related datasets to uncover patterns and predict viable therapeutic targets. Approaches like unsupervised learning and network-based models help in understanding disease mechanisms more comprehensively. Once viable targets are identified, AI facilitates the design and optimization of therapeutic molecules. Traditional drug development tends to be slow and iterative, but AI-based generative models like Generative Adversarial Networks (GANs)[5], Variational Autoencoders (VAEs)[6], and Reinforcement Learning (RL)[7] have demonstrated greater speed and efficiency. These models can generate novel chemical structures and optimize them for key properties like binding affinity, solubility, and toxicity, even before any laboratory synthesis. For instance, halicin, a novel antibiotic, was discovered through a deep learning model that screened over 107 million compounds from the ZINC15 database. Its antibacterial activity was later validated in vitro and in vivo [8]. Reinforcement learning has also been used to guide graph-based generative models to produce molecules with optimal size, drug-likeness, and predicted dopamine receptor D2 (DRD2) activity, even for compounds outside the training set [7]. Apart from these, recent approaches also include multi-target de novo design, where AI models are used to create compounds that act on more than one target simultaneously. A Perturbation Theory and Machine Learning (PTML) model was developed to generate virtual dual inhibitors for CDK4 and HER2. This model achieved over 75% sensitivity and specificity in validation and designed six compounds, three of which were predicted to be effective dual inhibitors [9].
Screening, Evaluating Candidates and Engineering Therapeutic Proteins
High-throughput screening of candidate molecules is traditionally resource-intensive. AI enhances this process by predicting molecular interactions and prioritizing compounds with the highest probability of success. For example, DeepMind’s AlphaFold has had a major impact by accurately predicting the 3D shapes of proteins based on their amino acid sequences which made it easier to study how drugs interact with their targets and significantly improved virtual screening processes [10]. Beyond small molecules, AI is also advancing the design of therapeutic proteins. Recent breakthroughs in diffusion models have enabled de novo protein design. The RFdiffusion framework, built on RoseTTAFold and fine-tuned for structure denoising, successfully generated protein monomers, symmetric oligomers, and binders with high experimental accuracy. The accuracy of these designs has been validated through Cryo-EM imaging, showing a close match between predicted and actual structures [11].
Automated Synthesis and Optimization & Key Outcomes
AI models are also used in retrosynthetic analysis which is planning how to synthesize molecules from available precursors [12]. Transformer-based models, such as the Molecular Transformer, are trained on large datasets of chemical reactions (encoded in SMILES strings) and can recommend feasible reaction steps, including reagents, solvents, and catalysts. This integration of virtual molecule generation with practical synthesis greatly accelerates the drug discovery and development process [12]. Recent successes in AI-driven drug discovery highlight the transformative potential of machine learning across multiple stages of the pipeline. A reinforcement learning (RL)-guided generative model was able to produce diverse compounds with predicted dopamine receptor D2 (DRD2) activity in 95% of samples, significantly outperforming earlier models. This demonstrates the capability of RL in generating target-specific molecular structures during de novo design [8]. Another notable advancement is the cell-based multi-target QSAR (CBMT-QSAR) model, developed to identify anticancer agents across 17 liver cancer cell lines. The model achieved over 80% predictive accuracy and facilitated the virtual design of eight drug-like compounds, six of which were predicted to be effective across all tested cell lines. This underscores the utility of ML in generating multi-target drug candidates with high therapeutic relevance [13]. In a more integrated approach, AlphaFold was used to predict the 3D structure of CDK20, which was then refined and analyzed using the Chemistry42 AI platform to identify binding pockets and generate inhibitors. Out of 8,918 designed molecules, seven were synthesized, and one compound, ISM042-2-001, showed promising binding with a Kd of 9.2 µM. The entire workflow, from structure prediction to hit identification was completed within 30 days, exemplifying the efficiency of AI-powered platforms in accelerating early-stage drug discovery [14]. Table 1 provides a summary of these and other key case studies, showcasing how AI tools have been successfully applied across different phases of the drug discovery process.
Table 1. Key Examples of AI Applications Across the Drug Discovery Pipeline.
This table highlights notable case studies demonstrating the use of artificial intelligence at various stages of drug discovery, including target identification, molecule generation, screening, and optimization.
Stage | AI Techniques / Tools | Use Case |
Target Identification | ML on omics data, Network analysis, imaging data | Predicting new protein targets in Alzheimer’s from gene expression datasets |
Molecule Generation | GANs, VAEs, RL | Designing halicin, a novel antibiotic, using deep learning on 100M compounds |
Screening & Evaluation | AlphaFold, Virtual Screening, Docking |
AlphaFold used to predict 3D structure of CDK20; docking used to test drug fit |
|
Optimization | Retrosynthesis AI, Transformer models | AI model plans chemical steps to synthesize a promising lead compound |
Preclinical Prediction | QSAR (Quantitative Structure–Activity Relationship), Toxicity prediction | ML model identifies fragments effective across 17 liver cancer cell lines |
1
I. Bano, U. D. Butt and S. A. H. Mohsan, in Novel Platforms for Drug Delivery Applications, Elsevier, 2023, pp. 619–643.
2
F. on N. and N. S. Disorders, B. on H. S. Policy and I. of Medicine, in Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary, National Academies Press (US), 2014.
3
J. Kleinberg, in Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, ACM, Irvine CA USA, 2018, pp. 40–40.
4
Q. Wu, J. Zheng, X. Sui, C. Fu, X. Cui, B. Liao, H. Ji, Y. Luo, A. He, X. Lu, X. Xue, C. S. H. Tan and R. Tian, Chem. Sci., 2024, 15, 2833–2847.
5
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, arXiv, 2014, preprint, DOI: 10.48550/ARXIV.1406.2661.
6
D. P. Kingma and M. Welling, arXiv, 2013, preprint, DOI: 10.48550/ARXIV.1312.6114.
7
S. R. Atance, J. V. Diez, O. Engkvist, S. Olsson and R. Mercado, J. Chem. Inf. Model., 2022, 62, 4863–4872.
8
J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran, I. W. Andrews, E. J. Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay and J. J. Collins, Cell, 2020, 180, 688-702.e13.
9
V. V. Kleandrova, M. T. Scotti, L. Scotti and A. Speck-Planche, CTMC, 2021, 21, 661–675.
10
J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli and D. Hassabis, Nature, 2021, 596, 583–589.
11
J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, N. Hanikel, S. J. Pellock, A. Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S. V. Torres, A. Lauko, V. De Bortoli, E. Mathieu, S. Ovchinnikov, R. Barzilay, T. S. Jaakkola, F. DiMaio, M. Baek and D. Baker, Nature, 2023, 620, 1089–1100.
12
P. Schwaller, R. Petraglia, V. Zullo, V. H. Nair, R. A. Haeuselmann, R. Pisoni, C. Bekas, A. Iuliano and T. Laino, Chem. Sci., 2020, 11, 3316–3325.
13
V. V. Kleandrova, M. T. Scotti, L. Scotti, A. Nayarisseri and A. Speck-Planche, SAR and QSAR in Environmental Research, 2020, 31, 815–836.
14
F. Ren, X. Ding, M. Zheng, M. Korzinkin, X. Cai, W. Zhu, A. Mantsyzov, A. Aliper, V. Aladinskiy, Z. Cao, S. Kong, X. Long, B. H. Man Liu, Y. Liu, V. Naumov, A. Shneyderman, I. V. Ozerov, J. Wang, F. W. Pun, D. A. Polykovskiy, C. Sun, M. Levitt, A. Aspuru-Guzik and A. Zhavoronkov, Chem. Sci., 2023, 14, 1443–1452.
Disclaimer: The views, interpretations, and conclusions presented in this article are those of the author(s) alone and do not necessarily reflect those of the journal, editorial board, or publisher. The journal assumes no responsibility for any loss, damage, or consequences arising from the use of the information, data, or methods described. Readers are encouraged to critically evaluate the content before applying it in practice.
Open Access: This article is published under a Creative Commons Attribution (CC BY 4.0) license. You are free to share and adapt the material, provided proper credit is given to the original author(s) and source.