Multimodal AI for Molecular Design: Bridging natural language with molecular structures for applications in Chemistry

Anushya Krishnan, Scientific Collaborator, ReaxionLab

www.linkedin.com/in/anushyakrishnan

Received June 17, 2025. Accepted July 06, 2025.

Reaxion Crucible 2025, 1 (1): e2025006

Abstract

Multimodal AI is transforming molecular design by integrating diverse data types such as graphs, textual descriptions, and molecular images. This approach enables models to learn complex relationships between molecular structure, function, and properties, leading to more accurate predictions and flexible generative capabilities. Conventional single-modal models, string based representations like Simplified Molecular Input Line Entry Specification (SMILES) limit generative models, but recent innovations in multimodal AI like Sequential Attachment-based Fragment Embedding (SAFE), Multi-Modal Fusion (MMF), and Vision Transformer (ViT) based frameworks provide flexibility and control. Models such as Madrigal demonstrate the power of unifying structured biological data for drug combination prediction, while the Llamole model pioneers interleaved text-graph generation using LLMs. These advances demonstrate strong potential across applications including drug discovery, materials science, and sensor development. This article highlights the key breakthroughs, current challenges for multimodal AI in molecular science and outlines emerging trends and tools that enhance molecular design performance and interpretability.

Keywords: Multimodal AI, molecular design, large language models (LLMs), graph neural networks (GNNs), molecular property prediction, inverse molecular design

Molecular design involves creating new molecules by assembling chemical components to achieve specific properties or functions. Designing a new molecule, whether by modifying an existing one or creating a completely novel structure, requires precise control over molecular behavior at the atomic level, which is a highly complex and challenging task [1]. To manage the immense variety of chemical compounds and improve efficiency, various molecule generators based on inverse molecular design and predictive modeling have been created and used in fields such as materials science and drug discovery [2, 3]. Conventional molecular string formats like SMILES, a textual molecular representation, can be limiting for AI-based molecular design, as they do not represent molecular substructures in a sequential or intuitive manner [4]. While traditional models rely on a single type of input, multimodal AI systems integrate multiple data sources, offering a more comprehensive understanding of chemical information and expanding the potential of molecular design tools. Multimodal AI models that combine images and text are proving useful in many fields beyond molecular design. For example, mPLUG-DocOwl, a multimodal framework trained using unified instructions across formats like tables, charts, and webpages, demonstrated strong performance in understanding complex documents without relying on traditional Optical Character Recognition (OCR) [5]. Such instruction-tuned models exemplify how Large Language Models (LLMs) can reason across modalities. This capability is now being explored in molecular design, where combining natural language with structured molecular representations enables breakthroughs in designing sensors, optimizing reaction pathways, and tuning electronic or thermal properties.

How Multimodal AI Works in Chemistry

Multimodal AI in chemistry uses both structured molecular data and unstructured textual information to improve tasks like property prediction and molecular design. Graph neural networks (GNNs) are well-suited for representing molecular systems due to their ability to learn atom and bond level features from graph-structured data [6]. The field of chemistry has been a major influence in how GNNs have developed [7]. One recent framework, Multi-Modal Fusion (MMF) [8] combines Graph Neural Networks (GNNs) and LLMs to improve molecular property prediction. GNNs learn structural features from molecular graphs, while LLMs are prompted with SMILES strings and generate descriptive chemical text. These textual outputs are converted into embeddings and combined with graph-based embeddings using a cross-modal attention mechanism. A Mixture-of-Experts module dynamically merges the two representations based on predictive relevance, enabling more accurate and generalizable property predictions without requiring extensive fine-tuning. In a related approach, a multimodal deep learning framework was developed for chemical toxicity prediction by combining molecular structure images with chemical property data. This method utilized a Vision Transformer (ViT) to extract features from structural images and a Multilayer Perceptron (MLP) to process numerical descriptors. The outputs were fused using a joint representation mechanism to enable multi-label classification of various toxic endpoints. The model demonstrated strong performance, achieving 87.2% accuracy, an F1-score of 0.86, and a Pearson Correlation Coefficient of 0.9192, highlighting the effectiveness of multimodal fusion in chemical prediction tasks [9]. The integration of multimodal inputs like molecular graphs and SMILES strings to 2D/3D images is crucial to the success of advanced AI frameworks in chemistry. Figure 1 illustrates how different AI models process such inputs to support diverse applications like drug discovery and materials design.

Figure 1 Overview of Multimodal AI for Molecular Design. Molecular inputs such as graphs, SMILES strings, and structural images are processed by specialized AI models, including MMF, SAFE, Madrigal, and Llamole, to perform various applications such as drug design, property prediction, and retrosynthesis. (Diagram created using Microsoft PowerPoint.)

Applications of LLM-Graph Fusion in Molecular Innovation

A study combined NLP-based topic modeling with substructure analysis to build datasets for catalysis, photophysics, biology, and magnetism from literature-linked transition metal complexes, illustrating how text and molecular data can be integrated for targeted discovery[10]. Recent advances such as Madrigal, a multimodal AI framework, demonstrate how molecular structure, pathway knowledge, cell viability, and transcriptomic data can be fused to predict drug combination effects and adverse reactions[11]. By handling incomplete modality data and integrating large language models (LLMs) to model outcomes beyond standardized vocabularies, Madrigal enables personalized combination therapy design for complex diseases like cancer and metabolic disorders. Another notable framework, SAFE, restructures SMILES strings into unordered, interpretable fragment blocks while remaining compatible with standard parsers. This simplifies generative tasks such as scaffold decoration, fragment linking, polymer generation, and scaffold hopping, supporting autoregressive fragment-constrained design without relying on complex decoding or graph-based models [4].

Strengths & Limitations

Multimodal AI models have greatly enhanced molecular design by combining diverse data types such as chemical structures, graphs, images, and text which enables tasks like property prediction, toxicity assessment, and molecule generation. Frameworks like MMF and ViT-based fusion models demonstrate strong predictive performance through joint representations of structured and unstructured data. SAFE simplifies generative design by fragmenting SMILES into interpretable blocks, aiding in scaffold decoration and polymer generation. Madrigal [11] addresses a key challenge in multimodal learning, particularly missing modalities, by employing a transformer bottleneck module that aligns and unifies heterogeneous preclinical drug data during both training and inference. A recent advancement, Large language model for molecular discovery (Llamole) [12], introduces the first multimodal large language model capable of interleaving text and graph generation, enabling controllable molecular design with retrosynthetic planning. By integrating Graph Diffusion Transformers and GNNs, Llamole allows large language models to reason over molecular graphs and significantly outperforms other LLMs in tasks requiring graph-aware molecule generation. Despite these innovations, several limitations remain. Most models perform poorly on out-of-distribution data or when domain-specific annotations are sparse. Sequence-based LLMs still underperform compared to graph-based methods in capturing molecular structure-property relationships. Moreover, existing models often assume complete labeled datasets and struggle with uncertainty estimation or model interpretability. Future efforts should focus on improving fusion architectures, robustness to sparse or unseen data, and domain-specific fine-tuning to support real-world deployment in chemical discovery.

Conclusion

Multimodal AI is shaping a new era in molecular design by combining molecular graphs, chemical text, and structural data to enhance prediction accuracy and generative capabilities. Recent frameworks such as MMF, SAFE, Madrigal, and Llamole have shown the potential of integrating graph neural networks and large language models for applications ranging from property prediction and toxicity assessment to retrosynthetic planning. These advancements are particularly relevant for designing molecular systems with targeted electronic, thermal, or biochemical properties. Despite these gains, key challenges remain, including limited performance on out-of-distribution molecules, lack of interpretability, and the constraints of using linear molecular representations like SMILES. Future research should focus on improving cross-modal fusion strategies, developing models that are more adaptable to unseen data, and applying instruction-based prompting to better align model outputs with design goals. As multimodal AI continues to evolve, it is expected to transform molecular design across key areas of chemistry, enabling more efficient discovery of functional molecules for catalysis, drug development, and materials engineering.

References:

F. Vella, Biochemical Education, 1988, 16, 110–111.

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams and A. Aspuru-Guzik, ACS Cent. Sci., 2018, 4, 268–276.

W. Jin, R. Barzilay and T. Jaakkola, in Proceedings of the 35th International Conference on Machine Learning, PMLR, 2018, pp. 2323–2332.

E. Noutahi, C. Gabellini, M. Craig, J. S. C. Lim and P. Tossou, Digital Discovery, 2024, 3, 796–804.

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian, Q. Qi, J. Zhang and F. Huang, arXiv, 2023, preprint, DOI: 10.48550/ARXIV.2307.02499.

P. Reiser, M. Neubert, A. Eberhard, L. Torresi, C. Zhou, C. Shao, H. Metni, C. Van Hoesel, H. Schopmans, T. Sommer and P. Friederich, Commun Mater, 2022, 3, 93.

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, arXiv, 2017, preprint, DOI: 10.48550/ARXIV.1704.01212.

S. S. Srinivas and V. Runkana, arXiv, 2024, preprint, arXiv:arXiv:2408.14964, DOI: 10.48550/arXiv.2408.14964.

J. Hong and H. Kwon, Sci Rep, 2025, 15, 19491.

I. Kevlishvili, R. G. St. Michel, A. G. Garrison, J. W. Toney, H. Adamji, H. Jia, Y. Román-Leshkov and H. J. Kulik, Faraday Discuss., 2025, 256, 275–303.

Y. Huang, X. Su, V. Ullanat, I. Liang, L. Clegg, D. Olabode, N. Ho, B. John, M. Gibbs and M. Zitnik, arXiv, 2025, preprint, arXiv:arXiv:2503.02781, DOI: 10.48550/arXiv.2503.02781.

G. Liu, M. Sun, W. Matusik, M. Jiang and J. Chen, arXiv, 2024, preprint, DOI: 10.48550/ARXIV.2410.04223.

Disclaimer: The views, interpretations, and conclusions presented in this article are those of the author(s) alone and do not necessarily reflect those of the journal, editorial board, or publisher. The journal assumes no responsibility for any loss, damage, or consequences arising from the use of the information, data, or methods described. Readers are encouraged to critically evaluate the content before applying it in practice.

Open Access: This article is published under a Creative Commons Attribution (CC BY 4.0) license. You are free to share and adapt the material, provided proper credit is given to the original author(s) and source.