Multimodal AI for Molecular Design: Bridging natural language with molecular structures for applications in Chemistry
Anushya Krishnan, Scientific Collaborator, ReaxionLab
www.linkedin.com/in/anushyakrishnan
Received June 17, 2025. Accepted July 06, 2025.
Reaxion Crucible 2025, 1 (1): e2025006
Abstract
Multimodal AI is transforming molecular design by integrating diverse data types such as graphs, textual descriptions, and molecular images. This approach enables models to learn complex relationships between molecular structure, function, and properties, leading to more accurate predictions and flexible generative capabilities. Conventional single-modal models, string based representations like Simplified Molecular Input Line Entry Specification (SMILES) limit generative models, but recent innovations in multimodal AI like Sequential Attachment-based Fragment Embedding (SAFE), Multi-Modal Fusion (MMF), and Vision Transformer (ViT) based frameworks provide flexibility and control. Models such as Madrigal demonstrate the power of unifying structured biological data for drug combination prediction, while the Llamole model pioneers interleaved text-graph generation using LLMs. These advances demonstrate strong potential across applications including drug discovery, materials science, and sensor development. This article highlights the key breakthroughs, current challenges for multimodal AI in molecular science and outlines emerging trends and tools that enhance molecular design performance and interpretability.
Keywords: Multimodal AI, molecular design, large language models (LLMs), graph neural networks (GNNs), molecular property prediction, inverse molecular design
How Multimodal AI Works in Chemistry
Multimodal AI in chemistry uses both structured molecular data and unstructured textual information to improve tasks like property prediction and molecular design. Graph neural networks (GNNs) are well-suited for representing molecular systems due to their ability to learn atom and bond level features from graph-structured data [6]. The field of chemistry has been a major influence in how GNNs have developed [7]. One recent framework, Multi-Modal Fusion (MMF) [8] combines Graph Neural Networks (GNNs) and LLMs to improve molecular property prediction. GNNs learn structural features from molecular graphs, while LLMs are prompted with SMILES strings and generate descriptive chemical text. These textual outputs are converted into embeddings and combined with graph-based embeddings using a cross-modal attention mechanism. A Mixture-of-Experts module dynamically merges the two representations based on predictive relevance, enabling more accurate and generalizable property predictions without requiring extensive fine-tuning. In a related approach, a multimodal deep learning framework was developed for chemical toxicity prediction by combining molecular structure images with chemical property data. This method utilized a Vision Transformer (ViT) to extract features from structural images and a Multilayer Perceptron (MLP) to process numerical descriptors. The outputs were fused using a joint representation mechanism to enable multi-label classification of various toxic endpoints. The model demonstrated strong performance, achieving 87.2% accuracy, an F1-score of 0.86, and a Pearson Correlation Coefficient of 0.9192, highlighting the effectiveness of multimodal fusion in chemical prediction tasks [9]. The integration of multimodal inputs like molecular graphs and SMILES strings to 2D/3D images is crucial to the success of advanced AI frameworks in chemistry. Figure 1 illustrates how different AI models process such inputs to support diverse applications like drug discovery and materials design.
Figure 1 Overview of Multimodal AI for Molecular Design. Molecular inputs such as graphs, SMILES strings, and structural images are processed by specialized AI models, including MMF, SAFE, Madrigal, and Llamole, to perform various applications such as drug design, property prediction, and retrosynthesis. (Diagram created using Microsoft PowerPoint.)
Applications of LLM-Graph Fusion in Molecular Innovation
A study combined NLP-based topic modeling with substructure analysis to build datasets for catalysis, photophysics, biology, and magnetism from literature-linked transition metal complexes, illustrating how text and molecular data can be integrated for targeted discovery[10]. Recent advances such as Madrigal, a multimodal AI framework, demonstrate how molecular structure, pathway knowledge, cell viability, and transcriptomic data can be fused to predict drug combination effects and adverse reactions[11]. By handling incomplete modality data and integrating large language models (LLMs) to model outcomes beyond standardized vocabularies, Madrigal enables personalized combination therapy design for complex diseases like cancer and metabolic disorders. Another notable framework, SAFE, restructures SMILES strings into unordered, interpretable fragment blocks while remaining compatible with standard parsers. This simplifies generative tasks such as scaffold decoration, fragment linking, polymer generation, and scaffold hopping, supporting autoregressive fragment-constrained design without relying on complex decoding or graph-based models [4].
Strengths & Limitations
Multimodal AI models have greatly enhanced molecular design by combining diverse data types such as chemical structures, graphs, images, and text which enables tasks like property prediction, toxicity assessment, and molecule generation. Frameworks like MMF and ViT-based fusion models demonstrate strong predictive performance through joint representations of structured and unstructured data. SAFE simplifies generative design by fragmenting SMILES into interpretable blocks, aiding in scaffold decoration and polymer generation. Madrigal [11] addresses a key challenge in multimodal learning, particularly missing modalities, by employing a transformer bottleneck module that aligns and unifies heterogeneous preclinical drug data during both training and inference. A recent advancement, Large language model for molecular discovery (Llamole) [12], introduces the first multimodal large language model capable of interleaving text and graph generation, enabling controllable molecular design with retrosynthetic planning. By integrating Graph Diffusion Transformers and GNNs, Llamole allows large language models to reason over molecular graphs and significantly outperforms other LLMs in tasks requiring graph-aware molecule generation. Despite these innovations, several limitations remain. Most models perform poorly on out-of-distribution data or when domain-specific annotations are sparse. Sequence-based LLMs still underperform compared to graph-based methods in capturing molecular structure-property relationships. Moreover, existing models often assume complete labeled datasets and struggle with uncertainty estimation or model interpretability. Future efforts should focus on improving fusion architectures, robustness to sparse or unseen data, and domain-specific fine-tuning to support real-world deployment in chemical discovery.
Conclusion
Multimodal AI is shaping a new era in molecular design by combining molecular graphs, chemical text, and structural data to enhance prediction accuracy and generative capabilities. Recent frameworks such as MMF, SAFE, Madrigal, and Llamole have shown the potential of integrating graph neural networks and large language models for applications ranging from property prediction and toxicity assessment to retrosynthetic planning. These advancements are particularly relevant for designing molecular systems with targeted electronic, thermal, or biochemical properties. Despite these gains, key challenges remain, including limited performance on out-of-distribution molecules, lack of interpretability, and the constraints of using linear molecular representations like SMILES. Future research should focus on improving cross-modal fusion strategies, developing models that are more adaptable to unseen data, and applying instruction-based prompting to better align model outputs with design goals. As multimodal AI continues to evolve, it is expected to transform molecular design across key areas of chemistry, enabling more efficient discovery of functional molecules for catalysis, drug development, and materials engineering.
References: