Large language models for molecules generation based on AI have attracted an increasing interest in the field of structure-based drug design, but there are many challenges, such as the quality of the pretrained dataset, the diversity and synthesizability of the molecules, etc.. To address these challenges, we introduce HitChem 2DMG, an advance model that sparks creativity in medicinal chemistry experts, accelerating the drug discovery process by generating diverse 2D molecular structures.
The model is pretrained on a vast dataset comprising:
•30 million commercial compound structures (covering most global commercial compounds)
•8 million patent-derived compound structures (spanning from the 1960s to 2023)
•11 million building block structures
•Billions of easily synthesizable molecular structures.
This large-scale training enables the generation of drug-like molecules suitable for molecular rational design and focused library generation.