HCLS Foundation Model Map
Foundation models are moving fast across biology, chemistry, and clinical medicine. This is my map of prominent models across healthcare and life sciences, auto-generated from my research notes.
Tracked in a CSV and per-model markdown files. A monthly AI-assisted scan across GitHub lists and academic surveys surfaces candidates and posts proposals to Slack. Approved entries written to the CSV trigger an automatic page rebuild. Still Curious?
| Model | Category | Modality | Description | Tags |
|---|---|---|---|---|
| Biology | ||||
| AlphaGenome | Genomics | DNA | DeepMind regulatory-genomics model for variant-effect prediction across modalities including expression, splicing, chromatin, and contact maps. | |
| DNABERT-S | Genomics | DNA | Species-aware DNA embedding genome foundation model. | |
| DNAGPT | Genomics | DNA | Generalized pre-trained tool for multiple DNA sequence analysis tasks. | |
| Enformer | Genomics | DNA | Long-range regulatory and gene-expression prediction from DNA sequence | |
| Evo 2 | Genomics | DNA | 7B and 40B-parameter DNA model that processes up to 1M base pairs at nucleotide resolution; backbone of the Mayo Clinic EVEE variant pathogenicity work | |
| GENA-LM | Genomics | DNA | Open-source foundational DNA language models tuned for long sequences. | |
| GPN-MSA | Genomics | DNA | Alignment-based DNA language model for genome-wide variant effect prediction. | |
| HyenaDNA | Genomics | DNA | Long-context DNA sequence model based on the Hyena architecture | |
| Nucleotide Transformer | Genomics | DNA | Foundation models for DNA sequences trained on 3202 human genomes plus multispecies genomes | |
| ATOM-1 | Transcriptomics | RNA | RNA foundation model trained on chemical mapping data for structure and function. | |
| CellPLM | Transcriptomics | scRNA | Cell language model pre-trained beyond single cells, using spatially resolved transcriptomics. | |
| ERNIE-RNA | Transcriptomics | RNA | RNA language model with structure-enhanced representations. | |
| GeneCompass | Transcriptomics | scRNA | Knowledge-informed cross-species single-cell foundation model for gene regulation. | |
| Geneformer | Transcriptomics | RNA | Transformer pre-trained on 30M single-cell human transcriptomes; widely used for drug-target and cell-state inference | |
| Orthrus | Transcriptomics | RNA | Evolutionary and functional RNA foundation model. | |
| RiNALMo | Transcriptomics | RNA | General-purpose RNA language model that generalizes to structure prediction tasks. | |
| RNA-FM | Transcriptomics | RNA | Interpretable RNA foundation model trained on unannotated ncRNA for structure and function prediction. | |
| RNABERT | Transcriptomics | RNA | Informative RNA base embedding via masked LM for structural alignment and clustering. | |
| scBERT | Transcriptomics | RNA | BERT-style single-cell RNA-seq foundation model for cell type annotation | |
| scFoundation | Transcriptomics | scRNA | Large-scale foundation model on single-cell transcriptomics. | |
| scGPT | Transcriptomics | RNA | Single-cell foundation model encoding gene and cell relationships across large atlases | |
| scMulan | Transcriptomics | scRNA | Multitask generative pre-trained language model for single-cell analysis. | |
| scPRINT | Transcriptomics | scRNA | Single-cell RNA FM pre-trained on 50M cells for robust gene network prediction. | |
| UNI-RNA | Transcriptomics | RNA | Universal pre-trained RNA representation model. | |
| Universal Cell Embeddings | Transcriptomics | scRNA | Single-cell foundation model producing universal cell representations across species. | |
| UTR-LM | Transcriptomics | mRNA -UTR | 5' UTR language model for decoding untranslated regions of mRNA. | |
| xTrimoGene | Transcriptomics | scRNA | Efficient and scalable representation learner for single-cell RNA-seq data. | |
| AlphaFold 3 | Protein | protein | Atomic structures of proteins DNA RNA and ligand complexes | |
| Ankh | Protein | protein | Optimized protein language model for general-purpose modelling. | |
| BioEmu-1 | Protein | protein -conformation | Scalable emulation of protein equilibrium ensembles via generative deep learning. | |
| CaLM | Protein | codon | Codon language embeddings for protein engineering. | |
| CARP | Protein | protein | Convolutional protein sequence model competitive with transformers. | |
| ESM-3 | Protein | protein | Multimodal protein language model over sequence structure and function | |
| Evolla | Protein | protein + text | Decodes molecular language of proteins for question answering and annotation. | |
| GearNet | Protein | protein -structure | Geometric structure pretraining for protein representation learning. | |
| HelixFold-Single | Protein | protein -structure | MSA-free protein structure prediction using a protein language model. | |
| OntoProtein | Protein | protein | Protein pretraining with gene ontology embeddings. | |
| OpenFold3 | Protein | protein | Open-source effort to support reproducible biomolecular co-folding models. | |
| PrimateAI-3D | Protein | protein | Variant-effect prediction model integrating 3D structure | |
| ProGen2 | Protein | protein | Family of autoregressive protein language models up to 6.4B parameters for design and fitness. | |
| ProteinBERT | Protein | protein | Universal deep-learning model of protein sequence and function with GO annotation pretraining. | |
| ProtGPT2 | Protein | protein | Deep unsupervised generative language model for protein design. | |
| ProTrek | Protein | protein + structure + text | Tri-modal contrastive learning across protein sequence, structure, and text. | |
| ProtST | Protein | protein + text | Multi-modality learning of protein sequences and biomedical texts. | |
| ProtTrans | Protein | protein | Family of protein language models (T5, BERT, Albert, XLNet, Electra) trained via self-supervision at HPC scale. | |
| RoseTTAFold All-Atom | Protein | protein | Open structure-prediction and design model from the Baker Lab, extending RoseTTAFold to broader biomolecular assemblies including proteins, nucleic acids, small molecules, and covalent modifications. | |
| SaProt | Protein | protein + structure | Protein language model with structure-aware vocabulary. | |
| UniRep | Protein | protein | Unified rational protein engineering with sequence-based deep representation learning. | |
| xTrimoPGLM | Protein | protein | Unified 100B-scale pre-trained transformer for protein language. | |
| AbLang2 | Antibody and biologics | antibody -sequence | Antibody-specific language model for infilling and humanization | |
| AntiBERTy | Antibody and biologics | antibody -sequence | BERT-style antibody language model trained on the Observed Antibody Space | |
| IgLM | Antibody and biologics | antibody -sequence | Generative antibody language model for heavy and light chain design | |
| Cell2Sentence | Multi-omics and systems | scRNA + text | Teaches LLMs the language of biology by rendering scRNA cells as sentences. | |
| Nicheformer | Multi-omics and systems | scRNA + spatial | Single-cell and spatial omics foundation model for tissue context modeling. | |
| Chemistry | ||||
| ChemGPT | Small-molecule | SMILES | Large-scale generative model for chemistry over SMILES | |
| MegaMolBART | Small-molecule | SMILES | BART-style model for SMILES part of the BioNeMo umbrella | |
| MoLFormer | Small-molecule | SMILES | Large-scale chemical language model over SMILES originally explored with AstraZeneca | |
| MolMIM | Small-molecule | SMILES | Controlled molecule generation with property guidance | |
| Clinical | ||||
| BiomedCLIP | Medical imaging | vision -language | Biomedical vision-language model pretrained on 15M PubMed image-text pairs | |
| CXR Foundation | Medical imaging | radiograph | Chest X-ray foundation model from Google | |
| Med-Gemini | Medical imaging | multimodal | Multimodal medical foundation model for radiology report generation and visual question answering | |
| Med-Gemini-2D | Medical imaging | vision -language | 2D medical imaging variant of Med-Gemini | |
| RadFM | Medical imaging | radiology | Radiology foundation model for general medical imaging tasks | |
| CONCH | Digital pathology | vision -language | Vision-language pathology model from the Mahmood Lab | |
| H-optimus-0 | Digital pathology | WSI | Histology foundation model for downstream pathology tasks. | |
| Prov-GigaPath | Digital pathology | WSI | Gigapixel pathology foundation model with tile-to-slide hierarchical representations | |
| UNI | Digital pathology | WSI | General-purpose pathology foundation model from the Mahmood Lab | |
| Virchow | Digital pathology | WSI | Whole-slide image transformer trained on roughly 1.5M slides; pan-cancer features | |
| AMIE | Clinical language and patient | medical -text | Google Research diagnostic-dialogue research system for medical reasoning and conversations. | |
| CLMBR | Clinical language and patient | EHR -codes | Clinical Language Model Based Representations from Stanford's Shah lab | |
| GatorTron | Clinical language and patient | EHR -text | Clinical LM backbone for downstream NLP at UF Health | |
| Med-PaLM | Clinical language and patient | medical -text | Google's medical large language model for question answering and clinical reasoning | |
| MedLM | Clinical language and patient | medical -text | Google's productized medical language model offering, sitting closer to medical question-answering than EHR-only pretraining. | |
| NYUTron | Clinical language and patient | EHR -text | Clinical language model trained on NYU Langone clinical notes | |
| Truveta Language Model | Clinical language and patient | EHR -text | Trained on the largest linked EHR corpus in the United States; clinical reasoning over longitudinal records | |
| Endo-FM | Surgical video | video | Endoscopy foundation model with video pre-training for downstream endoscopic tasks | |
| Emerging | ||||
| BioT5+ | Emerging | molecules + protein + text | Generalized biological understanding with IUPAC integration and multi-task tuning. | |
| LaBraM | Emerging | EEG | Large-scale pretraining for EEG signals; representative early entry in the brain-waveform foundation-model space. | |
| METAGENE-1 | Emerging | DNA -metagenomic | Arc Institute foundation model for metagenomic sequence, oriented toward microbiome and pathogen-surveillance applications. | |