202409212322
Status: #lecture
Tags: #ai #drug_discovery
# Foundational AI Models for Biological Systems
Presented by Le Song
Company Website: https://www.biomap.com/
- Biology as a multiscale, multimodal information system
- Multiple different types of signals between systems
- Network effects of interacting/cascading/inhibiting signals
- Multiple relevant scales (e.g. organism, cell, molecule, etc.)
- High dimensional space in biology but small data (especially labeled data)
- Idea is to mitigate the lack of labeled data and small per-modality data by dumping everything into a transformer trained using self-supervision.
- In order of lowest scale to highest scale:
- Genomes
- RNA sequences
- RNA structures
- Protein sequences
- Protein structures
- Interactions
- Cells
- Molecular perturbations
- Spatial data
- They haven't built this yet, focusing on separate modalities at the moment (built a protein language model and a genomics language model)
- Built a 100B protein language model
- Interestingly, instead of just masked language modeling, they used masked language modeling + causal language modeling as the objective
- This allows for both protein encoding and protein generation
- Use reinforcement learning to optimize protein generation for specific objectives (e.g. function predictions)
- Found that proteins with specific gene ontology labels cluster in embedding space without training on function information
---
# References