Virtual Cell

The next platform shift in drug discovery is a cell-state data layer.

AI has made proteins, molecules and genomes more computable. The next bottleneck is the training substrate: same-cell, multimodal perturbation data that lets models learn how living cells change when disease, genetics, drugs and time interact.

Opening thesis

Drug R&D is becoming a cell-state data business.

For decades, the industry optimized around targets: identify a protein, design a molecule, test whether the interaction is strong enough, then learn late whether the biology actually moves. That approach helped scale discovery, but it did not create the data needed to predict cell response upfront.

The strategic question is shifting from “Can this molecule bind?” to “What will this intervention do to a living cell in the disease context that matters?” Virtual Cell can only answer that if models are trained on aligned, perturbation-grade cellular records rather than disconnected assay snapshots.

In this transition, the durable moat is not a single model demo. It is the data layer: same-cell, multimodal records that connect genome, chromatin, RNA, protein, perturbation and phenotype with enough quality to guide real experimental decisions.

Target data is not enough

A drug response depends on cell type, disease state, genetic background, dose, timing and microenvironment. Target-level data cannot explain that cellular context.

Prediction needs matched perturbation records

Models need paired observations of baseline state, intervention and downstream cellular response from the same biological system.

The data layer becomes the moat

Once Virtual Cell models enter R&D workflows, proprietary single-cell multimodal perturbation data becomes the operating asset behind model quality.

Technology development

Biology foundation models have made DNA, RNA and protein computable.

The technical path toward Virtual Cell is no longer speculative. In the last few years, foundation models have learned useful representations of major biological modalities: protein structure, single-cell gene expression and genome-scale sequence. The next step is to connect those representations inside real cellular contexts.

Protein modality

AlphaFold made protein structure predictable.

Protein sequences can now be mapped into high-accuracy structural representations, turning a core layer of biology into a modelable substrate for binding, function and design.

RNA / transcriptome modality

Geneformer learned from single-cell expression.

Gene expression can be represented as a cell-state language. Models trained on large single-cell corpora can support network biology, perturbation reasoning and cell-context prediction.

DNA / genome modality

Evo 2 scales genome modeling.

Genome foundation models extend representation learning to long DNA context and cross-organism sequence patterns, bringing regulatory and variant-level biology into the computable stack.

The missing layer is the cell as an integrated system.

Protein, RNA and DNA models each capture an important biological view. But drug response happens inside cells, where genome, expression, protein state, disease context and perturbation interact.

This is why the data platform must be single-cell and multimodal.

Our core technology is the data layer that connects cellular context across modalities: DNA / chromatin, RNA expression, protein state, perturbation, time and disease background from the same biological system.

Technology references: AlphaFold protein structure prediction, Geneformer in network biology, Evo 2 genome foundation model.

Market signals

The ecosystem is already reorganizing around this stack.

The recent signals are not isolated headlines. Capital, pharma infrastructure and regulatory policy are moving in the same direction: drug discovery is absorbing AI infrastructure, and AI infrastructure is moving closer to human biology. The company that owns decision-useful cell-state data can sit at the center of that shift.

AI-native biotech

Xaira launches as a full-stack AI drug discovery company.

Xaira's $1B+ launch shows that investors are backing integrated platforms that combine machine learning, data generation and therapeutic development rather than narrow software tools.

Big Pharma infrastructure

Eli Lilly and NVIDIA connect compute, models and automated labs.

The reported up to $1B, five-year Eli Lilly x NVIDIA collaboration signals that major pharmaceutical companies are treating AI biology as core R&D infrastructure.

Regulatory shift

FDA is opening the door to human-relevant alternatives to animal testing.

In April 2025, FDA announced a roadmap to reduce, refine or replace certain animal studies with NAMs data, including AI toxicity models, organoids and organ-on-chip systems for IND applications.

Evidence standards

The bottleneck becomes credible, human-cell data.

FDA's shift and the CZI Virtual Cells workshop point to the same requirement: standardized, benchmarkable, human-relevant data that can make predictive biology trusted enough for R&D decisions.

Source references: Axios on Xaira, NVIDIA News on Eli Lilly x NVIDIA, FDA NAMs roadmap, CZI Virtual Cells workshop.

Our position

Building the data layer for Virtual Cell.

The next bottleneck in Virtual Cell is no longer model access. It is data. The field has started to build data infrastructure, but the leading attempts still concentrate on RNA perturbation. That is a useful signal, not the full substrate required for a real Virtual Cell.

Current attempts
Peers are already proving that cell-state data is becoming infrastructure.

Attempt from Tahoe Therapeutics 100M-cell dataset

Tahoe100M pushed RNA-seq drug perturbation scale.

Tahoe published Tahoe100M and raised roughly $30M, showing that RNA-seq perturbation data is already viewed as strategic infrastructure.

Attempt from Xaira Therapeutics $1B+ investment

Xaira is scaling RNA-seq gene perturbation.

Xaira's $1B+ launch shows capital is backing foundation-model biology and large perturbation datasets, but the core data view is still largely RNA-centric.

But Not enough

RNA-only infrastructure cannot carry Virtual Cell forward.

Current models can learn expression correlations, but without same-cell DNA and protein context they cannot explain the cellular mechanism behind perturbation response.

Virtual Cell is a four-layer stack. The missing layer is data.

Layer architecture

Application Deployment layer

Turn cell-state prediction into R&D decisions.

Target discovery, drug response, toxicity and patient stratification require models that can predict what an intervention does inside disease-relevant human cells.

Tahoe Therapeutics Recursion Cellarity Noetik Turbine Eli Lilly TuneLab

Models Virtual cell layer

Learn cell states, perturbations and biological representations.

Virtual cell, perturbation and foundation models translate biological data into predictions. Their ceiling is now set less by architecture and more by whether the data captures the cell as a system.

Xaira X-Cell Noetik OCTO-vc Recursion / Valence TxPert NVIDIA BioNeMo Novo Nordisk + NVIDIA Genentech lab-in-the-loop Sanofi AI Toolkit

Data Substrate Missing layer

Same-cell multimodal dataset.

The missing data layer must be unified, standardized, low-batch-effect and AI-trainable: disease-specific cells, matched modalities, controlled perturbations, dose, time and measurable phenotypic response from the same biological system.

We are here

Infrastructure Compute + healthcare systems

Provide the physical rails for AI biology.

GPU compute, automated labs, clinical systems and patient-data access create the operating environment. They are necessary, but they do not create the missing training substrate.

NVIDIA AI Factory Cloud providers Automated labs Hospital systems Pharma data platforms

RNA-only data can predict correlation. It cannot reveal the cell.

RNA-seq perturbation data is useful, but it is still one projection of biology. Without same-cell DNA and protein context, the model sees expression change but not the full mechanism that produced it.

That is the difference between a correlation engine and a Virtual Cell.

Public biology data was not generated for AI alignment.

For decades, experiments were designed for papers, targets and individual assays, not for training generalizable biological models. Public datasets are valuable, but the easy signal has already been mined by today's foundation models.

The next performance step requires purpose-built data: large-scale, high-throughput, biologically meaningful and standardized enough for models to learn mechanism.

The inevitable trend Multimodal training data is becoming the battleground for Virtual Cell.

Field consensus

CZI is turning Virtual Cell into a benchmarked research agenda.

The CZI Virtual Cells workshop centers the field on data curation, tooling and cross-domain benchmarks, because credible Virtual Cell systems need measurable reliability across modalities.

Capital proof

Biohub committed $500M to cellular AI and data generation.

Biohub's Virtual Biology Initiative includes a major data-collection effort and new measurement tools, reflecting a simple reality: predictive cell models need far more biological ground truth than public datasets provide.

Data gap

Everyone wants multimodal models, but almost nobody owns the substrate.

New cellular AI programs are pairing imaging with genomic and other biological measurements, but the scarce asset remains aligned, high-throughput, perturbation-grade same-cell data for training.

Value proposition We build the production data layer for Virtual Cell: aligned, high-throughput, biologically interpretable same-cell multimodal perturbation data that combines industrial scale with ground-truth biology across DNA, chromatin, RNA and protein.

Solution intro

We capture the flow of biology, not a cell snapshot.

Others approximate multi-omics ground truth through post-hoc alignment. We generate it directly from the same perturbed cell, turning each experiment into a central-dogma record for AI training.

Current Virtual Cell Data

scRNA-seq atlas dataexpression matrices

scRNA-seq perturbation datadrug / CRISPR

RNA + protein dataCITE-seq / ADT

RNA + ATAC data10x Multiome

ATAC + protein dataASAP-seq

RNA + ATAC + protein dataTEA-seq / DOGMA-seq

Multimodal perturbation dataVirtual Cell training

The most biologically meaningful data is also the least available.

Our data layer records perturbation inside the same cell.

Whole Genome Sequencing ATAC-seq scRNA-seq Proteomics

Gene perturbation Drug perturbation

One perturbed cell → One central-dogma record

Same-cell alignment Multi-modality pipeline High-throughput production

What the model receives Matched molecular state and perturbation response from the same biological system.

No post-hoc stitching across batches or assays. The model learns how genome, regulation, expression and protein state move together after intervention.

Technology & Solution

Our route.

The market has strong single-cell platforms, but each route optimizes one constraint. Our route is an in-house, hydrogel-enabled method for high-throughput same-cell multiomics, built for AI-training data rather than general-purpose sequencing workflows.

Production-grade same-cell multimodal data is the design target.

The route has to combine scale, cost control, flexible cell-state handling and matched multimodal capture from the same perturbed biological system.

High throughput Fixation optional In-house workflow Archive-ready samples

Our answer is a hydrogel-enabled, in-house single-cell multiomics method.

Hydrogel chemistry has been under-appreciated in this category. At a high level, it can preserve intracellular molecular information, support efficient reactions, enable sample archiving and avoid full dependence on Chromium-style droplet infrastructure.

Encapsulate the same cell in hydrogel.

Run compatible multiomic chemistry.

Generate archive-ready AI training data.

Cost and scale

Cheaper, high-throughput production.

Designed for throughput comparable to Parse and Scale, without requiring the same equipment burden as instrument-centered workflows.

Native state

Not fixation-dependent, but fixation-compatible.

The workflow is designed to preserve optionality. It can support native-state capture while remaining compatible with fixation when the experiment requires it.

Information quality

Better preservation and reaction efficiency.

Hydrogel encapsulation can help retain nucleic-acid information inside the cell-like compartment while improving reaction control and efficiency.

Data asset

Archivable biology for AI training.

The same physical design can support longer-term storage and sample archiving, turning experiments into reusable biological records.

Competitive landscape.

Scale and same-cell multimodality exist separately, not together. Commercial platforms have scale, workflow maturity or targeted multiomics. Research methods prove same-cell multimodal measurement is possible. The gap is an industrialized, perturbation-grade data layer built for Virtual Cell training.

Commercial single-cell scale platforms

They scale assays, but not the full same-cell training substrate.

10x Genomics Mature droplet ecosystem with RNA, ATAC, multiome and feature-barcoding workflows. Strong platform, but standard Multiome is RNA + ATAC. Same-cell ATAC + RNA + protein is not a routine production substrate.

Parse Biosciences Instrument-free combinatorial barcoding with Evercode WT Plex supporting up to 5M cells and 384 samples in one run. Very strong scale, but RNA-dominated. Not same-cell multimodal central-dogma perturbation data.

Scale Biosciences Combinatorial indexing across scRNA-seq, methylation and protein profiling products. Broad portfolio, but not a same-cell ATAC + RNA + protein perturbation data factory.

Our route Same-cell multimodal perturbation data production across ATAC-seq, scRNA-seq and protein markers. Built for high-throughput perturbation-scale production and scale-up validation.

Same-cell multimodal proof-of-concept methods

They prove biology, but remain research workflows.

TEA-seq Measures scRNA-seq, epitopes or protein markers and scATAC-seq from the same cell. Method-paper scale. Thousands of cells, not production-grade perturbation data.

DOGMA-seq Measures chromatin accessibility, gene expression and protein from the same cell. Research-grade protocol and platform-dependent processing, not a commercial Virtual Cell data factory.

Our route Targets ATAC-seq, scRNA-seq and protein markers under gene or drug perturbation. Platform-independent, not 10x-dependent, designed for scalable same-cell central-dogma records.

Virtual Cell datasets are scaling, but ground-truth multi-omics is still missing.

Public Virtual Cell data infrastructure is expanding, but most resources remain transcriptomic, atlas-based or post-hoc harmonized.

Tahoe-100M 100M cells from ~60,000 drug perturbation experiments, 50 cancer models and 1,100+ drug treatments. Single-cell drug perturbation atlas, primarily gene expression and RNA-seq. One of the strongest public Virtual Cell training datasets. Large-scale but RNA-heavy. Not same-cell ATAC + RNA + protein-marker records.

CZ CELLxGENE 33M+ unique cells, 436 datasets and 2.7K+ cell types. Curated and standardized public single-cell datasets. Major public single-cell data infrastructure. Broad atlas, but not designed as perturbation-scale multimodal data generation.

scPerturb 44 public perturbation-response datasets. Harmonized single-cell perturbation datasets with molecular readouts. Useful for benchmarking perturbation models. Heterogeneous and post-hoc harmonized, not a unified production data layer.

PerturBase 122 scPerturbation datasets, 24,254 genetic and 230 chemical perturbations. Comprehensive database for single-cell perturbation data. Large perturbation-data index for model development. Aggregated from existing studies. Modality and protocol heterogeneity remain.

Our data layer Designed for perturbation-scale production. Same-cell ATAC-seq, scRNA-seq and protein markers under gene or drug perturbation. Direct central-dogma training records for Virtual Cell. New platform, requires validation and scale-up.

The market has scale without full same-cell multimodality, and multimodality without industrial scale. Our opportunity is to build the production data layer where those two curves meet.

Business opportunities.

The same data layer can monetize through four customer groups. The common product is not a sequencing service alone. It is decision-useful, same-cell perturbation data that makes AI biology models and R&D programs more reliable.

01 · Sell to model companies

Training substrate for Virtual Cell and perturbation models.

Use case: pretraining and fine-tuning models that need matched DNA, chromatin, RNA, protein and perturbation response from the same biological system.

Why they buy: public data is mostly single-modality or post-hoc harmonized, so model quality becomes data-limited.

How it closes: dataset licensing, custom data generation, model-training partnerships and milestone-based data programs.

02 · Sell to pharma and biotech

Experiment engine for target, drug response and toxicity decisions.

Use case: run perturbation-scale screens in disease-relevant cells to compare targets, compounds, dose, time and patient context.

Why they buy: the data helps move decisions upstream before expensive wet-lab and translational programs.

How it closes: paid pilot studies, program-based data packages, recurring screening contracts and co-development deals.

03 · Partner with hospitals

Clinical biology data layer for disease-specific cohorts.

Use case: convert patient-derived samples into multimodal cellular response records for stratification, translational research and real-world disease biology.

Why they buy or partner: hospitals own patient context, but usually lack scalable AI-ready single-cell multiomics production.

How it closes: cohort-data partnerships, sponsored disease atlases, clinical research collaborations and shared downstream data assets.

04 · Sell benchmarking and validation

Neutral ground truth for model evaluation and external validation.

Use case: benchmark Virtual Cell, perturbation and toxicity models against standardized same-cell experimental readouts.

Why they buy: model claims need trusted biological validation before they influence R&D decisions.

How it closes: benchmark dataset access, validation reports, sponsored challenges and independent model-evaluation services.

Status: foundational lab validation is complete for DNA, RNA and protein capture inside hydrogel encapsulation. The next step is prototype engineering for a multimodal version that can produce same-cell central-dogma records at scale.

Platform references: 10x Chromium, Parse Biosciences, Scale Biosciences, Mission Bio Tapestri.