Target data is not enough
A drug response depends on cell type, disease state, genetic background, dose, timing and microenvironment. Target-level data cannot explain that cellular context.
AI has made proteins, molecules and genomes more computable. The next bottleneck is the training substrate: same-cell, multimodal perturbation data that lets models learn how living cells change when disease, genetics, drugs and time interact.
For decades, the industry optimized around targets: identify a protein, design a molecule, test whether the interaction is strong enough, then learn late whether the biology actually moves. That approach helped scale discovery, but it did not create the data needed to predict cell response upfront.
The strategic question is shifting from “Can this molecule bind?” to “What will this intervention do to a living cell in the disease context that matters?” Virtual Cell can only answer that if models are trained on aligned, perturbation-grade cellular records rather than disconnected assay snapshots.
In this transition, the durable moat is not a single model demo. It is the data layer: same-cell, multimodal records that connect genome, chromatin, RNA, protein, perturbation and phenotype with enough quality to guide real experimental decisions.
A drug response depends on cell type, disease state, genetic background, dose, timing and microenvironment. Target-level data cannot explain that cellular context.
Models need paired observations of baseline state, intervention and downstream cellular response from the same biological system.
Once Virtual Cell models enter R&D workflows, proprietary single-cell multimodal perturbation data becomes the operating asset behind model quality.
The technical path toward Virtual Cell is no longer speculative. In the last few years, foundation models have learned useful representations of major biological modalities: protein structure, single-cell gene expression and genome-scale sequence. The next step is to connect those representations inside real cellular contexts.
Protein sequences can now be mapped into high-accuracy structural representations, turning a core layer of biology into a modelable substrate for binding, function and design.
Gene expression can be represented as a cell-state language. Models trained on large single-cell corpora can support network biology, perturbation reasoning and cell-context prediction.
Genome foundation models extend representation learning to long DNA context and cross-organism sequence patterns, bringing regulatory and variant-level biology into the computable stack.
Protein, RNA and DNA models each capture an important biological view. But drug response happens inside cells, where genome, expression, protein state, disease context and perturbation interact.
Our core technology is the data layer that connects cellular context across modalities: DNA / chromatin, RNA expression, protein state, perturbation, time and disease background from the same biological system.
Technology references: AlphaFold protein structure prediction, Geneformer in network biology, Evo 2 genome foundation model.
The recent signals are not isolated headlines. Capital, pharma infrastructure and regulatory policy are moving in the same direction: drug discovery is absorbing AI infrastructure, and AI infrastructure is moving closer to human biology. The company that owns decision-useful cell-state data can sit at the center of that shift.
Xaira's $1B+ launch shows that investors are backing integrated platforms that combine machine learning, data generation and therapeutic development rather than narrow software tools.
The reported up to $1B, five-year Eli Lilly x NVIDIA collaboration signals that major pharmaceutical companies are treating AI biology as core R&D infrastructure.
In April 2025, FDA announced a roadmap to reduce, refine or replace certain animal studies with NAMs data, including AI toxicity models, organoids and organ-on-chip systems for IND applications.
FDA's shift and the CZI Virtual Cells workshop point to the same requirement: standardized, benchmarkable, human-relevant data that can make predictive biology trusted enough for R&D decisions.
Source references: Axios on Xaira, NVIDIA News on Eli Lilly x NVIDIA, FDA NAMs roadmap, CZI Virtual Cells workshop.
The next bottleneck in Virtual Cell is no longer model access. It is data. The field has started to build data infrastructure, but the leading attempts still concentrate on RNA perturbation. That is a useful signal, not the full substrate required for a real Virtual Cell.
Tahoe published Tahoe100M and raised roughly $30M, showing that RNA-seq perturbation data is already viewed as strategic infrastructure.
Xaira's $1B+ launch shows capital is backing foundation-model biology and large perturbation datasets, but the core data view is still largely RNA-centric.
Current models can learn expression correlations, but without same-cell DNA and protein context they cannot explain the cellular mechanism behind perturbation response.
Target discovery, drug response, toxicity and patient stratification require models that can predict what an intervention does inside disease-relevant human cells.
Virtual cell, perturbation and foundation models translate biological data into predictions. Their ceiling is now set less by architecture and more by whether the data captures the cell as a system.
The missing data layer must be unified, standardized, low-batch-effect and AI-trainable: disease-specific cells, matched modalities, controlled perturbations, dose, time and measurable phenotypic response from the same biological system.
GPU compute, automated labs, clinical systems and patient-data access create the operating environment. They are necessary, but they do not create the missing training substrate.
RNA-seq perturbation data is useful, but it is still one projection of biology. Without same-cell DNA and protein context, the model sees expression change but not the full mechanism that produced it.
That is the difference between a correlation engine and a Virtual Cell.
For decades, experiments were designed for papers, targets and individual assays, not for training generalizable biological models. Public datasets are valuable, but the easy signal has already been mined by today's foundation models.
The next performance step requires purpose-built data: large-scale, high-throughput, biologically meaningful and standardized enough for models to learn mechanism.
The CZI Virtual Cells workshop centers the field on data curation, tooling and cross-domain benchmarks, because credible Virtual Cell systems need measurable reliability across modalities.
Biohub's Virtual Biology Initiative includes a major data-collection effort and new measurement tools, reflecting a simple reality: predictive cell models need far more biological ground truth than public datasets provide.
New cellular AI programs are pairing imaging with genomic and other biological measurements, but the scarce asset remains aligned, high-throughput, perturbation-grade same-cell data for training.
Value proposition We build the production data layer for Virtual Cell: aligned, high-throughput, biologically interpretable same-cell multimodal perturbation data that combines industrial scale with ground-truth biology across DNA, chromatin, RNA and protein.
Others approximate multi-omics ground truth through post-hoc alignment. We generate it directly from the same perturbed cell, turning each experiment into a central-dogma record for AI training.
The most biologically meaningful data is also the least available.
No post-hoc stitching across batches or assays. The model learns how genome, regulation, expression and protein state move together after intervention.
The market has strong single-cell platforms, but each route optimizes one constraint. Our route is an in-house, hydrogel-enabled method for high-throughput same-cell multiomics, built for AI-training data rather than general-purpose sequencing workflows.
The route has to combine scale, cost control, flexible cell-state handling and matched multimodal capture from the same perturbed biological system.
Hydrogel chemistry has been under-appreciated in this category. At a high level, it can preserve intracellular molecular information, support efficient reactions, enable sample archiving and avoid full dependence on Chromium-style droplet infrastructure.
Designed for throughput comparable to Parse and Scale, without requiring the same equipment burden as instrument-centered workflows.
The workflow is designed to preserve optionality. It can support native-state capture while remaining compatible with fixation when the experiment requires it.
Hydrogel encapsulation can help retain nucleic-acid information inside the cell-like compartment while improving reaction control and efficiency.
The same physical design can support longer-term storage and sample archiving, turning experiments into reusable biological records.
Scale and same-cell multimodality exist separately, not together. Commercial platforms have scale, workflow maturity or targeted multiomics. Research methods prove same-cell multimodal measurement is possible. The gap is an industrialized, perturbation-grade data layer built for Virtual Cell training.
Public Virtual Cell data infrastructure is expanding, but most resources remain transcriptomic, atlas-based or post-hoc harmonized.
The same data layer can monetize through four customer groups. The common product is not a sequencing service alone. It is decision-useful, same-cell perturbation data that makes AI biology models and R&D programs more reliable.
Use case: pretraining and fine-tuning models that need matched DNA, chromatin, RNA, protein and perturbation response from the same biological system.
Why they buy: public data is mostly single-modality or post-hoc harmonized, so model quality becomes data-limited.
Use case: run perturbation-scale screens in disease-relevant cells to compare targets, compounds, dose, time and patient context.
Why they buy: the data helps move decisions upstream before expensive wet-lab and translational programs.
Use case: convert patient-derived samples into multimodal cellular response records for stratification, translational research and real-world disease biology.
Why they buy or partner: hospitals own patient context, but usually lack scalable AI-ready single-cell multiomics production.
Use case: benchmark Virtual Cell, perturbation and toxicity models against standardized same-cell experimental readouts.
Why they buy: model claims need trusted biological validation before they influence R&D decisions.
Status: foundational lab validation is complete for DNA, RNA and protein capture inside hydrogel encapsulation. The next step is prototype engineering for a multimodal version that can produce same-cell central-dogma records at scale.
Platform references: 10x Chromium, Parse Biosciences, Scale Biosciences, Mission Bio Tapestri.