Data

SandboxAQ Lives in the Future That Trains on Synthetic Molecules

Kelsey Mitchell

18 Jun 2025 — 4 min read

In the endlessly recursive world of drug development, where billions are burned chasing molecules that never make it to market, the idea of conjuring promising compounds from nothing has always been more hope than method.

Until this week.

SandboxAQ, a company born out of Alphabet and now backed by Nvidia and Google Ventures, released something that could change the terms of engagement for drug discovery.

It’s a new public dataset. 5.2 million synthetic 3D molecular structures, built not from lab experiments but from machine learning trained on high-fidelity experimental data.

They stress that these aren’t hallucinations. They’re chemically plausible, structurally rich, and critically, tagged with predicted binding affinity to target proteins and stand ready to be ingested by the hungry models of computational chemists.

This is not just more data, this is new data. And that distinction is the hinge on which the next decade of biopharma might swing.

The Alphabet that Became AQ

To understand the play, it helps to understand the player.

SandboxAQ emerged in early 2022 as an independent entity, spun out from Google’s X division. While its public face early on was quantum security (cryptography to help companies withstand quantum attacks) it was always more than that. The “AQ” stands for “AI + Quantum,” a tidy if ambitious promise that it would bridge two of the most complex technologies in modern computing.

It has since evolved into something closer to a full-stack applied science company, with one foot in AI, another in quantum physics, and eyes fixed firmly on sectors like pharma, materials science, and energy.

The leadership includes Jack Hidary, a former Google X executive and entrepreneur whose résumé reads like a CV optimized for frontier tech. The company quickly attracted talent from deep technical domains, including quantum physicists, AI researchers, molecular biologists, and cryptographers, along with nearly $1 billion in funding.

Back in April, Reuters confirmed that Google, Nvidia, and T. Rowe Price were among its major backers, with a post-money valuation of over $5 billion.

Their model isn’t to build AI as a general-purpose assistant. It’s to construct Large Quantitative Models (LQMs) a term SandboxAQ coined to describe AI systems trained on physics-grounded data rather than just language. Think GPT for matter, not words. LQMs ingest curated datasets from structural biology, chemistry, and physics, and are engineered to predict how molecules move, react, bind, or fall apart.

This latest release is the first public demonstration of that strategy at real scale.

Data Out of Thin Air, with Weight

The 5.2 million synthetic molecules SandboxAQ just released are generated by an AI model trained on verified experimental binding data, then cross-validated for structural plausibility.

Each molecule is annotated with what the company describes as “ground-truth”–level predicted affinities for binding to proteins. This means a researcher could, for example, train a downstream model on this dataset to predict whether a given compound might inhibit a kinase, modulate a GPCR, or interact with a viral protease, without ever touching a pipette.

The molecules exist in 3D space and follow proper stereochemistry. They include the quirks and constraints of actual molecules (chirality, bond angles, energetics, etc) not just idealized shapes. They are ready for docking simulations, structure-based screening, or as training fodder for generative models.

Oh. And they’re free.

There’s a wee bit of a catch. The dataset is open to the public. But the models trained on it, those LQM, are not. Those are SandboxAQ’s proprietary advantage. They’ll offer them as APIs or software-as-a-service to pharma clients who want faster hit-to-lead workflows.

The free data drives model performance. The models drive revenue. It’s a clever wedge, no?

From Promise to Performance

In partnership with UCSF’s Prusiner Lab, SandboxAQ has already applied its LQMs in a real-world drug discovery campaign, identifying binders to neurodegenerative disease targets.

According to the company, that effort led to a 30x higher hit rate than standard methods and was able to screen 5.5 million compounds in under a month.

What underpins it, of course, is infrastructure. Nvidia GPUs power the entire workflow, from the synthetic molecule generation to the structure-based screening to the physics-informed optimization loops. This aligns with Nvidia’s own push into bioAI, particularly through its BioNeMo platform, which hosts drug discovery models and molecular simulations.

SandboxAQ is not just a customer, it‘s is a validator of the ecosystem Nvidia is trying to build.

There’s a deeper implication too. As AI becomes increasingly verticalized, the general-purpose foundation models that captivated enterprise headlines last year may give way to domain-specific LQMs trained on tightly held or synthetically created data.

The future might not belong to the model with the biggest dataset, but to the one with the most relevant data and the best simulations behind it.

The Synthetic Road Ahead

Drug discovery has always been an exercise in probabilities and attrition.

Out of 10,000 early-stage compounds, maybe one becomes a drug. And that one takes ten years and costs upward of $2 billion. What SandboxAQ is suggesting is that AI-native, synthetic-first pipelines could collapse both the time and cost curves.

The implications go beyond pharma. If it works here, it could work in materials, in quantum chemistry, in energy catalysis. Basically in any field where we simulate nature to speed up the time between hypothesis and result.

And yet, the credibility here hinges on the quality of the synthetic data.

It’s not enough to generate molecules. They must matter. They must bind, fold, degrade, and persist in ways that mirror real biology. Early signs are promising, but the proof, as always, will live in the wet lab amd if all goes well, eventually, the clinic.

Still, the release of a public, predictive, structurally rich dataset of this scale is a first. Not from a university consortium or public agency. From a private company that believes its moat lies not in the data itself, but in what it can teach a machine to do.

It’s a bet that synthetic data, if crafted carefully enough, isn’t fake at all. It’s just reality, slightly ahead of schedule.

SandboxAQ Lives in the Future That Trains on Synthetic Molecules

Kelsey Mitchell

Read more

Parsing Biomanufacturing’s Hidden Signals

Inside Isomorphic Labs Cloud-Native HPC Architecture for AI Drug Discovery

Azure Captures Viome’s 10-Quadrillion-Point RNA Dataset

A Million CPU Hours Later, AI Cracks Open the T-Cell Code