A Million CPU Hours Later, AI Cracks Open the T-Cell Code

MJ Horowitz

03 Jul 2025 — 2 min read

Sophia Panagiotopoulou makes it look so straightforward.

Engineer T cells, sequence RNA, crunch the numbers, and let AI spit out targets. Boom, done.

But behind all of this is an intense, computationally ferocious machine, one that required 968,100 CPU hours (it's close enough to a million to warrant the title) of data-driven labor.

At its heart lies gx1, a 500-million-parameter transformer-based model built explicitly to decode T-cell biology, trained on expression data from 75 million cells, translating into billions of tokens.

ArsenalBio’s gamble, using massive compute to tackle drug discovery’s notorious complexity, is easy to grasp but brutally difficult to execute.

Biology’s sheer combinatorial explosion with each cell harboring around 20,000 genes means modifying even two genes simultaneously yields 200 million possible combinations, a number that quickly moves beyond feasible wet-lab experiments.

“We took gx1, fine-tuned it on a modest 500 CRISPR-edited cells, then computationally screened over 182,000 potential modifications,” she said at the Fully Connected event. In practice, this means ArsenalBio evaluated three hundred times more possibilities computationally than would be physically achievable in the lab. Panagiotopoulou, who is ArsenalBio's Director of Computational Biology and ML is candid. Without gx1’s predictions, traditional screening at this scale would cost millions of dollars and consume months, if not years, of valuable research time.

The secret behind gx1’s compute efficacy is its dual-mode training. Initially, gx1 learns gene-expression patterns through unsupervised pre-training, inspired by large language models. Panagiotopoulou describes it plainly: the model receives gene-expression vectors from T cells, with random portions masked out.

Like an AI filling in the blanks of sentences, gx1 learns to reconstruct those missing gene-expression values, capturing intricate biological relationships along the way.

But ArsenalBio pushes further. They add supervised fine-tuning, embedding cause-effect relationships into gx1 itself.

Using CRISPR they introduce precise gene modifications into T cells, meticulously mapping how these edits ripple through cellular biology. gx1, in turn, learns to predict exactly how any given CRISPR edit will reshape the cell’s gene-expression landscape.

It sounds ambitious, even audacious, but Panagiotopoulou backs the claims with crisp validation data.

Predictions correlate impressively with experimental results, approaching the reproducibility of repeat lab experiments. “Computational predictions aren’t guesswork,” Panagiotopoulou emphasizes. “They reliably match real-world biology.”

To underline this, she highlights a discovery: computational screening identified gene edits dramatically superior to anything their human-driven hypotheses, informed by literature alone, could have uncovered.

Yet gx1 isn’t confined to identifying killer T cells alone. The team also leveraged the model’s predictive power to sift subtle signals from clinical datasets, pinpointing precise gene-expression signatures in inflammatory bowel disease (IBD). gx1 outperformed existing diagnostic markers by identifying genes with far superior predictive power and targets completely invisible to conventional approaches.

ArsenalBio is already employing gx1 in-house, continually refining its power with new data, bigger screens, and even more compute.

Drug discovery, historically slow, expensive, and risky, has entered a new era—one defined by precision-guided computational biology, where a million CPU hours aren’t just brute-force but strategic necessity.

A Million CPU Hours Later, AI Cracks Open the T-Cell Code

MJ Horowitz

Read more

Parsing Biomanufacturing’s Hidden Signals

Inside Isomorphic Labs Cloud-Native HPC Architecture for AI Drug Discovery

Azure Captures Viome’s 10-Quadrillion-Point RNA Dataset

How to Share a GPU Without Starting a War