Engineering Biological Alpha

Accelerating towards a world without disease

Pheiron — Tue, 28 Apr 2026 13:54:36 GMT

500 years

That is, roughly, how long humanity will have to wait, at the current pace of drug development, for every known disease to have a cure. It is the ratio of uncured diseases to net new drugs our industry approves each year1. That’s twenty generations of patients waiting.

We started Pheiron because this timeline isn’t inevitable. We think it’s time to accelerate.

Making drugs is hard; Pioneering new biology is harder

Nine out of ten drug programs that reach human testing will fail. Six or seven of every ten drugs fail in Phase II. Everyone in the industry knows this.

Making drugs is hard because biology is hard. The human body is the most complex system we have ever tried to understand, and the honest answer to most of the questions is: we don’t know yet.

But biology risk is not evenly distributed. It concentrates on novelty.

Pioneering new biology, a new target, a new mechanism, a new disease hypothesis, multiplies biology risk and raises the odds of failure. The safe path is another me-too. The rational move in biopharma today is to stay close to what’s known.

But risk avoidance does not produce net new cures. Me-too’s do not help the patients who are still waiting.

Figure: At current pace, 20 generations of patients will have to wait for all diseases to be cured.

Every failed drug program had patients waiting for its cure

Investing in a novel biological hypothesis means betting on the validity of a long chain of experiments. Cell-based screening, animal experiments, safety / tox, even clinical trials, each is a probabilistic instrument trying to inform a single ground truth question: will this biology translate to humans?

Today, the industry has no systematic way of quantifying the probability that a drug program will succeed. We cannot quantitatively distinguish novelty likely to work from novelty likely to fail.

Our industry avoids novel biology because it lacks the capability to systematically underwrite biology risk.

Most conversations about fixing drug development focus on making experiments faster or cheaper. Better binders, higher-throughput screens, organoids.

But the economics of late-stage clinical trials are not going to change materially any time soon. A Phase III that enrolls thousands of patients, runs for years, and follows rigorous regulatory science will remain one of the most expensive things a human organization can attempt.

Capital for new drugs will remain finite. The lever that actually decides how much disease we cure, per dollar and per year, is not the cost of experiments. It is the selection of which programs to run and which biology to fund in the first place.

We as an industry need to stop funding failures and dramatically improve our ability to select winning biology.

Underwriting biology risk

Of course, every good scientist and every good portfolio manager is reaching for something like this, through their experience, their taste, their intuition. Nearly everyone in biopharma believes they have a better method for picking winners. But almost none of them benchmarks those claims, rationally and systematically against actual clinical outcomes.

Behind closed doors, in the portfolio strategy and IC meetings, decisions are taken based on “vibes and conviction”, trusting the experts, those “who know what’s good”. But just like algorithms replaced human economic intuition2, algorithms will replace human biological intuition.

Pheiron is built on one central belief: biology risk is quantifiable.

Certainly not deterministically, but far more rigorously than the industry does today. The substrate to do this exists. Human health data is exploding globally. Hundreds of experiments are published daily on top of decades of prior programs. What’s missing is a system that turns this data into systematic probabilities and updates them in real time as new evidence emerges.

Once we can quantify biology risk, we can underwrite it. We can distinguish novelty that is genuinely risky from novelty that only looks risky, and act on the difference. We can arbitrage mispriced opportunities, we can identify exploits. We can move with conviction on biology that the field has dismissed as too uncertain to fund.

The ability to anticipate what biology translates successfully to humans equals biological alpha.

At Pheiron, we are building AI to systematically underwrite biology risk, decision by decision, across experiments and inflection points. In other words, a way to engineer biological alpha.

How we engineer biological alpha

When assessing the probability of program success, it’s critical to distinguish biological risk from the corporate risk of execution. The cleanest trial in the world fails on a mechanism that doesn’t translate. The strongest biology will fail through an underpowered or badly designed trial.

Established methods exist to quantify execution risk by extrapolating from known biology: parametrized simulations to assess the trial design, its statistical power, the execution plan. The biology risk gets folded in implicitly, through the priors.

Most agentic research relies on what’s already known: synthesizing literature, integrating existing knowledge from the public domain. This works for grading established biology, such as me-too programs and late-stage trials with long paper trails, but cannot inform novelty (see our blog post on why LLMs are weak alpha generators3).

At Pheiron, we take a different approach: we build proprietary models and run primary analyses on multimodal human data (genetics, -omics, clinical records, observational and interventional cohorts, spanning millions of lives) to generate novel and comprehensive human evidence. The question we answer, on every program: is this mechanism causal to the disease endpoint?

Our platform treats biology risk the way an underwriter treats any other risk: by quantifying and measuring meticulously against what matters: clinical outcomes.

We construct retrospective and prospective evals to hillclimb biological alpha. We backtest on historical drug approvals and forward-evaluate on prospective approvals; we quantify the alpha that a given method, data source, or model adds to our calls, and iterate.

In backtests, our proprietary models and primary analyses capturing biological risk, enrich for successful programs at roughly ~20x over the base rate4.

Figure: Feedback loop for hillclimbing biological alpha.

We invite you to join us

So far, we’ve been building with partners who put the platform to work. Some have started programs on the back of what we surfaced, others have killed programs. But we’re just at the beginning.

Today, we’re entering the next phase. We are publishing our first set of public predictions on upcoming clinical trial readouts. Each carries two measures: Biological Support, our score for the underlying biology, and probabilities of trial success across the trial’s endpoints. Together, they let you see which lever is doing the work. You can read them at pheiron.com/predictions.

We are doing this for three reasons: to learn faster, to hold ourselves publicly accountable for the claims we make, and to invite the rest of the industry to do the same.

Accelerating towards a world without disease

If we are right, three things follow. They are the things we care about most.

First, pursuing novel biology becomes an economic necessity. When the industry can distinguish high-risk novelty from low-risk novelty and pursue the latter deliberately, the opportunity surface bends. That is how real progress happens.

Second, smaller opportunities become fundable. When risk can be priced, the risk-reward calculus no longer demands a blockbuster for a drug to be worth pursuing. Subtypes become viable. Rare diseases become viable. Personalized medicine is within reach.

Third, and this is the only measure that ultimately matters, we reach cures faster. It will still take time. But the 500-year timeline compresses. Every program not funded that was set up to fail is the opportunity to fund a program that might succeed. Every year we shave off that horizon is decades of suffering we did not let happen.

We believe that the missing piece is a disciplined, quantitative way to underwrite biology; this is the single most important lever towards a world without disease.

Pheiron is an applied AI lab dedicated to biological alpha. We use it to select and develop our own drug programs: identify mispriced opportunities, execute with conviction, and compound that advantage program after program to dramatically increase the drug-per-dollar output. This will allow us to break Eroom’s law5 and accelerate towards a world without disease.

If you’re a patient, or love someone who is, the wait you’re living through isn’t inevitable.

If you are a scientist, a drug hunter, a decision maker, a capital allocator, or someone who has spent long nights wondering if biology will translate: we would like to work with you.

Five hundred years is too long. Let’s get to work.

References

There are ~10,000 known diseases. 40-60 new drugs are approved per year, but of those only 10-20 are genuinely new, the rest are me-too’s aimed at market share not novelty.

Algorithmic trading, the replacement of human economic intuition with computers, gave rise to Renaissance Technologies, which is widely regarded as the most successful fund and arguably one of the most valuable technology platforms of all time.

Read our blog post on the ability of LLMs to predict unseen trial outcomes. Naive LLMs are weak alpha generators. https://blog.pheiron.com/p/llm-bench

Retrospective analysis on phase transitions of target-indication rates. Technical Report in preparation. For comparison: human genetic evidence is known to increase a program’s probability of success by 2 to 4x (Minikel, 2024)

Eroom’s Law is the observation that drug discovery is becoming slower and more expensive over time, with the inflation-adjusted cost of developing a new drug roughly doubling every nine years since the 1950s.

Can Frontier LLMs Predict Clinical Trial Outcomes?

Pheiron — Mon, 02 Mar 2026 08:37:44 GMT

tl;dr

Current Frontier LLMs are weak biological alpha generators.
LLM RCT-outcome prediction performance is driven by memorization, performance decreases with recency.
The field needs leakage-resistant bio-evals that genuinely test predictive validity on novel biology.

Chasing Predictive Validity

~90% of clinical programs fail. We still cannot reliably predict which biology will translate to humans. In practice, drug development remains a high-stakes capital allocation problem driven by expert judgment, imperfect experimental proxies, and uncertainty.

AI is promising to change that: from virtual cells, to virtual patients there is massive hype around models promising better biology. The reality is that, for most tools and evidence sources we don’t know how much valid information they add to the picture.

One of the tools promising change are the original heralds of the AI era: frontier LLMs. Frontier LLMs are increasingly validated on biology tasks, and biology has moved into the focus of frontier labs1 2. Already today, frontier LLMs are used in scientific workflows that affect real allocation decisions, including which therapeutic programs get advanced, partnered, or deprioritized.

This creates a practical and foundational question: can frontier LLMs truly assess the quality of biology being developed? Could they even assess translational performance of drug programs?

Established benchmarks such as TrialBench (2025) evaluate domain-specific architectures using static time splits. Here, we ask a different question: can off-the-shelf, general-purpose frontier models predict clinical trial outcomes in a way that plausibly reflects biological signal, rather than historical leakage?

The central challenge is train-test leakage through memorization. Real decisions happen at the frontier, where outcomes are not yet known. If performance is strongest on outcomes already disseminated in papers, registries, and media, then real prospective utility is limited.

We set up a straighforward experiment to test this.

Question: can frontier LLMs predict clinical trial outcomes, or are they mostly recovering already-known outcomes?
Experiment: test four frontier models (GPT-4.1, Gemini 3 Pro, Claude Opus 4.6, GPT-5.2) on clinical trial outcomes from ClinicalTrials.gov using a cutoff-centered evaluation design to clearly distinguish memorization from prospective prediction.
Table 1: Assessed Models.

To the best of our knowledge, this specific assessment has not been done in a cutoff-aware way.

Setting up the experiment

The core idea is pretty simple: Each model has a knowledge cutoff date, clinical trials that concluded before cutoff may have been represented in the model’s pretraining corpus, either directly or through derivative discussion. In high-visibility therapeutic areas, especially where mechanisms are repeatedly studied, historical outcomes can propagate broadly across scientific and non-scientific sources. Trials that concluded after cutoff should be less vulnerable to this effect and are the closest available proxy for true prospective prediction.

Figure 2: Design of the temporal benchmark.

To set up the experiment, we first constructed a comprehensive time-stamped benchmark of clinical trials with known real world outcomes.

We ingested ClinicalTrials.gov, and extracted all randomized controlled trials between 2010-2016.
Next, we validated the trial completion status and the outcomes, leaving us with a set of 47,133 trials with validated clinical outcomes.
The specific outcome label was either
- success (primary endpoint met) or
- failure primary endpoint not met).

Figure 1: Benchmark construction pipeline from trial registry ingestion to validated binary outcome labels.

We then prompted each model to generate a probability of success score from 0 to 100 for all trials in our dataset3 and calculated the AUROC on the predictions.

Frontier models are good at picking winning mechanisms

Across the full dataset of trials from 2016-2026, all four models deliver AUROC scores (0.69-0.76) on trial outcome classification.

Figure 3: Model performance in predicting overall clinical trial outcomes.

At first, that looks pretty good. Recent benchmarks (such as TrialBench) reported AUROC scores in the 0.64–0.75 range for clinical trial approval predictions for domain-specific models under a conventional hold-out validation setting. This means naive frontier models are roughly matching heavily engineered, domain-specific architectures entirely out of the box.

But of course, this aggregate performance masks the source of signal. So next, we’re investigating historical vs prospective performance to control for train-test leakage.

Frontier models are good at memorizing winning mechanisms

Controlling for train-test leakage through by computing two separate performance measures split by each model’s training cutoffs the picture looks entirely different. The temporally decomposed results show that performance drops in the prospective post-cutoff data for every single model tested.

Figure 4: Historical vs Prospective model performance. Model performance drops on prospective events.

For three of the four models, this degradation is both large and statistically significant under bootstraped confidence intervals. We observe the largest drop for Gemini 3 Pro. For GPT-5.2, the directional drop is similar, but uncertainty remains high because only 21 prospective trials are available at this specific cutoff horizon.

Table 2: Model Performance.

While we see clear performance drops beyond cutoff dates for each model, the performance is not bad per se. Claude Opus 4.6 still shows an AUROC of 0.680 post cutoff - a pretty impressive result, without grounding, and a naive prompt.

However, drug programs take years to be developed, and each program leaves a decades-long paper trail before it reaches registrational trials. This paper trail of positive reporting, whether through published papers or data from previous phase trials, is almost certainly included in the models’ pre-training corpora.

We have no real way to test that directly. But one thing we can do is investigate if more recent trails, with potentially more recent biology, are assessed differently compared to data that has been around longer.

Performance weakens with recency

Rather than scoring models on a single pooled time axis, we aligned each model’s performance relative to its own cutoff date. We evaluated the AUROC per calendar year. This allows us to observe trends and run a fair comparison among models trained at different times.

The calendar-year resolved trend lines show AUROC declining steadily as the evaluation moves through time. Models generally perform weaker on more recent trials, supporting the memorization hypothesis.

If these systems possessed robust, mechanism-level predictive abilities independent of seen outcomes, their performance would be expected to remain comparatively stable across time. Instead, the performance decays.

What this decay means is that the task with greatest business relevance is the one with weakest evidence for high performance.

Trial-level examples: spike-testing memorization behavior

To illustrate this point, we decided to dive a bit deeper into the memorization angle and investigate some trial-level examples. Following the memorization paper-trail hypothesis, we expect pivotal trials with a large paper trail (i.e. global headlines, strong financial impact or guideline change), to result in predictions with extreme confidence.

The following cases were selected based on model confidence to avoid anecdotal cherry-picking:

NCT02021656: A 2015 trial investigating ledipasvir/sofosbuvir for HCV/HIV co-infection. This trial made global headlines
- Outcome: Success (96% SVR12).
- Model Performance: All four models assigned a probability of 96–98. GPT-4.1’s score of 98 sits at the 98.5th percentile of all its predictions, representing near-maximum confidence.
- Market Context: The HCV direct-acting antiviral revolution was one of the most covered stories in modern medicine. Gilead, the sponsor, saw historic revenue growth and massive stock appreciation during this era as these curative treatments hit the market.
NCT04353037: This trial investigated hydroxychloroquine for COVID-19 and was terminated in 2021.
- Outcome: Failure. It was terminated early after external data showed no clinical benefit, making global headlines.
- Model Performance: All four models assigned a probability between 2 and 10. Gemini 3 Pro’s score of 2 sits at the 0.01st percentile, while GPT-4.1, GPT-5.2, and Opus 4.6 also ranked this in their bottom 1st percentile of predictions.
- Context: The immense public and macroeconomic attention on COVID-19 therapeutics guaranteed this failure was heavily overrepresented in training data.
NCT02406027: This trial was terminated in 2018 due to safety concerns, specifically elevated liver enzymes and cognitive worsening.
- Outcome: Failure.
- Model Performance: Opus 4.6 assigned a probability of 5 (the 0.07th percentile). Gemini 3 Pro assigned 10, GPT-4.1 assigned 15, and GPT-5.2 assigned 20.
- Market Context: Atabecestat, was a highly anticipated BACE inhibitor developed by Johnson & Johnson (Janssen) and Shionogi. The abrupt 2018 termination sent shockwaves through the sector and was a major catalyst in the subsequent collapse of the entire BACE inhibitor drug class.

While these examples do not prove memorization alone, they illustrate a broader pattern: high-confidence “predictions” on already-disseminated outcomes that received very high coverage in press and scientific journals.

What This Means

Before we move to interpretation, let’s summarize what we found:

Current frontier LLMs are strong retrievers of historical trial outcomes.
Out of the box performance on prospective trial readouts is much lower but still relevant (AUROC 0.680 for Claude Opus 4.6).
We have to assume that a substantial share of that capability is memorization from prior exposure to the paper trail leading up to the trial. Therefore, predictive reliability on truly novel programs remains unclear.

To be clear: this does not imply that LLMs have no role in drug development workflows.

It implies they should be used where recall-heavy capabilities are appropriate and memorization is not an issue. Many of these use cases exist in the current discovery workflows. However, that also mean that current frontierLLMs cannot (yet) be used as primary discovery engines where novelty and causal reasoning over unseen data are required.

Application in use cases, such as grading of novel biology, identification of novel targets or capital allocation decisions that assume model confidence equals out-of-sample validity, should be pursued with caution.

Practically, our findings demonstrate that naive LLMs are weak biological alpha generators. Because alpha depends entirely on information arbitrage not already embedded in the consensus, a system that primarily retrieves known outcomes is structurally limited as a predictive engine.

While our study successfully tests for the leakage of published trial results, testing for the predictive validity of truly novel biology faces an even more fundamental leakage problem. The standard scientific pipeline, where a paper is published, an LLM trains on that paper, a company initiates a related trial, and we evaluate the model on that trial, creating a chain that is virtually impossible to blind models against.

This raises an uncomfortable question for the field: is it even possible to design evaluations that genuinely assess a frontier LLM’s ability to predict truly novel biology?

We are actively working on this evaluation frontier, and it may turn out that LLMs are not the right tool for underwriting novel biology after all.

More to come, stay tuned!

On Pheiron

For readers new to Pheiron: we are an applied research lab focused on biological alpha, built around primary human data. We build calibrated, causal models on primary human data to underwrite biology at the program level, because primary human evidence is where out-of-sample signal should come from.

This is the first post in a series on predictive validity, primary human data, and biological alpha.