Sunday, May 24, 2026

Comparing Chat and Claude approaches to "Apply Bayes to Diagnostics Instead"

 Yes — on this more challenging, open-ended diagnostics question, Claude and ChatGPT both did better than simply “porting” the drug-trial discussion over to diagnostics. But they succeeded in different ways.

My bottom line: Claude is more original and conceptually agile; ChatGPT is more teachable, more systematic, and more directly useful as a training document. Claude better “sees” why diagnostics are a different epistemic world. ChatGPT better converts that insight into a structured working framework.

1. Both answers correctly reject a simple drugs-to-diagnostics translation

Both versions recognize the main point: Bayesian design for diagnostics is not just Bayesian drug-trial design with different endpoints. The FDA drug guidance is about causal inference for therapeutic safety and effectiveness. Diagnostics are about classification, prediction, measurement performance, clinical interpretation, and downstream decision consequences.

Claude states this sharply: therapeutic guidance estimates a treatment effect, while genomic diagnostics involve analytical performance, clinical performance, and clinical utility; it also says these are largely prediction and classification problems, not causal inference problems.

ChatGPT makes the same core distinction in a more training-friendly way: drugs ask whether an intervention improves outcomes; genomic tests ask whether a test accurately and usefully classifies patients, variants, tumors, residual disease, or treatment-relevant molecular states.

That shared insight is the most important success of both answers.

2. Claude is better on the deepest epistemic distinction: diagnostics do not rest primarily on randomization

Claude’s strongest contribution is that it explicitly says the JAMA-style critique of Bayesian borrowing in therapeutic trials — erosion of randomization — does not apply with the same force to diagnostics, because randomization is not the foundational warrant for diagnostic accuracy studies.

That is exactly right, and it is a higher-level insight than simply listing sensitivity, specificity, PPV, and NPV. For drugs, the central regulatory anxiety is: are we preserving the causal inference made possible by randomization? For diagnostics, the central anxiety is different: do we know the truth state, and does test performance generalize to the intended-use population?

Claude then follows this to the right destination: the reference standard problem. It says diagnostic accuracy studies require a reference standard, and that for many genomic tests the reference standard is itself imperfect — orthogonal sequencing, discordance resolution, low variant allele frequencies, future clinical events, latent class models, and composite reference standards.

That is probably the best section in either answer.

3. ChatGPT is better on practical diagnostic categories and intended use

ChatGPT’s strongest contribution is the section on intended use as the anchor. It explains that diagnostics must be tied to who is tested, what specimen is used, disease stage, clinical decision, and truth standard. Then it lists different diagnostic contexts: therapy selection, companion diagnostics, screening, MRD, monitoring, rare inherited disease, and tumor profiling.

This is highly useful as a training document because it prevents the learner from thinking “diagnostic test” is one category. PSA screening, DaTscan, NIPT, tumor profiling, MRD, and cancer relapse monitoring all have different prevalence, error tolerance, reference standards, and clinical consequences.

Your examples sharpen this point nicely:

PSA screening is dominated by low-prevalence screening dynamics, false positives, overdiagnosis, biopsy cascades, and the fact that “detecting prostate cancer” is not the same as improving outcomes.

DaTscan is closer to a diagnostic adjunct or rule-in/rule-out tool for dopaminergic deficit, with error rates that matter because the result affects diagnostic confidence, treatment direction, and exclusion of mimics.

NIPT is a probabilistic screening test where PPV varies dramatically by condition prevalence, maternal age, fetal fraction, and pretest risk. It is almost a textbook example of why Bayesian thinking is unavoidable.

MRD / post-surgical cancer recurrence testing is a prognostic and treatment-decision test, where a negative result may support de-escalation or avoiding adjuvant therapy. Here, the most important error may be the false negative: “no tumor DNA detected” is not equivalent to “no residual disease.”

ChatGPT’s “intended use” section is the better framework for teaching those distinctions.

4. Claude is more innovative on diagnostics-specific Bayesian problems

Claude introduces several points that ChatGPT either omits or treats less deeply.

First, Claude says the reference standard problem has no therapeutic analog. That is a powerful training phrase. Drugs may have endpoint validity problems, surrogate endpoint problems, and ascertainment problems, but they usually do not have the same “what is truth?” structure as diagnostic validation.

Second, Claude is better on locked versus adaptive algorithms. It connects diagnostics to AI/ML-enabled device software and predetermined change control plans, noting that a genomic diagnostic algorithm may update variant classifications as evidence accumulates. That is a real diagnostics-specific issue and not captured well by the drug-trial guidance model of a fixed protocol and pre-specified analysis.

Third, Claude is better on post-market drift. It identifies shifts in allele frequencies, variant spectrum, reagent or platform changes, and changes in clinical indications for testing. This is especially important for genomics, where the test is not a static pill; the specimen mix, variant knowledge base, informatics pipeline, and clinical-use population can all evolve.

Fourth, Claude is good on capability heterogeneity. The diagnostics industry includes large IVD firms, sequencing companies, single-lab LDT providers, and academic centers. A Bayesian diagnostics guidance would need to be realistic about that uneven statistical and regulatory capacity.

Those are high-value insights.

5. ChatGPT is better on diagnostic bias, clinical utility, and payer implications

ChatGPT is stronger in explaining diagnostic bias categories in a way a trainee can use immediately: reference-standard bias, partial verification bias, spectrum bias, prevalence distortion, and incorporation bias.

This is a major teaching advantage. Claude discusses imperfect reference standards but does not lay out the full epidemiologic bias menu as clearly. For someone training in this area, ChatGPT’s section is easier to convert into a checklist for reviewing a study.

ChatGPT is also better on clinical utility. It says accuracy is not automatically utility, then asks the right questions: does the result change management, is the management change evidence-based, does earlier detection improve outcome or merely move the clock, does the test identify patients who benefit from a drug or de-escalation strategy, and does a negative result safely avoid therapy, biopsy, imaging, or chemotherapy?

That section is highly relevant to your MRD example. A Bayesian posterior around analytical sensitivity is not enough. For a post-surgical MRD assay, the business and clinical question is whether MRD-negative patients can safely forego adjuvant therapy or reduce surveillance intensity. That requires not just test performance but outcome-linked clinical utility.

ChatGPT is also better on payer implications, noting that FDA authorization does not automatically solve reimbursement and that payers may still require evidence of changed management, improved outcomes, matching of coverage criteria, and justification of serial testing.

For your professional use case, that payer paragraph is not optional. It belongs near the center of the final article.

6. Your skepticism about the old CDRH “devices” guidance is fair — and ChatGPT leaned on it too much

You are right to be cautious about the old FDA CDRH Bayesian device guidance as a general answer to this problem. “Devices” is a broad category. Some devices are therapeutic or interventional: stents, valves, ablation devices, neurostimulators, orthopedic implants, wound therapies. These can look much more like drug trials because they produce therapeutic effects and raise causal questions about safety and effectiveness.

Diagnostics are different. PSA, DaTscan, NIPT, and MRD do not themselves treat disease. They change information states and then influence downstream decisions. That means the Bayesian problem is less “does the product cause benefit?” and more “how should a probabilistic test result update belief, classify risk, and change action?”

Claude handled this better. It mentions CDRH and the old device guidance but quickly says the landscape differs in ways that go beyond which center has jurisdiction. ChatGPT, by contrast, spends more time treating CDRH device guidance as a useful comparator and says the analogy is “not perfect, but relevant.” That is not wrong, but it risks blurring your central point: diagnostic Bayesian guidance should not be derived from device Bayesian guidance merely because diagnostics are legally devices.

A sharper version would say: CDRH’s device guidance may be institutionally relevant, but the right intellectual model for diagnostics is closer to Bayesian diagnostic reasoning, classification error, decision analysis, reference-standard uncertainty, and clinical utility, not therapeutic device trial design.

7. Claude better anticipates a true “FDA guidance for diagnostics”

Claude’s synthesis is stronger as a policy concept. It concludes that a Bayesian methodology guidance for genomic diagnostics would not be a simple translation of the therapeutic document because the estimands, role of randomization, multiplicity structure, borrowing problems, post-market lifecycle, and scientific question differ. It then says the current gap is arguably in diagnostic and combination-product spaces, where the inferential questions may be well suited to Bayesian formalism but guidance has not kept pace with modern genomics.

That is probably the best “thought leadership” conclusion.

ChatGPT’s conclusion is cleaner and more memorable: drugs ask whether prior and accumulating data can support a reliable causal claim about safety and effectiveness; genomic diagnostics ask whether heterogeneous molecular, analytical, clinical, and prior-platform information can support reliable claims about test performance and clinical interpretability in a defined intended-use population.

So: Claude wins on policy imagination; ChatGPT wins on communicability.

8. Which answer better handled your examples?

For PSA screening, ChatGPT is better because it explicitly discusses screening, low prevalence, false positives, PPV/NPV, overdiagnosis-like cascades, and clinical utility. Claude has the tools to get there but does not make the screening example vivid enough.

For DaTscan, neither answer is fully developed. A strong answer would discuss rule-in/rule-out use, pretest probability, specialist versus generalist setting, equivocal scans, reference diagnosis that evolves over time, and whether the test changes clinical management. ChatGPT’s intended-use framework would support this better, but neither explicitly exploited that example.

For NIPT, Claude is conceptually strong because it emphasizes posterior probability of disease given a test result and the inherently Bayesian nature of PPV/NPV. ChatGPT is also good because it emphasizes prevalence and PPV/NPV across intended-use populations.

For MRD / molecular residual disease, ChatGPT is stronger. It explicitly mentions MRD, recurrence risk, lead time, serial monitoring, no tumor DNA detected, false negatives, avoiding adjuvant chemotherapy, and payer consequences. Claude’s framework would support a good MRD discussion, but it does not foreground de-escalation and “negative patients” as strongly.

9. Best final synthesis

I would combine them this way:

Use Claude’s first principles:

Diagnostics are not causal therapeutic interventions.

Randomization is not the central epistemic warrant.

The reference standard problem is central.

Bayesian methods may be more natural in diagnostics because diagnostic interpretation itself is Bayesian.

Modern genomic diagnostics introduce lifecycle problems: algorithm updating, variant reclassification, post-market drift, and uneven sponsor capability.

Use ChatGPT’s teaching structure:

Target of inference.

Prior information.

Bias categories.

Intended use.

Clinical validity versus utility.

Borrowing across variants, samples, platforms, and populations.

Diagnostic-specific operating characteristics.

False-positive / false-negative harm asymmetry.

FDA authorization versus payer adoption.

The best sentence from Claude is essentially: diagnostics are prediction and classification problems, not causal inference problems.

The best sentence from ChatGPT is essentially: the diagnostic version is about classification accuracy, reference truth, measurement validity, and clinical decision consequences.

Together, those two sentences capture the article.

Overall verdict

On this harder question, Claude arguably had the more original insight, especially around reference standards, randomization not being the foundational warrant, adaptive algorithms, and post-market drift. But ChatGPT produced the better usable training document, especially for someone trying to learn the field without already being immersed in diagnostic epidemiology, Bayesian methods, and FDA culture.

For your purposes, I would not choose one over the other. I would use Claude as the conceptual provocateur and ChatGPT as the curriculum designer. The final article should be organized in ChatGPT’s clearer sequence but seeded with Claude’s deeper diagnostics-specific insights — especially the imperfect reference standard problem, the weaker relevance of the old CDRH device guidance, and the fact that diagnostics change belief and action, not physiology directly.


###

And of course, we wouldn't think of using diagnostics without some bayes-like thinking. If the FP rate on a test is 1, and a patient in a certain population has 1:10,000 risk, then FP's will swamp out TP's and no one would think otherwise - it's the real world.
### 

Exactly. Diagnostics are already Bayesian in ordinary clinical use, whether or not anyone calls them that.

A drug trial can be discussed, at least superficially, as “drug works / drug does not work” against a randomized comparator. But a diagnostic test is almost never interpretable without pretest probability. The same sensitivity/specificity table means different things in a high-risk clinic population, a screening population, an incidental finding population, or a worried-well population.

Your example is the cleanest version:

Suppose the disease prevalence is 1 in 10,000 and the test has a 1% false-positive rate. In 10,000 people:

CategoryApproximate count
True disease cases1
Non-disease cases9,999
False positives at 1%~100
True positives, even with excellent sensitivity~1

So even if the test catches the one real case, the positive results are roughly:

1 true positive + 100 false positives = 101 positives

That gives a PPV of about:

1 / 101 ≈ 1%

So a “positive” test would mean roughly a 1% chance of true disease, not 99%. That is why screening tests can look excellent on sensitivity/specificity and still perform badly in low-prevalence use. It is also why diagnostics cannot be evaluated only as instruments; they must be evaluated as instruments deployed in a population for a purpose.

This is the point I would add to the article as a core framing sentence:

In therapeutics, Bayesian methods may appear as an advanced statistical option; in diagnostics, Bayes-like reasoning is already embedded in the ordinary meaning of a test result, because every result is interpreted against a prior probability.

That is why diagnostics deserve their own Bayesian framework. The key issues are not merely “borrowing” and Type I error. They are pretest probability, disease prevalence, intended-use population, false-positive and false-negative consequences, reference-standard uncertainty, and whether the result changes a clinical decision in a way that improves outcomes.

It also clarifies why your examples are so good:

PSA screening is not just “can PSA detect prostate cancer?” It is: in which age/risk group, at what threshold, with how many false positives, biopsies, overdiagnoses, and clinically meaningful cancers detected?

DaTscan is not just “is the scan accurate?” It is: what is the pretest probability of Parkinsonian dopaminergic deficit, what alternative diagnoses are being considered, and how much does a positive or negative scan move the clinician’s belief?

NIPT is almost pure real-world Bayes: the same positive test result can have very different PPV depending on maternal age, fetal fraction, condition prevalence, and prior risk.

Post-surgical MRD is even more subtle: “no tumor DNA detected” does not mean “no residual disease.” It means the posterior probability of residual disease has been lowered by an amount that depends on tumor type, shedding biology, assay sensitivity, timing, sample adequacy, and baseline recurrence risk.

So yes: for diagnostics, Bayesian thinking is not a modern add-on. It is the mathematics behind what clinicians already mean when they say, “Given this patient and this setting, how much should this test result change what I believe and what I do?”

Comparing the Chat and Claude Opus Documents (Main Version, Therapeutics)

 Comparing the two main versions - the version re: Therapeutics.

##

My read: ChatGPT is the better teaching document for a smart but non-specialist reader. Claude Opus is the better expert briefing or annotated technical review. For your stated use case — someone with basic statistics, Excel comfort, and some applied scientific background, but not deep familiarity with Bayesian regulatory trial design — I would use the ChatGPT version as the base and selectively import several Claude sections.

Bottom-line comparison

The ChatGPT version has a clearer pedagogic arc. It begins by saying what the guidance is not — not a replacement for randomized trials, not permission to substitute “modern statistics” for evidence — and then gives the reader a practical frame: Bayesian design is “less like a cookbook and more like a regulatory grammar.” It then moves through concepts in digestible sections: what changes, what the guidance accomplishes, posterior credibility versus regulatory reliability, borrowing, implications, concerns, and closing synthesis. This is unusually good scaffolding for a learner.

The Claude Opus version is denser, more reference-rich, and more conventionally “expert.” It starts with docket number, Federal Register timing, PDUFA VII, CDRH’s 2010 device guidance, ICH E9, E9(R1), E11(R1), adaptive-design guidance, ICH E20, and Berry Consultants. That is valuable for someone already oriented to FDA regulatory science, but it front-loads institutional context before the reader has been taught the core Bayesian problem.

Readability

ChatGPT wins on readability. Its paragraphs are shorter, the topic sentences are stronger, and the article repeatedly restates the practical meaning of technical ideas. For example, it explains the frequentist/Bayesian contrast in plain language: frequentist analysis asks how often a design would falsely conclude success under a null; Bayesian analysis starts with a prior, likelihood, and posterior. It then immediately translates that distinction into regulatory consequences: posterior probabilities do not eliminate the need to evaluate trial performance; they change what must be evaluated.

Claude is very good, but it sometimes reads like a graduate seminar handout. Phrases such as “doctrinal framework,” “bifurcation of Bayesian trial designs,” “epistemic warrant,” “design priors or sampling priors,” and “posterior predictive checking” are accurate but not self-teaching. They assume the reader is already comfortable with the statistical and regulatory discourse. For your audience, those terms would need either a glossary or a short explanatory sentence.

Suitability as a training document

For training, the key issue is not just accuracy. It is whether the reader can build a mental model.

The ChatGPT version builds a mental model around five accessible ideas:

Bayesian methods are not one thing. They range from minimally informative priors to consequential borrowing.

The prior matters. It can be benign, useful, or dangerous.

Posterior probability is not the same as regulatory reliability. A posterior probability threshold can look like alpha but does not automatically provide Type I error control.

Borrowing is the central opportunity and central risk. It helps when data are genuinely comparable and harms when data are biased or non-exchangeable.

FDA is opening a door, not lowering the evidentiary bar. The conclusion emphasizes “disciplined prior knowledge, not statistical indulgence.”

That is exactly the conceptual spine a trainee needs.

Claude’s version is stronger if the trainee is already past the first stage and wants a technical map of the document. It gives a more detailed account of Type I error-calibrated versus non-calibrated Bayesian regimes, analysis priors versus design priors, effective sample size, static versus dynamic borrowing, commensurate priors, mixture priors, and computational reporting requirements. But as an introductory training document, this richness can become cognitive overload.

Where Claude is better

Claude is stronger in regulatory and bibliographic specificity. It gives the release date, docket number, comment deadline, authorship centers, PDUFA VII commitment, relationship to CDRH’s earlier device guidance, and the ICH framework. That material is useful and should be imported into the ChatGPT version’s opening or an appendix.

Claude is also stronger in technical completeness. It identifies two inferential regimes more sharply: Type I error-calibrated Bayesian trials, where Bayesian machinery functions like a decision-rule engine, and trials not calibrated to Type I error, where priors and Bayesian operating characteristics become central.

Claude is better on specific technical caveats. It flags predictive probabilities, skeptical and enthusiastic priors, exchangeability, patient-level-data pragmatics, and the potential for non-monotonic behavior in certain mixture or discounting approaches. Those are valuable for an advanced version.

Claude also provides a richer “beyond the PDFs” section, including replication concerns, industry-wide calibration, real-world-data priors, international harmonization, workforce gaps, and FDA’s own implicit Bayesian decision-making.

Where ChatGPT is better

ChatGPT is better at teaching the stakes. Its formulation that the model “can quantify borrowing; it cannot make irrelevant data relevant” is probably the best single training sentence in either version. It gives the reader a memorable rule that can be applied to drugs, biologics, diagnostics, external controls, and real-world evidence.

ChatGPT is also better at balancing enthusiasm and skepticism. It does not drown the reader in named methods. Instead, it keeps returning to regulatory judgment: what is being borrowed, why it is relevant, how much influence it has, what happens if it is wrong, and whether the conclusion survives a skeptical alternative prior. That five-question framework is excellent for training.

Finally, ChatGPT’s closing section is more memorable and more usable. “Quantification is not purification” is a strong teaching phrase: a biased prior remains biased after being written in mathematical notation. That is exactly the kind of sentence that helps a non-specialist retain the core lesson.

Main weakness of each version

The ChatGPT weakness is that it is a little too smooth. It sacrifices some concrete regulatory detail. A trainee might finish it understanding the conceptual issues but not knowing enough about the actual FDA document’s structure: docket, PDUFA commitment, CDRH contrast, ICH linkages, prior taxonomy, ESS, simulation code, and MCMC expectations.

The Claude weakness is that it is too front-loaded and too expert-coded. It is highly competent, but it does not sufficiently slow down at the exact places where a learner needs help: Type I error versus false positive conclusion, analysis prior versus design prior, posterior probability versus long-run reliability, and borrowing versus exchangeability.

Best training-document strategy

I would create a hybrid, using ChatGPT as the skeleton and Claude as the technical enrichment layer.

Use ChatGPT’s title, opening frame, section order, explanation of posterior credibility versus regulatory reliability, borrowing discussion, sponsor/FDA implications, and closing “disciplined prior knowledge” synthesis.

Then import from Claude:

Regulatory context paragraph: January 9 release, January 12 Federal Register notice, Docket FDA-2025-D-3217, PDUFA VII commitment, CDER/CBER authorship, and contrast with CDRH’s 2010 Bayesian device guidance.

Technical box: two inferential regimes — Bayesian-with-frequentist-calibration versus fully Bayesian prior-based operating characteristics.

Prior taxonomy box: noninformative/minimally informative, skeptical, enthusiastic, informative/borrowing priors; static versus dynamic borrowing; ESS.

Advanced cautions sidebar: predictive probabilities, exchangeability, patient-level versus summary data, mixture-prior behavior, and RWD risks.

My practical recommendation

For someone who is statistically literate but not immersed in Bayesian FDA trial design, I would assign the ChatGPT version first. It is the better “Chapter 1.” I would then use the Claude version as Chapter 2 / instructor notes / technical appendix.

For a final polished training product, I would aim for this structure:

Part 1: The plain-English regulatory problem.
Why FDA cares, why Bayes matters, why it does not replace evidentiary rigor.

Part 2: The core statistical contrast.
Prior, likelihood, posterior; posterior probability versus Type I error; false positive conclusion versus classical false positive rate.

Part 3: The practical FDA checklist.
Prior, estimand, success criterion, operating characteristics, simulations, sensitivity analyses, software, reporting.

Part 4: Borrowing.
When it helps, when it misleads, and how to test prior-data conflict.

Part 5: Advanced notes.
Predictive probabilities, ESS, design priors, dynamic borrowing, mixture priors, real-world data, and payer implications.

In short: ChatGPT is the better teaching narrative; Claude is the better technical annotation. The optimal training document would be about 70% ChatGPT structure and tone, 30% Claude specificity and caveats.

How FDA-Bayes-Drugs Would Be Different for Diagnostics, not Therapeutics

In a prior blog, I asked Chat GPT and Claude to review a January, 2026, guidance on Bayesian design in clinical drug trials.   Here.   I asked, what about diagnostics?

Claude first, then Chat GPT.

###

###

Discuss the most important ways the analysis would be different if the topic were used of Bayesian design in developing of genomic tests [diagnostics] as opposed to biotechnology and drugs [therapeutics.]

###

CLAUDE

Bayesian Design in Genomic Diagnostics versus Therapeutics: Key Analytical Divergences

The January 2026 FDA draft guidance addresses drugs and biologics, operating under CDER and CBER. Genomic diagnostics fall under CDRH (for in vitro diagnostic devices, IVDs) and increasingly under CDER for companion diagnostics co-developed with drugs, while laboratory-developed tests (LDTs) sit in a contested regulatory space following the FDA's 2024 final rule and subsequent litigation. The Berry Consultants commentary noted in passing that CDRH "has a separate guidance on Bayesian methods" — specifically, the 2010 Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. The analytical landscape differs in ways that go well beyond which Center has jurisdiction.

1. The Estimand Itself Is Categorically Different

The therapeutic guidance centers on estimating a treatment effect — a causal parameter requiring randomization to estimate without bias. The relevant Bayesian machinery concerns priors over treatment effects, hierarchical borrowing of effect estimates, and inference about whether a parameter exceeds a threshold of clinical benefit.

Genomic diagnostics are evaluated against very different estimands:

  • Analytical performance: limit of detection, precision, reproducibility across sites and reagent lots, specificity for the intended target
  • Clinical performance: sensitivity, specificity, positive and negative predictive values, likelihood ratios
  • Clinical utility: whether using the test changes outcomes, typically requiring a clinical-utility study or modeling exercise

These are largely prediction problems and classification problems, not causal inference problems. Bayesian methods are arguably more natural here than in therapeutics, because the relevant quantities — posterior probability of disease given a test result, predictive value adjusted for prevalence — are inherently Bayesian. PPV and NPV are literally Bayes' theorem applied to diagnostic data. The Evans, Fleming, Janes, and Dodd critique in JAMA of Bayesian methods in confirmatory therapeutic trials — that they erode the benefits of randomization — does not apply with the same force, because randomization is not the foundational epistemic warrant for diagnostic accuracy studies.

2. The Reference Standard Problem Has No Therapeutic Analog

Diagnostic accuracy studies require a reference standard against which the new test is compared. For many genomic tests, the reference standard is itself imperfect:

  • Sequencing-based variant calling is benchmarked against orthogonal sequencing platforms, with discordance resolution often involving Sanger confirmation or replicate testing
  • Cancer mutation panels may have no true gold standard for certain variants, particularly at low variant allele frequencies
  • Polygenic risk scores have no contemporaneous reference at all; their "truth" is a future clinical event

This creates a class of statistical problems — latent class models, imperfect reference standard methods, composite reference standards — that Bayesian methods handle particularly well but that have no parallel in the therapeutic guidance. The Bayesian framework here is not optional methodological flavor; it is often the only tractable framework for inference when the reference standard itself is uncertain.

3. The Borrowing Question Takes a Different Shape

The therapeutic guidance devotes substantial attention to borrowing from historical trials, external controls, and adult-to-pediatric extrapolation. The borrowing question in genomic diagnostics is structurally different:

Borrowing across analytical conditions: When a sequencing assay is validated, performance characteristics are estimated across many variant types, depths of coverage, allele frequencies, and sample matrices. Hierarchical Bayesian models naturally handle this combinatorial space; treating each variant-condition combination as an independent estimation problem is statistically wasteful and often infeasible given the rarity of specific variants.

Borrowing across variants in panels: A multi-gene panel may contain hundreds or thousands of variants, most of which will never be observed at sufficient frequency in any validation study to establish standalone performance. Bayesian hierarchical modeling — borrowing strength across variants sharing structural features (variant type, genomic context, technical difficulty class) — is essentially mandatory for these products.

Borrowing across populations: This is where the analogy is closest. The therapeutic guidance's framework for adult-to-pediatric extrapolation has a diagnostic parallel in the recognized problem that genomic test performance, particularly for polygenic risk scores, often degrades dramatically when applied to populations under-represented in discovery cohorts. The Bayesian framework can formalize this discounting, but the ethical and equity stakes are arguably sharper than in the therapeutic setting because performance degradation in under-represented groups can entrench health disparities through the test's deployment.

4. Locked versus Adaptive Algorithms

The therapeutic guidance assumes a fixed protocol throughout the trial, with adaptations limited to pre-specified options. Genomic diagnostics, particularly those incorporating machine learning components (as many modern variant interpretation pipelines do), present a different problem: the Predetermined Change Control Plan framework that FDA developed for AI/ML-enabled device software anticipates ongoing algorithm updates.

This intersects with Bayesian methodology in ways the therapeutic guidance does not address. A diagnostic algorithm that updates its variant classifications as new evidence accumulates is essentially performing continuous Bayesian updating in production. The regulatory questions — when does an update require new submission, how is performance monitored post-market, what constitutes prior-data conflict in a deployed algorithm — are distinct from anything in the drug context. The therapeutic guidance's framework of pre-specified analysis priors with limited sensitivity-analysis design priors does not map cleanly onto a system designed for ongoing learning.

5. Type I Error Has Different Meaning and Stakes

The therapeutic guidance's bifurcation between Type I error-calibrated designs and fully Bayesian designs presupposes that the central inferential question is whether a treatment effect exceeds zero (or a non-inferiority margin). Diagnostic studies typically test multiple performance characteristics simultaneously against pre-specified acceptance criteria:

  • Sensitivity ≥ some threshold (with appropriate confidence)
  • Specificity ≥ some threshold
  • Reproducibility within acceptable bounds
  • Limit of detection at or below claimed value

The multiplicity structure is fundamentally different. CDRH has historically used Bayesian methods with explicit acknowledgment of Type I error inflation, often accepting somewhat higher Type I error rates when borrowing is well-justified — a posture Berry Consultants identified as a real difference between CDRH and CDER/CBER practice. The new CDER/CBER guidance does not change this; it leaves the device-side approach intact, but PhD-level practitioners working across both domains need to recognize that the regulatory calculus differs.

6. The Loss Function Is More Tractable

The therapeutic guidance gestures toward decision-theoretic approaches with loss functions but does not develop them extensively. In diagnostics, the loss function structure is often more concrete and quantifiable:

  • False positives lead to specific downstream costs: confirmatory testing, unnecessary treatment, anxiety, potential harm from invasive workup
  • False negatives lead to specific clinical consequences: missed diagnosis, delayed treatment, potential progression
  • These can often be expressed in QALY terms or direct economic terms more readily than the loss function for approving an ineffective drug

This makes net benefit analysis — Vickers and colleagues' decision-curve framework being one prominent example — a natural setting for Bayesian methods in diagnostics. The prior over disease prevalence in the intended-use population directly enters the calculation, and sensitivity to this prior is a routine sensitivity analysis. The therapeutic guidance's relatively underdeveloped treatment of loss functions reflects, in part, that the loss function question is genuinely harder in the drug context.

7. Companion Diagnostics: The Hybrid Case

Companion diagnostics, co-developed with targeted therapies, sit at the intersection. Here the therapeutic guidance applies to the drug arm and the device guidance applies to the test, but the integrated development program raises Bayesian questions that neither guidance fully addresses:

  • How should prior information from the analytical validation of the diagnostic inform the design of the therapeutic trial that depends on it?
  • When the cutpoint for biomarker positivity is itself uncertain, how should that uncertainty propagate into the treatment effect estimate?
  • How should the inevitable drift between the trial-use assay and the commercial assay be handled in regulatory inference?

These are inherently Bayesian questions, but the regulatory framework remains compartmentalized in ways that PhD-level methodologists working in oncology biomarker programs encounter constantly.

8. Post-Market Surveillance and Performance Drift

The therapeutic guidance is largely silent on post-market Bayesian updating, treating the trial as the inferential endpoint. For genomic diagnostics, performance can drift for reasons that have no therapeutic analog:

  • Population allele frequencies shift as the test is deployed in new geographies
  • The spectrum of variants encountered changes as the test moves from research to clinical use
  • Reagent reformulations and platform updates introduce subtle performance changes
  • The distribution of indications for testing changes as clinical guidelines evolve

Bayesian sequential monitoring frameworks for diagnostic performance — updating posterior estimates of sensitivity and specificity as confirmed cases accumulate post-market — are an active methodological area with no clean parallel in the drug guidance, which assumes that post-approval evidence generation occurs through formally structured studies rather than continuous Bayesian updating from routine use.

9. The Replication and Generalizability Critique Cuts Differently

The Evans-Fleming-Janes-Dodd critique emphasizes that bayesian borrowing in confirmatory therapeutic trials risks compromising the randomization-based warrant for unbiased inference. This concern does not transfer directly to diagnostics, where the inferential foundation is not randomization but rather representative sampling from the intended-use population.

But a parallel concern does apply: diagnostic accuracy estimates derived from spectrum-enriched validation cohorts (e.g., samples enriched for known positives and clear negatives) systematically overestimate real-world performance. Bayesian methods can either ameliorate this (by formalizing the discounting needed to translate validation performance to deployment performance) or worsen it (by importing optimistic priors from non-representative cohorts into new validation studies). The structural risk is analogous; the mechanism differs.

10. Regulatory Sociology and Sponsor Capability

The therapeutic guidance addresses sponsors with substantial biostatistical infrastructure — large pharmaceutical companies with dedicated Bayesian methods groups. The diagnostic industry is more heterogeneous: large IVD manufacturers, mid-size sequencing companies, single-laboratory LDT providers, and academic centers all operate in this space. The implicit assumption of sophisticated sponsor capability that runs through the therapeutic guidance applies less uniformly in the diagnostic context, where regulatory burden falls on a much wider distribution of organizational types. A Bayesian methodology guidance for diagnostics — were one to be developed in parallel to the therapeutic version — would need to address this capability heterogeneity more explicitly than the current CDER/CBER document does.

Synthesis

A Bayesian methodology guidance for genomic diagnostics would not be a simple translation of the January 2026 therapeutic document. The estimands differ, the role of randomization differs, the multiplicity structure differs, the borrowing problems differ, the post-market lifecycle differs, and the relationship between Bayesian inference and the underlying scientific question differs. In some respects Bayesian methods are more natural in diagnostics — PPV is intrinsically Bayesian, hierarchical borrowing across variants is often necessary rather than optional, and loss functions are more tractable. In other respects the methodological challenges are sharper — reference standard imperfection, performance drift, equity implications of population-specific performance, and the integration with continuously-learning algorithms.

The therapeutic guidance is best understood as one document in what should ultimately be a coordinated family of statistical guidances. CDRH's 2010 document on medical device Bayesian statistics anticipated some of these issues but predates both the AI/ML era and the modern genomic diagnostics landscape. The most consequential gap in current FDA Bayesian methodology guidance is arguably not in the therapeutic space — which the January 2026 draft now substantially addresses — but in the diagnostic and combination-product spaces, where the inferential questions are in some ways better suited to Bayesian formalism but where formal guidance has not kept pace with either the science or the deployment patterns of modern genomic tests.


##

##

##
CHAT GPT

##

##

Yes — the analysis changes substantially if the target is Bayesian design for genomic diagnostics rather than Bayesian design for therapeutic drug trials. The same statistical vocabulary applies — priors, posteriors, operating characteristics, borrowing, sensitivity analyses — but the evidentiary object is different.

For drugs, FDA’s January 2026 draft guidance is centered on Bayesian methods to support primary inference in clinical trials intended to establish effectiveness and safety of drugs and biologics. FDA explicitly frames use cases such as interim adaptation, dose selection, and primary inference in trials supporting drug safety and effectiveness. (U.S. Food and Drug Administration) For genomic tests, the central question is not usually, “Does the intervention improve outcomes?” It is more often, “Does this test accurately, reproducibly, and clinically usefully classify patients, variants, tumors, residual disease, or treatment-relevant molecular states?”

1. The target of inference changes: treatment effect vs diagnostic performance

In a therapeutic trial, the principal Bayesian target is often a treatment effect: hazard ratio, response-rate difference, mean change, risk difference, noninferiority margin, dose-response function, or benefit-risk quantity. The regulatory question is whether the drug produces a sufficiently reliable clinical effect in the intended population.

For a genomic diagnostic, the key parameters are different. They may include sensitivity, specificity, positive percent agreement, negative percent agreement, limit of detection, variant-calling accuracy, reproducibility, tumor-fraction thresholds, concordance with tissue testing, positive predictive value, negative predictive value, or clinical reclassification performance. FDA’s diagnostic-test statistical guidance emphasizes that sensitivity and specificity are estimates from a subset of the intended-use population, and that PPV and NPV help characterize how test results should be interpreted. (U.S. Food and Drug Administration)

That changes the Bayesian analysis. In a drug trial, a posterior distribution may answer, “What is the probability the treatment effect exceeds zero or exceeds a clinically meaningful threshold?” In a genomic test submission, the posterior may answer, “What is the probability that sensitivity exceeds 95% at a specified allele fraction?” or “What is the posterior distribution for false-negative risk in Stage II colon cancer MRD detection?” The inferential quantity is not therapeutic benefit; it is classification performance under a specified intended use.

2. The regulatory center of gravity shifts from CDER/CBER to CDRH, and from drug approval logic to device/diagnostic logic

The 2026 Bayesian draft guidance is a CDER/CBER drug-and-biologics document. Diagnostics generally sit under CDRH device/IVD regulation, although companion diagnostics and certain biologic-related assays create cross-center interactions. FDA already has a separate Bayesian guidance for medical device clinical trials, which states that it addresses the statistical design and analysis of medical device trials using Bayesian methods. (U.S. Food and Drug Administration)

That matters because devices and diagnostics often have a different development logic. FDA’s medical-device Bayesian guidance notes that prior information may be more available for devices because devices often evolve incrementally, have physical mechanisms of action, and may have evidence from prior generations or overseas use. (U.S. Food and Drug Administration) For genomic diagnostics, the analogy is not perfect, but it is relevant: prior information may come from earlier versions of the assay, comparator platforms, analytical validation datasets, orthogonal sequencing, curated variant databases, public genomic repositories, natural-history cohorts, or prior clinical studies.

Thus, the diagnostic Bayesian question often becomes: How much can we borrow from technically or biologically adjacent evidence without overstating performance in the exact intended-use population?

3. “Prior information” is more likely to be technical, platform-based, or variant-class-based

For drugs, a prior may come from a prior Phase 2 study, adult data used for pediatric extrapolation, historical controls, a platform trial, or a related disease subtype. FDA’s drug guidance includes examples of borrowing from previous clinical trials, augmenting controls with external or nonconcurrent controls, pediatric extrapolation, borrowing across disease subtypes, subgroup borrowing, and oncology dose finding. (U.S. Food and Drug Administration)

For genomic diagnostics, prior information may be more granular and technical. Examples include:

Prior assay versions. A lab may have an earlier NGS panel and now adds genes, changes chemistry, switches instruments, modifies bioinformatics, or lowers the limit of detection. Bayesian borrowing might support bridging between old and new versions.

Variant-class borrowing. SNVs, indels, CNVs, fusions, MSI, TMB, methylation signals, and fragmentomic features may have different error structures. A Bayesian model might borrow across variant classes only if the wet-lab and informatic mechanisms are sufficiently similar.

Platform or site borrowing. If the same assay is run across laboratories, Bayesian hierarchical models might estimate site effects, lot effects, operator effects, batch effects, or instrument effects.

Synthetic and contrived samples. Genomic diagnostics often rely on blends, cell lines, engineered materials, reference standards, and dilution series. Bayesian designs may formally combine contrived-sample evidence with clinical-sample evidence, but the key regulatory question becomes whether contrived samples really represent clinical specimen behavior.

That is very different from a therapeutic prior. The prior is not just “belief about drug efficacy.” It may be knowledge about measurement performance.

4. The central bias problem changes: confounding vs reference-standard and spectrum bias

In therapeutic trials, the classic threat is confounding, especially when randomized evidence is diluted by external controls or historical borrowing. That is why critics of Bayesian methods in confirmatory drug trials worry about compromising the evidentiary value of randomization.

For diagnostics, the bias problem is different. FDA’s diagnostic-test guidance emphasizes that sensitivity and specificity can be biased; simply increasing the sample size does not reduce bias; and key sources include error in the reference standard and incorporation of the candidate test into the definition of the target condition. (U.S. Food and Drug Administration)

For genomic tests, major bias risks include:

Reference-standard bias. What is the truth comparator? Sanger? ddPCR? another NGS panel? tissue biopsy? clinical adjudication? longitudinal recurrence? orthogonal methylation assay?

Partial verification bias. If only discordant or selected samples undergo deeper adjudication, performance estimates can be biased.

Spectrum bias. A test may perform well in obvious high-tumor-fraction samples but less well in low-shedding tumors, early-stage disease, low-input FFPE, degraded cfDNA, or minority ancestry groups underrepresented in variant databases.

Prevalence distortion. Enriched validation sets can estimate sensitivity/specificity but may not support real-world PPV/NPV.

Incorporation bias. If the new test influences the adjudicated truth standard, the apparent performance can be inflated.

FDA’s diagnostic guidance is explicit that when no true reference standard is available, sensitivity and specificity may not be appropriate terms; agreement measures may be required instead, and agreement is not correctness because two tests can agree and both be wrong. (U.S. Food and Drug Administration) That point becomes central in Bayesian genomic-test design.

5. The intended-use statement becomes the anchor

For drugs, the analog is the indication. For diagnostics, the intended use is even more determinative. Bayesian design must be tied to who is tested, with what specimen, at what disease stage, for what clinical decision, and against what truth standard.

A Bayesian analysis of a genomic test is incomplete unless it specifies whether the test is intended for:

therapy selection, such as detecting actionable mutations;

companion diagnostic use, where false negatives may deny an effective therapy and false positives may expose patients to an ineffective or harmful therapy;

screening, where prevalence is low and false positives can dominate;

minimal residual disease, where the key question may be recurrence risk or lead time;

monitoring, where serial dynamics matter more than a single binary result;

diagnosis of rare inherited disease, where prior probability may be shaped by phenotype, family history, ancestry, and variant interpretation;

tumor profiling, where the endpoint may be concordance with tissue, detection of actionable variants, or successful classification.

The Bayesian model must serve the intended-use claim. A beautiful posterior distribution for analytical sensitivity is not enough if the label claim is about clinical recurrence prediction.

6. Clinical utility becomes harder to separate from clinical validity

For therapeutics, the clinical utility is intrinsic: if the drug improves survival, symptoms, or another accepted endpoint, the intervention has clinical value subject to benefit-risk.

For diagnostics, accuracy is not automatically utility. A genomic test can be analytically excellent and clinically useless, or clinically interesting but not yet action-guiding. Bayesian methods may help estimate diagnostic performance, but the regulatory and payer questions may require additional layers:

Does the result change management?

Is the management change evidence-based?

Does earlier molecular detection improve outcome or merely move the clock?

Does the test identify patients who benefit from a drug, surveillance program, or de-escalation strategy?

Does a negative result safely avoid treatment, biopsy, imaging, or chemotherapy?

For FDA, some diagnostic claims may be cleared or approved based on analytical and clinical validity. For payers, especially Medicare, the clinical utility argument may dominate. A Bayesian design that efficiently estimates sensitivity and specificity may therefore be only the beginning of the commercial evidence story.

7. Bayesian borrowing may be more attractive — but also more treacherous — in genomics

Genomics is full of repeated structure: genes, variants, samples, sites, batches, instruments, bioinformatic pipelines, cancer types, and molecular subgroups. That makes Bayesian hierarchical modeling tempting. It can borrow strength across rare variants, rare cancers, low-prevalence biomarkers, and small subgroups.

This is especially useful where conventional validation becomes numerically absurd. For example, validating every rare fusion, every exon-level deletion, every low-frequency variant, and every specimen subtype with large independent sample sizes may be impossible. Bayesian modeling can help by allowing partial pooling across biologically or technically related categories.

But the danger is false exchangeability. An NTRK fusion, an EGFR exon 19 deletion, a MET exon 14 skipping event, a copy-number amplification, and a methylation classifier are not interchangeable just because they appear on the same report. Similarly, a variant called at 35% allele fraction in a high-quality tissue sample is not exchangeable with a 0.08% ctDNA signal in plasma. Bayesian borrowing is powerful only when the borrowed units are genuinely comparable.

The practical rule would be: borrow across genomic units only when the wet-lab process, informatic process, specimen type, error mode, and clinical interpretation are sufficiently aligned.

8. Adaptive designs are less about patient assignment and more about sample allocation, thresholds, and enrichment

In therapeutic trials, Bayesian adaptive design often means modifying randomization, stopping early for futility or success, dropping arms, selecting doses, or enriching a subgroup.

In genomic diagnostics, adaptive Bayesian design may look different. It may involve:

adaptive allocation of validation samples to poorly characterized variant classes;

sequential expansion of rare-variant cohorts;

adaptive refinement of cutoffs, such as tumor fraction or methylation classifier thresholds;

planned enrichment for discordant or borderline cases;

adaptive stopping when precision around sensitivity or specificity is adequate;

Bayesian monitoring of lot-to-lot, site-to-site, or software-version drift;

posterior predictive checks for performance in underrepresented subgroups.

This is a major conceptual difference. In diagnostics, the “trial” may be less like a drug RCT and more like a measurement-system validation program.

9. Operating characteristics should be diagnostic-specific

For drugs, the standard operating characteristics include Type I error, power, bias, MSE, coverage, and probability of correct decision. The 2026 drug guidance explicitly discusses Bayesian calculations for primary inference and adaptive rules in drug trials. (U.S. Food and Drug Administration)

For genomic diagnostics, operating characteristics should include quantities such as:

posterior probability that sensitivity exceeds a clinically acceptable floor;

posterior probability that specificity exceeds a minimum threshold;

expected false negatives and false positives per 1,000 tested patients at plausible prevalence levels;

posterior distribution of PPV and NPV across intended-use prevalence scenarios;

probability of incorrect classification near the cutoff;

probability of no-call, indeterminate, or quantity-not-sufficient results;

robustness to specimen quality, tumor fraction, ancestry, batch, instrument, and site effects;

performance drift after software updates.

This is particularly important because a genomic test may have multiple performance regimes. A ctDNA MRD assay, for example, may behave very differently by tumor type, stage, shedding biology, time from surgery, adjuvant therapy, and blood-draw schedule. A single posterior mean sensitivity may be less informative than a hierarchical posterior profile across clinically meaningful strata.

10. The false-positive / false-negative asymmetry may dominate the Bayesian decision rule

In therapeutic trials, false approval and false rejection are usually framed as approval of ineffective therapy versus failure to approve effective therapy, with safety layered into benefit-risk.

In diagnostics, the harms are more context-specific:

A false positive companion diagnostic may send a patient to an ineffective targeted therapy.

A false negative companion diagnostic may deny a highly effective drug.

A false positive MRD test may trigger unnecessary chemotherapy, imaging, anxiety, or trial enrollment.

A false negative MRD test may falsely reassure and delay treatment.

A false positive screening test may initiate cascades of imaging, biopsy, and overdiagnosis.

A false negative screening test may delay cancer diagnosis.

Thus, Bayesian diagnostic design naturally invites decision analysis. The relevant threshold may not be “posterior probability of performance > X,” but expected clinical harm under different test-result pathways. FDA’s device Bayesian guidance explicitly includes decision analysis, predictive probabilities, interim analyses, model checking, and sensitivity analysis among topics for Bayesian medical device trials. (U.S. Food and Drug Administration)

For genomic diagnostics, this is where Bayesian methods could be especially valuable: they can formalize asymmetric consequences of false positives and false negatives in a way that conventional sensitivity/specificity tables often do not.

11. The payer implications are different and probably more severe

For drugs, FDA approval often creates substantial market access momentum, although payers may restrict coverage.

For genomic diagnostics, FDA authorization does not automatically solve reimbursement. Medicare and commercial payers may still ask whether the test is reasonable and necessary, whether it changes management, whether the evidence supports improved outcomes, whether the tested population matches coverage criteria, and whether serial testing is justified.

Bayesian evidence may be persuasive to FDA for a diagnostic performance claim but less persuasive to payers if the clinical utility chain is incomplete. This is especially true for MRD, MCED, methylation classifiers, AI-derived genomic signatures, and recurrence-risk tests. A Bayesian validation design should therefore be paired with a payer-facing evidence plan: decision impact, outcomes, real-world management changes, and health-economic consequences.

12. The article’s tone would shift from “modernizing pivotal trials” to “formalizing evidence integration for complex measurement systems”

If rewriting the earlier article for genomic diagnostics, I would change the organizing thesis.

For therapeutics, the article is about whether Bayesian methods can support drug approval-level causal inference while preserving evidentiary standards.

For genomic diagnostics, the article would be about whether Bayesian methods can support credible performance estimation and clinical interpretation in complex, high-dimensional, low-prevalence, rapidly evolving measurement systems.

That is a different article.

The major sections would become:

1. Bayesian methods as a tool for diagnostic evidence integration.
The emphasis would be on combining analytical validation, clinical validation, orthogonal comparator data, reference materials, prior assay versions, public genomic knowledge, and real-world datasets.

2. Intended use as the Bayesian anchor.
Every prior, endpoint, and operating characteristic would be tied to a specific specimen, population, disease state, and clinical decision.

3. Reference standards and the problem of truth.
This would become a central section, not a side issue. FDA’s diagnostic guidance makes clear that when the comparator is not a reference standard, sensitivity and specificity may be inappropriate and agreement is not correctness. (U.S. Food and Drug Administration)

4. Borrowing across variants, samples, sites, and platforms.
This would be the diagnostics analog of borrowing across trials, diseases, or subgroups in drug development.

5. Bayesian decision thresholds and clinical consequences.
The article would discuss false-positive and false-negative harms for specific diagnostic contexts.

6. FDA authorization versus payer adoption.
For diagnostics, this deserves a major section because clinical validity and clinical utility are often separated in ways that are less true for therapeutics.

Bottom line

For drugs, Bayesian design asks: Can we use prior information and accumulating trial data to make a reliable causal claim about safety and effectiveness?

For genomic diagnostics, Bayesian design asks: Can we integrate heterogeneous molecular, analytical, clinical, and prior-platform information to make a reliable claim about test performance and clinical interpretability in a defined intended-use population?

The therapeutic version is mainly about causal inference under regulatory evidentiary standards. The diagnostic version is mainly about classification accuracy, reference truth, measurement validity, and clinical decision consequences. Bayesian methods may be even more naturally suited to genomic diagnostics because genomics is modular, hierarchical, sparse, and information-rich. But that same structure makes over-borrowing, hidden bias, and false exchangeability especially dangerous.

Chat GPT and CLAUDE OPUS Review FDA Draft Guidance for Bayesian Trials

 In January 2026, FDA released a 29 page draft guidance on Bayesian design in clinical drug trials.   

I gave that, plus about 6-8 total documents (JAMA, Linked In, etc) and asked for an evaluation.  Here I publish Chat GPT first and CLAUD OPUS below.

###

FDA BAYES GUIDANCE 01-2026  by CHAT GPT and then CLAUDE OPUS

 

# # #

Write a detailed and objective article about the FDA draft guidance, what it accomplishes, its implications, and drawbacks or concerns. 

In a closing section, feel free to go beyond the PDFs provide and contribute other germane  thought capital to the topic of FDA usage of Bayesian designs and analyses. 

You are writing for PHD and MD PHD professionals.

 

# # #

 See a separate discussion "What about diagnostics?" here.

 

Contents

Chat gpt 2

claude. 11

 


 

 

 

CHAT GPT

Chat gpt

 

FDA’s Draft Bayesian Guidance: A Quietly Important Shift in Regulatory Trial Design

FDA’s January 2026 draft guidance, “Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products,” is not a wholesale replacement of the randomized controlled trial paradigm. Nor is it a license for sponsors to substitute optimism, external data, or “modern statistics” for persuasive evidence. Its real importance is more specific and more technical: it gives CDER and CBER sponsors a more explicit regulatory language for when Bayesian methods may support primary inference in trials intended to establish safety and effectiveness. That is a meaningful step, because Bayesian methods have long been used in dose finding, pediatric extrapolation, adaptive designs, device studies, and selected rare-disease settings, but their use in pivotal drug and biologic trials has remained cautious, uneven, and often negotiated case by case. FDA’s draft guidance attempts to move that negotiation into a more transparent framework.

The guidance is draft and nonbinding. FDA emphasizes that it represents current thinking, does not create legally enforceable duties, and allows alternative approaches that satisfy applicable statutes and regulations. That matters because Bayesian trial design is not a single method. It is a family of inferential, computational, and evidentiary approaches, ranging from almost-frequentist analyses with minimally informative priors to highly consequential borrowing from external or historical information. FDA’s document therefore functions less like a cookbook and more like a regulatory grammar: define the prior, define the estimand, define the success criterion, evaluate operating characteristics, test sensitivity, document the computational workflow, and make clear what part of the conclusion is coming from current randomized data versus earlier information.

What Bayesian analysis changes

In a conventional frequentist pivotal trial, the familiar regulatory question is framed around long-run error control: assuming a null hypothesis, how often would a design falsely conclude success? Bayesian analysis begins elsewhere. A prior distribution summarizes pre-study information or uncertainty about a parameter of interest; the likelihood summarizes the observed trial data; the posterior distribution then expresses updated information after the trial. In the FDA’s phrasing, the posterior can be used to draw conclusions about efficacy or safety, and it may be summarized through posterior means, credible intervals, and posterior probabilities such as the probability that the treatment effect exceeds zero.

That shift is attractive because it often maps more directly to how clinicians, regulators, and patients speak. “What is the probability the drug has a clinically meaningful effect?” is a more natural question than “How surprising would these data be if the null hypothesis were true?” However, the regulatory problem is not merely philosophical. FDA must decide whether a trial result is reliable enough to support approval, labeling, or a supplement. For that purpose, Bayesian posterior probabilities do not eliminate the need to evaluate trial performance. They change what must be evaluated.

FDA’s guidance is particularly attentive to this distinction. It separates trials that use Bayesian analysis inside a broadly frequentist framework—where Bayesian posterior probabilities may be calibrated to Type I error control—from trials that use a more fully Bayesian framework, particularly when external information is incorporated through an informative prior. In the first case, the Bayesian machinery may be used to create an adaptive rule or posterior-probability decision threshold, but the design is still judged by familiar frequentist operating characteristics: Type I error, power, bias, mean squared error, and interval coverage. In the second case, especially when borrowing is central, FDA recognizes that the operating-characteristic conversation is different: the reliability of the decision depends strongly on the prior, the fit of the data model, the design, and the measurement process.

What the guidance accomplishes

The first accomplishment is normalization. The guidance makes clear that Bayesian methods are not exotic exceptions. FDA lists several settings in which Bayesian methods have already appeared in development programs: borrowing from prior clinical trials, augmenting concurrent randomized controls with external or nonconcurrent control data, pediatric extrapolation, borrowing across related diseases or disease subtypes, subgroup estimation through hierarchical modeling, and oncology dose-finding. The examples include REBYOTA, where a Phase 3 analysis incorporated data from a prior placebo-controlled Phase 2 study; oncology platform trials such as GBM AGILE and Precision Promise, where nonconcurrent controls and temporal effects are methodological concerns; pediatric extrapolation examples involving empagliflozin and linagliptin; and basket or related-disease contexts where hierarchical models can borrow across groups.

The second accomplishment is regulatory taxonomy. FDA forces sponsors to specify whether they are using Bayesian methods for interim adaptation, dose selection, primary inference, borrowing from external information, subgroup estimation, or some combination. This is useful because “Bayesian trial” is otherwise too vague. A dose-escalation model in early oncology development, a Bayesian hierarchical subgroup model, an adaptive platform trial using predictive probabilities, and a pivotal trial borrowing adult data for pediatric inference pose very different evidentiary risks.

The third accomplishment is discipline around success criteria. FDA recognizes that frequentist success criteria may not be directly applicable or appropriate when the primary analysis is Bayesian. A Bayesian success criterion might be framed as a posterior probability that the treatment effect exceeds a threshold, or it might be built into a more explicit benefit-risk or decision-theoretic framework. FDA also notes that more complex approaches may consider disease seriousness, availability of approved therapies, the harm of approving an ineffective drug, and the harm of failing to approve an effective drug. In such settings, a “loss function” could formally combine safety and effectiveness considerations into a benefit-risk decision rule.

The fourth accomplishment is clarity about priors. FDA states plainly that the prior distribution is the feature that distinguishes Bayesian from frequentist approaches. Noninformative or minimally informative priors may be used when no relevant prior information exists and will often be overwhelmed by adequate trial data. Informative priors require more justification because they can materially affect conclusions. Sponsors must pre-specify and justify the prior, explain its influence, and show that the design’s operating characteristics remain acceptable under the proposed prior.

The fifth accomplishment is a practical review checklist. The guidance requires attention to prior-data conflict, external-data relevance, quality and reliability, pre-specification of source selection, sensitivity analyses, software reliability, MCMC convergence, reproducibility, and reporting. FDA recommends that sponsors provide results using alternative reasonable priors and, when borrowing external information, show posterior results under different degrees of borrowing. It also asks for model checking, prior-data conflict assessment, convergence diagnostics, software version details, documented code, and even random seeds for MCMC analyses where needed for reproducibility.

The most important technical distinction: posterior credibility vs regulatory reliability

One of the most subtle and important parts of the guidance is FDA’s discussion of false positive conclusions. In frequentist analysis, Type I error is the probability of rejecting the null hypothesis when the null is true. In Bayesian analysis, FDA emphasizes that a false positive conclusion is conditional on a positive conclusion, not on a true null hypothesis. Put differently, the Bayesian regulatory question can resemble: among trials that declare success, how often is the conclusion wrong? That is closer in spirit to a positive predictive value framework than to the classical Type I error definition.

Haitao Pan’s commentary usefully sharpens this distinction. A posterior probability threshold such as Pr(θ > 0 | data) ≥ 0.975 is a single-trial statement: conditional on the observed data and the model, the posterior probability that the treatment effect is positive is at least 97.5%. But that is not the same as saying that, across repeated uses of the decision rule, only 2.5% of declared successes will be false. That cross-trial reliability question must be evaluated through design-level simulations, particularly when priors are informative, sample sizes are small, or early stopping rules effectively reduce information.

This is more than a statistical nicety. It is probably the point on which many sponsor misunderstandings will turn. A posterior probability threshold may look familiar because 0.975 resembles the complement of a one-sided 0.025 alpha. But the number alone does not guarantee frequentist Type I error control. The trial’s behavior depends on the prior, the likelihood, the data-generating assumptions, the stopping rules, the endpoint distribution, and the amount of borrowing. In Bayesian regulatory work, the threshold is only one piece of the decision architecture.

Borrowing: the promise and the trap

The most consequential use of Bayesian methods in this guidance is borrowing—that is, using prior or external information to strengthen inference in the current trial. Borrowing is appealing in precisely the settings where conventional trials are difficult: rare diseases, pediatric indications, small molecularly defined oncology populations, platform trials, and settings where ethical or practical constraints make large randomized controls difficult.

But borrowing is also where Bayesian methods can do the most damage. If historical or external information is genuinely exchangeable with the current trial, borrowing can increase precision, reduce sample size, expose fewer patients to inferior treatment, and accelerate learning. If the external data are biased, noncomparable, selectively chosen, or affected by temporal drift, borrowing can import bias into the primary analysis with a statistical elegance that makes the problem harder to see.

FDA addresses this by requiring sponsors to evaluate data quality and reliability, pre-specification, relevance, study design, and availability of patient-level data. It also emphasizes prior-data conflict: the possibility that the observed current data are notably inconsistent with the prior. FDA recommends exploring this conflict through simulations across plausible scenarios, including no effect in the target population, and planning sensitivity analyses accordingly.

Berry Consultants’ Kert Viele broadly praises the guidance as rigorous and constructive, but his commentary also shows how technically complex this territory is. He notes that borrowing works best when historical and current data are similar, and can mislead when they diverge. He also stresses that dynamic borrowing, hierarchical models, and mixture priors can help reduce the influence of prior information when current data disagree with it, whereas static borrowing may be more brittle.

A useful regulatory heuristic is this: Bayesian borrowing is most defensible when the biology, endpoint, population, care pathway, ascertainment, and trial conduct are all sufficiently aligned that the borrowed information would be persuasive even before formal modeling. The model can quantify borrowing; it cannot make irrelevant data relevant.

Implications for sponsors

For sponsors, the guidance creates an opportunity but also raises the bar for statistical and clinical planning. A Bayesian proposal cannot be treated as a late-stage rescue strategy after a trial is underpowered. It must be designed prospectively. Sponsors will need to bring FDA a coherent package: the clinical rationale for borrowing, the external evidence-generation process, the prior-construction method, simulations, operating characteristics, sensitivity analyses, computational diagnostics, and comparison with simpler alternatives.

That comparison with simpler alternatives is important. FDA specifically recommends that complex designs using informative priors be compared against alternative and simpler designs so reviewers can understand the advantages and limitations of the proposed design features. This is a quiet but important discipline. A Bayesian design should not merely be clever; it should solve a concrete problem better than a simpler randomized or frequentist design.

The guidance also implies that companies will need stronger internal cross-functional teams. Prior construction is not a job for statisticians alone. FDA describes it as a multidisciplinary process requiring domain knowledge to judge relevance and statistical expertise to quantify uncertainty and select methods. In practice, the team may need clinical development, biostatistics, regulatory affairs, translational science, real-world evidence, data management, and sometimes payer-oriented evidence experts in the same conversation.

There is also an operational implication: early FDA interaction becomes even more important. If a sponsor plans to rely on an informative prior, waiting until NDA/BLA review is far too late. The central negotiation—what prior is acceptable, what external sources count, how much borrowing is allowed, and what sensitivity analyses will be persuasive—must occur before trial initiation.

Implications for FDA

For FDA, the guidance is both an invitation and a burden. It invites more sophisticated trial designs, especially in settings where conventional trial designs are slow, ethically awkward, or infeasible. But it also requires review divisions to develop consistent expectations. One risk is that Bayesian acceptability becomes division-specific or reviewer-specific: acceptable in one disease area, resisted in another; welcomed by one review team, treated with suspicion by another.

This is not a trivial issue. Bayesian methods make explicit what is often implicit in regulatory judgment: prior knowledge matters. FDA already uses prior knowledge informally in many ways—through disease natural history, class effects, surrogate endpoint experience, external controls, mechanistic plausibility, and prior trial results. The Bayesian framework offers a way to quantify and pre-specify some of that knowledge. But once quantified, it becomes contestable. Sponsors, FDA reviewers, academic methodologists, patients, and payers may not share the same prior.

That creates a governance problem. If Bayesian designs are to become more common in pivotal settings, FDA will need not only technical guidance but institutional habits: model libraries, review templates, simulation expectations, cross-division case studies, public statistical reviews, and perhaps more transparent precedent. Otherwise, the guidance may paradoxically increase uncertainty: everyone knows Bayesian designs are possible, but no one knows when a given review division will accept one.

Concerns from the evidentiary-standard perspective

The JAMA commentary by Evans, Fleming, Janes, and Dodd provides the most pointed counterweight. The authors acknowledge that Bayesian methods have value in diagnostic medicine, prediction, early-phase drug and device trials, and rare diseases, but warn that late-phase confirmatory use can compromise evidentiary standards if it dilutes randomized evidence, incorporates sponsor- or investigator-specific priors, or depends on strong assumptions that are difficult to verify.

Their central concern is the primacy of randomized evidence. Randomization protects against known and unknown confounding, supports unbiased treatment-effect estimation, and provides the foundation for valid inference. Bayesian methods do not necessarily compromise randomization; a trial can be randomized and Bayesian. The concern arises when external or prior information is combined with randomized data in a way that weakens the protection randomization was meant to provide.

They also emphasize objectivity of priors. Priors must be constructed from relevant, high-quality, objectively selected external data. Post hoc selection of favorable evidence is especially problematic. Even well-intended priors may differ among sponsors, regulators, payers, IRBs, patients, and data monitoring committees. This is an important reminder: the “prior” is not merely a technical input; it is a compressed evidentiary argument.

A further concern is robustness. Bayesian models can be elegant but assumption-dependent. Hierarchical models, commensurate priors, mixture priors, dynamic borrowing, and model-based adaptive designs can all behave well under some assumptions and poorly under others. Simulation helps, but simulations only test scenarios the designers thought to simulate. That is why sensitivity analysis and prior-data conflict diagnostics are not decorations; they are central to the credibility of the approach.

Drawbacks and unresolved issues

The guidance has several limitations.

First, it is still somewhat ambiguous about when Type I error calibration is expected and when a fully Bayesian framework will be acceptable. FDA distinguishes the two categories, but sponsors will want more disease-area-specific examples. Without such examples, the default posture may remain conservative: Bayesian methods will be welcomed in principle, but pivotal reliance on informative priors may still require substantial negotiation.

Second, the guidance may underemphasize predictive probabilities, a point made by Berry Consultants. Predictive probabilities are central to many adaptive trials and development decisions: should the trial stop for futility, expand a cohort, graduate an arm, or proceed to Phase 3? The guidance focuses heavily on borrowing and primary inference, but a fuller Bayesian regulatory framework would give predictive probability a more explicit role.

Third, Bayesian methods can create a communications problem. Posterior probability statements are intuitive but easily misread. A 97.5% posterior probability of benefit does not automatically mean the regulatory decision rule has a 2.5% false-positive rate. The distinction between single-trial posterior credibility and cross-trial decision reliability will need to be explained repeatedly to executives, clinical teams, investors, advisory committees, and possibly courts.

Fourth, the guidance may not fully address patient-level utility. Viele notes that operating characteristics are often framed statistically—power, bias, coverage, probability of correct decision—whereas Bayesian decision theory can also optimize patient-centered outcomes, such as expected lives saved or expected morbidity avoided. This is not a minor philosophical point. In some adaptive designs, the rule that maximizes information is not the same as the rule that maximizes patient benefit within or beyond the trial.

Fifth, there is a payer and HTA implication. FDA approval based partly on Bayesian borrowing may be scientifically defensible but still less persuasive to payers if the current randomized evidence base is small. A Bayesian approval package may therefore need a postmarket evidence strategy from the beginning. FDA may accept a posterior inference that a therapy is effective; payers may still ask whether the effect is robust, generalizable, clinically meaningful, and economically relevant in their covered population.

Closing thought capital: Bayesian FDA review as disciplined prior knowledge, not statistical indulgence

The most productive way to view the FDA guidance is not as “Bayes versus frequentism.” That debate is too scholastic for the regulatory problem. The real issue is how FDA should use prior knowledge without letting prior belief erode evidentiary standards.

FDA already uses prior knowledge. It does so when it accepts pediatric extrapolation, interprets single-arm oncology response rates, considers natural history controls, evaluates surrogate endpoints, uses class-wide safety expectations, or asks whether a subgroup result is biologically plausible. Bayesian methods can make these judgments more explicit. They can also make them more reproducible, auditable, and stress-tested.

But quantification is not purification. A biased prior remains biased after being written in mathematical notation. A poorly comparable external control remains poorly comparable after being embedded in a hierarchical model. A posterior probability can be precise and still be wrong if the model is wrong. The promise of Bayesian regulatory science is not that it bypasses hard evidentiary questions. The promise is that it forces those questions into the open.

For FDA, the mature position should be Bayesian pluralism with evidentiary hierarchy. Minimally informative Bayesian analyses should be broadly acceptable when they are well specified and computationally reliable. Bayesian adaptive designs should be encouraged when their operating characteristics are transparent and their decision rules protect trial integrity. Informative priors and external borrowing should be welcomed only when the external evidence is highly relevant, objectively selected, and sensitivity-tested. In confirmatory settings, randomized concurrent evidence should remain the anchor unless there is a compelling disease-specific reason to depart from it.

The deeper opportunity is to build a regulatory culture of model accountability. Every Bayesian pivotal design should answer five questions plainly: What is being borrowed? Why is it relevant? How much influence can it have? What happens if it is wrong? Would the conclusion survive a skeptical but plausible alternative prior? If the answer to the last question is no, the sponsor may still have an argument—but it should be understood as a Bayesian argument dependent on prior acceptance, not as an ordinary replication of the conventional pivotal-trial standard.

In that sense, FDA’s draft guidance is not permissive in the loose sense. It is permissive in the disciplined sense. It opens the door, but asks sponsors to show their work. For sophisticated drug and biologic development, especially in rare diseases, pediatrics, oncology subtypes, and platform trials, that is a valuable advance. The drawback is that the hardest questions remain judgment calls: relevance, exchangeability, acceptable uncertainty, and the regulatory consequences of being wrong. Those questions cannot be solved by Bayes’ theorem. But Bayesian methods, used carefully, can make the assumptions visible enough for regulators, sponsors, clinicians, and patients to argue about them honestly.

 


 

# # #

# # #

CLAUD OPUS 4.7

claude

# # #

 

The FDA's January 2026 Draft Guidance on Bayesian Methodology in Clinical Trials: Substance, Implications, and Critical Reflections

Background and Regulatory Context

On January 9, 2026, the FDA released a draft guidance titled Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products, with formal Federal Register notice appearing on January 12, 2026 (Docket No. FDA-2025-D-3217). The document, jointly authored by the Office of Biostatistics in CDER and the Division of Biostatistics in CBER, satisfies a PDUFA VII performance commitment to issue draft guidance on Bayesian methodology, with the comment period closing March 13, 2026. The release fills a long-anticipated gap: while CDRH issued Bayesian guidance for medical devices in 2010, drug and biologic sponsors have operated without comparable formal guidance for primary inference, navigating Bayesian designs through case-by-case interactions with the agency.

The draft articulates FDA's current thinking on the use of Bayesian methods to support primary inference in pivotal trials of drugs and biologics—including INDs, NDAs, BLAs, and their supplements. It complements existing instruments such as the ICH E9 (1998) Statistical Principles, the ICH E9(R1) Addendum on estimands (2021), ICH E11(R1) on pediatric extrapolation, the FDA's 2019 Adaptive Designs guidance, and the more recent ICH E20 draft on adaptive designs. As Berry Consultants' Kert Viele observed, the document "formalizes progress in the past two decades," providing what had previously been tacit knowledge accumulated through specific reviews and approvals.

What the Guidance Accomplishes

1. Doctrinal Framework: Two Inferential Regimes

The most consequential structural feature of the guidance is its bifurcation of Bayesian trial designs into two categories with distinct evidentiary frameworks:

Type I error-calibrated trials retain familiar frequentist operating characteristics—FWER controlled at 0.025 one-sided, power calculated under fixed parameter assumptions, bias and MSE of point estimates, and coverage of intervals. In this regime, Bayesian machinery functions, in Haitao Pan's apt formulation, as "an engineering mechanism for constructing decision rules with acceptable long-run error rates." The posterior probability serves as a computational device analogous to a test statistic. Success criteria take the form Pr(d > a) > c, with c calibrated—algebraically for simple conjugate models, via clinical trial simulation for complex adaptive designs—to maintain the desired Type I error.

Trials not calibrated to Type I error rate represent the more philosophically novel territory. Here, FDA explicitly acknowledges that when informative priors meaningfully shape inference, traditional Type I error control becomes "philosophically inconsistent given a prior which assumes a non-zero effect" (lines 756–758). Operating characteristics are instead calculated with reference to the prior distribution itself: Bayesian power (averaged over a design prior), probability of correct decision (a positive predictive value analog), expected bias, expected MSE, and expected interval coverage. The guidance introduces the formal distinction between an analysis prior (the single, pre-specified prior used for inference) and design priors or sampling priors (multiple distributions used to characterize sensitivity).

This bifurcation is, as Pan emphasizes, almost certainly deliberate rather than oversight. The guidance does not prescribe which regime a sponsor must adopt; instead, it shifts the justificatory burden. Sponsors abandoning Type I error calibration must demonstrate through carefully specified priors and comprehensive simulation that the decision rule produces sufficiently reliable conclusions.

2. Cataloguing Established Practice

Section III provides an evidentiary backbone showing that Bayesian methods are already embedded in FDA decision-making. Concrete examples include:

  • REBYOTA (fecal microbiota, 2022): Bayesian incorporation of phase 2 data into phase 3 inference for recurrent C. difficile infection.
  • GBM AGILE and Precision Promise: Bayesian platform trials in glioblastoma and pancreatic cancer using nonconcurrent controls with adjustment for temporal drift (the "Bayesian time machine" of Saville et al., 2022).
  • Empagliflozin and linagliptin pediatric T2D supplements: Bayesian extrapolation from adult data using pharmacometric bridging models.
  • LEADER trial of liraglutide: BHM for regional subgroup analysis.
  • Dose-finding methods including CRM, BLRM, BOIN (with explicit fit-for-purpose designation), mTPI, and mTPI2.

This catalogue accomplishes something rhetorical as well as substantive: it neutralizes the implicit objection that Bayesian methods are novel or untested in regulatory submissions.

3. Prior Specification: A Structured Taxonomy

The guidance devotes the bulk of its substantive content (Section V) to prior distributions, organizing them into noninformative/minimally informative, skeptical, enthusiastic, and informative (borrowing) priors. Within the borrowing category, it discusses:

  • Static discounting (e.g., power priors with fixed discount factor)
  • Dynamic discounting, including commensurate priors, supervised power priors, mixture priors, Bayesian hierarchical models, and elastic priors (Jiang et al., 2023)

Critically, the guidance frames prior construction as a multidisciplinary process requiring evidence synthesis principles akin to those used in systematic reviews and meta-analyses, with attention to data quality, pre-specification, relevance, study design (preferring randomized comparisons), and patient-level data availability. It explicitly addresses prior-data conflict (Evans and Moshonov, 2006), recommending that simulations cover plausible degrees of conflict.

4. Quantification and Sensitivity

The guidance recommends multiple metrics for quantifying prior influence: prior mean, effective sample size (ESS) with appropriate methods (Malec, 2001; Morita et al., 2008; Neuenschwander et al., 2020), and ratios of prior ESS to prospective trial size. Notably, it explicitly discourages using Type I error inflation as a measure of borrowing strength when informative priors are used—a position aligned with Berry Consultants' long-standing methodological commentary. Sensitivity analyses across alternative priors are required, with the document endorsing what Viele describes as "tipping point" analyses.

5. Computational and Reporting Rigor

The guidance addresses MCMC convergence diagnostics, the responsibility for software reliability, and detailed documentation requirements for both protocol and clinical study report stages. It requires that sponsors submit simulation code, seed numbers (for reproducibility), and comprehensive simulation reports.

Strategic and Operational Implications

Alignment with International Frameworks

The Alston & Bird advisory correctly highlights that this guidance aligns with the EMA's 2025–2026 workplan to accelerate clinical trials, with ICH E11A on pediatric extrapolation (finalized 2024), and with the draft ICH E20 on adaptive designs. For sponsors operating across jurisdictions, this convergence reduces regulatory arbitrage risk and increases the feasibility of single trials supporting multi-jurisdictional submissions.

Particular Relevance for Specific Indications

The guidance is most immediately consequential for:

  • Pediatric programs, where adult-to-pediatric extrapolation is increasingly expected and formally supported by ICH E11A
  • Rare diseases, where conventional sample sizes are infeasible
  • Oncology, particularly basket trials, platform trials, and dose-optimization studies under Project Optimus
  • Programs with substantial external data, including biosimilar bridging studies, supplements to approved products, and indications with rich historical control databases

The C3TI Infrastructure

The FDA's Center for Clinical Trial Innovation (C3TI), launched in 2024, provides operational pathways: the Demonstration Program for non-adaptive Bayesian trials and the Complex Innovative Trial Design (CID) Meeting Program for adaptive designs. The guidance crystallizes principles that C3TI engagements have been developing case-by-case.

Sponsor Implications

For sponsors, several practical implications follow:

  1. Early FDA engagement is essential. The guidance repeatedly emphasizes pre-trial alignment on priors, success criteria, and operating characteristics. The cost of misalignment late in development is substantial.
  2. Statistical capability must be deepened. Sponsors need expertise not only in Bayesian inference but in simulation-based design evaluation, MCMC diagnostics, and evidence synthesis.
  3. Documentation requirements are intensive. The reporting expectations—simulation code, convergence diagnostics, sensitivity analyses across multiple priors—exceed those typical for frequentist analyses.
  4. The benefit-risk framework opens new design space. Section IV.A.3, with its allusion to loss functions and decision-theoretic approaches, formally accommodates trial designs that incorporate utility functions (as in AWARD-5 for dulaglutide).

Drawbacks, Concerns, and Open Questions

The Evidentiary Standards Debate

The most pointed critique appears in JAMA (May 19, 2026) from Scott R. Evans (George Washington University), Thomas R. Fleming (University of Washington), Holly Janes (Fred Hutchinson), and Lori E. Dodd (NIAID). Their concerns warrant serious attention:

First, they argue that bayesian methods used in confirmatory trials can compromise the benefits of randomization by combining external information with high-quality randomized data. Randomization's epistemic warrant—unbiased treatment effect estimates, valid CI coverage, error-rate control—is conditional on the integrity of the randomized comparison being preserved as the primary basis of inference.

Second, they emphasize the loss of objectivity introduced by sponsor- or investigator-specific priors, and the cognitive dissonance risk: when sponsors aim to establish rather than determine effectiveness, selection bias in prior construction is foreseeable. They invoke the STARTS-2 sildenafil trial, in which high-dose sildenafil increased pediatric mortality 3.95-fold despite adult data not predicting this signal—a cautionary example of the perils of cross-population extrapolation.

Third, they question the FDA's framing of the guidance as "modernizing statistical methods" in its press release. The implication that current methods are "outdated" misrepresents methods that "are principled, based on scientific fundamentals for protecting integrity and ensuring robustness."

The contrasting JAMA perspective by Gelman, van Zwet, and Więcek (also May 19, 2026) is broadly supportive but raises complementary concerns: priors should be clearly stated, their influence assessable, and the data model itself should be subject to posterior predictive checking. They note that the survey by Janiaud et al. (2021) of 912 FDA applications found inconsistent citation of prior reasoning and approvals reflecting new interpretations of existing evidence rather than new evidence—a problem they argue Bayesian formalism could ameliorate.

Process and Consistency Concerns

Even Berry Consultants, broadly enthusiastic about the guidance, identifies process gaps. Viele observes that the document is "vague on the process for obtaining this agreement" on priors and benefit-risk quantifications. The risks include:

  • Reviewer-to-reviewer inconsistency: Different statistical reviewers may have meaningfully different prior preferences, creating perceived inequity between sponsors.
  • Temporal drift in agreements: A prior agreed to early in development may face pressure for revision as related trials report.
  • Asymmetric leverage: Sponsors with substantial regulatory experience and resources will navigate the discretionary terrain more effectively than smaller developers.

Specific Technical Concerns

Berry's commentary identifies several specific technical limitations worth highlighting:

  • Predictive probabilities are largely absent from the document despite their centrality to many real-world Bayesian adaptive designs and phase transition decisions.
  • The discussion of skeptical and enthusiastic priors sits uneasily with the analysis prior / design prior framework introduced earlier; these concepts may be more coherently treated as design priors used to assess what would convince a range of stakeholders.
  • The framing that informative priors "have been used mostly in pediatrics and rare diseases" risks being read as a recommendation that they are best suited to those areas, when in fact common diseases with rich historical data may offer equally compelling use cases.
  • The treatment of exchangeability is technically correct but practically over-emphasized; what borrowing actually requires is closeness between historical and current data, not formal exchangeability.
  • Sections IV.A.2 and IV.A.3 (direct interpretation of posterior probability versus benefit-risk based criteria) are difficult to distinguish operationally and might be more usefully merged.

The Operating Characteristics Question

Pan's LinkedIn analysis identifies a subtle but important point in lines 295–297 of the guidance: "the concept of a false positive conclusion in a Bayesian framework is conditional on a positive conclusion and not, as in the frequentist framework, on a true null hypothesis." This shift—from Pr(reject H₀ | H₀ true) to Pr(H₀ true | reject H₀), conceptually akin to 1 − PPV—is fundamental but may not be widely appreciated by clinicians and reviewers more accustomed to frequentist error rates. The risk is that posterior probability thresholds (e.g., Pr(θ > 0 | data) ≥ 0.975) may be misinterpreted as long-run error bounds when they are in fact data-conditional statements whose long-run reliability depends on the realistic distribution of true effects, prior calibration, and design features such as interim stopping.

Patient-Level Data Pragmatics

The guidance prefers patient-level data for borrowing. Viele's commentary notes that this creates a practical tension: many historically informative studies are available only as summary statistics, and excluding them in favor of patient-level data from less relevant studies may be epistemically suboptimal. Real-world borrowing decisions involve trade-offs that the guidance acknowledges in principle but does not fully operationalize.

Mixture Prior Behavior

Viele also flags an underappreciated issue with certain test-then-pool and weight-as-function-of-discrepancy approaches: pathological cases can arise in which a weaker observed treatment effect would lead to a successful trial conclusion while the actual (stronger) observed effect would not, because the assumed discounting function compensates differently for the two scenarios. Such non-monotonic behavior is difficult to explain to clinical audiences and can undermine confidence in Bayesian conclusions.

Beyond the PDFs: Additional Considerations

The Borrowing-Replication Tension

The pharmaceutical evidence ecosystem has spent two decades grappling with replication concerns—the so-called replication crisis affecting biomedical research broadly. There is genuine tension between two reform impulses: (1) increased rigor through pre-registration, larger samples, and stricter error control, versus (2) increased efficiency through information borrowing and adaptive designs. Bayesian borrowing, by design, makes the primary trial's data less determinative of the conclusion. In a world where many published frequentist findings have failed to replicate, doubling down on borrowing from prior studies—whose own replicability may be uncertain—deserves epistemological scrutiny that the guidance does not fully provide. The historical priors used in extrapolation embed whatever biases (selection, measurement, publication) afflicted the original studies.

The Question of Calibration Across the Industry

If Bayesian designs become more common, the meta-question of how the industry's evidence base behaves in aggregate becomes pressing. Each individual trial's local operating characteristics may be acceptable, but the cumulative effect across many trials—each borrowing from a partially overlapping evidence base—is not addressed in the guidance. There is potential for correlated errors and feedback loops in which optimistic priors derived from early studies propagate forward into confirmatory trials whose results then reinforce the priors used by subsequent programs.

Real-World Data and the Prior Frontier

The guidance acknowledges real-world data (RWD) as a potential prior source but defers substantively to existing FDA RWD guidances. Yet the central scientific question—when can RWD legitimately substitute for randomized concurrent controls?—remains contested. The combination of Bayesian borrowing methodology with RWD risks compounding two streams of uncertainty (the relevance of historical/external data and the comparability of non-randomized populations) in ways that warrant more explicit treatment.

The Lecanemab Precedent and Confirmatory Trials

The accelerated approval of lecanemab and the broader debates over the strength of evidence in Alzheimer's drug development illustrate how disputes over confirmatory standards play out publicly. Bayesian methods could either improve such situations (by formalizing how prior evidence informs interpretation of borderline results) or worsen them (by providing additional degrees of freedom for advocates and skeptics to construct conflicting analyses). The guidance's emphasis on pre-specification is the principal defense, but it relies on a regulatory infrastructure with sufficient capacity to enforce it.

International Harmonization and the Limits of Convergence

While alignment with EMA workplans is a real benefit, harmonization is not uniformity. EMA and FDA may agree on the legitimacy of Bayesian approaches but disagree on specific prior choices, discounting strengths, or success criteria. PMDA (Japan), NMPA (China), and Health Canada operate with their own statistical traditions. Sponsors planning multi-jurisdictional programs should not assume that a Bayesian design acceptable to FDA will be uncritically accepted elsewhere—particularly when borrowed historical data comes from non-representative populations.

Workforce and Capability Gaps

The guidance's expectations—simulation-intensive design justification, MCMC diagnostics, hierarchical modeling, prior elicitation—require statistical expertise that is unevenly distributed across the industry and regulatory agency. FDA has built capability through C3TI and through hires of methodologists, but reviewer bandwidth remains finite. The guidance encourages early engagement, which has resource implications for both sponsors and the agency. Realizing the guidance's potential will require sustained investment in biostatistical training, including in academic programs that have historically emphasized frequentist methods.

Bayesian Methods for the Regulator's Own Decisions

Gelman, van Zwet, and Więcek raise a point worth amplifying: the guidance addresses how sponsors should conduct Bayesian analyses, but the FDA itself makes regulatory decisions that are implicitly Bayesian (combining trial evidence with prior beliefs about mechanism, related programs, population needs, and historical precedent). Making the agency's own decision framework more explicitly probabilistic and transparent—as is happening, in different ways, in some health technology assessment bodies internationally—could complement the guidance's industry-facing focus. The episode noted in JAMA regarding the Moderna mRNA influenza vaccine refusal-to-file letter and its reversal illustrates how informal regulatory reasoning can produce decision instability that a more explicitly Bayesian framework might reduce.

The Loss Function Question

The guidance opens the door to decision-theoretic approaches incorporating loss functions but does not develop this in depth. Yet loss functions encode value judgments—whose preferences enter, and with what weight—that have historically been left implicit in regulatory decision-making. Patient preferences, payer perspectives, and population health considerations might all be formalized within such frameworks. This is an opportunity for more transparent value-based decision-making but also a potential vector for capture by whoever specifies the loss function.

Conclusion

The FDA's January 2026 draft guidance is a substantial and largely welcome document that formalizes two decades of accumulated experience with Bayesian methods in drug and biologic regulation. It accomplishes three principal things: it provides sponsors with a clearer roadmap for proposing Bayesian designs; it articulates a defensible bifurcated framework distinguishing Type I error-calibrated designs from those evaluated within a fully Bayesian inferential paradigm; and it raises the documentation and pre-specification expectations sufficiently to mitigate—though not eliminate—the risks of bias and inconsistency that Bayesian methods can introduce.

The substantive critiques deserve engagement rather than dismissal. The Evans-Fleming-Janes-Dodd concerns about randomization integrity, objectivity, and robustness are not relics of statistical conservatism; they reflect lessons learned from decades of trials whose initially promising results failed to replicate. The Berry Consultants commentary, while supportive, identifies process gaps that the guidance will need to address as it moves toward finalization. Pan's analysis surfaces conceptual subtleties that practicing clinicians may underappreciate.

For PhD and MD-PhD professionals working in or around clinical development, the guidance creates both opportunity and obligation. The opportunity is real: well-designed Bayesian trials can be more efficient, more informative, and more aligned with how scientists actually update beliefs in light of evidence. The obligation is to use these methods with the same epistemic rigor that has made well-conducted randomized trials the gold standard of clinical evidence—pre-specifying priors with the same discipline applied to hypotheses, evaluating sensitivity comprehensively, and resisting the temptation to let methodological sophistication substitute for genuine evidence.

The guidance is, in the language of FDA practice, nonbinding. Its real influence will emerge through how it shapes the case-by-case interactions between sponsors and the agency over the coming years, and through whether the comment process—closing March 13, 2026—surfaces sufficient critical engagement to refine the document before finalization.