There are some special complexities and new thought capital required in the field of computational pathology and both scientific validation and regulatory science approval and validation. I understand this 2024 article by Gerrard pushes the thinking forward. I also include an example of an actual 24p FDA review (as context or background). Please discuss Gerrard's paper from an innovation and thought capital perspective. use deep thinking or agentic mode as necessary.
Publication here. (DEN240068 decision here.)
###
###
Bottom line:Gerrard's 2024 Contribution
Gerrard et al. is not just another “AI works on H&E slides” paper. Its real contribution is regulatory-science thought capital: it tries to answer the deceptively hard question, what does analytical validation mean when the “analyte” is not a molecule, not a stain, not a pathologist-recognizable morphology, but an AI-derived patient-level risk output? That is a major conceptual shift. The paper is valuable because it gives computational pathology a bridge from impressive retrospective AI performance toward something closer to a clinical laboratory test discipline.
The FDA review provides useful context because it shows the regulatory endpoint of this line of thinking: ArteraAI Prostate was reviewed as a software-only device analyzing H&E whole-slide images from prostate biopsies to provide 10-year risk estimates of distant metastasis and prostate cancer-specific mortality, intended to assist physician prognostic decision-making alongside standard clinicopathologic factors.
Why Gerrard’s paper matters
The key move in Gerrard is the reframing of analytical validation. In molecular diagnostics, analytical validation usually asks whether the assay accurately and reproducibly measures the target molecule: DNA sequence, RNA expression, protein epitope, methylation signal, copy number, fusion transcript, and so on. In traditional pathology AI, analytical validation may ask whether software detects a recognizable slide-level feature: tumor focus, mitoses, nuclei, Gleason pattern, HER2 staining, Ki-67 positivity, or some other human-interpretable morphology.
ArteraAI Prostate does something different. It uses routine H&E slides and generates patient-level prognostic and predictive outputs. Gerrard explicitly argues that for an H&E AI test using nonspecific stains, the meaningful “biomarker” is not the H&E input itself. It is the algorithm output. The paper states that for an AI algorithm using nonspecific probes such as H&E, “the output of the algorithm rather than the algorithm measured input is the only meaningful biomarker of interest.” It further distinguishes Artera from Paige Prostate: Paige points to slide-level regions of likely cancer, while Artera generates patient-level outputs associated with oncologic endpoints.
That is the innovation. Gerrard is saying, in effect: the biomarker is no longer a thing under the microscope; the biomarker is the reproducible computational inference from the digitized tissue.
The paper’s central conceptual advance: the “output as analyte” doctrine
The most important thought-capital phrase, even if not branded this way, is output as analyte. In ordinary lab medicine, the analyte is something prior to the assay: glucose, troponin, EGFR exon 19 deletion, PD-L1 staining, HER2 amplification. The assay detects or quantifies it.
In Artera-like computational pathology, the assay is not simply detecting an established entity. It is extracting high-dimensional image features, integrating them through a trained model, and producing a clinically meaningful risk estimate or classification. There may be no single pathologist-identifiable feature corresponding to the result. The “thing measured” is therefore not an object waiting to be found. It is a model-defined risk signal, validated against patient outcomes.
That creates a new regulatory-science problem. You cannot validate the test by asking whether the H&E stain binds its target, because H&E is deliberately nonspecific. You also cannot validate it by asking a pathologist to confirm the AI’s morphology, because the AI is not necessarily reporting a named morphology. Gerrard therefore adapts analytical validation around reproducibility of the AI output under relevant preanalytic, analytic, operator, scanner, and tissue-selection conditions.
This is a subtle but important departure from both molecular pathology and classic digital pathology.
The hybrid validation strategy
Gerrard’s method is not invented from whole cloth. It is a hybrid. The authors looked for analogies in two directions: first, AI pathology devices such as Paige Prostate, because they also analyze H&E whole-slide images; second, prognostic molecular tests such as breast cancer recurrence-risk classifiers, because they produce patient-level risk outputs. Gerrard notes that Paige is methodologically similar but intended for slide-level localization, while gene-expression classifiers are clinically similar because they produce algorithmic patient-level prognostic results.
That hybridization is a major form of innovation. In regulatory science, progress often comes not from discovering a wholly new method, but from identifying which older validation concepts are portable and which are not. Gerrard’s answer is roughly:
From molecular diagnostics, borrow the seriousness of analytical validation, reproducibility, precision, and assay implementation.
From digital pathology, borrow concern for scanners, whole-slide image quality, operators, and tissue workflows.
From prognostic classifiers, borrow the idea that the clinically relevant result may be an algorithmic patient-level risk score rather than a visible feature.
From AI/ML software regulation, borrow attention to locked models, deployment environment, data flow, software controls, cybersecurity, and change control.
The FDA review shows this same convergence. It references software submission guidance, qualitative binary-output performance standards, cybersecurity guidance, off-the-shelf software guidance, ISO 14971 risk management, IEC 62304 software lifecycle processes, and usability engineering standards. This is not just pathology anymore; it is a composite field of pathology + software engineering + clinical prediction + laboratory validation + regulatory risk management.
What Gerrard actually validated
Gerrard assessed two AI biomarkers in a clinical laboratory setting: one prognostic algorithm and one predictive algorithm for likely benefit from short-term androgen deprivation therapy. The paper evaluated analytical accuracy, intra-operator reliability, inter-operator reliability, and biopsy-set completeness reliability. The reported analytical accuracy ICCs were high: 0.991/0.993 range for the prognostic algorithm and 0.934 for the ST-ADT algorithm, with strong intra-operator and inter-operator reliability and somewhat lower but still substantial reliability when comparing one core versus three or six cores.
The “biopsy completeness” experiment is especially interesting. In molecular pathology, one thinks about limit of detection, tumor fraction, nucleic acid quality, and extraction sufficiency. In computational pathology, the analogous issue may be: which tissue did the model see? One core versus several cores is not merely a specimen-handling issue. It becomes a biological heterogeneity issue. The Gerrard paper recognizes this: prostate cancer is multifocal and heterogeneous, and the AI result may depend on which core or cores are selected. This becomes a computational-pathology version of sampling error, but with a twist: the image-based AI may also have an operational advantage because it can often use the original diagnostic H&E slide rather than consuming additional tissue.
The FDA review as the more formal regulatory endpoint
The FDA review is narrower and more regulatory in tone. The device is described as analyzing scanned WSIs of H&E-stained prostate needle biopsies using an AI/ML algorithm to provide 10-year prognostic risk estimates for distant metastasis and prostate cancer-specific mortality. The review describes a workflow in which a pathologist has already diagnosed prostate cancer, identifies the WSI containing the highest Gleason score, verifies image quality, uploads the WSI, and later reviews and releases the report.
This workflow matters because FDA is not simply approving an abstract algorithm. It is reviewing a specific clinical system: specimen type, scanner, magnification, user role, image quality controls, report generation, traceability, intended population, labeling, and physician use context. The FDA indication also specifies use with FDA-cleared interoperable scanners already authorized for the device, or additional 510(k)-cleared scanners qualified through a Predetermined Change Control Plan.
That PCCP component is important thought capital in its own right. Computational pathology depends on scanners, compression, color profiles, image formats, tissue processing, and software updates. A static one-time validation paradigm is poorly suited to a field where the ecosystem changes. The PCCP is a regulatory mechanism for controlled evolution: not “anything goes,” but also not “freeze the whole world forever.”
Innovation perspective: what is really new here?
The first innovation is biomarker abstraction. The paper moves from physical biomarker to computational biomarker. The H&E slide is not the biomarker in the way HER2 protein or EGFR mutation is. The biomarker is the AI-generated risk signal derived from the slide.
The second innovation is patient-level computational pathology. Much digital pathology AI has been assistive: find cancer, count cells, segment tissue, quantify staining, flag regions of interest. Artera-like AI instead asks a broader clinical question: what is the patient’s future risk? That moves the field from computer-assisted pathology interpretation toward AI-derived clinical prognosis.
The third innovation is analytical validation without a human gold-standard morphology. This is a profound shift. If the model is trained on outcomes rather than named morphology, a pathologist cannot necessarily say whether the model “saw the right thing.” The validation anchor becomes reproducibility of output and clinical association with outcomes, not visual agreement with a human observer.
The fourth innovation is scanner and workflow realism. Gerrard does not simply say, “The model works.” It asks whether the output is stable across operator, day, scanner/workflow, and tissue sampling. The FDA review goes further by embedding the device in a defined lab workflow and requiring attention to scanner qualification, image quality, software controls, traceability, and labeling.
The fifth innovation is regulatory translation. A research AI model can be impressive and still not be a regulated medical product. Gerrard is about the crossing of that valley: taking AI from publication and retrospective validation into a form that can live inside a clinical laboratory and, eventually, an FDA device file.
The deepest issue: analytical validity and clinical validity are intertwined but not identical
Gerrard’s paper helps separate two concepts that are easily blurred in AI pathology. Clinical validation asks whether the model predicts or stratifies clinically meaningful outcomes. The paper summarizes prior clinical validation in large randomized trial datasets, including validation for distant metastasis and prostate cancer-specific mortality, and a predictive ST-ADT interaction analysis.
Analytical validation, by contrast, asks whether the test produces the intended result reliably when performed in the real-world laboratory process. For AI pathology, this includes scanner behavior, image quality, operator handling, tissue selection, repeat scans, and reproducibility of the risk output.
The conceptual problem is that the “analyte” and the “clinical meaning” are close together. If the algorithm output is the biomarker, and the biomarker is defined by its relationship to outcome, then analytical validity can feel circular unless carefully handled. Gerrard’s solution is pragmatic: use the clinically validated model output as the thing to be reproduced, then stress-test the laboratory workflow for reproducibility. That is not philosophically perfect, but it may be the workable regulatory-science solution.
Why this matters for the broader field
This paper helps computational pathology mature from “AI can find patterns” to “AI can be a laboratory test.” That transition requires a new vocabulary. The field needs to define:
What is the analyte?
For Artera-like tests, it may be the model output.
What is the specimen?
Not just tissue, but tissue plus digitization plus image quality plus scanner environment.
What is the instrument?
Not just the scanner, and not just the software, but the scanner–image–algorithm–reporting pipeline.
What is analytical accuracy?
Not closeness to a molecular truth, but concordance with a locked or reference implementation under defined conditions.
What is reproducibility?
Not just same sample/same result, but same patient-level computational inference despite ordinary variation in scanning, operator, day, and tissue subset.
What is change control?
Not just reagent lots, but scanner qualification, software versioning, AI model locking, file formats, color normalization, cybersecurity, and potentially PCCP-governed updates.
That is why Gerrard’s paper is important. It gives the field a proto-template for how to think.
A critical caveat
The paper is also company-authored and product-specific. It is not an independent standards document. It provides a strong example of how one company framed and executed analytical validation, but it should not be mistaken for a universal answer. The FDA review is also specific to the authorized device, intended use, scanner conditions, labeling, and risk mitigations. FDA’s benefit-risk summary emphasizes that erroneous results or incorrect interpretation could lead to inappropriate management, while also concluding that analytical and clinical validation plus labeling and special controls mitigate those risks sufficiently for Class II De Novo classification.
So the thought capital is real, but it is still early. The field will need independent replication, cross-platform studies, post-market monitoring, clearer approaches to race and subgroup performance, calibration drift management, tissue-processing variability studies, and standards for when a model update requires new validation.
My synthesis
Gerrard et al. pushes the field forward because it recognizes that computational pathology is not simply “digital pathology plus AI.” It is a new diagnostic category in which the clinical signal may be latent, high-dimensional, nonsemantic, and outcome-trained. That forces a new analytical validation logic. You cannot validate it like HER2 IHC, because there is no specific epitope. You cannot validate it like Paige Prostate, because it is not merely localizing cancer on a slide. You cannot validate it exactly like Oncotype or MammaPrint, because the input is not an extracted molecular specimen but a digitized tissue image. And you cannot validate it like ordinary software, because the software output is itself a medical biomarker.
That is the “new thought capital”: computational pathology needs its own regulatory science, and Gerrard offers one of the first serious attempts to build it.