
The Trust Gap
Pathology AI has crossed a critical threshold. These systems can now achieve diagnostic accuracy that rivals human-level performance on benchmarks ranging from cancer detection [1] and grading [2] to tissue segmentation [3] and quantitative analysis. They can process high-volume, routine cases at speeds that free pathologists to review AI recommendations and dedicate their time to the complex diagnoses that require nuanced clinical judgement.
Yet despite this technical maturity, clinical adoption crawls forward at a frustrating pace. Many hospitals still grapple with foundational challenges: building adequate digital infrastructure, securing sufficient budget, and navigating complex regulatory requirements. But even institutions that overcome these hurdles will hit an unexpected wall–one built from inadequate evaluation tools and information gaps.
Each stakeholder group at a hospital or clinic has critical questions they need answered before signing off on adoption. Hospital administrators focus on ROI: they need concrete evidence of cost reduction and operational efficiency gain. Pathologists grapple with job displacement concerns alongside practical questions about tool reliability, workflow integration, and diagnostic quality. IT teams must evaluate compute capabilities and compliance overhead. Ultimately, all stakeholders share the same fundamental goal: better patient outcomes and a more effective healthcare system.
But trust requires evidence. Each group needs confidence that AI will genuinely advance this shared mission through their respective priorities, yet they lack the evaluation frameworks to build that trust, especially when dealing with a tool as complex and arcane as AI. This trust deficit elicits two extreme attitudes towards AI–blind opposition and unfounded fervor–two sides of the same coin born from inadequate information.
So, how do we build that trust?
Bridging the Gap
The answer lies in a transparent, evidence-based evaluation process between AI vendors and all hospital stakeholders. AI vendors have a responsibility to provide hospitals with comprehensive proof about model performance and limitations rather than marketing claims and generic benchmarks. Hospitals, in turn, must establish clear evaluation frameworks and success metrics for responsible AI adoption. The challenge is that these two parties often speak different languages and have misaligned incentives. This is where independent 3rd party evaluators can bridge the gap—providing objective assessment, translating technical capabilities into business outcomes, and ensuring that evaluation processes serve the interests of patient care rather than vendor sales cycles or administrative convenience.
Such thorough assessment demands a comprehensive AI governance framework to guide the process. An AI governance plan establishes clear guidelines and policies to manage the procurement, deployment, and ongoing monitoring of AI systems. This isn’t meant to be bureaucratic overhead, it’s a strategic framework that forces all stakeholders to confront the hard questions about data privacy, model bias, clinical efficacy, and long-term reliability before committing resources.
Effective AI governance operates on two levels: operational planning and technical assessment. The operational side addresses workflow integration, stakeholder alignment, and policy development. The technical side rigorously assesses what the AI model is actually doing–its capabilities, failure modes, and real-world performance.
Both components are essential for successful AI adoption, and developing this framework requires specialized expertise. If your team lacks AI experience, several resources are available to help. We offer educational materials and evaluation tools free of charge to hospitals, and can connect teams with experts at Northeastern University who specialize in AI governance frameworks. Organizations like the Coalition for Health AI (CHAI) and the College of American Pathologists (CAP) also provide valuable guidance for teams building their governance frameworks.
Operational Alignment
The goal of having this plan in place is both to evaluate initial adoption and to manage continued use. We need to produce an evaluation and usage framework that provides each stakeholder the evidence they need to approve adoption and have continued confidence in what is happening.
Here are a couple questions that should be encapsulated in the governance plan:
- Who is legally and morally liable when a patient is misdiagnosed with assistance from an AI tool?
Although the pathologist has the final say, did the usage of the tool induce a hastier, less thorough diagnosis, regardless of intention? These are complex ethical questions that require input from more than just the technical team. It is crucial to have a moral philosopher or ethicist guide these discussions. Decisions should be grounded in evidence from peer-reviewed studies on human-AI collaboration.
- How do we know the tool is ready for clinical use?
How long should the results be analyzed alongside pathologists before the AI can be trusted in real-world applications? What accuracy thresholds must it reach, and under what conditions? It is crucial for pathologists to actively use the model and understand its failure parameters to build trust.
For instance, an AI tool designed to detect cancer may achieve 99% accuracy on an academic benchmark. However, real-world testing by a pathologist might reveal that its accuracy drops to 70% on slides stained with a reagent used by your hospital's lab. This "failure parameter" must be understood and addressed with a clear protocol. One option is to have the AI flag these slides for mandatory human review before the tool can be safely deployed.
- What is the necessary return on investment (ROI) for AI adoption to make sense for the hospital?
How do we measure ROI—number of cases handled by pathologists, dollars saved annually, or reduction in turnaround time? Should we track clinical metrics (diagnostic accuracy improvements), operational metrics (pathologist satisfaction scores), or focus purely on financial metrics (cost per diagnosis, payback period)?
What are the specific targets—a 25% increase in throughput, 20% cost reduction, or breaking even within 36 months? Just like procuring any other technology, you need to map out what the goal for adoption is and account for hidden costs like staff training and annual licensing fees. You cannot magically hope that AI will solve all your problems.
- How will this tool access patient data?
This is a critical two-part question. First, where is the patient data going? Will it be stored in the cloud, sending sensitive information outside your hospital's network? Or will it be stored on-premises to keep patient data secure within your firewalls but require significant computational capabilities? This decision has major implications for your IT team, which must ensure that the tool complies with privacy laws, such as HIPAA, and is robust enough to avoid slow performance that would undermine its clinical value.
These are just a couple of the important questions that need to be answered before adoption. Without a proper governance framework, it is impossible to responsibly adopt AI and use it to its maximum potential. This process always requires the right stakeholders to be present: administrators, pathologists, AI experts, ethicists. You need them all to set the goals and parameters for adoption of a technology with as many health and moral implications as AI.
Technical Execution
Once clear criteria are laid out for the adoption and use of a tool, technical evaluation must be done to determine whether those criteria are met. To do this right, you must acquire or consult with an AI expert. As an example, the question of understanding why an AI tool made an incorrect prediction is non-trivial. You need someone who can empirically or theoretically open up the model’s black box to give practitioners an understanding of what’s happening.
Consider the difference between a junior pathologist who relies on a "gut feeling" and an experienced pathologist who can identify specific features and explain their reasoning. A "black box" AI is like the former, providing a diagnosis without justification. Neural representation analysis, on the other hand, is a method that allows us to examine the model and determine which features, such as shapes, colors, and textures, it uses to make a decision. We can also see if those features have produced reliable diagnoses before. While this doesn't make the AI perfectly transparent, interpretability research is advancing rapidly, providing us with increasingly powerful tools for understanding. This transparency is key to moving from simply trusting a model to understanding it. With this analysis, we can answer critical questions, such as "Why did the AI misclassify this slide?" and "Is it focusing on cancer or an irrelevant detail?" This analysis transforms a black box into a transparent, accountable tool, similar to asking an experienced pathologist to explain their reasoning.
This is where having the proper technical expertise and evaluation tools is critical.
Let’s look at a couple scenarios:
- Scenario 1: You are given a 90% accuracy score on a tumor detection academic benchmark. This tells the pathologist very little on how trustworthy a tool is for detecting metastasis in breast tissue. Instead, a benchmark focused on metastasis in breast cancer should be used and stratified analysis should break down the cases that work reliably from those that are more error prone. Confidence values of the model should be considered when determining what previously unseen scenarios might result in unexpected failures.
- Scenario 2: A benchmark reports results on a Hamamatsu scanner and 3-minute hematoxylin stain and shows 98% recall for cell type classification. Will the model generalize to your hospital’s slides that have been scanned using 3DHistech’s scanner with an 8-minute hematoxylin process
Hard to say. You should ideally test on your own hospital’s data. But do you have the infrastructure and technical expertise to do that? Do you have the time? If not, you can use a carefully vetted style-transfer model to capture the style of your hospital’s slides and transform an external benchmark to mimic your hospital’s data. It’s not perfect, but that would be better than blindly accepting the benchmark results.
- Scenario 3: A vendor presents impressive 95% accuracy metrics for their AI tool. Six months after deployment, your pathologists report spending excessive time reviewing certain cases. Analysis reveals the model is only 70% accurate on patients over 65—who represent 40% of your caseload. The promised efficiency gains have evaporated because pathologists must carefully review all these cases.
As a hospital admin calculating ROI, you need to understand exactly how pathologists will use the AI model in practice. Within the context of computer-aided diagnostics, different trust levels based on known model behavior produce different cognitive loads on pathologists using the results. If significant time must be spent overseeing results for large patient populations, your efficiency gains disappear. This real-world usage pattern—not just headline accuracy—needs to be modeled when computing ROI, but this information is rarely available before purchase.
- Scenario 4: Your hospital has been using an AI model for six months, and it has been working well. The vendor released an update, claiming "improved performance." Unlike with consumer software, where updates are routine, pathology AI updates can introduce dangerous regressions. What worked yesterday might fail today. How can you verify that this update won't compromise patient care?
The vendor should provide model drift analysis on all benchmarks and explain in practical terms what changes mean for your hospital. They should use neural representation tracking to account for all model behavior changes. Without these guarantees, you're essentially beta-testing on patients. Your hospital must maintain an internal evaluation set of both common and rare cases to independently verify that model updates maintain or improve performance across all patient populations.
Good technical evaluation must comprise two components. The first is producing the proper datasets to evaluate for a specific goal. Each data point provides empirical evidence that supports a model’s behavior and our understanding of it. But just like physicists run experiments with the end goal of developing theories, we must also be vigilant in the same manner. We must use neural representation analysis alongside the carefully designed benchmarks to mechanistically illuminate the black box model. Only by giving stakeholders an understanding of why a misclassification has happened, will we know how to properly proceed. Just like you can ask a pathologist why they made a decision, we must enable the same for our AI models.
Good technical evaluation doesn’t guarantee perfection. But it’s a matter of reducing risk as much as possible for the sake of aligning all stakeholders.
The Future of Trust
We envision a future where AI adoption in pathology is based on earned trust rather than marketing promises or academic benchmarks alone. Real-world, high-stakes medical applications demand highly personalized evaluations that align with actual clinical objectives and patient outcomes. This requires a systematic bridging of the gap between AI vendors and hospitals through objective, data-driven evaluations.
Our role is providing the technical evaluation tools that allow both parties to speak the same language, interpreting model performance in terms that matter for specific clinical needs. When stakeholders can see exactly what is and isn’t good enough for adoption, procurement becomes a rational decision rather than a leap of faith.
The stakes extend beyond individual hospitals. Pathology continues to rely heavily on subjective interpretation, with diagnosis often based on consensus rather than objective standards. The technology exists to change this: AI tools that could provide consistency and standardization while improving patient outcomes. Through rigorous evaluation and explainable AI systems, we can move towards standardized biological interpretation that frees pathologists to re-focus on the field’s true purpose–the study of disease itself–advancing our understanding of pathophysiology and developing better treatments.
The tools for transformation are here. The question is whether we as an industry will act decisively or continue accepting inconsistent care when better solutions are within reach.