Beyond Benchmarks – A New Paradigm for Building Trustworthy Pathology AI
August 12, 2025

Scientific progress often starts with an observation, an idea, a hypothesis about how the universe, the world, or the human body works. From there, progress relies on a structured approach: repeated cycles of hypothesis-experiment-analysis where each iteration builds on the previous, refining our understanding through successive approximations until robust theories emerge. For centuries, researchers in fields like pathology, genetics, and the engineering sciences have followed this paradigm to great success. Yet in modern AI, we rarely formalize hypotheses, we blindly run tons of experiments, and we avoid the necessary analysis to understand exactly what’s happening.

We’re going about this the wrong way.

What’s the Problem?

As an industry, the low barriers to model experimentation predispose us toward rapid iteration over deliberate exploration. In contrast, wet labs constrained to monthly experiments optimize for deliberate hypothesis testing, where each result compounds into coherent theory.

This preference for speed over understanding is problematic for us in a number of ways.

Lack of principled direction

We've replaced the scientific method with computational lottery tickets. Rather than continuously building principled hypotheses about what might work and why, teams simply try everything: different architectures, datasets, hyperparameters, hoping that sufficient compute will reveal the winning combination. This works until it doesn't. When models exhibit unexpected behavior due to spurious correlations or distribution shifts, debugging becomes a nightmare. Without understanding the underlying mechanisms, teams face agonizingly long debug cycles, often rebuilding from scratch rather than systematically addressing root causes.

The result? Brittle models that frustrate developers—reducing them to hyperparameter tuners—while giving executives anxiety, never knowing if debugging will take weeks or months.

Benchmarks that Don’t Tell the Whole Story

When the stars finally align and we’re blessed with a working model, how do we determine efficacy? We chase percentage point improvements over competitors on benchmarks that are often very loose proxies for end-user goals.

A 2% accuracy improvement tells us nothing about systematic failures. Your medical imaging model could have sacrificed robust representations that generalize across hospitals in exchange for improved confidence on a single stain and scanner combination that was over-represented in your benchmark. We aren’t stratifying our evaluation sets today, and the consequences are predictable: unexpected brittleness and silent distribution shift failures. Instead of hiding behind coarse dataset-level metrics, we should be mining evaluation data for systematic failure modes that guide our choices. In this way, we can frame every decision we make as a deliberate trade-off.

“All models are wrong, but some are useful” – George Box

Box’s insight points us towards our real goal: every evaluation we employ should be intentional in understanding a specific aspect of a model that can clearly be used as targeted guidance for improvement. We currently employ model evaluation as performance theater in search of “correctness” when we should be using it as a diagnostic tool to guide us towards “usefulness”.

Fragmented Knowledge that Doesn’t Scale

After the benchmarks are beaten and the model shipped, what happens to the hundreds of experiments that got us here? They’re left to sit in your Weights & Biases account and the insights from them scattered across researchers as bits and pieces of intuition that fade over time.

We discard two types of critical knowledge: the technical insights about how our models respond to different architectures, loss functions, and hyperparameter choices, as well as the human intuition about why certain experiments were worth trying.

Tribal knowledge must be institutionalized so that every developer can build on past insights. Failing to do this condemns your company’s output to rise and fall with the quality of researchers you currently employ rather than the legacy of work you’ve built up until now.

A New Paradigm

How do we fix this?

For developers, running systematic experiments feels daunting when management incentivizes getting lucky with rapid iteration. The result is a nagging sense that we're optimizing hyperparameters instead of advancing science, running experiments like machines rather than researchers, all without a clear goal. This isn't a critique of individual developers; it's a systemic problem of over-optimizing a metric, a textbook example of Goodhart's Law.

Shifting established practices is challenging, particularly when they serve immediate business needs. But meaningful progress doesn't require starting from scratch. We believe the solution has two components: hyper-experimentation and neural representation analysis, both beginning with something deceptively simple: proper data collection.

Hyper-Experimentation

We've already entered the age of rapid experimentation. With tools like Claude Code and Cursor, developer time is no longer the bottleneck for exploration, compute capacity and pattern recognition are. The question isn't whether to run more experiments; teams are already doing this instinctively. The question is whether we're learning from them.

Hyper-experimentation involves systematically mapping model behavior using a hypothesis-driven approach. Instead of running isolated experiments and discarding the insights, we create comprehensive behavioral maps along two critical dimensions: how individual models respond to different inputs, and how different model configurations perform across the same tasks.

The raw material for this already exists in every ML pipeline. Teams routinely benchmark models and track performance across versions. The problem isn't data collection—it's that we ignore the intermediate experiments, failed attempts, and subtle patterns that emerge between model conception and finalization. These contain the very insights that could guide our next decisions.

Equally important is how we evaluate in the first place. Rather than generic accuracy metrics, we need targeted evaluation sets that probe specific model behaviors. Want to understand robustness? Design evaluations that systematically stress-test failure modes. Want to predict performance on novel inputs? Map the boundaries where your model breaks.

When we collect this data systematically over time, patterns emerge. Certain hyperparameters consistently improve robustness. Systematic errors as a result of spurious correlations are identified. These insights transform from tribal knowledge into a working theory of your model. Nothing here is groundbreaking. It’s an incremental step to what folks are already doing. But developers today don’t have the time to run these analyses even though they will produce outsized returns, largely due to organizational incentives.

Neural Representation Analysis

Hyper-experimentation reveals what works, but to build truly robust models, you need to understand why. Behavioral patterns tell you that certain hyperparameters improve robustness, but they don't explain the underlying mechanisms that dictate model behavior.

Neural analysis addresses this gap. Even if your evaluation scores don’t change at all, a change in the neural representation of a subset of examples can signal a change in robustness. Your model might maintain the same accuracy on clean images while developing completely different internal representations that are more or less resilient to noise, blur, or distribution shift.

Working with this data, we can transform debugging from trial-and-error into targeted intervention, building working theories about model behavior while accumulating the data foundation needed as interpretability tools continue advancing.

The Complete Picture

By pairing hyper-experimentation with neural representation analysis, every experiment automatically contributes to institutional knowledge. But this only works if you have the infrastructure to capture and analyze this data without adding overhead to your workflow.

Such a system would catch what you miss: out-of-distribution examples through representational drift, label errors through inconsistent clusters, spurious correlations before they break in production. When you're debugging, instead of guessing, you probe accumulated data: "What patterns correlate with robustness failures?" or "Show me experiments where distribution shift was handled well?"

This is exactly what Tessel's Neural Developer Tools provide—a platform to build models you can trust with clear guidance on where to improve next.

While these principles apply across all AI domains, certain fields face unique challenges that make systematic model development not just beneficial, but essential for survival. Medical imaging, and specifically, digital pathology is one such field.

Digital Pathology

AI-driven digital pathology is still very much in its infancy, with computer-aided-diagnostics representing an even smaller subset. This creates an opportunity for model providers to establish themselves, but also reveals a critical challenge unique to this medical imaging AI.

Unlike mature industries like manufacturing where every percentage point of accuracy improvement drives competitive advantage, pathology AI companies face a different priority: getting in the door first. The primary barrier isn't marginal performance gains—it's establishing trust with risk-averse healthcare institutions that need confidence their diagnostic tools will work reliably and translate from benchmarks to clinical outcomes. This is a tricky subject that involves different regulatory requirements for different workflows, split across LDTs and FDA approvals, but trust is at the center of it all.

Once deployed, the challenge shifts to maintaining that trust through controlled improvement:  How do you manage continuous improvement without risking performance degradation? How do you guarantee that updates won't introduce new failure modes in critical diagnostic scenarios?

We believe this is the perfect opportunity for collaboration.

We’re currently working with Stanford Radiation Oncology on robustness benchmarking through empirical and neural representation analysis. Our recent work demonstrates this in practice—we automated label error detection in pathology datasets, and will share a separate post shortly showing how our SDK easily integrates into existing workflows to catch data quality issues that undermine model reliability. These are exactly the kind of hidden problems that contribute to brittle models and erode trust with healthcare institutions.

If you're a hospital looking to rigorously evaluate pathology AI models before deployment, or developing AI tools that need systematic robustness validation, let's discuss how we can help.

Roadmap

Our development roadmap follows the same progression we laid out in our vision: establish data foundations, enable rigorous evaluation, then build systematic analysis and remediation capabilities.

Stain and Scanner Robustness Benchmarks

We're developing benchmarks that help hospitals evaluate whether external AI models will work reliably on their patient data. Our approach uses neural style transfer to generate test images that match a hospital's scanner and staining characteristics, then validates model performance through neural representation alignment. These benchmarks can operate in federated settings, enabling rigorous pre-deployment testing that builds confidence in model generalization across different hospital environments.

Interpretable Neural Analysis

To build trust with clinicians and regulators, models must explain their decisions in clinically meaningful ways. We're actively developing interpretability tools that decompose model representations into human-interpretable concepts. This helps show not just that a model detected a tumor, but which visual features it relied on and whether those align with clinical knowledge.

Our current research focuses on automated concept discovery and connecting learned features to specific predictions. We're expanding on the current sparse autoencoder approaches and developing view-aware sparse autoencoders, a novel technique designed to cleanly isolate specific artifacts (scanner noise, staining variations) from model representations without compromising biological signal detection.

The ultimate goal isn't just understanding what models do wrong, but systematically fixing it. Rather than retraining entire models when artifacts are discovered, we can leverage these feature isolation methods to selectively suppress problematic concepts while preserving clinical accuracy, enabling precise, controlled model improvements.

Vision For the Future

The future of machine learning will be grounded in models that know what they know and know what they don’t. Every experiment, novel data point, and piece of feedback will contribute to an evolving, comprehensive theory of model behavior.

This starts with the collection and mapping of empirical observations, which will pave the way for mechanistic analysis of model internals to reinforce hypotheses. Think beyond standard interpretability. Instead, it’s a systematic loop connecting hypotheses and experiments to model behavior, enabling us to understand the implications behind every perturbation and design choice. During both training and inference, our understanding of a model will evolve in real-time until we reach a coherent theory that accounts for every data point.

From developer experience to model robustness, proper evaluation to lifecycle management, what we expect of our AI models will fundamentally change. The future of trustworthy AI will necessitate us approaching AI development in the same methodical way we approach software.

The future of trustworthy AI requires the same methodical rigor that drives progress in established sciences. Our SDK is currently in private beta with select partners. If you're ready to transform experimental chaos into systematic knowledge, reach out to us at team@tessel.ai.