Approach Cobalt Research Team Docs Demo Get In Touch

See what your AI
tests missed.

BluelightAI reads your existing test and production traces at the concept level and surfaces the behavior your evals never described, so you can see why a model failed and catch it before it reaches production.

Built for the data science, risk, and engineering teams responsible for models in production across banking, financial services, insurance, healthcare, and defense.

Trusted By
The Problem

Your tests only cover the cases you imagined.

Every eval is a list of failures someone already knew to check for. The real world is not bound by that list. It keeps generating new problems: rare inputs, distribution shifts, and behaviors you never knew were relevant to the outcome.

That gap is where most production failures start. The cost is highest in banking, financial services, insurance, healthcare, and defense. By the time the failure surfaces, the model has already acted, and nothing in your tests explains it.

The Method

We compare what you tested to what your model actually does in production

01

Read the traces you already have

We work from the test and production traces your systems already record, using your existing evaluation and observability tools. No new instrumentation, and no change to your stack.

02

Analyze behavior, not just inputs and outputs

We read those traces at the concept level, the concepts actually driving the model's behavior, and map them against the coverage your tests already provide.

03

Show you the coverage gap

You see the concepts and cases your test suite never exercised, and the behavior that shows up in production but nowhere in your tests. Every finding describes behavior, with a traceable record.

How We Fit

Adding depth to your current stack

Your evaluation and observability tools run the tests you defined and log every run. We read what they record and do something they were not built for: compare that coverage against what the model actually does in production, at the concept level.

BluelightAI Advantage

We look past inputs and outputs to the behavior actually driving each decision. We surface the cases your tests never checked and the failures standard testing tools were never built to find.

Model Change

Every new model reopens the evaluation gap

A test suite grows to catch the weaknesses of a specific model. Swap or upgrade the model and the suite no longer targets the new model's weak points, and re-running the same test cases with the same metrics can miss subtle differences. Whatever the reason for changing models, we can help make sure the transition goes smoothly.

Pre-deployment safety review

When a model has to be confirmed safe before launch, you need evidence on your workload, not a generic benchmark.

Avoiding single-model dependence

Keeping a backup model ready means understanding behavior across both models, not just the primary.

Applications

Where an unseen failure costs the most

We go deepest with teams in banking, financial services, insurance, healthcare, and defense, where an automated decision carries real consequences and has to hold up under scrutiny.

Mission & Intelligence

For models assessed before deployment in high-stakes settings, we characterize behavior on the real workload, including bias patterns and fragility no benchmark anticipated.

Fraud Detection

Scaling auto-close means trusting the model on cases no analyst reviewed. We surface the triage behavior your tests never covered, before it closes the wrong case.

Collections

When agents follow the model most of the time, its untested behavior propagates quietly. We show where production decisions diverge from anything your tests described.

Lending & Credit

Automated decisions need transparent reasoning, especially on non-traditional data. We surface behavior patterns post-hoc explainability tools miss.

Across all of these, the starting point is the same: the failures already present in production, on the cases no one thought to test.

Fundamental Technology

How we read behavior other tools cannot

Two of the most rigorous methods for understanding AI systems, the same families of techniques used by leading AI research labs, applied to your production traces.

Topological Data Analysis

Reveals the shape of high-dimensional behavior without imposing assumptions, surfacing clusters, transitions, and failure modes standard testing misses. Co-invented by our founder, Dr. Gunnar Carlsson, at Stanford.

Mechanistic Interpretability

Decomposes model activations into interpretable features, using sparse autoencoders and cross-layer transcoders, so you see not just what the model predicts, but why.

TDA MI

Verifiable by design

Together these produce reasoning your technical and risk teams can verify, not just trust.

LLM Explorer

Mechanistic Interpretability in Action

Our cross-layer transcoders for Qwen3 models reveal interpretable concepts inside an LLM's mind. These concepts connect into computational circuits, tracing how the model turns a prompt into an output, and how production diverges from what you tested.

With TDA, we can map relationships between features and concepts to show how they combine in hierarchical ways.

Cobalt: table view of dataset analysis
Cobalt

The engine behind the analysis

Cobalt is our TDA-powered engine for AI interpretability. It helps data teams discover, inspect, and verify what a model is doing across test runs and production.

By relating model inputs, interpretable features, and decisions, Cobalt gives technical and compliance stakeholders a shared, verifiable view of where coverage is strong and where it is thin.

Install: pip install cobalt-ai

The Team

Sachin Khanna, CEO

Sachin Khanna

CEO

Gunnar Carlsson, Founder

Gunnar Carlsson

Founder

Jakob Hansen, Head of Data Science

Jakob Hansen

Head of Data Science

John Carlsson, Principal Scientist

John Carlsson

Principal Scientist

David Fooshee, Principal Scientist

David Fooshee

Principal Scientist

Founded by Dr Gunnar Carlsson, one of the inventors of Topological Data Analysis at Stanford. The founding team combines pioneering research in TDA and mechanistic interpretability with decades of enterprise software execution across global organizations.

Our advisory board brings deep BFSI credibility spanning tier-1 banking CTOs, AI governance leadership at global financial institutions, and PhD-level expertise in explanation-based AI. We navigate both the scientific complexity of interpretability and the operational reality of deploying AI in high-stakes environments.