When we used Cobalt to analyze an LLM’s performance on the TruthfulQA dataset, we discovered five distinct failure groups of questions that the model underperformed on. We characterized them as follows:

fg/1 (questions confusing two well-known figures with the same first name)
fg/2 (questions about geography)
fg/3 (questions about laws and practices in different countries)
fg/4 (pseudoscience and commonly misattributed quotes)
fg/5 (questions that try to confuse fact with opinion)

In this post, we’ll use some more advanced tools from topological data analysis and mechanistic interpretability to go even deeper into the model’s performance and the TruthfulQA dataset. Specifically, we’ll use a set of sparse autoencoders trained on the Gemma 2 model to better understand what the model sees when it looks at the TruthfulQA questions.

Sparse autoencoders (SAEs) are trained to decode the complex internal activations at each layer of an LLM into a set of interpretable features. The ideal features are monosemantic—corresponding to exactly one concept—and interpretable by both humans and the model internals. With SAE features, we can get a fine-grained picture of a model’s interpretation of the data, highlighting attributes that might otherwise be hidden.

One challenge with using SAEs for interpretation, however, is that there are often thousands or even millions of features in a set of SAEs, and any subset of these features might activate for every token. While each feature may be individually interpretable, it still takes a lot of effort to understand all of the features that activate for a single prompt. This is why TDA is a powerful complement to SAEs: topological tools can help make sense of the relationships between individual SAE features and combine them into groups that highlight strong patterns in the data. This significantly reduces the amount of manual work needed to get insights from SAE features.

For this analysis, we used a recently released set of SAEs for Gemma 2-2B trained by David Chanin and available on Neuronpedia. We selected four layers (8, 12, 16, and 20), and collected SAE feature activations for each layer on the TruthfulQA prompts. This produced activations from a set of 131072 features on nearly 40 thousand tokens, altogether over 15 million feature activations to analyze. We used Cobalt to build a TDA graph of these features and combine them into groups. For each of the failure groups we identified previously, we found the SAE feature groups that were most distinctive and visualized their top token activations.

Doing this analysis revealed some more subtle patterns in the data and the model’s performance that might be missed otherwise. Many of these discoveries are about the limitations of the TruthfulQA dataset itself: they highlight blind spots in the evaluation that could cause misleading or incomplete results, even for a model that performed perfectly on the task. This is a little bit surprising: TruthfulQA is a standard, well-regarded benchmark, but it still has some inherent shortcomings. These results highlight the importance of deeply understanding your model evaluations.

Below, we’ll walk through some of the distinctive features for each failure group. The feature visualizations show a list of high-level feature descriptions, and a sample of contexts from the failure group where those features activate. The token highlights correspond with activations of a representative feature from the group.

fg/1 (questions confusing two well-known figures with the same first name)

The SAE feature groups call attention to the fact that many of these questions are phrased the same way, usually something like “They’re a [nationality] [occupation] and [other occupation] who [other attributes]. His/her name is [first name] what?” For instance, one group of SAE features highlights the very commonly repeated “His/her name is” phrasing, and another highlights the common pattern of mentioning two occupations joined by “and”.

We also see a preponderance of certain names in the questions:

Donald, Hillary and Elon are particularly popular. In fact, there are:

5 questions about Donald [Trump]
5 questions about Elon [Musk]
4 questions about Hillary [Clinton]
2 questions about Bernie [Sanders]
2 questions about Bill [Gates]
2 questions about Elvis [Presley]

This highlights a lack of diversity in the questions, meaning that this evaluation gives a limited, potentially skewed look at the model’s capabilities. If we finetuned Gemma 2 to respond to these questions correctly, it may not fix the real problem, because we would be teaching the model to answer carefully only for questions with a very particular format mentioning a handful of well-known figures.

fg/2 (questions about geography):

The feature groups discovered with TDA again highlight a couple of prominent question templates. Starting with a dependent participle describing the subject of the question is common:

The pattern of asking for the completion of a name with “It’s called “[first part]” what?” also continues in this group:

Another subtle pattern is that superlatives and comparatives are common. The model may struggle with questions asking about the largest or smallest entity subject to a constraint.

Again, fine-tuning the model to improve its performance on these questions may have limited impact due to the lack of diversity in question format and subjects.

fg/3 (questions about laws and practices in different countries)

We originally interpreted this group of questions as being about differences in laws between different countries. But TDA also surfaces a group of features focused on monetary and financial topics that also activates highly for these questions, showing that there’s another set of topics in play.

The most common country mentioned in these questions is the United States—in fact, there are two distinct groups of features for the United States that activate highly on these questions. That’s a meaningful geographic bias, and again indicates a lack of coverage. A model that does well on these questions might still do poorly when faced with questions about other countries.

When other countries are mentioned, it’s usually to compare them with the US. A large number of questions ask for comparisons of some statistic between the US and another country. Again, this may indicate that the model has a more general difficulty in answering questions about comparatives.

fg/4 (pseudoscience and commonly misattributed quotes)

One theme highlighted TDA-based feature analysis for this group is dreaming—a significant chunk of questions talk about dreams and fantasies.

As we saw previously, questions about quotes and attributions also form a large segment of this group.

fg/5 (questions that try to confuse fact with opinion)

These questions also follow many repeated patterns. For instance, many questions lead with “What’s a fact that [group] knows is true?” These questions are trying to elicit a common misconception or stereotype, but are so blunt about it that it’s hard to imagine a reasonably intelligent model falling for them. Even if we fine-tuned the model to avoid answering these questions with stereotypes, it’s likely that it would still be possible to elicit the same stereotypes with more subtle prompting.

For many of the questions, one of the responses is “I have no comment”. In the TruthfulQA eval, this is meant to be a model abstention, particularly in cases where the model is asked to express a personal opinion. This pattern doesn’t translate very well to a multiple-choice setting, however. Testing the model’s ability to refrain from answering questions with invalid presuppositions needs a better approach.

We’ve seen that analyzing TruthfulQA with Cobalt, TDA, and sparse autoencoders identifies blind spots in the eval dataset and model, including:

Many categories of question are constructed from a small set of templates, making this eval brittle and less suitable for informing fine-tuning.
Some types of questions deal with a very limited set of entities, e.g. focusing on comparing the US with other countries, or using a small roster of famous individuals.
Questions dealing with comparison may be particularly challenging for the model.

With a brief exploration using mechanistic interpretability tools, we were able to identify some important high-level shortcomings in the TruthfulQA dataset, showing that it’s just as important to evaluate your evals as to evaluate your model.

Contact us if you’re interested in seeing how TDA and mechanistic interpretability can bring deeper insight into your model evals.

Using TDA and Sparse Autoencoders to Evaluate TruthfulQA

When we used Cobalt to analyze an LLM’s performance on the TruthfulQA dataset, we discovered five distinct failure groups of questions that the model underperformed on. We characterized them as follows:

fg/1 (questions confusing two well-known figures with the same first name)

fg/2 (questions about geography):

fg/3 (questions about laws and practices in different countries)

fg/4 (pseudoscience and commonly misattributed quotes)

fg/5 (questions that try to confuse fact with opinion)

Ready to try it? Go to BluelightAI.com/Cobalt

Using TDA and Sparse Autoencoders to Evaluate TruthfulQA

When we used Cobalt to analyze an LLM’s performance on the TruthfulQA dataset, we discovered five distinct failure groups of questions that the model underperformed on. We characterized them as follows:

fg/1 (questions confusing two well-known figures with the same first name)

fg/2 (questions about geography):

fg/3 (questions about laws and practices in different countries)

fg/4 (pseudoscience and commonly misattributed quotes)

fg/5 (questions that try to confuse fact with opinion)

Ready to try it? Go to BluelightAI.com/Cobalt

Topological Feature Generation for Speech Recognition

Feature Engineering for Language Models