Next Generation AI Model Evaluation
Go beyond the leaderboard: How TDA uncovers what benchmark scores miss in model evaluation.
The evaluation of models is absolutely critical to the artificial intelligence enterprise. Without an array of evaluation methods, we will not be able to understand whether the models are doing what we want them to do, or what measures we should take to improve them. Another reason for the need for good evaluation measures is that once an AI model is deployed, we will find that the input data, the interaction of users with the model, and the user reactions to the output of the model will change over time. This means that not only do we need evaluation at the time of construction of the model, we will need to evaluate continually throughout the deployment lifecycle of the model.
For the simplest kinds of models, for example classifiers, there are very simple methods for evaluating them, such as accuracy, precision, and recall. On the other hand, for large language models or generative AI models, where the output is much more complex than a simple classification into groups, evaluation is a very challenging task. For example, how does one compare two different summaries of an input document? Even for the simpler models, one may very well want to know more about the nature of its failures. For example, if we can identify a description of groups of failures in human understandable terms, it can give guidance for the use of the model as well as for improving it. The importance of evaluation is confirmed by the variety of measures of performance provided on the Hugging Face leaderboard.
In this blog, we will give a brief description of some methods for evaluation, and will propose and illustrate the use of the methods of topological data analysis (TDA) to make these methods more powerful.
Large Language Models are one collection of models where evaluation is a challenging problem. One important general method for evaluating LLMs is benchmarking on tasks where the outcomes are very precisely defined, so that they can be automatically verified. Several of the metrics on the Hugging Face leaderboard are of this type, namely MATH, GPQA, and MMLU. Each of these consists of a data set of questions with precisely defined answers, such as multiple choice questions. IFEval also has a data set of questions, but uses readily verifiable properties of the answer, such as length and language, rather than precisely defined answers to questions. The LLM is evaluated by the proportion of right answers to these questions that it obtains. Big-Bench Hard (BBH), a benchmark of especially challenging tasks from the BIG-bench suite, also follows this evaluation principle but includes open-ended, human-generated outputs that lack precisely defined answers along with a collection of questions with more precisely defined outcomes. Performance on the less well defined answer outputs is measured by various similarity measures based on the vocabulary used in the responses. Another approach is to use LLMs to evaluate the performance of other models for more specific tasks. This permits an effectively infinite collection of metrics, which can be constructed by prompting the evaluating LLM in various ways.
Each of these measures provides a benchmark for an LLM model, which is a broad evaluation of the performance of the model. To make evaluation more fine-grained and more usable, it is important to understand the source of the failures, either within the data or in the model itself. Also, given two models, say A and B, it may be that A outperforms B on all of the input data but B outperforms A on a group within the input data. A single benchmark attached to the model is not capable of working on such tasks. I will give some examples of how one can obtain more fine-grained evaluations of AI models.
Here is a brief discussion of topological data analysis. For more detail, look at this blog post or this book. Topological data analysis proceeds from the idea that data has shape, encoded in a dissimilarity measure, and that one can represent shapes by graphs. Each node of the graph corresponds to a set of data points. The graph supports “heat maps” representing any quantity associated with the data points, where the coloring reflects the average value of the quantity across the subset of the data corresponding to a node.
Example 1: Our first example is the examination of a model for classifying images of airplanes, in a military context. The data consists of satellite images of various airplanes or groups of airplanes, and the purpose is to classify them as belonging to certain categories. The image on the lower left shows a graph model of the data set of images on which the model is operating. The nodes are colored from yellow to blue, with dark blue indicating a high rate of misclassification. You can see a group on the left which is circled, and which is dark blue. On the right you see a few samples from the corresponding group of images. It is then relatively easy to see that these are actually large military transports, which have been misclassified as civilian aircraft. The topological model (the graph) is very useful because it allows you to understand groups of failures which share a lot of characteristics.
Example 1
Example 2: Large Language Models are notoriously difficult to understand. Mechanistic interpretability is a rapidly developing subfield of artificial intelligence, focused on developing a deep understanding of how these models work internally. Topological modeling has a lot to offer to this area of research. One of the directions of research is the construction of interpretable features from the internal layers of LLMs, using a deep learning method called Sparse Autoencoders, or SAEs. SAEs have been trained by OpenAI for GPT2-small, and by Google for Gemma 2. Another recently developed approach is Cross Layer Transcoders, or CLTs, which are an improvement over SAEs for fine-grained model circuit analysis. For example, here.
Example 2
Above you see two topological models of the SAE features from GPT-2. What the topological methods do is highlight natural groups of features, which can then be interpreted using appropriate automatically generated summaries of them. We described this investigation in this earlier blog post. The graph model gives a good way to navigate the space of features for better interpretation of groups of features. We tend to see that groups of features exhibit more stability in their interpretation than individual features. See our post for more general aspects of this philosophy. There is also a Colab notebook which you can use to explore these spaces.
Example: For LLM evaluation, one certainly has the various metrics computed on the Hugging Face leaderboard. The models are ranked by an average of these metrics. It turns out, though, that different metrics may disagree on the rankings. For example, model A may score higher on IFEval than model B, while model B scores higher on BBH. This kind of disagreement points out that while an average rating is useful for establishing a leaderboard, it is important to analyze the individual scores, not just the aggregate, when making decisions on choice of model. This is particularly true since the methodologies for the individual scores may be trying to measure different things. What might be more surprising is that even for individual benchmarks, the analysis of input data and performance of models may show that model A scores higher than model B on an interesting subset of the data, even though B scores higher overall. For example, model A may perform better on MMLU overall than model B, but Model B could still perform better on questions that require quantitative reasoning. This is important information to know, since it could lead to different decisions on model choice or influence design decisions. This phenomenon was demonstrated in the context of retrieval models in our Webinar with Zilliz. What this points out is that overall metrics can often be misleading, and that it is important to understand the variation of the metric across different groups of the input data. For this reason it is important to have ways of modeling the input data which surfaces different meaningful groups, as one can by using TDA methods.