Cobalt: Adapting LLMs with Topological Data Analysis

Oct 10

BlueLightAI Cobalt, powered by topological data analysis, is revolutionizing the adaptation of large language models to specific domains by enabling precise data curation for continual pre-training, a sophisticated form of unsupervised fine tuning that enhances AI quality and accelerates production deployment.

Commercially Available Foundation Models

Every few months, the efficacy and capabilities of publicly available LLM models continues to rapidly advance. In the last 12 months, over a half dozen publicly available foundational models have delivered MMLU performances greater than 80. Any model that scores greater than 78 is considered professor-level knowledge. While MMLU is a somewhat basic standard, it nonetheless offers a clear indication that publicly available foundational models are rapidly advancing to a point where enterprises will find that pre training their own LLMs will likely never provide the general capability and range that LLMs such as LlaMA 3 or GPT-4o can.

Baseline models will continue to commoditize rapidly. Differentiation will likely take the shape of cost characteristics, specialization and ease of adaptation. As is standard in commodity markets, most use cases will coalesce around a few leading foundations that over time will deliver specific advantages & capabilities fairly rapidly. For the foreseeable future, entities that pre train their own LLMs will find that this rapid evolution will expedite the obsolescence of internally developed LLMs.

The punchline to all of this is that unless you are an enterprise that possesses internet scale data and are willing to invest in updating a model frequently, pre-training a production grade LLM is going to be an expensive fool's errand. However, general purpose foundational models cannot deliver the competitive advantages needed for enterprises to successfully deploy effective AI applications in a production setting.

Domain Adaptation Challenges

General-purpose foundation models need domain-specific adaptation before they can be reliably deployed in business use cases. There are a host of popular techniques in use today. The major groupings are Prompt Engineering, Retrieval Augmented Generation (RAG) and Fine Tuning. Each requires varying degrees of internal skill, expense and offer a similarly wide range of efficacy and performance. It is generally the case that most of these three approaches are good at building simple AI applications on data that changes infrequently. Simpler approaches such as Prompt Engineering will by definition be subject to greater inaccuracies in the form of hallucinations and inadequate answers, but will be inexpensive to implement, require less data and need fewer / lower level skills to implement. More advanced techniques will be far less prone to errors, but are more expensive, require potentially far greater data and will rely on a team of highly skilled data scientists.

The existing suite of LLM “Augmentation” techniques is a good first step, but it is far from an optimal long term solution for most critical use cases. Aside from Prompt Engineering, most techniques are “supervised” in nature, meaning they require ample amounts of labeled data, which can be time consuming and expensive to label or must be sourced from 3rd parties. Over time, this expense can become material, especially in areas where data “freshness” is a material factor in accuracy and efficacy. In practical applications, the highest value use cases are almost always dependent on data that rapidly changes. However, rapidly changing data represents a structural limitation in LLM Augmentation, in that there are currently few technologies that can label it in real time & properly weight its impact on the model, especially as scale increases.

Continual Pre-Traning

Techniques like continual pre-training <footnote: links to background on CPT> and unsupervised fine tuning solve the structural imperative. However, their implementation can present significant hurdles. Traditional approaches to applying these techniques are often time-consuming and costly while yielding suboptimal results. Key challenges include:

Identifying and selecting relevant domain-specific data from vast, diverse unlabeled datasets – especially burdensome for large corpora of unstructured text data
Eliminating noise and irrelevant information that would degrade model performance
Balancing the need for domain expertise with technical AI sophistication
Ensuring the adapted model maintains general capabilities while excelling in the target domain
Measuring and validating the effectiveness of domain adaptation efforts

Overcoming these obstacles is crucial for organizations seeking to leverage the full potential of LLM Augmentation in enterprise applications, driving the need for innovative solutions like Cobalt.

Cobalt Data Curation

Leveraging advanced topological data analysis, Cobalt offers a uniquely helpful approach to organizing and understanding large datasets. Unlike conventional methods for data exploration such as tSNE and UMAP, Cobalt provides groupings of data browsable at multiple levels of resolution. This innovative breakthrough allows data scientists to:

Easily identify and select the most pertinent information for their specific domain
Eliminate irrelevant or low-quality data that could harm model performance and unnecessarily bloat model size
Curate groups of data with high natural similarity for pre-training
Automatically surface groups of problematic data to address with fine tuning and evaluation
Detect and address production drift

By guiding intelligent and targeted data curation, Cobalt significantly enhances the quality and relevance of training and tuning data, laying a solid foundation for high-quality (more accurate and domain-specific) AI models.

Streamline Unsupervised Fine Tuning Process

With Cobalt, continual pre-training becomes a systematic endeavor rather than a haphazard approach. Instead of indiscriminately ingesting all available datasets, teams can now:

Intelligently match training and tuning data to the intended model purpose
Sift through the garbage to train on the gold, ensuring more robust and accurate models
Rapidly adapt foundation models for context-specific applications

This targeted approach significantly reduces the time and resources required for fine tuning foundational LLMs, thereby improving model quality and relevance to the intended domain.

Real-World Application

A compelling example of Cobalt's practical application involves a data scientist developing an AI assistant for American football broadcast commentary. Starting with a vast sports and athletics corpus, Cobalt automatically categorizes the data into related groups, enabling the data science team to focus on football-related content. This targeted approach ensures the model is trained on the most relevant data, leading to superior performance in its specific domain.

For instance, Cobalt analysis ingests the raw data and automatically highlights clusters with groups of related keywords like "nfl, steelers, pittsburgh" or "fantasy, 12, 11," allowing the data scientists to identify football-related content and intelligently adjust the training data distribution to emphasize it, while de-emphasizing or excluding irrelevant sports information.

The Cobalt Advantage

By leveraging the powerful capabilities, of Cobalt, organizations can:

Reduce Time-to-Production: Streamline the entire process— from pre-training data curation to confident model deployment.
Improve Model Quality: Ensure models are trained and tuned on the most relevant and high-quality data.
Increase ROI: Minimize investment costs while maximizing returns through more effective AI solutions.

Mattimore Cronin

Cobalt: Adapting LLMs with Topological Data Analysis

Curate Your Datasets with Cobalt for Higher Performing Models