Curate Your Datasets with Cobalt for Higher Performing Models

Topic: Controlling your input training dataset for a fine-tuning task. 

Business Context:  Any model serving customers has a certain risk/reward ratio that must be considered. The reward is the potential value the business receives (the automation of routine or mundane tasks, the associated saved costs, etc.). The risks associated with releasing the model to production consist of questions like: Will it scare away customers with offensive or biased output? Will it fail to answer specific questions accurately, and on what questions? How important are those questions?  Will it spew irrelevant information, crippling the company’s public image?

Everyone knows that artificial intelligence (AI) is complicated. But what’s generally agreed on is common sense: your model is exactly as good as the data it's trained on.

And in business, this question gets quite real. Folks want a model that performs well for their customers and that can answer their customers questions, with minimal risk.  A generally performant model without specific-to-use-case value is next to worthless for specific companies bottomline.  By taking a proactive role in aligning datasets with their downstream use case, companies can avoid downstream problems from releasing their models into production.

Your team can use Cobalt to increase their control of their training datasets, reduce time to production, improve performance and reduce risk of your models, ultimately increasing ROI. 

Read on to learn more from a technical perspective. 

Technical 

Trouble is: curating datasets at the scale needed for LLMs is incredibly labor-intensive. Simply put, the content of datasets at this scale are opaque to humans. Judging relevance or quality of a certain document to a downstream task is a very challenging question that teams generally choose to avoid answering. There are other options. Google offers classifiers trained to recognize illicit content, there’s the perplexity metric which can be used to gauge the quality of document.

Next
Next

Cobalt: Adapting LLMs with Topological Data Analysis