Signal Processing for AI
Solving Model Errors at the Source
Signal processing consists of the extraction of information usable by a mathematical model from raw data. It is a critical element in many engineering domains, including imaging, audio, speech, radar, and many others. It includes filtering tasks to remove noise, say in images or audio, Fourier transform techniques for audio, as well as more complex tasks such as location and reconstruction of objects using radar or sonar.
There are also a number of examples in the human sensory system. For example, the primary visual cortex in the mammalian brain acts as a signal processor, taking image data from the retina and performing edge and line detection on it, and then feeding it to higher layers in the visual pathway.
It is useful to think of these ideas as represented in a diagram, see below.
In traditional signal processing, the models are often well understood mathematical models, such as regression (linear and logistic), decision trees, support vector machines, and many others. Processes A and B above constitute the signal processing portion of flow. A is strictly processing and produces new input, and B describes how the data is used to interface with the model. Let’s generalize the notion of signal processing to mean the adaptation and augmentation of the input data to it so as to improve the performance of the model. It does not include modifying the model internals, such as is done with various fine tuning techniques, such as LORA, and it does not include modifications to the output layer.
We have performed signal processing in this sense for images and video for CNNs, as we have described in our earlier blogs Klein bottles and Klein CNNs, and the results included large improvements in learning efficiency and generalization for images, and in accuracy and data efficiency for video. Interestingly, Z. Hu (Z. Hu et al) adapted the same general techniques for use in speech analysis, with a very clever approach to converting speech signals to images, where the image is a spectrogram. Interestingly, their approach includes signal processing with a specialized feature set constructed specifically for images that are spectrograms.
We have also asked ourselves what kind of signal processing would be appropriate for large language models. We do not have complete answers yet, but in a Bluelightai blog, we describe some experiments which develop features that can be used as part of the input to a language model. Briefly, these features are constructed using parts of speech tagging. The idea is to associate to each word in the input text its part of speech (POS), and thereby get a sequence of parts of speech. This is a sequence in a much smaller vocabulary than the English language (8 plus a few special symbols vs. roughly 400K). One can then train a transformer model to predict the next POS, and thereby obtain an embedding for the parts of speech. Such an embedding also produces an embedding of words by using POS tagging, and can be used as a feature set to augment the embedding of words that is the input to the transformer. A learnable interaction between the POS embedding and the transformer embedding will then give a new model. In the above blog, we demonstrated that one could obtain significant improvement in perplexity in a small language model using this method. We interpret this as saying that the added features are syntactic, rather than semantic, and that the a priori semantic information is useful. There are many smaller vocabularies that one can construct, and we are experimenting with them. What we strongly believe is that such specialized vocabularies will be very useful in the building of specialized models, which come with their own specialized vocabularies.
In general, we advocate for training and fine tuning using methods that involve augmenting or modifying input data. We see the following advantages to such methods.
They can be used both for fine tuning as well as ab initio training.
They enable the introduction of a priori knowledge into the model, likely making for more interpretable models.
One can imagine one can build a collection of “LEGO blocks” of features, which can be simply introduced in a systematic way when performing adaptation of LLMs to specific domains, or training more specialized models from the ground up. .
Our work on CNNs suggests that the models constructed with features which are motivated by some outside understanding will tend to be more generalizable.