Feature Engineering for Language Models
Using parts of speech to improve language model performance
V. Lado Naess, L. Sjögren, D. Fooshee, G. Carlsson
In our earlier post “Improving CNNs with Klein Networks: A Topological Approach to AI,” we observed that adding predefined interpretable features to a CNN, and even modifying its architecture, resulted in significant improvement in its performance. Speed of learning was of course greatly improved, but it turned out that generalization was also greatly improved. Moreover, the insights thus obtained allowed us to construct new features for video that greatly improved performance on a video classification task. These observations led us to ask if the addition of predefined interpretable features to Large Language Models could also lead to improvements in their performance. In this post we report findings from experiments we performed on a language model by adjoining features constructed from parts of speech tagging applied to the input data.
Let’s first describe an approximate high level version of what we are doing, and then fill in the details concerning tokenization, tagging, etc. Suppose we have a language model being trained with word tokenization. Then given a text data set, each word can be assigned a part of speech (POS). Sometimes a word may be assigned different parts of speech in different contexts, for example the word “down” can act as a noun (eider down), preposition (running down the hill), or an adjective (the down staircase), but for simplicity let’s ignore that for now. So, if we are given data set of text documents, we can break up the data set into sentences, and for each sentence we can associate its corresponding sequence of POSs, which we’ll call a POS-sentence. As an example, the POS sentence corresponding to the sentence
“She was a thoughtful and driven individual”
would be the POS sentence
Pronoun, verb, article, adjective, conjunction, adjective, noun
Applying this reduction to word sequences provides a collection of sequences on a drastically reduced vocabulary, and one can train a transformer model to predict the next POS. Such a model could be done much more cheaply because (a) the dimension of the learned embedding could be very much smaller but yet informative, (b) since the structure of the space of POS-sentences is expected to be simpler one might need many fewer layers, and (c) the amount of data needed to begin to fill out the space is also smaller. The coordinates of this embedding could be used as additional pre-trained input to a language model, to provide a lightweight kind of fine tuning of the model. This experiment is what we have carried out, and found that we achieved significant performance improvement on the perplexity measure, and also smaller improvements in next token prediction. We have also analyzed the embedding obtained for the POS-sentences, with some interesting initial results. Here are some of the details of our experiment.
Data Set: We used the WikiText-2 dataset (Merity et al, 2016), 2 million tokens extracted from Wikipedia’s ‘Good’ and ‘Featured’ articles.
Tokenization: As the basic tokenizer, we used Byte Pair Encoding (BPE). Before applying BPE tokenization, raw text wan normalized with the NFKC standard and accent stripping to normalize character representation. Whitespace pretokenization was used to to separate putative word boundaries. The tokenizer preserves an end-of-word marker to facilitate alignment with word-level POS tags.
POS tagging: We used the Penn Treebank tags, chosen to balance small size vs. sufficient POS complexity. For the tagging, we used spaCy’s en_core-web-sm model. This tagging at the word level, prior to subword tokenization. Once the tokenization is done, each subtoken of a word will be assigned the POS attached to the full word of which it is a part. We also allowed for more than one POS attached to a word, up to 5 possible tags. The aggregation of this POS information occurs in the transformer’s embedding layer.
Model architectures: The basic model is a decoder-only transformer model. It is modified by adding an input adapter, which begins with concatenation of the the baseline input with the POS embedding, and then applying a learned linear projection to the original baseline embedding space. There are some preprocessing steps (POS aggregation for multiple POS applicable to the same word, layer normalization, positional encoding) which are applied, and the diagram below describes the structure.
The transformer model has three transformer layers, each with two attention heads.
We found that with this model, we were able to improve the average perplexity score by 15.1%, going from 42.70 to 36.26.
We were also very interested in the POS embedding itself. It is a sentence embedding model, which is ten dimensional. It is not surprising that the length of the POS sentence is a very important feature. We attempted to build a topological model using BluelightAI’s Cobalt software for the entire embedding, but found that length dominated the model. In order to gain a better understanding of it, we selected the middle quintile in length, and constructed a Cobaltmodel of that subset. See the result below.
In examining the graph, we decided to look at a couple of groups which appeared as flares in the graph, which you can see below. To get understanding of the groups, we collected the actual sentences corresponding to members of the group (which are POS sentences, but we had retained the actual sentences corresponding to them) and used ChatGPT, v.4.1, asking it to summarize the two groups of sentences with an eye to syntax.
A summary description of group A is as follows.
Narrative sequencing: temporal and causal phrases leading to actions,“Following the trade deadline Howson announced that the team had attempted to trade Nash at the players request”
Mixed structures: fragments, title formats, headlines, conversational clauses e.g. “Wild Cherry Makes A Wish collaboration with Pippa Le Quesne Frederick Warne 2006.”
Implicit cohesion: pronouns and ellipsis expecting surrounding context “They contain seemingly contradictory ideas each expressing a particular perspective on divine events”
The similar description of group B is as follows.
Expository compression: nesting of modifiers and dense terminology e.g, “In 1989 Frederick Warne, a division of Penguin Books since 1983, acquired the Flower Fairies properties”
Historical/technical reporting: formality, passives, measurements. “The 150 pound Parrott rifle weighed 16,500 pounds 7,500 kg and was 17 calibers long”
Explicit cohesion: self-contained sentences requiring minimal external contex, e.g. “It became one of the top grossing productions of India that year and earned 1.86 billion US 28 million worldwide”
Parts of speech have been studied directly in numerous ways, see for example here and here. Our analysis represents a more purely data driven approach to understanding sentence types. We believe our embedding should be viewed as capturing pure syntactic information, since it is essentially independent of the meaning of the words involved. We are working on a similar analysis for high level conceptual semantic information. We believe that understanding the behavior of models of “lower resolution sequences” will give insight into the behavior of large language models.
As we pointed out above, the features arising in this embedding can be used as useful input to a simple tuning process for a language model, which yields significant improvement in the perplexity of the model. We hope that one can ultimately produce a family of a priori features that will act as a tool kit for “signal processing” for large language models, in both pre- and post-training.