What are Transformer models?

Transformer-based models have revolutionized the way that machine learning models interact with text data, both in terms of analyzing (classifying) it and generating it.


Bridge

Transformer-based models have revolutionized the way that machine learning models interact with text data, both in terms of analyzing (classifying) it and generating it. Transformers can be made up of a combination of encoders and decoders or just decoders. For the purpose of this article, we will focus on decoder-only models, because they are specifically trained for text-generation tasks. These types of models are also referred to as Generative Pre-Trained Transformers (GPTs), which is where ChatGPT and the rest of OpenAI's GPT models get their name from.

Pre-training a decoder-only transformer model primarily relies on a type of task called language modeling. This task essentially involves predicting the next word in a sequence given the previous words, which is also known as autoregressive language modeling. Let's unpack the training process involved in teaching a language model to predict the word "rain" in the sentence, "The river bank was flooded by heavy rain."

Pre-training decoder-only Transformers for generative tasks

The first step in the pre-training process is tokenization, which breaks down the full text in the training data into individual words or subwords. These tokens are then passed through the input embedding layer of the model, which converts the tokens into high-dimensional vectors (arrays of numbers) that the model can understand. These vectors are then augmented with positional encodings, which give important information to the model about the order of the words in the sequence.

These vectors are then fed to the main part of the model, which consists of several decoder blocks. Each of these blocks is equipped with a mechanism called masked self-attention, which allows the model to weigh the importance of each input word when predicting the next word. Self-attention allows the model to comprehend and capture intricate dependencies between words in a sentence, significantly improving the model's understanding of context and language nuances and is a critical feature in the transformer architecture.

Next, the data is fed to a feed-forward neural network, which helps the model learn more complex representations of the data. To stabilize and speed up the process, each of these components is followed by layer normalization, which balances the scale of the inputs across the network layers, enhancing model stability and accelerating the learning process.

The output of the final decoder block is passed through a linear layer, which is a simple transformation of the data, followed by a softmax operation. This essentially means that the output from the last decoder block is used to calculate the likelihood of all possible next words. The model then picks the word with the highest likelihood as its prediction. In this case, the model would pick the word "rain" to complete the sentence, "The river bank was flooded by heavy ___." The output, "rain" is then compared to the full sentence present in the training data. These comparisons form the basis of the loss (error), which is used to update the model parameters through backpropagation. This is performed iteratively over multiple epochs (passes) of the data until the model's performance either plateaus or begins to degrade.

Diagram illustrating individual steps to pre-train a generative transformer model

Running inference on transformers

After the model has been successfully pre-trained or fine-tuned, it's ready to be used to make predictions. Running inference on a transformer model is relatively straightforward. You start by feeding the model your text prompt. Just as in the pre-training process, the input text is transformed into something the model can understand by going through the processes of tokenization and positional encoding. The resulting output of this text transformation is then fed to the decoder and the output layer. The model returns the output of the inference one word at a time. Each resulting word is then fed back into the decoder to iteratively predict the next word in the sentence, until completion.

Running inference on a pre-trained transformer in the form of a prompt

Advantages of Transformer models

Transformer models have revolutionized the field of natural language processing (NLP), a subset of AI that focuses on training models to understand, interpret, and generate human language. Decoder-only models are especially good at generating novel text, given a prompt or set of instructions.

Here are some NLP tasks that decoder-only transformers perform particularly well:

Language generation - Generating coherent and contextually relevant text, such as in chatbots or language models. Text summarization - Condensing long documents or pieces of text into shorter, more manageable summaries while retaining key information and meaning. Text style transfer - Rewriting text to match the style of a specific piece of existing text. Machine translation - Achieving state-of-the-art results in translating text between different languages. Additionally, these models can be refined with lesser amounts of data to enhance their performance on specific tasks, through the process of fine-tuning. This means that you can leverage the knowledge from large language models pre-trained on massive amounts of data (at someone else's expense), and fine-tune them using a comparatively smaller, task-specific dataset. In fact, many of the best large language models (LLMs) out there today are models that are fine-tuned on previously released LLMs. Check out Sebastian Raschka's excellent blog post to learn more about fine-tuning.

Finetuning Large Language Models An introduction to the core ideas and approaches

Limitations of Transformer models

Transformers, despite their strengths, also experience several prominent challenges and constraints. Historically, the top-performing generative text models are the most expensive and difficult to pre-train. And while we are starting to see examples of models that cost a fraction to pre-train while maintaining much of the accuracy of the best benchmark models, like Stanford's Alpaca, these models are not perfect by any means and can sometimes generate toxic content.

Here are some of the top challenges and limitations that decoder-only generative text models face:

Computational complexity - Pre-training transformers from scratch requires significant computational resources, including state-of-the-art hardware and high compute costs. Training data requirements - Transformer models often require large amounts of training data, which may be difficult to access, especially if licensing and copyright issues are a concern. Interpretability - The complex and distributed nature of transformer models can make it hard to understand how they make their predictions and are sometimes referred to as a "black box." Hallucination - When used to generate text, models with the Transformer architecture are known for generating wildly inaccurate (and sometimes toxic) responses in a convincing manner. Model bias - Generative models always exacerbate the biases of their training data. The Brookings Institution found that ChatGPT exhibited a left-leaning political bias when presented with a sequence of statements on political and social matters.

In summary, decoder-only transformer models are excellent at performing generative text tasks because they are able to "predict the next word" so well. This ability is a result of the model's ability to understand the contextual relationships between words and phrases, including their meaning and order in a piece of text. While top-tier generative text models are generally expensive to pre-train, the ability to fine-tune these models means that businesses and researchers can train models that inherit the high-performance capabilities of larger models. And although we are living in a golden age of LLMs, we still need to be aware of their performance limitations.