Background of cave with intricate rock formations and smooth curvatures. This is meant to illustrate the result of LLM fine-tuning: smoothing out the wrinkles of pre-training for the task at hand.

Fine-Tuning LLMs: A Practical Overview

May 16, 2024

As we have discussed in a past article, fine-tuning is a form of transfer learning that makes an AI model better at a very specific task. The AI application we want at the end will determine the best fine-tuning approach to take. This post first outlines specific fine-tuning approaches and how can lead to some useful AI applications.

What is Supervised Fine-Tuning?

When we talk about fine-tuning large language models (LLMs) with specific outcomes in mind, like designing a compassionate chatbot or designing a medical summarizer, we are typically talking about supervised fine-tuning. We’ll be using supervised fine-tuning and fine-tuning as interchangeable terms throughout this post, as there are ways to perform unsupervised fine-tuning (Li et al., 2021).

Supervised fine-tuning is all about taking a pre-trained language model, and aligning it for human interactions. It is a similar process to the model pre-training phase, where we first source or generate a dataset that aligns with some fine-tuning task (described in the Task- and Multi-Task Fine-Tuning section). We will then update some or all of the pre-trained model weights with this new dataset to minimize a loss function.

Prompting and Fine-Tuning

What differs between model pre-training and supervised fine-tuning is training the data on prompts, which contains context to a user query, and specifies the type of response that should come out of the LLM.

An example of a prompt and a model response is show below with the following dialogue. Note that we can do things like ask a model a question, and also request a specific format for the response (e.g. “Provide a single sentence response.”). The model actually listens to the user input, and responds appropriately.

Why Prompting?

Before ChatGPT, users couldn’t interact with LLMs in a dynamic way like today. The model inputs needed to structured in a specific format (e.g questions for question-answer tasks), and the programmer would then feed the model the data to perform a specific task. In the T5 paper from Google, they developed the first prompt-able model. The programmer could feed in a question, and T5 returned a relevant answer. Later, Google built upon this innovation with Flan-T5 to help an LLM generalize its responses across a range of scenarios.

These insights shortly led to ChatGPT, which was one of the first models where a user had a web interface that enabled them to directly prompt a language model to get a response. Additionally, this new ability to interact with LLMs has led to prompt engineering, which is the art and science of designing prompts that help guide the model to the desired outcomes. We will explore different aspects of prompt engineering in a future post, but the takeaway is that fine-tuning an LLM is all about training a model to take in a prompt as an input, and return a relevant and useful response.

Now that we reviewed what fine-tuning a model entails, we’ll discuss 3 fine-tuning methods that are commonly used to generate AI models that have many commercial use-cases. We’ll also provide practical considerations for each fine-tuning method.

Task- and Multi-Task Fine-Tuning

Single Task Learning

Task-specific fine-tuning aims to train a model to perform a machine learning (ML) or natural language processing (NLP) task. Common NLP tasks are outlined in the table below.

Task	Description	Representative models	Data requirements	Evaluation metric	Example application
Text summarization	Takes raw text and condenses it to main points.	T5	Pairs of full text and summaries	ROUGE, BLEU	Generating 1 pagers for specific research topics
Language generation	Produces text that matches the user prompt	GPT, T5	A prompt followed by an appropriate response	BLEU, perplexity	Chatbots
Question and answering	Answers a question, given some context	BERT and BERT variants	Question answer pairs with context	F1 score, Exact Match (deterministic), Fuzzy Matching (open-ended)	Semantic search engines
Language translation	Translate one language to another	T5	Source and target languages	BLEU, METEOR	Real-time voice translators
Sentiment analysis	Determines the sentiment expressed in the text	BERT and BERT variants	Texts with associated sentiment labels	Accuracy, F1-score	Customer feedback reports

There’s a lot to unpack here, but there are two main takeaways to focus on here:

Each NLP task that we want the LLM to learn to do requires a specified data structure for fine-tuning
There are different evaluation metrics for each NLP task. This can make fine-tuning LLMs on multiple tasks really difficult.

The Risks and Rewards of Fine-Tuning on a Single Task

We could train an LLM on just one of these tasks, which would allow us to meet specific technical requirements and is simpler to implement. However, two main drawbacks from fine-tuning on a single task include:

Overfitting on the specific task. This means that the model doesn’t extract the patterns in the data, but rather starts to memorize specific patterns, resulting in subpar performance and a worse user experience.
Catastrophic forgetting, which is where a model begins to forget what it learned. This is described in more depth below.

Multi-Task Learning and the Risk-Reward Tradeoff

To reduce the risk of overfitting on the task-specific data, one could train the model on multiple tasks. While this much more complicated, multi-task learning has multiple benefits, including:

Knowledge transfer: LLMs have been shown to have increases in performance across multiple tasks when fine-tuned on multiple tasks, even on tasks that are not fine-tuned on.
Regularization: By varying the structure of the data, we are introducing more variety in the data, which reduces the risk of overfitting or model memorization.
Resource efficiency: AI systems can be composed of different models that all perform different tasks (e.g. running a sentiment analysis model and summarization model separately). Multi-task learning has the potential to make AI systems more efficient, if all of these functionalities are encoded in a single model.

However, we should note that the way you fine-tune on multiple tasks does matter! If you train your AI model sequentially on different tasks ****(e.g. fine-tuning on text summarization, then moving on to question and answering), you run the risk of ****catastrophic forgetting. The result is that the model tends to have a reduction of performance on the previously learned tasks. Instead, it is better to interweave the tasks during training, but the practical implementation of this can be challenging.

The figure above illustrates the point. TL;DR: do not fine-tune models like panel (a), where you’re training LLMs task by task. Instead, aim to fine-tune models like panel (b), where you intersperse that tasks randomly during training. Figure adapted from van de Ven et al., 2024.

Further, we need to note that fine-tuning for multiple tasks requires considerable overhead. You need different data sources or build ways to structure your data for each task. Further, you need to allocate additional computational time and resources to fine-tune on more tasks. Multi-task fine-tuning requires considerable investment in time and capital.

Single or Multi-Task Learning?

Should you fine-tune an LLM on a single task or multiple tasks? It depends on what the model is going to be used for, and there’s no one-size-fits-all solution. However, we’ll give you our perspective on this, which aligns with designing a minimum viable product.

If the model is going to have a single use (e.g. summarization only), we would recommend focusing on the requirements for that single use-case, and use standard methods to mitigate the risk of overfitting, including data augmentation, regularization, early stopping, batch normalization, and so on. However, if you are aiming to design an AI system that is complex in functionality, or you’ve found that single task learning does not get you the key performance indicators you need to release your product, we would suggest multi-task learning.

Instruction Fine-Tuning

Pre-ChatGPT, the authors of the FLAN paper noted that large pre-trained language models like the base GPT3 model are poor at responding to human queries in a zero-shot manner. This means that you couldn’t get the response that you wanted most of the time through a single prompt. Rather, they would have to be “warmed up” with a process called few-shot learning, where some examples of instructions and responses before doing the actual task.

The Issue With Few Shot Learning

Below is an example of a prompt that you would feed to an LLM to perform few-shot prompting for an e-commerce customer support chatbot. The prompt shows a 3-shot problem, which contains 3 question-answer examples within the ### delimiters. The final user/system dialogue is meant to elicit a response from the chatbot to the real customer inquiry, who is asking “What’s your return policy?”.

While this helps a model return a more relevant response, there are several practical issues with few shot prompting.

Most models have a fixed token context window, which means you can only feed the model a certain number of tokens. If you need to add a few examples of how the model should respond, few-shot learning reduces the amount of context you can provide for the real task at hand. This can reduce the quality of the model output.
When using an API or even hosting our own models for inference, more text into the prompt increases the inference cost. This is because a larger prompt consumes more computational resources. Additionally, this leads to increased latency.
Few shot prompting is not scalable from an operational perspective. There can be an infinite number of user flows that you would need to optimize for with different few-shot prompting templates. Additionally, the model response would be tied directly to the quality of the examples used in the prompt.

Instruction Fine-Tuning Gets Us to Zero-Shot Learning

To have the model learn how to respond directly to human queries without any priming of additional (instruction, response) pairs is known as zero-shot learning. Instruction fine-tuning is a way of enabling zero shot learning by teaching an LLM how to respond and follow a set of instructions. Below is a table with three examples of what a model could be instructed to do.

Instruction	Input	Expected Output
Translate the text to French.	"Where is the nearest pharmacy?”	"Où est la pharmacie la plus proche?”
Summarize the content.	"This article discusses the impact of climate change on global agriculture, highlighting the need for sustainable practices to mitigate adverse effects.”	"The article outlines how climate change affects global agriculture and emphasizes the importance of sustainable practices.”
Provide a math solution.	"Solve 2x + 3 = 11.”	"The solution to the equation 2x + 3 = 11 is x = 4.”

While this seems like a simple change in prompting, it resulted in massive improvements in zero-shot learning and interactivity with LLMs (specifics can be found in Ouyang et al., 2022). This was one of the critical design changes that allowed users to interact with AI in a user-friendly way, and it enabled ChatGPT and similar chatbots to go viral.

Challenges with Generating Instruction Datasets

A common pain point when trying to fine-tuning custom LLMs is developing good instruction datasets. One approach involves manually constructing (instruction, response) pairs with subject matter experts. While this is the gold standard way of creating high-quality datasets, it is very time- and labor- intensive.

Another approach is more semi-automated, where you generate synthetic examples using LLMs and open-source libraries.

The Self-Instruct framework (Paper, GitHub) works by using a smaller instruction dataset, and let an LLM generate an instruction dataset.
1. Note that this approach was used to train the Alpaca model, which is a instruction fine-tuned LLaMA model that has similar performance to OpenAI’s text-davinci-003, but cost less than $600 to fine-tun.
Tuna (Blog, Replit) is similar to the Self-Instruct framework to generate instruction datasets, but the blog makes it easy to setup the dataset generation process with an OpenAI API key.
Bonito (Paper, GitHub), which is an open-source model that converts unannotated text into task-specific training datasets for instruction tuning.
1. They made it easy to integrate with common LLM libraries like HuggingFace transformers and vllm libraries, making this an appealing option.

Torchstack's Recommendation for Dataset Generation

There’s no right answer in what approach you should use to fine-tune your LLM. However, we side with a data-centric based philosophy to generating custom fine-tuning datasets and AI models.

TL;DR: “Garbage-In, Garbage-Out”. To create better AI models that will please your customers, you need to have high quality data.

What we recommend is creating a smaller, high-quality dataset (~100-1000 examples) per task, generated internally. This ensures that both business stakeholders and the tech team are aligned with what model they want to build. This gold-standard dataset will be used as the basis for training your custom AI models.

Then, we would augment your hand-made dataset with AI-generated data. You should evaluate and monitor your LLM’s performance, and see if the AI-generation method helped or hurt your model’s performance, and establish a metric to track for ongoing performance evaluation. Once the model is fine-tuned to your standard, whether it’s a capital-based constraint or a performance objective you’ve been maximizing, you would deploy your model into your application and start to acquire real customer data. This is the engine that allows you to improve your model further.

AI Alignment with Human Preferences and Instruction Fine-Tuning

We should briefly discuss AI alignment, which is related to instruction fine-tuning but is so important that we’ll describe in the depth that it deserves in a future post. LLMs are trained on a lot of human (and AI) generated text from the internet. With it, systemic biases (racial, gender, etc), toxic interactions, misinformation, and other issues can be propagated en masse with AI.

Thus, an important step in deploying production-grade models include AI alignment: ensuring that AI systems behave in line with human intentions on values. This will be a key step in ensuring your customers have a great experience with your AI application.

There’s so much that goes into aligning AI models to human preferences that are not even about the training aspects. But from a fine-tuning perspective, there are methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are popular training methods to align AI models with human preferences. For a moment, we leave you a nice reference that discusses the principles and tactics for human-AI alignment (Ji et al., 2023).

Parameter Efficient Fine Tuning (PEFT)

The issue with fine-tuning whole LLMs

While the intuition behind fine-tuning is fairly straightforward, there are challenges with implementing fine-tuning in practice. One potential issue that you could face are the resource requirements to train a model, especially if you’re updating all the model parameters.

Let’s take the example of doing full fine-tuning for GPT3 model, which is a 175 billion parameter model. Each parameter requires 32 bits or 4 bytes. 175 billion parameters x 4 bytes per parameter is 700 billion bytes, which is 700GB of memory. This doesn’t include other things we need to store during model training, including the gradients and the optimizer states for these parameters.

The problem is clear: most of us don’t have access to servers with sufficient memory to perform full fine tuning. And if you do rent out compute from a cloud service provider like Amazon Web Services (AWS), the costs associated with it are high. For instance, the approximate cost to do full fine-tuning with GPT3 would be around $4 million. If this were the only way to fine-tune LLMs, this would prevent individuals, startups, and small businesses from making their own custom LLMs.

Parameter Efficient Fine-Tuning (PEFT)

Fortunately, there are methods that enable us to more efficiently fine-tune models called Parameter Efficient Fine Tuning (PEFT). The intuition is simple: an LLM carries a general representation of our language, which is stored in the weights. We want to use that general model for the most part, but only tweak parameters that are task-specific.

So instead of fine-tuning all model parameters, we freeze (in other words, don’t train) most of the weights in a pre-trained model, and we only fine-tune a small number of weights that correspond to task-specific activities. This approach has several benefits:

By only fine-tuning a smaller subset of weights, we decrease the computational and storage cost significantly.
There are methods where you freeze the entire model except for a smaller layer that is task specific called an adapter. By swapping out adapters during inference, you can have great flexibility in model capabilities.
With full fine-tuning, we can run into the risk of catastrophic forgetting, where the model forgets what it learned previous. However, with PEFT, we reduce this risk by only updating a smaller number of weights. This retains most of the behavior of the pre-trained LLM, while allowing the model to adapt to the fine-tuning dataset.

PEFT and Low Rank Adaptation (LoRA)

There are many methods for PEFT, and we reference this review article for a much deeper and technical dive on PEFT methods. However, we with give a special mention to a specific method called Low Rank Adaptation (LoRA). LoRA is a popular PEFT method for LLM fine-tuning, because of the three advantages we mentioned previously:

We fine-tune a smaller set of weights while freezing the pre-trained LLM weights, saving us computational resources and money.
You can fine-tune multiple LoRA adapters for different tasks, and swap them out easily during inference to optimize the customer experience.
Because the pre-trained LLM is frozen during fine-tuning, we remove the risk of catastrophic forgetting.

LoRA is supported with AI libraries like HuggingFace, allowing your AI team or developers to try out LoRA easily. You can check out the peft library on HuggingFace, which supports other popular PEFT methods for LLM fine-tuning.

Summary

Fine-tuning large language models (LLMs) takes a pre-trained language model that knows the underlying patterns and correlations associated with language, and tweaks it to align with our needs, such as performing specific language-based tasks or listening to instructions. A fundamental shift between traditional modeling and fine-tuning LLMs is this concept of prompting, where we have a language model learn how to respond to user inputs. This was an important innovation that opened up AI models to use by the general population.

Finally, we discussed different fine-tuning approaches, what they do to the model, and provide frameworks for deciding how to approach LLM fine-tuning. LLMs can be fine-tuned to do different natural language-based tasks and to respond to user instructions, but different methods carry risks and costs that need to be accounted for specific business use-cases. We finally discuss Parameter Efficient Fine-Tuning, a way to save on fine-tuning AI models while not sacrificing much on performance.

Together, these approaches are the main levers that you can pull to develop your own custom LLMs! While we discussed a lot of practical tips for LLM fine-tuning, there’s a lot more that goes into the process, including monitoring model fine-tuning, choosing or designing evaluation metrics for measuring model performance, prompt engineering, and much more.

If you’d like to work with us and help your team navigate the challenges associated with training custom AI models, reach out to info@torchstack.ai to set up a time to discuss your specific needs.

This Blog is Supported by Notion and Notion AI

Notion is an all-in-one organizational workspace and collaboration tool for your notes, to-do lists, calendars, and much more. Our team has used many other applications in the past for workspace organization, collaboration, and project management like Evernote, Trello, Microsoft OneNote, among many others. However, we’ve found that things get lost in the mix - there are just too many applications, and switching back-and-forth causes us to lose things. Once we stumbled on notion, we began using for everything:

Writing my blog posts
Jotting down notes, storing images, videos, and linking to other files and web pages
Sharing documents with team-specific sites and creating public web pages for our team to access at all times
Using their templates for project management, documentation, collaborations, and reporting

And much more!

Recently, they released Notion AI, which is directly integrated into the application and it enables you to automate routine tasks, like scheduling and document organization, analyze and summarize large documents, and draft work documents, with one click of the spacebar.

Notion is our headquarters for our organization, and if you’re looking to have a single platform for all your work and collaboration needs, try out Notion today by clicking on the affiliate links above. We appreciate you supporting our business.

Back to blog

Item added to your cart