A Taxonomy of PEFT Methods for LLM Fine-Tuning

May 28, 2024

Summary

Parameter Efficient Fine Tuning (PEFT) is a set of a cost effective methods to develop custom Large Language Models (LLMs). But like any other tool in the toolkit, we need to select the method that will work for your specific problem.

This blog goes over a provided taxonomy of PEFT methods, covering the five broad categories of PEFT methods
We highlight representative PEFT methods, and show how to deploy these PEFT methods with a simple fine-tuning example.
We describe the strengths and weaknesses of these PEFT methods and provide a framework and selection criteria for choosing PEFT algorithms to experiment with.

The Taxonomy of PEFT Methods

Here we will describe five large categories of PEFT methods at a high-level before diving into each specific method. There are three main PEFT categories (Reparameterization, additive, and partial fine-tuning), while two methods are composite methods (hybrid and unified fine-tuning).

Reparameterized fine-tuning: The outputs from the pre-trained weights are slightly tuned by low rank matrices that are trained to be task specific.
Additive fine-tuning: this method introduces additional parameters to the pre-trained model that learn the specific task. The pre-trained weights remain frozen.
Selective fine-tuning: A subset of the model’s parameters are selected for fine-tuning, while the remaining parameters are frozen.
Hybrid fine-tuning: Multiple fine-tuning approaches are combined together as modules.
Unified fine-tuning: A single approach unifies various fine-tuning methods into a cohesive framework.

Reparameterization

Reparameterization PEFT methods introduce small, task-specific adjustments to the frozen pre-trained model. Many approaches achieve this by adding lightweight components to the model’s weights, such as low rank matrices. We’ll take a deeper dive into the original low rank decomposition methods, and then deep dive into LoRA and LoRA derivatives.

What does it mean to be a low-rank matrix?

Taking a step back into Linear Algebra 101, the rank of a matrix is the number of linearly independent rows or columns in a matrix. In other words, a matrix is considered full rank if no vector in the set can be written as a linear combination of the others.

Low rank matrices, on the other hand, contain rows and columns that are linearly dependent on each other. But with LoRA, we’re trying to approximate a high rank matrix with a low rank one. Why? Because with LoRA fine-tuning, we’re trying to capture the most important features of the data, while discarding redundant or not so significant information. Additionally, we’re trying to learn good low rank matrices that contain the task specific information we need, while condensing that knowledge into computations that are more efficient in storage and processing.

Low Rank Decomposition Methods

One popular method is Low Rank Adaptation (LoRA), where we introduce two trainable low rank matrices during model fine-tuning. The original pre-trained weights are frozen, while the LoRA weights are learning the task-specific information. During inference, the weights are merged with the pre-trained model, which allows you to use the newly learned information while minimizing the changes in inference time.

Practically, this affords several benefits. You can have multiple lightweight LoRA models that can be swapped easily for different downstream tasks, especially if you’re able to effectively route prompts to appropriate models.

LoRA Derivatives

LoRA is a highly effective reparameterization method that affords several useful practical benefits for LLM fine-tuning. As a result, LoRA derivatives have emerged that improve upon the base LoRA method. We’ll discuss three specific methods here.

While LoRA is a powerful method, the original LoRA uses a fixed rank for the low-rank matrices. This fixed rank results in several issues:

The given rank may not be sufficient for a specific dataset size and complexity. For instance, with a smaller dataset, the rank may be too large, resulting in overfitting. Further, the more complex a dataset is, the larger the rank needs to be.
Additionally, depending on the rank issue, there may be an excessive number of parameters. This will result in increased computational costs and memory usage.
Manual determination for an appropriate rank is almost like hyperparameter tuning. Experimentation to optimize this value may be time-consuming and resource-intensive.

Adaptive Low Rank Adaptation (AdaLoRA) solves the fixed rank problem from the original LoRA method by dynamically adjusting the rank of matrices using a combination of Singular Value Decomposition, and pruning of unimportant values.

Another general issue with LoRA is the large amount of memory still required to fine-tune pre-trained LLMs. This has always been a problem in training deep learning models, and to solve this problem, we (the AI field) have developed quantization methods that allow us to use less information (bits) while still capturing enough of the data patterns for modeling.

For some historical context, many smaller models are trained in full precision, where each floating point number is represented using 32 bits (FP32). Recent models, such as LLaMA 2 was trained with half precision, which is 16 bits (FP16). While halving the precision may already lead to significant memory savings, sometimes it still is not enough for fine-tuning very large language models. Further, because more bits are being removed, the model calculations become less precise, which tends to lead to a decrease in performance.

Thus, Quantized Low Rank Adaptation (QLoRA) was an innovation that enabled fine-tuning on a 4-bit quantized pretrained LLM, while retaining the performance of 16-bit fine-tuning.

Back to blog