Fine-Tuning LLMs: Sentiment Analysis With BERT

May 9, 2024

In a previous post, we discussed how fine-tuning can be used to generate better artificial intelligence (AI) models for your specific business problems and use-cases, which can not only be used to generate new AI applications and products, but can be used internally to save time and effort on tasks and research.

This is the first part of a multi-part tutorial where we’ll walk through fine-tuning large language models from start to finish, and go over related technologies, business considerations, and best practices associated with AI fine-tuning. In this post, we’ll be fine-tuning a BERT model to perform sentiment analysis on financial Tweets to see if they appear to be bullish or bearish.

What is BERT?

In 2018, Google released the Bidirectional Encoder Representations from Transformers (BERT) model, which was one of the first encoder-only large language models (LLMs) that competed with other state-of-the-art architectures like OpenAI’s original GPT model. Some of the significant aspects of BERT are shown in the table below.

BERT was one of the first language models that could be used for many NLP tasks at once, while achieving good enough performance for production-grade systems.

Note that this post doesn’t go into the technical details with how BERT works, because there are others that have already done a fantastic job at this! If you’re interested in BERT works in greater detail, I would point to HuggingFace’s BERT 101 post and Jay Alammar’s Illustrated BERT post.

Why BERT?

At this point, you may be thinking,

“Why are we talking about BERT? It sounds like old technology that isn’t that relevant compared to the latest AI models like OpenAI’s GPT4, Mistral AI’s models, or Google Gemini. Shouldn’t we be talking about fine-tuning those models?”

While those are valid points, BERT is still a valuable AI technology to learn and apply in a modern AI technology stack for the following reasons:

Fine-Tuning BERT for Sentiment Analysis

This section goes into the details of fine-tuning a BERT model for a common machine learning (ML) task called sentiment analysis. We’ll largely focus on the higher-level concepts and core parts of the code used here, while the full Jupyter notebook for this tutorial can be found in this GitHub repo.

What is Sentiment Analysis?

Sentiment analysis is an NLP task that is used to determine whether a piece of text has some sort of sentiment associated with it. When applied to a commercial technology stack, sentiment analysis is typically used to understand the sentiment of user or customer generated text to do something downstream. Some examples are listed below:

Categorizing customer feedback responses to determine areas of improvement
Monitoring social media engagement to gauge public sentiment about brands/products/services and inform targeted marketing strategies
Flagging highly negative customer support tickets for immediate attention to increase retention

We used the Twitter Financial News dataset in our BERT fine-tuning example, which is available with the HuggingFace dataset API. This dataset consists of English tweets that are annotated with three possible labels: 1. Bearish, 2. Bullish, or 3. Neutral. We’ll be using the validation set to benchmark the base BERT model and the fine-tuned BERT model.

First, we’ll load some relevant libraries.

To load the dataset, we can use the following function call from the datasets API. We’ll also visualize a single data point.

Before we do any modeling, we need to tokenize the text dataset. Tokenizing is simply a way to convert string data (e.g. sentences, words, subwords, or characters) into a numerical representation that the model can use.

We’ll also apply a little short cut for model fine-tuning here: we’ll fine-tune and evaluate our BERT model on only 1000 data points. If you increase the number of data points further, you will likely get much better model performance, but for demonstration purposes, this works.

Benchmarking the Base BERT model

Now let’s load the base BERT model here.

Since sentiment analysis can be framed as a multi-class classification problem, we’ll use accuracy as the evaluation metric. We’ll use HuggingFace’s [evaluate](<https://huggingface.co/docs/evaluate/en/index>) API to calculate the accuracy.

The code that we’ll use for inference is in the predict function supplied below.

Let’s now assess how the base BERT model with no fine-tuning does with this dataset.

As you can see, the baseline accuracy measure isn’t that high. Let’s see if fine-tuning BERT helps.

Setting Up Model Fine-Tuning

We’ll need to set a function that computes the accuracy during model training. This is provided below.

To set up the training configuration, we can use the TrainingArguments object from HuggingFace.

Finally, we’ll create the training pipeline that will execute BERT fine-tuning. This is done using the Trainer object from HuggingFace.

Fine-Tuning BERT and Results

The beautiful thing about HuggingFace is that it abstracts a lot of code for us to execute model training and fine-tuning. We can execute the fine-tuning process for BERT with a single line of code.

Now if you’re like me, you probably stepped away from your machine while fine-tuning, and need to re-load the model. We’ll load the latest checkpoint that was generated.

Similar to the code we used for the BERT baseline accuracy, we can grab the fine-tuned BERT accuracy value.

We’ll report out the final accuracy, and the improvement between the base model and the fine-tuned model.

Here we can conclude that fine-tuning did improve the BERT sentiment analysis performance by 17.6%. We could certainly improve upon this if we wanted to create a production-grade AI system. For instance, we could use the full training dataset, or try out another BERT variant - both of these tasks would be interesting, but are out of the scope of this tutorial.

If you do these other fine-tuning activities, you can email me here and let me know your results and what you think of your model!

Takeaways

In this tutorial, we discussed both the theory and Python code that is involved with fine-tuning BERT.

In summary:

BERT is an OG but relevant language model that can be used for many NLP and LLM-based tasks in modern AI production systems.
We discussed how to fine-tune BERT to perform sentiment analysis, specifically using the Twitter Financial News dataset to infer the sentiment associated with a tweet.
Using the HuggingFace library, we were able to fine-tune BERT with a few lines of code and saw massive improvements in accuracy (although there is definitely room for improvement)!

Together, we hope you found this tutorial useful for your own projects involving BERT or model fine-tuning. In later tutorials, we’ll implement other fine-tuning strategies for larger LLMs and discuss practical considerations for those use-cases.

Whether you are building basic Retrieval Augmented Generation systems for your AI applications or you’re fine-tuning the next hotdog classifier, it’s always useful to have an experienced Data, ML, and AI team with you to call on. Please reach out to info@torchstack.ai to discuss the problems you’re trying to solve, the markets you’re trying to conquer, or just to grab a cup of coffee.

Back to blog

Item added to your cart