I want to discuss a very important concept in ML that is often misunderstood: accuracy.

Most people have a reasonable assumption that they should aim for high accuracy when optimizing their machine learning models.

But what if I told you that high accuracy isn't always the best strategy? In fact, if you're optimizing for accuracy, you're likely doing something wrong.

As you'll see, high accuracy in AI can equate to high bias and bad predictions. Instead, you should be optimizing for F1 Score. This is a much more important metric and will give you better results in the end. So what are the risks of optimizing for accuracy instead of F1 Score? In this post, I'll explain why accuracy is overrated and how to optimize for F1 Score instead.

Accuracy explained

Let's take a look at accuracy through the lens of a real-world problem... identifying blueberry muffins amongst a bunch of chihuahuas 😏

Accuracy, as a measure, is the ratio of correct predictions to the total number of predictions.

accuracy = (correct predictions / total predictions) * 100

Now let’s imagine we created a classifier that could identify blueberry muffins with 80% accuracy.

80% accuracy may sound pretty good. However, the classifier can achieve 80% accuracy with the following dataset simply by classifying everything as not a muffin.

This is because the dataset is imbalanced, with a lot more images of Chihuahuas than blueberry muffins. This example is contrived. However, the issue of imbalance is common in real-world classification problems.

For example, in disease screening, most patients will not have a disease. So accuracy is not a good measure to assess performance because if you predicted no one has the disease, you would have high accuracy.

Similarly, in NLP, there is an imbalance when identifying topics of conversation. When comparing mentions of a topic to all of the text amongst thousands of conversations, the topics you seek to predict aren't mentioned often. If you use the wrong measurements to assess performance, it's easy to create a biased classifier. It will appear as if it's accurately classifying topics, but it is not doing you much good in reality.

Exacerbating this problem is observation bias. People naturally judge AI accuracy based on their observation of the predictions. Observation bias is also known as confirmation bias and is the effect of seeing what you want to see in data. You let subjective thoughts control the input to train your classifier with the intent to get higher accuracy. Still, the result is a solution that biases predictions to what you have observed and doesn't work well for data that you haven't observed.

So if you can't rely on accuracy to assess classifiers and observation as a tactic is a frowned-upon practice, how do you get reliable predictions? There are two essential steps to take –

  1. Meditate and do whatever you must to resist the urge to rely on observations of predictions to assess your classifier
  2. Create a dataset as a gold standard to test against and use precision, recall, and F1 Score as your measure of quality

More on what these metrics mean below. But first, let’s cover a few basics –

How to assemble a test dataset

Think of your test dataset as the gold standard for evaluating your classifier. However, there are no golden rules for how much data you should have in a test dataset to get reliable results. It all depends on the amount of available data and how many labels you’re classifying.

The good news is that we have a  few simple guidelines you can follow to create reliable test datasets –

  1. Your test dataset should not contain any data that was used to train your model.
  2. Make sure your test dataset is well-balanced with similar annotated sample counts per label.
  3. As for the size of your test dataset, if you're training a model, it should be a minimum of 10% of the total dataset you've annotated. If you're applying prebuilt models, it's a bit tricker to determine the perfect size. We follow a rule of thumb to have 25 test samples before trusting the results.
  4. Seek variability in samples; don't pick 25 very similar samples.

Understanding why your classifier gets confused

Before we can understand classifier quality, we need to know where our classifier gets confused when making predictions. A confusion matrix gives you insight into the types of errors your classifier makes.

To assemble a confusion matrix, you make predictions for each sample in your test dataset and count the number of True Positives, False Positives, False Negatives, and True Negatives by comparing the predicted label to the actual annotated label.

True Negatives are predictions that are negative that are annotated negative.

False Negatives are predictions that were negative that are annotated positive.

False Positives are predictions that are positive that are annotated negative.

True positives are predictions that are positive that are annotated positive.

Let's fill in a confusion matrix using our Muffin classifier that achieves 80% accuracy by predicting everything as not a muffin.

The confusion matrix helps us identify precisely where our model is confused:

  • False Negatives - it’s falsely predicting images as not a blueberry muffin that are in fact blueberry muffins
  • It also didn’t predict any blueberry muffins, totaling 0 True Positives

Using our understanding of classifier confusion to calculate Precision, Recall, and F1-Score

Each of these metrics serves an essential purpose in assessing classifier quality.

Precision measures the number of correct predictions made. It is the ratio of True Positives to the total number of True Positives and False Positives. Often, when observation bias is introduced, precision can be high.

precision = true positives / (true positives + false positives)

In our example, Precision would be 0 / (0 + 0) = 0

Recall measures the number of correct positive predictions made out of all the possible positive predictions. It provides an indication of positive predictions that our classifier missed. Often, when observation bias is introduced, recall drops.

recall = true positives / (true positives + false negatives)

In our example above, Recall would be 0/ (0 + 2) =  0

We can see that our classifier, which labeled all images as not muffins, is not useful. Because even though it had a high accuracy of 80%, it's precision and recall is 0 because there are zero True Positives.

Let's modify our classifier slightly to predict one muffin correctly, resulting in one True Positive.

Our precision will be 1.0, but our recall is low at 0.5.

And if we flip our classifier to the other extreme to predict everything as a muffin…

Our recall would be 1.0. However, our precision is very low at 0.2.

This is the relationship between precision and recall; as we increase precision, recall decreases and vice-versa, helping you assess where your classifier is getting confused. Is it classifying positive predictions incorrectly, or is it too precise and overlooking predictions that should be positive?

In most cases, we want to find the optimal blend of precision and recall. We can combine the two metrics using the F1 Score when this is the case.

F1 Score is the mean of precision and recall, taking both metrics into account.

F1 Score = (2 * precision * recall) / (precision + recall)

In both extreme versions of our classifier with high bias, our F1 Scores would be 0 and 0.33. Both suboptimal measures tell us the model is very confused and needs improvement—a much more reliable measure than the 80% measure of accuracy reports.

Test and measure your classifiers with ease

With an understanding of how to assess classifier quality, you're ready to assemble a test dataset, annotate data, and calculate precision, recall, and F1 Score.

The easiest way to do this is with Caravel's classifier builder. You can use our integrations to sync your data sources to create test datasets, and our AI-guided annotation UI helps you tag your test data twice as fast.

Caravel's classifier builder will automatically calculate precision, recall, and F1 Score, so you don't have to worry about the confusion matrix and calculations.

We’ve put together a short demonstration of the classifier testing UI for you below👇

Caravel is your hub for customer feedback that uses AI to automatically analyze all of your responses and guide your teams to action.

Reach out here and we'd be delighted to help with any of your AI and customer feedback needs.