logo

Hugging Face Course 2025

Hugging Face

Hugging Face is an ecosystem that provides models, datasets, and other tools to develop NLP (Natural Language Processing) and other machine learning models. Natural language is inherently ambiguous, which makes NLP challenging to learn. However, using Hugging Face tools can make the process easier. For example, you can use Hugging Face pretrained models instead of creating and training a model from scratch. You can find a brief introduction to Hugging Face Datasets, Tokenizers, and Hugging Face Models. You can explore these concepts in more detail in the Tasks section. There, you'll learn how to use datasets, tokenizers, and models. The process typically begins with preparing your data, followed by tokenizing the input. Then, you pass the tokenized data to the model. For some models, you need to decode the output as the final step. You should also keep in mind that Hugging Face has its own complexity, and it takes time to learn how to use it. Pretrained models have different syntax rules, and you should know how to use them. Using pipelines is the easiest way to get a result from a Hugging Face model. You can load a model directly as well. There are many models, and you should find the right model for your dataset (and the task). For some models, Scikit-learn metrics are also compatible. Some models perform better than others; you can test them using your data.

We will use other Python libraries as well. If you want to learn about the Pandas library, please visit the Pandas tutorial. If you want to learn about the NumPy library, please visit the NumPy tutorial. If you're interested in a specific topic, feel free to jump straight to it. Otherwise, every topic contains useful information. You will learn how to use Hugging Face libraries locally. You can use your editor to test your code. Visual Studio Code is used for the examples below. You should be familiar with Python and PyTorch. The Hugging Face course below is for beginners, and you don't need any prior knowledge. You can always consult the Hugging Face website for more information.

Virtual Environment setup

You should set up the environment before installing Hugging Face libraries. You need to set up a virtual environment in Python. You need to install virtualenv. If you are using pip, run the command below:

pip install virtualenv

If you are using pip3 or pipx, use pip3 or pipx instead of pip.

You need to create a virtual environment in your Python project folder. If you are using pip, run the command below:

python -m venv new_env

If you are using python3, use python3 instead of python. We named the virtual environment "new_env" but you can choose another name.

You can activate the environment:

source new_env/bin/activate

We will need the following Python libraries, you need to install them as well. If you are using pip, run the command below:

pip install -U scikit-learn

To install scikit-learn using conda, check the official website.

After the installation, you may need to close and reopen your folder.

To check the version of the scikit-learn library:

import sklearn
print(sklearn.__version__)

We will use the pandas and numpy libraries. If you are using pip, run the commands below:

pip install pandas

pip install numpy

If you are using conda, run the commands below:

conda install pandas

conda install numpy

Import the pandas and other libraries:

import pandas as pd
import numpy as np

Hugging Face Installation for Natural Language Processing

You need to install the Natural Language Processing libraries to use Hugging Face locally. You need to install datasets and tokenizers libraries to use Hugging Face datasets and tokenizers:

pip install datasets

pip install tokenizers

Hugging Face datasets 3.5.0 and tokenizers 0.21.1 will be used for the tutorial below. If you installed a Hugging Face library successfully and want to learn the library version, you can run the command below:

import datasets
print(datasets.__version__)

You can use the syntax above for other Hugging Face libraries; you just need to change the library name.

The Transformers library is the necessary library to run the Transformers models.

pip install transformers

Transformers 4.51.0 will be used in the tutorial below.

You should install PyTorch as well:

pip install torch

PyTorch 2.6.0 will be used for the course below. If you want to run the codes below without any installation, you can use Google Colab as well. However, some of the libraries, like datasets, may not be up-to-date, and they may not work as expected. You are likely to get ImportErrors.

Since they use older versions of the libraries, examples generated by AI tools may also be out of date. Therefore, errors may occur if you run them locally.

Device Types and Hugging Face

PyTorch will be used for the Hugging Face examples below. The compute platform for PyTorch depends on your device's operating system. MacOS will be used for the examples below. Therefore, "mps" will be used. If your device supports NVIDIA GPUs, you should use "cuda". You can run the commands to learn your compute platform:

import torch
print(torch.backends.cuda.is_built())
print(torch.backends.mps.is_available())
print(torch.backends.cpu.get_cpu_capability())

If you still need help, you can visit the PyTorch website. Unfortunately, training models can take a long time and the device you use can determine the length of the training time.

Hugging Face Datasets

Hugging Face Datasets is a library for Natural Language Processing tasks. If you haven't installed it yet, run the command below:

pip install datasets

You can check the full list of Hugging Face Datasets. You can use the "Tasks", "Libraries", and "Languages" categories to specialize/filter your dataset search. If you need a dataset for a specific task, you can choose the task and see your dataset options. You can read the Dataset card to get more detailed information about the dataset. For example, you can go to the Tasks category and choose the "Question Answering" task under the Natural Language Processing title.

How to load a Hugging Face dataset

If you want to use a dataset locally, you should import it. You should go to the dataset's page and click "</>Use this dataset". You can copy and paste the code, or you can manually write the code:

from datasets import load_dataset
ds = load_dataset("dataset_name")

Keep in mind that some of the datasets have multiple versions, you should import the version you want.

Running NLP models can take a long time. You can select a part of the dataset instead of a full dataset:

small_dataset = ds["train"].shuffle(seed=42).select(range(300))

Your dataset must match what the model expects. Reading dataset cards can be very helpful. You can find the links (and credits) in the comments or below the examples.

How to split Hugging Face dataset

You can split your data into training and testing sets using different methods.:

ds = load_dataset("dataset_name", streaming=True, split="train")
ds = load_dataset("dataset_name", split="test")
dataset = load_dataset("dataset_name", split="test[:5%]")
dataset = load_dataset("dataset_name", split="train[:1000]")

Hugging Face Tokenizers

As mentioned earlier, you need to tokenize your data for the model. Hugging Face offers different types of tokenizers. If you haven't installed the tokenizers library yet, run the command below:

pip install tokenizers

You need to load and import your tokenizer:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

You can also use auto classes to tokenize your data:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The tokenizer and model should always be from the same checkpoint. Depending on your task, you may need to decode your output using tokenizers.

Tokenizers return input_ids, attention_masks, and token_type_ids (not present for DistilBERT). Input IDs are lists of token IDs. Attention masks indicate whether a token is real or padding. Some tokenizers may return additional outputs. Keep in mind that different models require different inputs.

Hugging Face Models

As mentioned earlier, Hugging Face has great pretrained models. You can use them instead of training your model from scratch. If you haven't installed the transformers library yet, run the command below:

pip install transformers

Transformers

Transformers is a cutting-edge deep learning architecture that enables machines to understand language contextually. Transformers consider all words in a sentence simultaneously, allowing machines to grasp meaning, relationships, and nuances more effectively. This approach has significantly advanced the field of Natural Language Processing and enabled more accurate language understanding tasks such as translation, question answering, and text generation. You should find the best model for your task and dataset. The process is similar to the datasets. You can choose a task and select a suitable model. You can read model cards to get more detailed information about the model. You can click the Use this model button and see your options.

How to use a Hugging Face model

You can specify a task and use the pipeline() function:

from transformers import pipeline
qa_model = pipeline("question-answering")

The pipeline function carries out all stages of the process. Although using a pipeline without specifying a model name and revision in production is not recommended, you don't have to specify a model like in the example above. You can also explicitly specify a model:

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

You can also load a model directly:

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

To train or fine-tune a pretrained model, you need to load it directly.

You can also use auto classes. You should find the auto classes for the model you want to use and specify the model:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

If you want to learn more about your data, you can check its configuration:

print(model.config)

Reading model cards can be very helpful. You can find the links (and credits) in the comments or below the examples. Unfortunately, some of the model cards lack sufficient information.

Tasks

Natural Language Processing Models

Table Question Answering

Table Question Answering using a pipeline

We will use the pipeline for Table Question Answering. We will create simple synthetic data. The table will display the names of the products and the number of products.

from transformers import pipeline
import pandas as pd

# prepare table
data = {"Products": ["jeans", "jackets", "shirts"], "Number of products": ["87", "53", "69"]}
table = pd.DataFrame.from_dict(data)

#prepare your question
question = "how many shirts are there?"

# pipeline model
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq", aggregator="SUM")

# result
print(tqa(table=table, query=question))

{'answer': 'SUM > 69', 'coordinates': [(2, 1)], 'cells': ['69'], 'aggregator': 'SUM'}

If we change the data and make two columns for the shirts, the answer changes:

#new data
data = {"Products": ["jeans", "jackets", "shirts", "shirts"], "Number of products": ["87", "53", "69", "21"]}
print(tqa(table=table, query=question))

The answer:

{'answer': 'COUNT > 69, 21', 'coordinates': [(2, 1), (3, 1)], 'cells': ['69', '21'], 'aggregator': 'COUNT'}

We can get the total number of shirts:

z = tqa(table=table, query=question)["cells"]
x= []
for i in z:
    x.append(int(i))
print(sum(x))

The answer is 90.

*google/tapas-large-finetuned-wtq model from Hugging Face — licensed under the Apache 2.0 License.

Table Question Answering Model

You can load the Table Question Answering Model directly. We will use the same data.

from transformers import TapasTokenizer, TapasForQuestionAnswering
import pandas as pd
import torch

# Load model and tokenizer
model_name = "google/tapas-base-finetuned-wtq"
tokenizer = TapasTokenizer.from_pretrained(model_name)
model = TapasForQuestionAnswering.from_pretrained(model_name)

# Example table
data = {"Products": ["jeans", "jackets", "shirts"], "Number of products": ["87", "53", "69"]}
table = pd.DataFrame.from_dict(data)

# Question
question = "how many shirts are there?"
# Tokenize inputs
inputs = tokenizer(table=table, queries=[question], return_tensors="pt")

# Forward pass with torch.no_grad():
    outputs = model(**inputs)

# Decode predicted answer
logits = outputs.logits
logits_agg = outputs.logits_aggregation

# Get the most probable cell answer
predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions( inputs, outputs.logits, outputs.logits_aggregation )
# Extract the answer from the table
answers = []
for coordinates in predicted_answer_coordinates:
    if not coordinates:
        answers.append("No answer found.")
    else:
        cell_values = [table.iat[row, column] for row, column in coordinates]
        answers.append(", ".join(cell_values))

# Print the result
print("Answer:", answers[0])

Answer: 69

*The pipeline model, google/tapas-large-finetuned-wtq based on code from https://huggingface.co/google/tapas-large-finetuned-wtq (Apache 2.0)

The model's syntax can be a bit complex. Let's analyze this step by step. logits are raw output scores. logits_aggregation returns the scores of numeric aggregation operations. The model can perform basic operations like SUM using the table data.

Zero-shot classification

Zero-shot classification using a pipeline

Zero-shot classification is used to predict the class of unknown data. Zero-shot classification models require text and labels. Let's see an example of Zero-shot classification using a pipeline:

from transformers import pipeline
classifier = pipeline("zero-shot-classification")
print(classifier( "Is this a good time to buy gold?", candidate_labels=["education", "politics", "business", "finance"] ))

{'sequence': 'Is this a good time to buy gold?', 'labels': ['finance', 'business', 'education', 'politics'], 'scores': [0.5152193307876587, 0.38664010167121887, 0.057615164667367935, 0.040525417774915695]}

You see the results in descending order. The "finance" label has the highest score.

Zero-Shot classification model

You can load the model directly:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

# Load model and tokenizer
model_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Input sentence
sequence = "The pi is the ratio of the circumference of any circle to the diameter of that circle"

# Candidate labels
labels = ["education", "psychology", "sports", "finance", "math"]

# Create NLI-style premise-hypothesis pairs
premise = sequence
hypotheses = [f"This text is about {label}." for label in labels]

# Tokenize and get model outputs for each hypothesis
inputs = tokenizer([premise]*len(hypotheses), hypotheses, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

# Convert logits to probabilities (softmax over entailment class)
entailment_logits = logits[:, 2]
probabilities = F.softmax(entailment_logits, dim=0)
print(probabilities)

# Print results
for label, score in zip(labels, probabilities):
    print(f"{label}: {score:.4f}")

tensor([0.0125, 0.0091, 0.0089, 0.0109, 0.9586])
education: 0.0125
psychology: 0.0091
sports: 0.0089
finance: 0.0109
math: 0.9586

*facebook/bart-large-mnli model from Hugging Face — licensed under the MIT License.

The BART MNLI model has complex syntax rules. Let's simplify this. We need to get model outputs for each hypothesis. There are 5 labels. Therefore, the premise ("sequence") must be provided five times. The model returns logits for contradiction, neutral, and entailment. We are interested in entailment, and its index is 2. That's why we selected the logits at index 2.

What's softmax?

softmax in PyTorch is applied to all slices along dim, and will re-scale them so that the elements lie in the range [0, 1] and sum to 1.
The sum of the scores [0.0125 + 0.0091 + 0.0089 + 0.0109 + 0.9586] in the example above is 1 and the "math" label has the highest score. For more information about softmax, visit the PyTorch docs.

Fill-Mask

Fill-Mask task using a pipeline

The fill-mask models replace the masked word/words in a sentence.

from transformers import pipeline
unmasker = pipeline("fill-mask")
print(unmasker("The most popular sport in the world is <mask>.", top_k=2))

[{'score': 0.11612111330032349, 'token': 4191, 'token_str': ' soccer', 'sequence': 'The most popular sport in the world is soccer.'},
{'score': 0.10927936434745789, 'token': 5630, 'token_str': ' cricket', 'sequence': 'The most popular sport in the world is cricket.'}]

Fill-Mask model

You can also load the model directly:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( "google-bert/bert-base-uncased" )
model = AutoModelForMaskedLM.from_pretrained(
"google-bert/bert-base-uncased", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa" )

#See the device type explanation below
inputs = tokenizer("The most popular sport in the world is [MASK].", return_tensors="pt").to("mps")

with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
prediction = tokenizer.decode(predicted_token_id)
print(f"The most popular sport in the world is {prediction}.")

The most popular sport in the world is football.

*google-bert/bert-base-uncased model from Hugging Face — licensed under the Apache 2.0 License.

You can use "mps" for macOS and "cuda" for devices compatible with CUDA. You can also remove it.

What's argmax?

The argmax returns the indices of the maximum value of all elements in the input tensor.
It returns the index of the maximum value to decode in the example above. For more information about argmax, visit the PyTorch docs.

Question Answering

Question Answering pipeline

There are different types of Question Answering (QA) tasks. If you use a pipeline for QA without specifying a model, the distilbert/distilbert-base-cased-distilled-squad model is used. It is used for extractive QA tasks. In other words, the model extracts the answer from a given text. Let's see an example of an extractive QA task using a pipeline:

from transformers import pipeline
question_answerer = pipeline("question-answering")
print(question_answerer(
question="Where does Julia live?",
context="Julia is 40 years old. She lives in London and she works as a nurse." ))

{'score': 0.9954689741134644, 'start': 36, 'end': 42, 'answer': 'London'}

Question Answering model

You can load the QA model directly:

from transformers import AutoTokenizer, BertForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

#question, text
question, text = "Where does Julia live?", "Julia is 40 years old. She lives in London and she works as a nurse."

#tokenize question and text
inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
   outputs = model(**inputs)
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

result = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)
print(result)

{'score': 0.9954689741134644, 'start': 36, 'end': 42, 'answer': 'London'}

*deepset/bert-base-cased-squad2 model from Hugging Face — licensed under the CC BY 4.0 License.

Translation

Translation using a pipeline

Our model will translate a sentence from French to English. However, there are other models for other languages.

from transformers import pipeline
translator = pipeline("translation", "Helsinki-NLP/opus-mt-fr-en")
print(translator("C'est un beau roman."))

Translation model

We will use the same model but it will translate the sentence from English to French:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

text = "The food is very delicious."
inputs = tokenizer(text, return_tensors="pt").input_ids
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

*"Helsinki-NLP/opus-mt-en-fr”model from Hugging Face — licensed under the Apache 2.0 License.

Summary

Summary using a pipeline

You can use summary models to summarize a text:

from transformers import pipeline
from datasets import load_dataset

ds = load_dataset("dataset_name")
text = ds["train"][0]["context"]
classifier = pipeline("summarization", max_length=100)
print(classifier(text))

Summary model

We will use a Hugging Face dataset, abisee/cnn_dailymail to summarize. You can write your own paragraph.

from transformers import AutoTokenizer, BartForConditionalGeneration

checkpoint = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

from datasets import load_dataset
ds = load_dataset("abisee/cnn_dailymail", "1.0.0")
text = ds["train"][0]["article"]
inputs = tokenizer(text, max_length=100, return_tensors="pt")

# Generate Summary
summary_ids = model.generate(inputs["input_ids"], max_length=180, min_length=40, do_sample=False, no_repeat_ngram_size=3)
print(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported $41.1 million fortune. Radcliffe says he has no plans to fritter his cash away on fast cars.

*"abisee/cnn_dailymail", "1.0.0" dataset from Hugging Face — licensed under the Apache 2.0 License.
*"facebook/bart-large-cnn" model from Hugging Face — licensed under the MIT License.

You can control how the model generates a summary. For example, you might set the minimum and maximum length of the output, as shown above.

Token Classification

Token Classification using a pipeline

Token classification models are used to identify entities in a text. What type of entities can a token classification model identify? It depends on the model. For example, dslim/bert-base-NER can identify four types of entities: location (LOC), organizations (ORG), person (PER), and miscellaneous (MISC).

from transformers import pipeline
classifier = pipeline("token-classification")
z = "I'm Alicia and I live in Milano."
d = classifier(z)
print(d)
for token in d:
    print(token["word"], token["entity"])

[{'entity': 'B-PER', 'score': np.float32(0.9941089), 'index': 4, 'word': 'Alicia', 'start': 4, 'end': 10},
{'entity': 'B-LOC', 'score': np.float32(0.9950382), 'index': 9, 'word': 'Milano', 'start': 25, 'end': 31}]
Alicia B-PER
Milano B-LOC

Token Classification Model

We can load the token classification model directly. We will use the same text with a different model:

import torch
from transformers import BertTokenizerFast, BertForTokenClassification

# Load model and tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)

# Sample input
text = "I'm Alicia and I live in Milano."
# Tokenize
tokens = tokenizer(text, return_tensors="pt", truncation=True, is_split_into_words=False)
# Forward pass
with torch.no_grad():
    outputs = model(**tokens)
    logits = outputs.logits # shape: (batch_size, seq_len, num_labels)
# Get predicted class indices
predictions = torch.argmax(logits, dim=2)

# Convert IDs to label names
id2label = model.config.id2label

# Token IDs
input_ids = tokens["input_ids"][0]
predicted_labels = [id2label[label_id.item()] for label_id in predictions[0]]
print(predicted_labels)

['O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O']

*"dslim/bert-base-NER" model from Hugging Face — licensed under the MIT License.

B refers to the beginning of the entity: B-PER - Beginning of a person's name right after another person's name, B-LOC - Beginning of a location right after another location.

For more detailed information about the dslim/bert-base-NER model, please visit the dslim/bert-base-NER website.

Text Classification

Text Classification using a pipeline

Text classification models are designed to categorize text into predefined labels. They are widely used in tasks like sentiment analysis, spam detection, and topic labeling. In the example below, the model will determine whether a given text expresses a positive or negative sentiment.

from transformers import pipeline
text = "Your dog is super cute."
pipe = pipeline("text-classification")
result = pipe(text)
print(result[0]["label"])

POSITIVE

Text Classification Model

We will load the same model directly:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("Your dog is super cute.", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
print(model.config.id2label[predicted_class_id])

POSITIVE

*"distilbert-base-uncased-finetuned-sst-2-english" model from Hugging Face (Apache 2.0).

We used a simple text, but you can use the model for more complicated texts like reviews as well.

How to evaluate Hugging Face models

There are several ways to evaluate your Hugging Face models. Hugging Face provides a variety of evaluation metrics tailored to different tasks. You can use the evaluate library to assess the performance of Hugging Face models and datasets. To get started, you need to install the evaluate library:

pip install evaluate

Different models return different data types, so it's important to choose the right evaluation metric for each model. You can find a full list of available metrics in the Hugging Face documentation. In this section, you will learn how to use the Hugging Face evaluate library. We will reuse some of the earlier examples to gain a better understanding of how Hugging Face models work. Keep in mind that using a full dataset may require slightly different syntax compared to evaluating a single sample. Feel free to revisit the earlier example if you need a refresher on how the model works. We will be working with the stanfordnlp/sst2 dataset. Let's evaluate the sentiment analysis model using the following metrics: accuracy, f1 score, precision, recall metrics:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model.eval()

from datasets import load_dataset
ds = load_dataset("stanfordnlp/sst2")
sent = ds["validation"]["sentence"][:100]
labels = ds["validation"]["label"][:100]

inputs = tokenizer(sent, truncation=True, padding=True, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

results = [i.argmax().item() for i in logits]

import evaluate
accuracy = evaluate.load("accuracy")
result = accuracy.compute(predictions=results, references=labels)
print("Accuracy:", result)

from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, _ = precision_recall_fscore_support(labels, results, average='binary')
print("precision, recall, f1: ", precision, recall, f1)

Accuracy: {'accuracy': 0.94}
precision, recall, f1: 0.9259259259259259 0.9615384615384616 0.9433962264150944

We evaluated the AutoModelForSequenceClassification model using the "distilbert/distilbert-base-uncased-finetuned-sst-2-english" checkpoint. The evaluation included accuracy, precision, recall, and f1 scores. Accuracy is calculated using the Hugging Face evaluate library. Alternatively, you can use Scikit-learn's built-in metrics to evaluate Hugging Face models. The example above shows how to use Scikit-learn metrics with Hugging Face models in this context. We used Sikit-learn to compute precision, recall, f1 score.

Seqeval is a Python library designed for evaluating sequence labeling (NER and POS tagging) tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. It calculates precision, recall, and F1 score. You need to install the seqeval library to use the seqeval metrics:

pip install seqeval

We will evaluate the token classification model example above using the "tomaarsen/conll2003" dataset. We will use the seqeval metric's classification report for the evaluation:

import torch
from transformers import BertTokenizerFast, BertForTokenClassification
from datasets import load_dataset
from seqeval.metrics import classification_report
import datasets

# Load model/tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()

# Load dataset
dataset = load_dataset("tomaarsen/conll2003")
texts = dataset["validation"]["tokens"][:100]
true_tags = dataset["validation"]["ner_tags"][:100]
label_names = dataset["validation"].features["ner_tags"].feature.names
id2label = model.config.id2label
predicted_labels = []
true_labels = []

#Tokenize the Dataset
for tokens, tag_ids in zip(texts, true_tags):
    encoding = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**encoding)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)
    word_ids = encoding.word_ids()
    preds = []
    trues = []
    prev_word_id = None
    for idx, word_id in enumerate(word_ids):
        if word_id is None or word_id == prev_word_id:
            continue # Skip special tokens and subwords
        pred_id = predictions[0][idx].item()
        preds.append(id2label[pred_id])
        trues.append(label_names[tag_ids[word_id]])
        prev_word_id = word_id
    predicted_labels.append(preds)
    true_labels.append(trues)

# Evaluation
print(classification_report(true_labels, predicted_labels))

*The dataset used in the example above may cause errors in Google Colab. You can either test the code with a similar dataset or run it in your local environment.

As mentioned earlier, different models require different evaluation metrics. Question answering models have more complex syntax, which can make evaluation more challenging. We will use the rajpurkar/squad_v2 dataset to evaluate the question answering model. Let's evaluate the question answering model example above:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, BertForQuestionAnswering
import evaluate
import torch

# Load the SQuAD validation set
dataset = load_dataset("rajpurkar/squad_v2", split="validation[:50]")

# Load pretrained model and tokenizer
model_name = "deepset/bert-base-cased-squad2"
tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")
model.eval()

# Tokenize the question and the text
def preprocess(example):
    return tokenizer( example["question"], example["context"], truncation=True, padding="max_length", max_length=384, return_tensors="pt" )
tokenized_dataset = dataset.map(preprocess, batched=True)
answers = []
for example in tokenized_dataset:
    input_ids = example["input_ids"]
    attention_mask = example["attention_mask"]

# Convert to tensors
    inputs = {
        "input_ids": torch.tensor([input_ids]),
        "attention_mask": torch.tensor([attention_mask]) }
    with torch.no_grad():
        outputs = model(**inputs)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

# Decode answer span
    answer_ids = inputs["input_ids"][0][start_idx:end_idx + 1]
    answer = tokenizer.decode(answer_ids, skip_special_tokens=True)
    answers.append(answer)

import evaluate
metric = evaluate.load("squad_v2")

# Format predictions and references
pred = [ {"id": dataset[i]["id"], "prediction_text": answers[i], 'no_answer_probability': 0.}
    for i in range(len(answers))]
references = [
{"id": dataset[i]["id"], "answers": dataset[i]["answers"]}
    for i in range(len(answers)) ]
results = metric.compute(predictions=pred, references=references)
print(results)

{'exact': 46.0, 'f1': 48.0407876230661, 'total': 50, 'HasAns_exact': 28.571428571428573, 'HasAns_f1': 33.430446721585966, 'HasAns_total': 21, 'NoAns_exact': 58.62068965517241, 'NoAns_f1': 58.62068965517241, 'NoAns_total': 29, 'best_exact': 64.0, 'best_exact_thresh': 0.0, 'best_f1': 64.22222222222223, 'best_f1_thresh': 0.0}

We evaluated the model using the SQuAD v2 metric.


What's the difference between SQuAD and SQuAD v2?

SQuAD is the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD). You can find more details in the official SQuAD documentation. SQuAD v2 is the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD). For more information, please refer to the official SQuAD 2.0 documentation. To perform well on SQuAD 2.0, systems must not only provide answers when they are supported by the paragraph but also correctly identify when no answer is available and refrain from answering.

How to fine-tune a pretrained Hugging Face model

We explored various Hugging Face datasets and models and evaluated the performance of those models. Next, we'll learn how to fine-tune a pretrained model. Since you're now familiar with the examples above, we'll build on them for the fine-tuning process. Use the comments as a guide —they explain each section and help you track what's happening in the code. Since fine-tuning a Hugging Face model can take a significant amount of time, we'll shuffle the dataset and use a subset of the samples to reduce training time. We will use Hugging Face Trainer and TrainingArguments. Hugging Face models for different tasks require different fine-tuning syntax, since each task uses a different model class with its own inputs and outputs. We will fine-tune a sequence classification, token classification, and question-answering model. You don't need to evaluate your model, but you can learn how to do so with the examples below after training. If the model doesn't perform well, you can make improvements.

We will use the AutoModelForSequenceClassification model with the distilbert/distilbert-base-uncased-finetuned-sst-2-english checkpoint to fine-tune the model. We will use the stanfordnlp/imdb dataset. We'll shuffle the dataset and select a smaller sample to speed up the process.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from sklearn.metrics import precision_recall_fscore_support
import numpy
from datasets import load_dataset
from datasets import Dataset

#1. Import and Load Model + Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

#2. Prepare Your Dataset
ds_train = load_dataset("stanfordnlp/imdb", split="train").shuffle(seed=42).select(range(100))
ds_test = load_dataset("stanfordnlp/imdb", split="test").shuffle(seed=42).select(range(30))
sent = ds_train["text"]
labels = ds_train["label"]
test_sent = ds_test["text"]
test_labels = ds_test["label"]
train_dataset = Dataset.from_dict({"text": sent, "label": labels})
val_dataset = Dataset.from_dict({"text": test_sent, "label": test_labels})

#3. Tokenize the Dataset
def tokenize(batch):
      return tokenizer(batch["text"], truncation=True, padding=True, return_tensors="pt")
train_dataset = train_dataset.map(tokenize, batched=True)
val_dataset = val_dataset.map(tokenize, batched=True)
train_dataset = train_dataset.remove_columns("text").with_format("torch")
val_dataset = val_dataset.remove_columns("text").with_format("torch")

#4. Define Metrics
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support

#For a deeper understanding of argmax, see the explanation below.
def compute_metrics(pred):
      preds = pred.predictions.argmax(-1)
      labels = pred.label_ids
      return { "accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds) }

#5. Set Up Training Arguments
from transformers import TrainingArguments
training_args = TrainingArguments(
        logging_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        weight_decay=0.01 )

#6. Initialize Trainer
from transformers import Trainer
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics)

#7. Train the Model
trainer.train()

#8. Evaluate After Training
print(trainer.evaluate())

{'eval_loss': 0.09679087996482849, 'eval_accuracy': 0.9666666666666667, 'eval_f1': 0.967741935483871, 'eval_runtime': 0.7008, 'eval_samples_per_second': 42.81, 'eval_steps_per_second': 5.708, 'epoch': 3.0}

argmax(-1) takes the index of the highest value (argmax) along the last dimension of the tensor.

We will train a token classification model using the BertForTokenClassification. The column names for the dataset below are 'id', 'document_id', 'sentence_id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'.

import torch
from transformers import BertTokenizerFast, BertForTokenClassification
from datasets import load_dataset
from seqeval.metrics import classification_report
import datasets

# Load model/tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
id2label = model.config.id2label
label2id = {v: k for k, v in id2label.items()}

# Load dataset
dataset = load_dataset("tomaarsen/conll2003") # Disable caching
dataset2 = load_dataset("tomaarsen/conll2003", split="train").select(range(200))
dataset3 = load_dataset("tomaarsen/conll2003", split="validation").select(range(20))

#Tokenize the Dataset
def preprocess_function(examples):
      tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True,
return_tensors="pt")
      labels = []
      for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
          if word_idx is None:
            label_ids.append(-100)
          elif word_idx != previous_word_idx:
            label_ids.append(label[word_idx])
          else:
            label_ids.append(-100)
          previous_word_idx = word_idx
        labels.append(label_ids)
      tokenized_inputs["labels"] = labels
      return tokenized_inputs

predicted_labels = []
true_labels = []
print(dataset2.column_names)
train_data = dataset2.map(preprocess_function, batched=True, remove_columns=dataset2.column_names)
test_data = dataset3.map(preprocess_function, batched=True, remove_columns=dataset3.column_names)

import numpy as np
def align_predictions(predictions, label_ids):
      preds = np.argmax(predictions, axis=2)
      batch_size, seq_len = preds.shape
      true_labels = []
      true_preds = []
      for i in range(batch_size):
        pred_tags = []
        true_tags = []
        for j in range(seq_len):
          if label_ids[i][j] != -100:
            true_tags.append(id2label[label_ids[i][j]])
            pred_tags.append(id2label[preds[i][j]])
        true_labels.append(true_tags)
        true_preds.append(pred_tags)
      return true_preds, true_labels

from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
import evaluate
seqeval = evaluate.load('seqeval')

def compute_metrics(p):
      predictions, labels = p.predictions, p.label_ids
      preds, trues = align_predictions(predictions, labels)
      print(seqeval.compute(predictions=preds, references=trues))
      results = seqeval.compute(trues, preds)
      return {
        "precision": precision_score(trues, preds),
        "recall": recall_score(trues, preds),
        "f1": f1_score(trues, preds) }

from transformers import TrainingArguments
training_args = TrainingArguments(
      output_dir="my_awesome_qa_model",
      learning_rate=2e-5,
      per_device_train_batch_size=16,
      per_device_eval_batch_size=16,
      num_train_epochs=3,
      weight_decay=0.01 )

from transformers import Trainer
trainer = Trainer(
      model,
      training_args,
      train_dataset=train_data,
      eval_dataset=test_data,
      processing_class=tokenizer )
trainer.train()
results = trainer.evaluate()
print("\nEvaluation results:", results)

# Optional: print detailed classification report
predictions = trainer.predict(test_data)
pred_tags, true_tags = align_predictions(predictions.predictions, predictions.label_ids)
print("\nDetailed Classification Report:\n")
print(classification_report(true_tags, pred_tags))

Evaluation results: {'eval_loss': 0.47007593512535095, 'eval_runtime': 0.2464, 'eval_samples_per_second': 81.158, 'eval_steps_per_second': 8.116, 'epoch': 3.0}

Let's fine-tune a question answering model. While fine-tuning a question-answering model can be time-consuming and complex, the process follows the same pattern as with previous models. If you'd like to refresh your memory on how the model works, refer to the example above. We will use the BertForQuestionAnswering model with the deepset/bert-base-cased-squad2 checkpoint.

from datasets import load_dataset, Dataset
from transformers import ( AutoTokenizer, BertForQuestionAnswering, TrainingArguments, Trainer, )
import collections
import evaluate
import numpy as np

#Load small SQuAD v2 dataset
dataset = load_dataset("rajpurkar/squad_v2")
dataset2 = load_dataset("rajpurkar/squad_v2", split="train").select(range(50))
dataset3 = load_dataset("rajpurkar/squad_v2", split="validation").select(range(20))

model_name = "deepset/bert-base-cased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)
context = dataset["train"]["context"][:50]
question = dataset["train"]["question"][:50]
context_v = dataset["validation"]["context"][:20]
question_v = dataset["validation"]["question"][:20]
answers_v = dataset["validation"]["answers"][:20]

inputs = tokenizer(
      question,
      context,
      max_length=100,
      truncation="only_second",
      stride=50,
      return_overflowing_tokens=True,
      return_offsets_mapping=True )
answers = dataset["train"][:50]["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(inputs["offset_mapping"]):
      sample_idx = inputs["overflow_to_sample_mapping"][i]
      answer = answers[sample_idx]
      start_char = answer["answer_start"][0]
      end_char = answer["answer_start"][0] + len(answer["text"][0])
      sequence_ids = inputs.sequence_ids(i)

#Find the start and end of the context
      idx = 0
      while sequence_ids[idx] != 1:
            idx += 1
      context_start = idx
      while sequence_ids[idx] == 1:
            idx += 1
      context_end = idx - 1

#If the answer is not fully inside the context, label is (0, 0). Otherwise it's the start and end token positions.
      if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
      else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                  idx += 1
                  start_positions.append(idx - 1)

                  idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                  idx -= 1
                  end_positions.append(idx + 1)

max_length = 384
stride = 128

def preprocess_training_examples(examples):
      questions = [q.strip() for q in examples["question"]]
      inputs = tokenizer(
        question,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length" )

      offset_mapping = inputs.pop("offset_mapping")
      sample_map = inputs.pop("overflow_to_sample_mapping")
      answers = examples["answers"]
      start_positions = []
      end_positions = []

      for i, offset in enumerate(offset_mapping):
            sample_idx = sample_map[i]
            answer = answers[sample_idx]
            start_char = answer["answer_start"][0]
            end_char = answer["answer_start"][0] + len(answer["text"][0])
            sequence_ids = inputs.sequence_ids(i)

#Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:
                  idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                  idx += 1
            context_end = idx - 1

#If the answer is not fully inside the context, label is (0, 0). Otherwise it's the start and end token positions.
            if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
                  start_positions.append(0)
                  end_positions.append(0)
            else:
                  idx = context_start
                  while idx <= context_end and offset[idx][0] <= start_char:
                        idx += 1
                        start_positions.append(idx - 1)

                        idx = context_end
                  while idx >= context_start and offset[idx][1] >= end_char:
                        idx -= 1
                  end_positions.append(idx + 1)

      inputs["start_positions"] = start_positions
      inputs["end_positions"] = end_positions
      return inputs
train_dataset = dataset2.map(
      preprocess_training_examples,
      batched=True,
      remove_columns=dataset2.column_names )
print(len(dataset2), len(train_dataset))

def preprocess_validation_examples(examples):
      questions = [q.strip() for q in examples["question"]]
      inputs = tokenizer(
            question_v,
            examples["context"],
            max_length=max_length,
            truncation="only_second",
            stride=stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length" )

      sample_map = inputs.pop("overflow_to_sample_mapping")
      example_ids = []

      for i in range(len(inputs["input_ids"])):
            sample_idx = sample_map[i]
            example_ids.append(examples["id"][sample_idx])

            sequence_ids = inputs.sequence_ids(i)
            offset = inputs["offset_mapping"][i]
            inputs["offset_mapping"][i] = [ o if sequence_ids[k] == 1 else None for k, o in enumerate(offset) ]

      inputs["example_id"] = example_ids
      return inputs
validation_dataset = dataset3.map(
      preprocess_validation_examples,
      batched=True,
      remove_columns=dataset3.column_names )
print(len(dataset3), len(validation_dataset))

import collections
import numpy as np
formatted_predictions = []
def postprocess_predictions(examples, features, raw_predictions, tokenizer, n_best_size=20, max_answer_length=30):
      all_start_logits, all_end_logits = raw_predictions
      example_id_to_index = {k["id"]: i for i, k in enumerate(examples)}
      features_per_example = collections.defaultdict(list)

      for i, feature in enumerate(features):
            features_per_example[example_id_to_index[feature["example_id"]]].append(i)

      predictions = collections.OrderedDict()

      for example in examples:
            example_index = example_id_to_index[example["id"]]
            feature_indices = features_per_example[example_index]
            min_null_score = None
            valid_answers = []
            context = example["context"]
            for feature_index in feature_indices:
                  start_logits = all_start_logits[feature_index]
                  end_logits = all_end_logits[feature_index]
                  offset_mapping = features[feature_index]["offset_mapping"]
                  input_ids = features[feature_index]["input_ids"]
                  cls_index = input_ids.index(tokenizer.cls_token_id)
                  feature_null_score = start_logits[cls_index] + end_logits[cls_index]
                  if min_null_score is None or feature_null_score < min_null_score:
                        min_null_score = feature_null_score

            start_indexes = np.argsort(start_logits)[-1: -n_best_size - 1: -1].tolist()
            end_indexes = np.argsort(end_logits)[-1: -n_best_size - 1: -1].tolist()

            for start_index in start_indexes:
                for end_index in end_indexes:
                        if (
                              start_index >= len(offset_mapping)
                              or end_index >= len(offset_mapping)
                              or offset_mapping[start_index] is None
                              or offset_mapping[end_index] is None
                              or end_index < start_index
                              or (end_index - start_index + 1) > max_answer_length ):
                              continue
                        start_char = offset_mapping[start_index][0]
                        end_char = offset_mapping[end_index][1]
                        answer_text = context[start_char:end_char]
                        score = start_logits[start_index] + end_logits[end_index]
                        valid_answers.append({"text": answer_text, "score": score})

            if valid_answers:
                  best_answer = max(valid_answers, key=lambda x: x["score"])
            else:
                  best_answer = {"text": ""}

            if min_null_score is not None and min_null_score > best_answer["score"]:
                  predictions[example["id"]] = ""
            else:
                  predictions[example["id"]] = best_answer["text"]

      formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 1.0 if answers[i] == "" else 0.0} for k, v in predictions.items()]
      references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
      import evaluate
      metric = evaluate.load("squad_v2")
      results = metric.compute(predictions=formatted_predictions, references=references)
      print("results: ", results)
      return formatted_predictions, references

import evaluate
metric = evaluate.load("squad_v2")

def compute_metrics(eval_preds):
      features = validation_dataset # tokenized eval dataset
      examples = dataset3 # original eval examples (20 samples)
      raw_preds = eval_preds.predictions
      preds, refs = postprocess_predictions( examples=examples, features=features, raw_predictions=raw_preds, tokenizer=tokenizer )
      metrics = metric.compute(predictions=preds, references=refs)
      return metrics

from transformers import TrainingArguments

training_args = TrainingArguments(
            eval_strategy="epoch",
            learning_rate=2e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=3,
            weight_decay=0.01 )

trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=validation_dataset,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics )
trainer.train()

trainer.save_model("./my_qa_model")
trainer.save_state() # saves optimizer & scheduler state
tokenizer.save_pretrained("./my_qa_model")
result = trainer.evaluate()
print("result: ", result)

#Evaluation
raw_preds = trainer.predict(validation_dataset).predictions
preds, refs = postprocess_predictions(dataset3, validation_dataset, raw_preds, tokenizer)
results = metric.compute(predictions=preds, references=refs)
print("Manual metrics:", results)

Manual metrics: {'exact': 75.0, 'f1': 75.0, 'total': 20, 'HasAns_exact': 90.0, 'HasAns_f1': 90.0, 'HasAns_total': 10, 'NoAns_exact': 60.0, 'NoAns_f1': 60.0, 'NoAns_total': 10, 'best_exact': 80.0, 'best_exact_thresh': 0.0, 'best_f1': 80.0, 'best_f1_thresh': 0.0}