Text Classification with Transformers

Build Your First NLP Model

A step-by-step tutorial to learn natural language processing with transformers, Hugging Face, and PyTorch.

What You'll Learn

Load and explore datasets from Hugging Face

Tokenize and prepare text data

Extract features from pre-trained models

Build a simple baseline classifier

Fine-tune transformer models

Deploy your model to Hugging Face Hub

By the end of this tutorial, you'll have a functional emotion classification model that can identify emotions in text.

No prior transformer knowledge required, basic Python and ML understanding helpful

What are Transformers?

Transformers are neural network architectures that revolutionized NLP with attention mechanisms, enabling models to understand context better than previous approaches.

Key Features

Attention mechanism - allows models to focus on relevant parts of the input

Pre-training & fine-tuning - learn general language understanding, then adapt to specific tasks

Parallelization - process entire sequences at once, unlike RNNs

Transfer learning - leverage knowledge from large datasets for your specific task

Popular Models

BERT (Bidirectional Encoder Representations from Transformers)
DistilBERT - Lighter, faster version of BERT with similar performance
RoBERTa - Optimized BERT training
GPT models - Generative pre-trained transformers

Setting Up Your Environment

# Install required packages
!pip install transformers[torch]
!pip install datasets
!pip install scikit-learn
!pip install huggingface_hub
!pip install pandas
!pip install matplotlib
!pip install torch

We'll use:

Python 3.8+
PyTorch
Hugging Face libraries
Scikit-learn for evaluation

Mini-Challenge

Verify your setup is complete! Run these commands and make sure you get successful output:

import torch
import transformers
import datasets

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")

Having trouble? Check the respective library's installation guide or try using Google Colab, which has many libraries pre-installed.

Loading and Exploring the Dataset

Hugging Face's Datasets library provides easy access to numerous public datasets. We'll use the emotions dataset, which contains text labeled with emotions.

Loading the Dataset

from datasets import load_dataset

# Load the emotions dataset
emotions = load_dataset("dair-ai/emotion")

# Check what we got
print(emotions)

Examining Dataset Structure

# Check dataset features
emotions["train"].features

# Check column names
emotions["train"].column_names

# Look at some examples
emotions["train"]["text"][:2]
emotions["train"]["label"][:2]

Converting to Pandas

import pandas as pd

# Convert to pandas
emotions.set_format(type="pandas")
df = emotions["train"][:]

# Convert numerical labels to strings
def convert_label_int_2_string(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(convert_label_int_2_string)

# View the dataframe
df.head()

Visualizing the Data

import matplotlib.pyplot as plt

# Visualize label distribution
df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Distribution of Emotions in Dataset")
plt.xlabel("Count")
plt.ylabel("Emotion")
plt.show()

Visualizing helps us understand the distribution of emotions in our dataset and identify any class imbalance issues.

About the Emotions Dataset

Dataset Structure:

Text: Sentences expressing emotions
Label: Emotion categories (0-5)
Split into train, validation, test sets

Emotion Categories:

Sadness
Joy
Love
Anger
Fear
Surprise

Mini-Challenge

Calculate and print the percentage of each emotion class in the training set:

# Calculate the percentage of each emotion
emotion_percentages = df["label_name"].value_counts(normalize=True) * 100
print(emotion_percentages)

# Which emotion is the most common? Which is the least common?

This helps identify if we need to address class imbalance in our model training.

Tokenization and Data Preparation

Before feeding text to our model, we need to convert it into a numerical format that the model can understand. This process is called tokenization.

What is Tokenization?

Tokenization breaks text into smaller units (tokens) and converts them to numeric IDs.

Consider a simple example:

text = "My name is kareem"

# Manual tokenization (character level)
tokenized_text = list(text)
token2index = {ch:index for index, ch
               in enumerate(sorted(set(tokenized_text)))}
print(token2index)
# Output: {' ': 0, 'M': 1, ..., 's': 15}

In practice, we use more sophisticated tokenizers like WordPiece, BPE, or SentencePiece.

Using Hugging Face Tokenizers

from transformers import AutoTokenizer

# Load a pre-trained tokenizer
model_ckp = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckp)

# Tokenize a sample text
text = "My name is kareem"
encoded_text = tokenizer(text)
print(encoded_text)

# Convert IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(
    encoded_text.input_ids)
print(tokens)

# And back to text
print(tokenizer.convert_tokens_to_string(tokens))

Tokenizer Properties

# Explore tokenizer properties
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Model input names: {tokenizer.model_input_names}")
print(f"Maximum sequence length: {tokenizer.model_max_length}")

# The tokenizer outputs:
# 1. input_ids: Token IDs
# 2. attention_mask: Indicates which tokens should be attended to
# 3. special tokens: [CLS], [SEP], etc. that have special meaning

Tokenizing the Entire Dataset

# Define a function to tokenize batches
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

# Apply tokenization to the entire dataset
emotion_encoded = emotions.map(tokenize, batched=True, batch_size=None)

# Check new columns
print(emotion_encoded.column_names)

# Now we have:
# - input_ids
# - attention_mask
# - plus original columns

Tokenization Parameters

padding:

True: Pad sequences to the longest in the batch
'max_length': Pad to model's max length
Padding adds special tokens to make sequences the same length

truncation:

True: Cut sequences longer than model's max
Prevents errors with sequences that are too long
May lose information from truncated text

Mini-Challenge

Tokenize a custom sentence and investigate the outputs:

# Tokenize this sentence
sentence = "I am feeling both happy and sad at the same time."

# 1. Encode the sentence
encoded = tokenizer(sentence)

# 2. Print the input_ids
print(encoded.input_ids)

# 3. Convert back to tokens and print
tokens = tokenizer.convert_ids_to_tokens(encoded.input_ids)
print(tokens)

# 4. Which tokens get split into subwords?

Understanding tokenization helps you debug issues with model inputs and explains how the model "sees" your text.

Feature Extraction from Pre-trained Models

Before fine-tuning a transformer, we can extract features from a pre-trained model to use with a simple classifier. This approach is faster and requires less computational resources.

Loading the Pre-trained Model

from transformers import AutoModel
import torch

# Set device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available()
                     else "cpu")

# Load pre-trained model
model_chk = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_chk).to(device)

# This loads the model without the classification head
# We'll use it as a feature extractor

The model is moved to GPU if available for faster processing.

Understanding Hidden States

Transformers produce contextualized embeddings called "hidden states" for each token:

Each token gets a vector representation
Final layer contains the most task-relevant features
The [CLS] token (first token) often summarizes the sentence

For classification, we typically use the [CLS] token's representation from the last layer.

Feature Extraction Function

# Define feature extraction function
def extract_hidden_states(batch):
    # Move inputs to the device
    inputs = {k: v.to(device) for k, v in batch.items()
              if k in tokenizer.model_input_names}

    # Don't calculate gradients
    with torch.no_grad():
        # Get the last hidden state
        last_hidden_state = model(**inputs).last_hidden_state

    # Return the [CLS] token's representation
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

This function extracts the final layer's hidden state for the [CLS] token, which works well for sentence classification.

Processing the Dataset

# Set format to PyTorch tensors
emotion_encoded.set_format("torch",
                          columns=["input_ids", "attention_mask", "label"])

# Extract features from all examples
emotions_hidden = emotion_encoded.map(extract_hidden_states, batched=True)

# Prepare data for scikit-learn
import numpy as np

X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])

print(f"Training features shape: {X_train.shape}")
print(f"Validation features shape: {X_valid.shape}")

Feature Extraction Benefits

Advantages:

Much faster than fine-tuning
Less computational resources required
Still leverages pre-trained knowledge
Works well for many tasks with small datasets

Limitations:

Not optimized for the specific task
May not capture task-specific nuances
Performance ceiling lower than fine-tuning

Mini-Challenge

Investigate the feature dimensions and visualize the embeddings:

# 1. Print the shape of the feature vectors
print(f"Feature vector size: {X_train.shape[1]}")

# 2. Calculate the average feature vector for each emotion
emotions_list = emotions["train"].features["label"].names
for i, emotion in enumerate(emotions_list):
    mean_vector = X_train[y_train == i].mean(axis=0)
    print(f"{emotion}: {mean_vector[:5]}...")  # First 5 values

# Optional: Visualize with UMAP or t-SNE
# !pip install umap-learn
# import umap
# reducer = umap.UMAP()
# X_train_umap = reducer.fit_transform(X_train)
# plt.scatter(X_train_umap[:, 0], X_train_umap[:, 1], c=y_train, cmap='viridis')
# plt.colorbar()

Visualizing the embeddings can help you understand if your features are capturing emotion differences.

Creating a Baseline Model with Logistic Regression

Before investing time in fine-tuning, it's good practice to create a simple baseline model to establish performance benchmarks.

Dummy Classifier

Let's start with the simplest possible model:

from sklearn.dummy import DummyClassifier

# Create a dummy classifier that always predicts
# the most frequent class
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)

# Evaluate
dummy_score = dummy_clf.score(X_valid, y_valid)
print(f"Dummy classifier accuracy: {dummy_score:.4f}")

# This establishes the simplest baseline
# Any real model should perform better than this

Logistic Regression

from sklearn.linear_model import LogisticRegression

# Create a logistic regression classifier
# We increase max_iter to ensure convergence
lr_clf = LogisticRegression(max_iter=3000)

# Train the model
lr_clf.fit(X_train, y_train)

# Evaluate
lr_score = lr_clf.score(X_valid, y_valid)
print(f"Logistic regression accuracy: {lr_score:.4f}")

# Compare with dummy classifier
print(f"Improvement: {lr_score - dummy_score:.4f}")

Confusion Matrix

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt

# Define function to plot confusion matrix
def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(10, 10))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                 display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()

# Get predictions
y_preds = lr_clf.predict(X_valid)

# Plot confusion matrix
labels = emotions["train"].features["label"].names
plot_confusion_matrix(y_preds, y_valid, labels)

The confusion matrix helps identify which emotions are most often confused with each other.

Additional Metrics

from sklearn.metrics import classification_report

# Generate a detailed classification report
report = classification_report(y_valid, y_preds,
                              target_names=labels)
print(report)

# This report shows precision, recall, and F1-score
# for each emotion class

Why Create a Baseline?

Benefits of a Simple Baseline:

Establishes minimum performance expectations
Quick to implement and evaluate
Helps identify easy vs. difficult classes
Reference point for more complex models

Interpreting Results:

High baseline performance may indicate dataset issues
Similar performance across models suggests feature limitations
Class imbalance often visible in baseline results

Mini-Challenge

Try a different classifier and compare with logistic regression:

from sklearn.ensemble import RandomForestClassifier

# Train a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Evaluate and compare
rf_score = rf_clf.score(X_valid, y_valid)
print(f"Random forest accuracy: {rf_score:.4f}")
print(f"Logistic regression accuracy: {lr_score:.4f}")

# Which emotions does random forest predict better?
rf_preds = rf_clf.predict(X_valid)
rf_report = classification_report(y_valid, rf_preds,
                                 target_names=labels)
print(rf_report)

Different classifiers may perform better for different emotion classes. This helps you understand your data better.

Fine-tuning a Transformer Model

Now let's fine-tune a pre-trained transformer model specifically for our emotion classification task. This adapts the entire model to our specific data.

Loading a Classification Model

from transformers import AutoModelForSequenceClassification
import torch

# Number of classes in our dataset
num_labels = 6  # The 6 emotions

# Load the pre-trained model with a classification head
model_chk = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_chk,
    num_labels=num_labels
).to(device)

# This model has the same base as before but with
# an added classification layer on top

Evaluation Metrics

from sklearn.metrics import accuracy_score, f1_score

# Define our evaluation metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Calculate accuracy
    acc = accuracy_score(labels, preds)

    # Calculate F1 (weighted average across classes)
    f1 = f1_score(labels, preds, average="weighted")

    return {
        "accuracy": acc,
        "f1": f1
    }

Setting Up the Trainer

from transformers import Trainer, TrainingArguments

# Set training parameters
batch_size = 64
logging_steps = len(emotion_encoded["train"]) // batch_size
model_name = f"{model_chk}-finetuned-emotion"

# Define training arguments
training_args = TrainingArguments(
    output_dir=model_name,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    logging_steps=logging_steps,
    push_to_hub=True,
    log_level="error"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotion_encoded["train"],
    eval_dataset=emotion_encoded["validation"],
    tokenizer=tokenizer
)

Training the Model

# Start the training process
trainer.train()

# This will:
# 1. Train for the specified number of epochs
# 2. Report metrics after each epoch
# 3. Save checkpoints
# 4. Show a progress bar unless disabled

Training Arguments Explained

output_dir: Where to save model checkpoints
num_train_epochs: Number of training cycles
learning_rate: Controls step size during optimization
per_device_train_batch_size: Examples per batch

weight_decay: L2 regularization to prevent overfitting
evaluation_strategy: When to evaluate ("epoch", "steps")
push_to_hub: Whether to upload to Hugging Face Hub

Mini-Challenge

Modify the training arguments to improve performance:

# Try adjusting these parameters:

# 1. Change learning rate
training_args = TrainingArguments(
    # ... other arguments ...
    learning_rate=5e-5,  # Try higher or lower values
)

# 2. Add learning rate scheduler
training_args = TrainingArguments(
    # ... other arguments ...
    lr_scheduler_type="cosine",  # Try different schedulers
)

# 3. Try longer training with early stopping
from transformers import EarlyStoppingCallback

training_args = TrainingArguments(
    # ... other arguments ...
    num_train_epochs=5,
    load_best_model_at_end=True,
)

trainer = Trainer(
    # ... other arguments ...
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

Experiment with these parameters to find the optimal settings for your task.

Evaluating Model Performance

After training, we need to evaluate our model thoroughly to understand its strengths and weaknesses.

Evaluating on Validation Set

# Evaluate the model
validation_results = trainer.evaluate()

print(f"Validation Accuracy: {validation_results['eval_accuracy']:.4f}")
print(f"Validation F1 Score: {validation_results['eval_f1']:.4f}")

# Compare with our baseline model
print(f"Baseline Accuracy: {lr_score:.4f}")
print(f"Improvement: {validation_results['eval_accuracy'] - lr_score:.4f}")

Confusion Matrix

# Get predictions on validation set
validation_preds = trainer.predict(emotion_encoded["validation"])

# Get predicted classes
y_preds = validation_preds.predictions.argmax(-1)
y_valid = validation_preds.label_ids

# Plot confusion matrix
labels = emotions["train"].features["label"].names
plot_confusion_matrix(y_preds, y_valid, labels)

Detailed Classification Report

from sklearn.metrics import classification_report

# Generate a detailed report
report = classification_report(
    y_valid, y_preds, target_names=labels)
print(report)

# This shows precision, recall, F1 for each emotion class
# Look for classes with lower performance

Error Analysis

# Get the actual texts and analyze errors
validation_texts = emotions["validation"]["text"]
validation_labels = emotions["validation"]["label"]

# Find misclassified examples
errors = []
for i, (pred, true) in enumerate(zip(y_preds, y_valid)):
    if pred != true:
        errors.append({
            "text": validation_texts[i],
            "true": labels[true],
            "predicted": labels[pred]
        })

# Look at the first few errors
for i, error in enumerate(errors[:5]):
    print(f"Example {i+1}:")
    print(f"Text: {error['text']}")
    print(f"True label: {error['true']}")
    print(f"Predicted: {error['predicted']}")
    print("---")

Understanding Model Performance

Key Metrics:

Accuracy: Overall percentage of correct predictions
Precision: How many of the predicted positives are actually positive
Recall: How many of the actual positives were correctly identified
F1 Score: Harmonic mean of precision and recall

Common Patterns in Errors:

Confusion between similar emotions (love/joy)
Ambiguous expressions with multiple emotions
Sarcasm and figurative language
Cultural or contextual references

Mini-Challenge

Create a function to test your model on custom inputs:

def predict_emotion(text):
    # Tokenize the input
    inputs = tokenizer(text, return_tensors="pt").to(device)

    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)

    # Get probabilities
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

    # Get predicted class
    predicted_class = torch.argmax(probs, dim=-1).item()

    # Get the emotion label
    emotion = labels[predicted_class]

    # Get the probability
    confidence = probs[0][predicted_class].item()

    return {
        "emotion": emotion,
        "confidence": confidence,
        "all_probs": {labels[i]: p.item() for i, p in enumerate(probs[0])}
    }

# Test with some examples
examples = [
    "I can't believe I aced my exam!",
    "I miss my family so much right now.",
    "That driver just cut me off! So rude!"
]

for example in examples:
    result = predict_emotion(example)
    print(f"Text: {example}")
    print(f"Predicted emotion: {result['emotion']} (Confidence: {result['confidence']:.4f})")
    print("All probabilities:")
    for emotion, prob in result['all_probs'].items():
        print(f"  {emotion}: {prob:.4f}")
    print("---")

This function lets you interactively test your model on new inputs to better understand its behavior.

Deploying the Model to Hugging Face Hub

Now that we have a trained model, let's share it with the world by uploading it to the Hugging Face Hub.

Setting Up Hugging Face Account

# Login to Hugging Face
from huggingface_hub import notebook_login

notebook_login()

# This will prompt you to enter your Hugging Face
# access token to authenticate

# You can get a token from:
# https://huggingface.co/settings/tokens

If you're running this in a script instead of a notebook, use huggingface-cli login in your terminal first.

Pushing the Model to Hub

# Push the model to the Hub
trainer.push_to_hub(commit_message="Training completed!")

# This uploads:
# - Model weights
# - Tokenizer configuration
# - Model configuration
# - README with model card

# Your model will be available at:
# https://huggingface.co/YOUR_USERNAME/distilbert-base-uncased-finetuned-emotion

Creating a Model Card

# To create a more detailed model card, you can edit the README.md
# Either directly on the Hugging Face website, or by:

model_card = """
# Emotion Classification Model

This model can detect 6 emotions in text: sadness, joy, love, anger, fear, and surprise.

## Model Description

- Model architecture: DistilBERT (distilbert-base-uncased)
- Fine-tuned on the DAIR.AI Emotion dataset
- Training accuracy: {train_acc:.4f}
- Validation accuracy: {val_acc:.4f}
- F1 score: {f1:.4f}

## Intended Usage

This model is intended for sentiment analysis and emotion detection in English text.

## Limitations

- Only works for English text
- Struggles with sarcasm and ambiguous emotions
- May not perform well on very short texts

## Training procedure

- Trained for 2 epochs
- Learning rate: 2e-5
- Batch size: 64
""".format(train_acc=0.92, val_acc=validation_results['eval_accuracy'],
           f1=validation_results['eval_f1'])

# You can then push this to the hub
# trainer.push_to_hub(commit_message="Update model card", model_card=model_card)

Using Your Model

# Once deployed, anyone can use your model with:
from transformers import pipeline

# Replace YOUR_USERNAME with your actual Hugging Face username
model_name = "YOUR_USERNAME/distilbert-base-uncased-finetuned-emotion"

# Load the model
classifier = pipeline("text-classification", model=model_name)

# Use the model
result = classifier("I'm so excited to see this working!")
print(result)
# [{'label': 'joy', 'score': 0.9874}]

Hugging Face Hub Benefits

Advantages:

Easy sharing with the community
Version control for model iterations
Automatic model cards and documentation
Widgets to test your model in the browser

Additional Features:

Model discussions and community feedback
Download statistics and usage metrics
Spaces for creating interactive demos
Integration with many ML frameworks

Mini-Challenge

Create a simple Gradio interface for your model (optional, requires internet access):

# Install gradio
# !pip install gradio

import gradio as gr

# Define the prediction function
def predict(text):
    result = classifier(text)[0]
    return {result["label"]: result["score"]}

# Create the interface
demo = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(placeholder="Enter text here..."),
    outputs=gr.Label(num_top_classes=6),
    title="Emotion Classifier",
    description="Detect emotions in text: sadness, joy, love, anger, fear, and surprise."
)

# Launch the demo
demo.launch()

# This creates an interactive web interface for your model
# You can also deploy this permanently on Hugging Face Spaces

Creating a demo makes your model more accessible to non-technical users and provides an easy way to showcase your work.

Next Steps and Conclusion

What You've Learned

Loading and exploring datasets from Hugging Face

Tokenizing and preparing text for transformer models

Extracting features from pre-trained models

Creating baseline models with scikit-learn

Fine-tuning transformer models for classification

Evaluating model performance with metrics and visualizations

Deploying models to Hugging Face Hub

Improving Your Model

Try Different Architectures

Experiment with RoBERTa, BERT, or other transformer models as your base.

Hyperparameter Tuning

Optimize learning rate, batch size, and training epochs with grid or random search.

Data Augmentation

Generate additional training examples by synonym replacement, random insertion, or swap.

Address Class Imbalance

Use techniques like weighted loss, oversampling, or generate synthetic examples.

Advanced Topics to Explore

Prompt Engineering

Learn how to design effective prompts for large language models.

Multi-task Learning

Train models to perform multiple NLP tasks simultaneously.

Few-shot Learning

Train effective models with very limited labeled examples.

Model Distillation

Create smaller, faster models that retain performance.

Project Ideas

Multi-label Emotion Detection

Detect multiple emotions in the same text (e.g., both surprise and joy).

Emotion Intensity Predictor

Predict not just the emotion but its intensity on a scale.

Sentiment Analysis Dashboard

Build a web app that analyzes emotions in social media or customer feedback.

Cross-lingual Emotion Detection

Create a model that works across multiple languages.

Resources for Further Learning

Documentation:

Courses:

Communities:

Congratulations!

You've built your first text classification model with transformers!

You now have the foundation to tackle more complex NLP tasks and build powerful AI-powered applications.

Keep exploring, experimenting, and building amazing things with NLP!