A step-by-step tutorial to learn natural language processing with transformers, Hugging Face, and PyTorch.
By the end of this tutorial, you'll have a functional emotion classification model that can identify emotions in text.
No prior transformer knowledge required, basic Python and ML understanding helpful
Transformers are neural network architectures that revolutionized NLP with attention mechanisms, enabling models to understand context better than previous approaches.
# Install required packages
!pip install transformers[torch]
!pip install datasets
!pip install scikit-learn
!pip install huggingface_hub
!pip install pandas
!pip install matplotlib
!pip install torch
We'll use:
Verify your setup is complete! Run these commands and make sure you get successful output:
import torch
import transformers
import datasets
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
Having trouble? Check the respective library's installation guide or try using Google Colab, which has many libraries pre-installed.
Hugging Face's Datasets library provides easy access to numerous public datasets. We'll use the emotions dataset, which contains text labeled with emotions.
from datasets import load_dataset
# Load the emotions dataset
emotions = load_dataset("dair-ai/emotion")
# Check what we got
print(emotions)
# Check dataset features
emotions["train"].features
# Check column names
emotions["train"].column_names
# Look at some examples
emotions["train"]["text"][:2]
emotions["train"]["label"][:2]
import pandas as pd
# Convert to pandas
emotions.set_format(type="pandas")
df = emotions["train"][:]
# Convert numerical labels to strings
def convert_label_int_2_string(row):
return emotions["train"].features["label"].int2str(row)
df["label_name"] = df["label"].apply(convert_label_int_2_string)
# View the dataframe
df.head()
import matplotlib.pyplot as plt
# Visualize label distribution
df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Distribution of Emotions in Dataset")
plt.xlabel("Count")
plt.ylabel("Emotion")
plt.show()
Visualizing helps us understand the distribution of emotions in our dataset and identify any class imbalance issues.
Dataset Structure:
Emotion Categories:
Calculate and print the percentage of each emotion class in the training set:
# Calculate the percentage of each emotion
emotion_percentages = df["label_name"].value_counts(normalize=True) * 100
print(emotion_percentages)
# Which emotion is the most common? Which is the least common?
This helps identify if we need to address class imbalance in our model training.
Before feeding text to our model, we need to convert it into a numerical format that the model can understand. This process is called tokenization.
Tokenization breaks text into smaller units (tokens) and converts them to numeric IDs.
Consider a simple example:
text = "My name is kareem"
# Manual tokenization (character level)
tokenized_text = list(text)
token2index = {ch:index for index, ch
in enumerate(sorted(set(tokenized_text)))}
print(token2index)
# Output: {' ': 0, 'M': 1, ..., 's': 15}
In practice, we use more sophisticated tokenizers like WordPiece, BPE, or SentencePiece.
from transformers import AutoTokenizer
# Load a pre-trained tokenizer
model_ckp = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckp)
# Tokenize a sample text
text = "My name is kareem"
encoded_text = tokenizer(text)
print(encoded_text)
# Convert IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(
encoded_text.input_ids)
print(tokens)
# And back to text
print(tokenizer.convert_tokens_to_string(tokens))
# Explore tokenizer properties
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Model input names: {tokenizer.model_input_names}")
print(f"Maximum sequence length: {tokenizer.model_max_length}")
# The tokenizer outputs:
# 1. input_ids: Token IDs
# 2. attention_mask: Indicates which tokens should be attended to
# 3. special tokens: [CLS], [SEP], etc. that have special meaning
# Define a function to tokenize batches
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
# Apply tokenization to the entire dataset
emotion_encoded = emotions.map(tokenize, batched=True, batch_size=None)
# Check new columns
print(emotion_encoded.column_names)
# Now we have:
# - input_ids
# - attention_mask
# - plus original columns
padding:
True
: Pad sequences to the longest in the batch'max_length'
: Pad to model's max lengthtruncation:
True
: Cut sequences longer than model's maxTokenize a custom sentence and investigate the outputs:
# Tokenize this sentence
sentence = "I am feeling both happy and sad at the same time."
# 1. Encode the sentence
encoded = tokenizer(sentence)
# 2. Print the input_ids
print(encoded.input_ids)
# 3. Convert back to tokens and print
tokens = tokenizer.convert_ids_to_tokens(encoded.input_ids)
print(tokens)
# 4. Which tokens get split into subwords?
Understanding tokenization helps you debug issues with model inputs and explains how the model "sees" your text.
Before fine-tuning a transformer, we can extract features from a pre-trained model to use with a simple classifier. This approach is faster and requires less computational resources.
from transformers import AutoModel
import torch
# Set device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available()
else "cpu")
# Load pre-trained model
model_chk = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_chk).to(device)
# This loads the model without the classification head
# We'll use it as a feature extractor
The model is moved to GPU if available for faster processing.
Transformers produce contextualized embeddings called "hidden states" for each token:
For classification, we typically use the [CLS] token's representation from the last layer.
# Define feature extraction function
def extract_hidden_states(batch):
# Move inputs to the device
inputs = {k: v.to(device) for k, v in batch.items()
if k in tokenizer.model_input_names}
# Don't calculate gradients
with torch.no_grad():
# Get the last hidden state
last_hidden_state = model(**inputs).last_hidden_state
# Return the [CLS] token's representation
return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
This function extracts the final layer's hidden state for the [CLS] token, which works well for sentence classification.
# Set format to PyTorch tensors
emotion_encoded.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
# Extract features from all examples
emotions_hidden = emotion_encoded.map(extract_hidden_states, batched=True)
# Prepare data for scikit-learn
import numpy as np
X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
print(f"Training features shape: {X_train.shape}")
print(f"Validation features shape: {X_valid.shape}")
Advantages:
Limitations:
Investigate the feature dimensions and visualize the embeddings:
# 1. Print the shape of the feature vectors
print(f"Feature vector size: {X_train.shape[1]}")
# 2. Calculate the average feature vector for each emotion
emotions_list = emotions["train"].features["label"].names
for i, emotion in enumerate(emotions_list):
mean_vector = X_train[y_train == i].mean(axis=0)
print(f"{emotion}: {mean_vector[:5]}...") # First 5 values
# Optional: Visualize with UMAP or t-SNE
# !pip install umap-learn
# import umap
# reducer = umap.UMAP()
# X_train_umap = reducer.fit_transform(X_train)
# plt.scatter(X_train_umap[:, 0], X_train_umap[:, 1], c=y_train, cmap='viridis')
# plt.colorbar()
Visualizing the embeddings can help you understand if your features are capturing emotion differences.
Before investing time in fine-tuning, it's good practice to create a simple baseline model to establish performance benchmarks.
Let's start with the simplest possible model:
from sklearn.dummy import DummyClassifier
# Create a dummy classifier that always predicts
# the most frequent class
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
# Evaluate
dummy_score = dummy_clf.score(X_valid, y_valid)
print(f"Dummy classifier accuracy: {dummy_score:.4f}")
# This establishes the simplest baseline
# Any real model should perform better than this
from sklearn.linear_model import LogisticRegression
# Create a logistic regression classifier
# We increase max_iter to ensure convergence
lr_clf = LogisticRegression(max_iter=3000)
# Train the model
lr_clf.fit(X_train, y_train)
# Evaluate
lr_score = lr_clf.score(X_valid, y_valid)
print(f"Logistic regression accuracy: {lr_score:.4f}")
# Compare with dummy classifier
print(f"Improvement: {lr_score - dummy_score:.4f}")
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
# Define function to plot confusion matrix
def plot_confusion_matrix(y_preds, y_true, labels):
cm = confusion_matrix(y_true, y_preds, normalize="true")
fig, ax = plt.subplots(figsize=(10, 10))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=labels)
disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
plt.title("Normalized confusion matrix")
plt.show()
# Get predictions
y_preds = lr_clf.predict(X_valid)
# Plot confusion matrix
labels = emotions["train"].features["label"].names
plot_confusion_matrix(y_preds, y_valid, labels)
The confusion matrix helps identify which emotions are most often confused with each other.
from sklearn.metrics import classification_report
# Generate a detailed classification report
report = classification_report(y_valid, y_preds,
target_names=labels)
print(report)
# This report shows precision, recall, and F1-score
# for each emotion class
Benefits of a Simple Baseline:
Interpreting Results:
Try a different classifier and compare with logistic regression:
from sklearn.ensemble import RandomForestClassifier
# Train a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
# Evaluate and compare
rf_score = rf_clf.score(X_valid, y_valid)
print(f"Random forest accuracy: {rf_score:.4f}")
print(f"Logistic regression accuracy: {lr_score:.4f}")
# Which emotions does random forest predict better?
rf_preds = rf_clf.predict(X_valid)
rf_report = classification_report(y_valid, rf_preds,
target_names=labels)
print(rf_report)
Different classifiers may perform better for different emotion classes. This helps you understand your data better.
Now let's fine-tune a pre-trained transformer model specifically for our emotion classification task. This adapts the entire model to our specific data.
from transformers import AutoModelForSequenceClassification
import torch
# Number of classes in our dataset
num_labels = 6 # The 6 emotions
# Load the pre-trained model with a classification head
model_chk = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
model_chk,
num_labels=num_labels
).to(device)
# This model has the same base as before but with
# an added classification layer on top
from sklearn.metrics import accuracy_score, f1_score
# Define our evaluation metrics
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# Calculate accuracy
acc = accuracy_score(labels, preds)
# Calculate F1 (weighted average across classes)
f1 = f1_score(labels, preds, average="weighted")
return {
"accuracy": acc,
"f1": f1
}
from transformers import Trainer, TrainingArguments
# Set training parameters
batch_size = 64
logging_steps = len(emotion_encoded["train"]) // batch_size
model_name = f"{model_chk}-finetuned-emotion"
# Define training arguments
training_args = TrainingArguments(
output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
push_to_hub=True,
log_level="error"
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=emotion_encoded["train"],
eval_dataset=emotion_encoded["validation"],
tokenizer=tokenizer
)
# Start the training process
trainer.train()
# This will:
# 1. Train for the specified number of epochs
# 2. Report metrics after each epoch
# 3. Save checkpoints
# 4. Show a progress bar unless disabled
Modify the training arguments to improve performance:
# Try adjusting these parameters:
# 1. Change learning rate
training_args = TrainingArguments(
# ... other arguments ...
learning_rate=5e-5, # Try higher or lower values
)
# 2. Add learning rate scheduler
training_args = TrainingArguments(
# ... other arguments ...
lr_scheduler_type="cosine", # Try different schedulers
)
# 3. Try longer training with early stopping
from transformers import EarlyStoppingCallback
training_args = TrainingArguments(
# ... other arguments ...
num_train_epochs=5,
load_best_model_at_end=True,
)
trainer = Trainer(
# ... other arguments ...
callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
Experiment with these parameters to find the optimal settings for your task.
After training, we need to evaluate our model thoroughly to understand its strengths and weaknesses.
# Evaluate the model
validation_results = trainer.evaluate()
print(f"Validation Accuracy: {validation_results['eval_accuracy']:.4f}")
print(f"Validation F1 Score: {validation_results['eval_f1']:.4f}")
# Compare with our baseline model
print(f"Baseline Accuracy: {lr_score:.4f}")
print(f"Improvement: {validation_results['eval_accuracy'] - lr_score:.4f}")
# Get predictions on validation set
validation_preds = trainer.predict(emotion_encoded["validation"])
# Get predicted classes
y_preds = validation_preds.predictions.argmax(-1)
y_valid = validation_preds.label_ids
# Plot confusion matrix
labels = emotions["train"].features["label"].names
plot_confusion_matrix(y_preds, y_valid, labels)
from sklearn.metrics import classification_report
# Generate a detailed report
report = classification_report(
y_valid, y_preds, target_names=labels)
print(report)
# This shows precision, recall, F1 for each emotion class
# Look for classes with lower performance
# Get the actual texts and analyze errors
validation_texts = emotions["validation"]["text"]
validation_labels = emotions["validation"]["label"]
# Find misclassified examples
errors = []
for i, (pred, true) in enumerate(zip(y_preds, y_valid)):
if pred != true:
errors.append({
"text": validation_texts[i],
"true": labels[true],
"predicted": labels[pred]
})
# Look at the first few errors
for i, error in enumerate(errors[:5]):
print(f"Example {i+1}:")
print(f"Text: {error['text']}")
print(f"True label: {error['true']}")
print(f"Predicted: {error['predicted']}")
print("---")
Key Metrics:
Common Patterns in Errors:
Create a function to test your model on custom inputs:
def predict_emotion(text):
# Tokenize the input
inputs = tokenizer(text, return_tensors="pt").to(device)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Get predicted class
predicted_class = torch.argmax(probs, dim=-1).item()
# Get the emotion label
emotion = labels[predicted_class]
# Get the probability
confidence = probs[0][predicted_class].item()
return {
"emotion": emotion,
"confidence": confidence,
"all_probs": {labels[i]: p.item() for i, p in enumerate(probs[0])}
}
# Test with some examples
examples = [
"I can't believe I aced my exam!",
"I miss my family so much right now.",
"That driver just cut me off! So rude!"
]
for example in examples:
result = predict_emotion(example)
print(f"Text: {example}")
print(f"Predicted emotion: {result['emotion']} (Confidence: {result['confidence']:.4f})")
print("All probabilities:")
for emotion, prob in result['all_probs'].items():
print(f" {emotion}: {prob:.4f}")
print("---")
This function lets you interactively test your model on new inputs to better understand its behavior.
Now that we have a trained model, let's share it with the world by uploading it to the Hugging Face Hub.
# Login to Hugging Face
from huggingface_hub import notebook_login
notebook_login()
# This will prompt you to enter your Hugging Face
# access token to authenticate
# You can get a token from:
# https://huggingface.co/settings/tokens
If you're running this in a script instead of a notebook, use huggingface-cli login
in your terminal first.
# Push the model to the Hub
trainer.push_to_hub(commit_message="Training completed!")
# This uploads:
# - Model weights
# - Tokenizer configuration
# - Model configuration
# - README with model card
# Your model will be available at:
# https://huggingface.co/YOUR_USERNAME/distilbert-base-uncased-finetuned-emotion
# To create a more detailed model card, you can edit the README.md
# Either directly on the Hugging Face website, or by:
model_card = """
# Emotion Classification Model
This model can detect 6 emotions in text: sadness, joy, love, anger, fear, and surprise.
## Model Description
- Model architecture: DistilBERT (distilbert-base-uncased)
- Fine-tuned on the DAIR.AI Emotion dataset
- Training accuracy: {train_acc:.4f}
- Validation accuracy: {val_acc:.4f}
- F1 score: {f1:.4f}
## Intended Usage
This model is intended for sentiment analysis and emotion detection in English text.
## Limitations
- Only works for English text
- Struggles with sarcasm and ambiguous emotions
- May not perform well on very short texts
## Training procedure
- Trained for 2 epochs
- Learning rate: 2e-5
- Batch size: 64
""".format(train_acc=0.92, val_acc=validation_results['eval_accuracy'],
f1=validation_results['eval_f1'])
# You can then push this to the hub
# trainer.push_to_hub(commit_message="Update model card", model_card=model_card)
# Once deployed, anyone can use your model with:
from transformers import pipeline
# Replace YOUR_USERNAME with your actual Hugging Face username
model_name = "YOUR_USERNAME/distilbert-base-uncased-finetuned-emotion"
# Load the model
classifier = pipeline("text-classification", model=model_name)
# Use the model
result = classifier("I'm so excited to see this working!")
print(result)
# [{'label': 'joy', 'score': 0.9874}]
Advantages:
Additional Features:
Create a simple Gradio interface for your model (optional, requires internet access):
# Install gradio
# !pip install gradio
import gradio as gr
# Define the prediction function
def predict(text):
result = classifier(text)[0]
return {result["label"]: result["score"]}
# Create the interface
demo = gr.Interface(
fn=predict,
inputs=gr.Textbox(placeholder="Enter text here..."),
outputs=gr.Label(num_top_classes=6),
title="Emotion Classifier",
description="Detect emotions in text: sadness, joy, love, anger, fear, and surprise."
)
# Launch the demo
demo.launch()
# This creates an interactive web interface for your model
# You can also deploy this permanently on Hugging Face Spaces
Creating a demo makes your model more accessible to non-technical users and provides an easy way to showcase your work.
Experiment with RoBERTa, BERT, or other transformer models as your base.
Optimize learning rate, batch size, and training epochs with grid or random search.
Generate additional training examples by synonym replacement, random insertion, or swap.
Use techniques like weighted loss, oversampling, or generate synthetic examples.
Learn how to design effective prompts for large language models.
Train models to perform multiple NLP tasks simultaneously.
Train effective models with very limited labeled examples.
Create smaller, faster models that retain performance.
Detect multiple emotions in the same text (e.g., both surprise and joy).
Predict not just the emotion but its intensity on a scale.
Build a web app that analyzes emotions in social media or customer feedback.
Create a model that works across multiple languages.
Documentation:
Communities:
You've built your first text classification model with transformers!
You now have the foundation to tackle more complex NLP tasks and build powerful AI-powered applications.
Keep exploring, experimenting, and building amazing things with NLP!