Master the fundamentals of deep learning with this interactive tutorial on feedforward neural networks, multilayer perceptrons, and the XOR problem.
Understanding the quintessential deep learning models
Deep feedforward networks, also called feedforward neural networks or multilayer perceptrons (MLPs), are the quintessential deep learning models. The goal of a feedforward network is to approximate some function f*.
These models are called feedforward because information flows through the function being evaluated from x, through the intermediate computations used to define f, and finally to the output y.
No feedback connections - information flows in one direction
Include feedback connections where outputs are fed back into the model
Form the basis of many important commercial applications
Convolutional networks for object recognition are specialized feedforward networks
Conceptual stepping stone to recurrent networks for NLP
Understanding the fundamental building blocks
Feedforward neural networks are called networks because they are typically represented by composing together many different functions. The model is associated with a directed acyclic graph describing how the functions are composed together.
# Example: Three functions connected in a chain
f(x) = f³(f²(f¹(x)))
# Where:
# f¹ is the first layer
# f² is the second layer
# f³ is the third layer (output layer)
The training data provides us with noisy, approximate examples of f*(x) evaluated at different training points. Each example x is accompanied by a label y ≈ f*(x).
Training examples specify directly what the output layer must do at each point x
Training data does not show the desired output for these layers - they are "hidden"
These networks are called neural because they are loosely inspired by neuroscience. Each hidden layer of the network is typically vector valued, and each element may be interpreted as playing a role analogous to a neuron.
Why we need nonlinear transformations
Linear models, such as logistic regression and linear regression, are appealing because they can be fit efficiently and reliably, either in closed form or with convex optimization.
To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation.
# Instead of: y = w^T x + b
# We use: y = w^T φ(x) + b
# Where φ(x) is a nonlinear transformation that:
# - Provides a set of features describing x
# - Provides a new representation for x
Three strategies for choosing φ(x)
Use a very generic φ, such as the infinite-dimensional φ that is implicitly used by kernel machines based on the RBF kernel.
Manually engineer φ. Until the advent of deep learning, this was the dominant approach.
Learn φ. We have a model y = f(x; θ, w) = φ(x; θ)ᵀw with parameters θ to learn φ and parameters w to map to output.
The deep learning approach is the only one that gives up on the convexity of the training problem, but the benefits outweigh the harms. This approach parametrizes the representation as φ(x; θ) and uses optimization algorithms to find the θ that corresponds to a good representation.
A concrete example of feedforward networks in action
The XOR function ("exclusive or") is an operation on two binary values, x₁ and x₂. When exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0.
x₁ | x₂ | XOR Output |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Suppose we choose a linear model with θ consisting of w and b. Our model is defined as:
f(x; w, b) = x^T w + b
# Using mean squared error loss function:
J(θ) = (1/4) Σ (f*(x) - f(x; θ))²
After solving the normal equations, we obtain w = 0 and b = 1/2. The linear model simply outputs 0.5 everywhere!
We introduce a simple feedforward network with one hidden layer containing two hidden units.
# Network structure:
h = f¹(x; W, c) # Hidden layer
y = f²(h; w, b) # Output layer
# Complete model:
f(x; W, c, w, b) = f²(f¹(x))
# Hidden layer computation:
h = g(W^T x + c) # where g is activation function
Most neural networks use an affine transformation controlled by learned parameters, followed by a fixed nonlinear function called an activation function.
g(z) = max{0, z}
# Properties:
# - Default recommendation
# - Piecewise linear function
# - Preserves linear model properties
# - Easy to optimize
# - Good generalization
How neural networks transform the problem space
Essential insights from deep feedforward networks
Linear transformations alone cannot solve complex problems like XOR. Nonlinear activation functions are crucial for learning complex patterns.
Deep learning automatically learns useful representations, eliminating the need for manual feature engineering in most cases.
Multiple layers allow complex function approximation through composition of simpler functions, enabling deep networks to model intricate patterns.
Intermediate representations are not directly specified by training data, allowing the network to learn optimal internal representations.
Deep feedforward networks overcome the limitations of linear models through learned feature representations and nonlinear transformations. They provide the foundation for understanding more complex deep learning models and have revolutionized machine learning across numerous domains.