Residual Blocks

Introduction
Residual blocks, often referred to as “ResBlocks” or “Residual Blocks”, represent one of the most influential innovations in the field of deep neural networks. These blocks were introduced by Kaiming He et al. in the paper titled “Deep Residual Learning for Image Recognition” presented at the CVPR conference in 2016. Since then, the concept of residual learning has transformed the way we design deep neural networks, particularly in the area of computer vision.
True Function
In the realm of machine learning, the “true function” is often cited as the Holy Grail. Simply put, it describes the ideal relationship between inputs and outputs we aim for in our model. Suppose we have an equation f(x) = y
, where x
represents the features of a house, and y
is its price. In an ideal world, our “true function” would be this equation that, when provided with the features x
, would always give us the exact price y
of any house, eliminating all uncertainty.
Imagine you want to predict the price of a house. If we consider features like size (t
), location (l
), and the number of rooms (q
), the true function might be something like Price = a*t + b*l + c*q
. However, this is an oversimplification. In reality, unseen factors, measurement errors, and other variables make it nearly impossible to pin down a perfect “true function”.
Within machine learning, the journey is often less about finding the exact “true function” and more about getting as close to it as possible. Our best approximation might be f'(x) = y + error
, where error
signifies the limitations of our model, such as the amount and quality of the data and the chosen network architecture.
Nested Functions
The idea of “nested function classes” can be best visualized through matryoshka dolls, the Russian nesting dolls. Imagine each doll represents a neural network, with the smallest being the simplest. We can represent this smallest doll by f1(x)
. As we move to larger dolls, the networks become more complex, potentially represented by f2(x)
, f3(x)
, and so on.

Matryoshka Dolls
Each time we move to a larger doll (or box), we are essentially adding more capacity to our model. Moving from f1(x)
to f2(x)
means we’ve encompassed everything f1(x)
could do and added new capabilities.
In this analogy, the bigger the box we are working within, the closer we get to the “true function”. By considering a broader range of possibilities, we increase the likelihood of our function getting closer to the ideal.
Whether it’s through the pursuit of the “true function” or navigating between different nested function classes, the ultimate goal is always to craft models that faithfully and effectively represent the complexity and variability of the real world. The perfect equation might be unattainable, but the journey of approximation is what truly drives innovation in the field of machine learning.
Identity Function
In mathematical terms, the identity function is defined by the equation f(x) = x
. This means that for any input x
, the output is identical to the input itself. At first glance, this function might seem redundant or unnecessary, but in deep neural networks, it serves a very special purpose.
When constructing deep neural networks with multiple layers, data scientists face a dilemma. Adding extra layers, although potentially beneficial, can also make training more challenging or even reduce the model’s efficacy. In this context, the identity function offers an intriguing solution. If a newly added layer acts merely as an “identity function,” it does not change the data, thereby preserving the model’s integrity.
Adding layers, in theory, poses the risk of harming overall performance. However, if a new layer is trained to act as the identity function, this ensures the model’s performance will not be compromised. In other words, even with an additional layer, the output remains consistent with the input, avoiding unwanted distortions.
The real brilliance of the identity function in deep neural networks is its ability to allow a more refined adjustment to training data. By adding layers that act as the identity function, the network has the flexibility to adjust and better adapt to data without the worry of distorting or impairing the outputs.
Residual Mapping: An Innovative Approach
Traditionally, networks attempt to learn a direct mapping between inputs and outputs. In contrast, residual blocks focus on learning the discrepancy or “residual” between these inputs and outputs. Mathematically, instead of zeroing in on f(x)
, a residual block directs its attention to the difference g(x) = f(x) - x
.
The residual is the difference between the desired output and the neural network’s output, factoring in the identity function. In other words, the residual is what the neural network needs to learn to generate the desired output.
Residual blocks are built on the idea that if the desired output is already close to or even identical to the input, the function learned by that block might approximate an identity function. This means that the residual mapping g(x)
could become very small or even zero.
A defining feature of residual blocks is the “shortcut connection.” This connection allows the original input x to be passed directly to the block’s output, where it’s added to g(x)
. Thus, the final output becomes f(x) = g(x) + x
.
A common question regarding residual blocks is why f(x) = g(x) + x
is considered an identity function. The answer lies in the nature of the residual: if g(x)
is very small (because the desired output is close to the input), the function effectively becomes f(x) = x
.
The magic of the residual block is that it aims to learn only the difference between the input and the desired output, rather than the direct mapping. This approach allows the neural network to easily approximate identity functions when needed, making training more robust.
The residual connection offers the network the flexibility to retain the original information if it’s beneficial for learning. In deep networks, this can sidestep common issues like gradient vanishing, ensuring more effective training.
Backpropagation and optimization algorithms work in tandem to update layer weights. Depending on the need, the network can adjust the output of g(x)
to approximate zero, making the residual block’s output resemble the input.
When training a neural network, residual blocks don’t ensure that g(x)
is always zero. Instead, the training process will determine if useful features are learned in g(x)
. When combined with x
, these features can significantly improve information representation and, consequently, model performance.
Residual Block Example
import torch
import torch.nn as nn
# Defining a simple "function" (actually, a neural model) to represent g(x)
class SimpleFunction(nn.Module):
def __init__(self):
super(SimpleFunction, self).__init__()
self.fc = nn.Linear(10, 10) # A fully connected layer, just for illustration
def forward(self, x):
return self.fc(x)
# Defining the residual block
class ResidualBlock(nn.Module):
def __init__(self):
super(ResidualBlock, self).__init__()
self.g = SimpleFunction()
def forward(self, x):
return self.g(x) + x
# Let's test the residual block
res_block = ResidualBlock()
# Defining loss function and optimizer
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(res_block.parameters(), lr=0.01)
# Creating a dummy dataset for training
inputs = torch.randn(100, 10)
targets = torch.randn(100, 10)
# Training
epochs = 1000
for epoch in range(epochs):
# Forward pass
outputs = res_block(inputs)
# Calculating the loss
loss = loss_function(outputs, targets)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print the loss every 10 epochs
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
# Testing the residual block after training
x = torch.randn(1, 10) # A tensor of size (1, 10) with random values
output = res_block(x)
print("\nInput (x):", x)
print("Output (g(x) + x) after training:", output)
- Definition of Simple Function $ g(x) $:
- The code starts by defining a neural model called
SimpleFunction
, which represents the function $ g(x) $. - As mentioned in the text, in a residual block, the network focuses on the difference or “residual”, i.e., $ g(x) = f(x) - x $. In this code,
SimpleFunction
attempts to model this residual function $ g(x) $. - The
SimpleFunction
function is a simple neural network with a single fully-connected layer.
- Definition of Residual Block:
- The
ResidualBlock
residual block is defined next. - As described in the text, the residual block utilizes a “shortcut connection”. The implementation does this by adding the original input $ x $ to the output of $ g(x) $, which is exactly what the line
return self.g(x) + x
in theforward
method of this block does.
- Training:
- To train the residual block, a mean squared error loss function (
MSELoss
) and an Adam optimizer are used. This part of the code is a practical representation of the training mentioned in the text, where it is stated that “when training a neural network, residual blocks do not guarantee that $ g(x) $ will always be zero”. Through the loss function and optimizer, the code adjusts the weights of the residual block to approach the desired output. - Fictional data are used for training, and the loss is printed every 10 epochs for monitoring.
- Testing:
- After training, the residual block is tested with a random input. This is just to demonstrate how to use the block after training.
The code is a direct representation of the concept of residual blocks. Instead of trying to learn a direct mapping between inputs and outputs, the network attempts to learn only the difference (the residual) between them. This is achieved by adding the original input to the output of the function $ g(x) $. Thus, the final output effectively becomes $ f(x) = g(x) + x $. As mentioned in the text, if the desired output is close to the input, then $ g(x) $ will approach zero, making $ f(x) $ effectively $ x $, representing an identity function.
Applying Residual Blocks in a CNN
Convolutional networks (CNNs) have proven exceptionally successful in handling computer vision tasks. However, as we attempt to train deeper networks to capture more complex features, we can encounter issues such as gradient vanishing. One proposed solution to this problem is the introduction of residual blocks.
A typical residual block in a CNN involves:
- One or more convolutional layers.
- A non-linear activation (such as ReLU) after each convolutional layer.
- A “shortcut connection” that skips over these layers and connects directly to the output.
The shortcut, or residual connection, allows the original input to pass through directly and be added to the output of the last convolutional layer of the block. Mathematically, this can be represented as $ f(x)=g(x)+x $ , where $ g(x) $ is the transformation learned by the convolutional layers.
Here’s a simplified example of how a residual block might be implemented in PyTorch:
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super(ResidualBlock, self).__init__()
# The g(x) transformation is represented by these layers:
self.g = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False),
nn.BatchNorm2d(out_channels)
)
# The shortcut connection
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
g_x = self.g(x)
f_x = g_x + self.shortcut(x)
return nn.ReLU()(f_x)
Conclusion
Residual blocks represent a significant innovation in neural network architecture. By focusing on learning residuals and offering shortcut connections, these blocks facilitate the training of deep and complex models, opening up new possibilities in the field of deep learning.
Cite this article
You can cite this article in your academic work.
@article{rodrigues2023blocks,
title={Residual Blocks},
author={Rodrigues, Thiago Luiz},
journal={URL https://rodriguesthiago.me/posts/residual_blocks/},
year={2023}
}