8.6. Residual Networks (ResNet) and ResNeXt

As we design increasingly deeper networks, it becomes imperative to understand how adding layers can increase the complexity and expressiveness of the network. Even more important is the ability to design networks where adding layers makes networks strictly more expressive rather than just different. To make some progress, we need a bit of mathematics.

8.6.1. Function Classes

Consider $\mathcal{F}$, the class of functions that a specific network architecture (together with learning rates and other hyperparameter settings) can reach. That is, for all $f \in \mathcal{F}$ there exists some set of parameters $\mathbf{W}$ that can be obtained through training on a suitable dataset. Assume that $f^*$ is the "truth" function we really would like to find.

If $f^*$ is in $\mathcal{F}$, we are in good shape, but typically we will not be quite so lucky. Instead, we will try to find the best $f^*_\mathcal{F}$ within $\mathcal{F}$, which is the best choice given $\mathcal{F}$:

$$f^*_\mathcal{F} \stackrel{\mathrm{def}}{=} \mathop{\mathrm{argmin}}_f L(\mathbf{X}, \mathbf{y}, f) \text{ subject to } f \in \mathcal{F}.$$

It is only reasonable to assume that if we design a different and more powerful architecture $\mathcal{F}'$ we should arrive at a better outcome. In other words, we would expect that $f^*_{\mathcal{F}'}$ is "better" than $f^*_\mathcal{F}$. However, if $\mathcal{F} \not\subseteq \mathcal{F}'$ there is no guarantee that this even needs to happen.

This is the key question that He et al., 2016 asked: in what circumstances is a larger function class guaranteed to contain the smaller one? The answer is: when larger networks can always subsume the role of smaller ones — i.e., when we can always set the additional parameters to reproduce the identity mapping.

Fig. 8.6.1 Left: non-nested function classes — adding capacity does not guarantee improvement. Right: nested function classes as designed by ResNet — larger classes always contain the smaller class, ensuring the added layers can always express the identity mapping.

ResNet solves this problem by making every additional layer easier to contain the identity function as one of its elements. If the ideal mapping $f(\mathbf{x}) = \mathbf{x}$, the residual mapping $g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x} = 0$ is easier to learn — pushing weights towards zero is a natural regularization pressure.

8.6.2. Residual Blocks

ResNet follows VGG's full 3×3 convolutional layer design. The residual block has two 3×3 convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. Then, we skip these two convolution operations and add the input directly before the final ReLU activation. This design requires that the two convolutional layers produce output of the same shape as the input so that they can be added together.

Fig. 8.6.2 A residual block. The skip connection (orange path) bypasses the two convolution layers, adding the input $\mathbf{x}$ directly to the learned residual $F(\mathbf{x})$. The network learns $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$ instead of $H(\mathbf{x})$.

If we wish to change the number of channels, we need to introduce an additional 1×1 convolutional layer to transform the input in the skip connection into the desired shape for the addition operation. Let us have a look at the code below.

import torch
from torch import nn
from torch.nn import functional as F


class Residual(nn.Module):
    """The Residual block of ResNet models."""

    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3,
                                      padding=1, stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
                                      stride=strides) if use_1x1conv else None
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

import jax
from jax import numpy as jnp
from flax import linen as nn


class Residual(nn.Module):
    """The Residual block of ResNet models."""
    num_channels: int
    use_1x1conv: bool = False
    strides: int = 1

    @nn.compact
    def __call__(self, X, training=False):
        Y = nn.Conv(self.num_channels, kernel_size=(3, 3),
                    strides=self.strides, padding=1)(X)
        Y = nn.BatchNorm(use_running_average=not training)(Y)
        Y = nn.activation.relu(Y)
        Y = nn.Conv(self.num_channels, kernel_size=(3, 3), padding=1)(Y)
        Y = nn.BatchNorm(use_running_average=not training)(Y)
        if self.use_1x1conv:
            X = nn.Conv(self.num_channels, kernel_size=(1, 1),
                        strides=self.strides)(X)
        return nn.activation.relu(Y + X)

import tensorflow as tf


class Residual(tf.keras.Model):
    """The Residual block of ResNet models."""

    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
                                              padding='same', strides=strides)
        self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
                                              padding='same')
        self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
                       strides=strides) if use_1x1conv else None
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.bn2 = tf.keras.layers.BatchNormalization()

    def call(self, X):
        Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3 is not None:
            X = self.conv3(X)
        Y += X
        return tf.keras.activations.relu(Y)

This code generates two types of networks: one where we add the input to the output before applying the ReLU nonlinearity whenever use_1x1conv=False, and one where we adjust channels and resolution by means of a 1×1 convolution before adding. The diagram below shows the shapes at each step:

8.6.3. ResNet Model

The first two layers of ResNet are the same as those of the GoogLeNet: the 7×7 convolutional layer with 64 output channels and a stride of 2 is followed by the 3×3 maximum pooling layer with a stride of 2. The difference is the batch normalization layer added after each convolutional layer in ResNet.

class ResNet(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super(ResNet, self).__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(self.b1())
        for i, b in enumerate(arch):
            self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
        self.net.add_module('last', nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
            nn.LazyLinear(num_classes)))

    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

    def block(self, num_residuals, num_channels,
               first_block=False):
        blk = []
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.append(Residual(num_channels,
                    use_1x1conv=True, strides=2))
            else:
                blk.append(Residual(num_channels))
        return nn.Sequential(*blk)


class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__(
            ((2, 64), (2, 128), (2, 256), (2, 512)),
            lr, num_classes)

class ResNet(d2l.Classifier):
    arch: tuple
    lr: float = 0.1
    num_classes: int = 10

    @nn.compact
    def __call__(self, X, training=False):
        X = self.b1()(X, training=training)
        for i, b in enumerate(self.arch):
            X = self.block(*b, first_block=(i==0))(X, training=training)
        return nn.Dense(self.num_classes)(X.mean(axis=(1, 2)))

8.6.4. Training

We train ResNet on the Fashion-MNIST dataset using the same setting as before. ResNet is significantly easier to train than GoogLeNet because of the skip connections — gradient flow is improved throughout the very deep network.

model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]])
trainer.fit(model, data)

# Expected output:
# Epoch 10 | train loss: 0.014 | val accuracy: 0.906

8.6.5. ResNeXt

One of the challenges in the design of ResNet is the trade-off between nonlinearity and dimensionality within a given block. ResNeXt (Xie et al., 2017) offers a solution by combining the idea of VGG's repeated block structure with grouped convolutions.

In a ResNeXt block, instead of one wide convolution, we use many narrow convolutions (grouped convolutions) in parallel and sum their outputs:

$$\mathbf{y} = \mathbf{x} + \sum_{t=1}^{C} \mathcal{T}_t(\mathbf{x}),$$

where $C$ is called the cardinality of the transformation set. Empirically, increasing cardinality is more effective than increasing width or depth when the computational cost is held constant.

class ResNeXtBlock(nn.Module):
    """The ResNeXt block."""

    def __init__(self, num_channels, groups, bot_mul,
                 use_1x1conv=False, strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1)
        self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
            stride=strides, padding=1, groups=bot_channels//groups)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1)
        self.bn1   = nn.LazyBatchNorm2d()
        self.bn2   = nn.LazyBatchNorm2d()
        self.bn3   = nn.LazyBatchNorm2d()
        if use_1x1conv:
            self.downsample = nn.Sequential(
                nn.LazyConv2d(num_channels, kernel_size=1, stride=strides),
                nn.LazyBatchNorm2d())
        else:
            self.downsample = None

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.downsample:
            X = self.downsample(X)
        return F.relu(Y + X)

Summary

Nested function classes are desirable: extra layers in a larger function class should be able to express the identity function, so adding layers cannot hurt performance.
Residual mappings $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$ are easier to learn than the original mapping — pushing weights to zero forces the network to prefer simpler (identity-like) transformations.
The residual block is the fundamental building block of modern deep convolutional networks. It allows training of very deep networks (100+ layers) that were previously intractable.
ResNeXt introduces grouped convolutions and cardinality as a third dimension (alongside depth and width) for scaling neural networks.
ResNet and its variants remain competitive baselines for image classification, object detection, and semantic segmentation tasks.

Exercises

What are the major differences between the Inception block in GoogLeNet (Section 8.4) and the residual block? After removing some paths in the Inception block, how are they related to each other?
Refer to Table 1 in the ResNet paper (He et al., 2016) to implement different variants of the network. Do they achieve the same accuracy on the Fashion-MNIST dataset?
For deeper networks, ResNet introduces a "bottleneck" architecture to reduce model complexity. Try to implement it. Refer to Figure 5 in the ResNet paper for details.
In subsequent versions of ResNet, the authors changed the "convolution, batch normalization, and activation" structure to the "batch normalization, activation, and convolution" structure. Make this improvement yourself. See Figure 1 in He et al. (2016b) for details.
Why can't we just increase the complexity of functions without bound, even if it fits the training data perfectly?
In ResNeXt, the number of groups $g$ and bottleneck ratio are hyperparameters. Using the Fashion-MNIST dataset, explore how varying cardinality affects accuracy vs. computation cost.