A ConvNet for the 2020s (ConvNeXt)

August 31, 2022 · paper-review

Main Idea

Since the emergence of ViT, which applied Transformer-based architectures to vision tasks, classification tasks saw strong performance improvements. Following ViT, methods like Swin Transformers were developed to extend transformers to segmentation and object detection. Swin Transformer notably uses a hybrid approach that leverages several ConvNet priors. However, calling it "hybrid" is generous — it merely borrows the power of Transformers rather than fully exploiting the inductive biases inherent to ConvNets.

The authors set out to verify the power of a pure ConvNet (without any Transformer components) by taking the standard ResNet (architecture + training methodology) and progressively incorporating recent innovations to train it more like a Vision Transformer. In particular, whenever a new vision Transformer model is introduced, it typically comes with new training methodologies that claim performance improvements — yet these training recipes are rarely applied back to existing ConvNets. Through a series of experiments, the authors show that applying modern training recipes and redesigning convolution block structure can yield a more intuitive architecture that approaches Transformer-level performance.

A common misconception about ConvNeXt is that it achieves comparable performance to Transformers with significantly fewer FLOPs. In reality, ConvNeXt is purely based on ConvNet architecture (with its inherent inductive biases), but its model size is by no means smaller than Transformers. However, since more compression methods have been developed for convolution layers compared to Transformers, I personally believe there is greater potential for future compression thanks to the more intuitive structure.

Performance is primarily reported on ResNet-50, with each accuracy measurement averaged over 3 runs with different random seeds.

Background Knowledge

Examples of Representative ConvNets

VGGNet, Inceptions, ResNe(X)t, DenseNet, MobileNet, EfficientNet, and RegNet

Key Properties of ConvNets

The following properties arise from the "sliding window" mechanism used in convolution:

Translation equivariance — Particularly useful for tasks like object detection. For those who sometimes confuse equivariance and invariance: whether you shift the patch or shift the output, the feature vector values don't change, so it can also be viewed as invariant (g: Identity Mapping).
Weight sharing

Change 1: Training Methodology

The authors apply training recipes similar to DeiT and Swin Transformer to ResNet-50 and examine the resulting performance improvements.

Increased training epochs: 90 → 300
Changed optimizer: Adam → AdamW
Added data augmentation: Mixup, Cutmix, RandAugment, Random Erasing
Added regularization schemes:
- Stochastic depth — randomly drops entire ResBlocks during training.
- Label smoothing

These changes yielded the following improvement:

ResNet-50: 76.1% → 78.8% (+2.7%)

This suggests that much of the gap between traditional ConvNets and Vision Transformers may have stemmed from differences in training methodology rather than architecture.

Change 2: Macro-level Structural Changes

Changing Stage Compute Ratio to (3, 3, 9, 3)

The per-stage design of original ResNet was determined largely empirically.

Swin Transformer shows a slightly different stage ratio: (1, 1, 3, 1) for small models and (1, 1, 9, 1) for large models. To match Swin-T's FLOPs ratio, the authors changed the stage ratio from (3, 4, 6, 3) to (3, 3, 9, 3).

ResNet-50: 78.8% → 79.4% (+0.6%, cumulative: +3.3%)

Changing the Stem Layer to Conv(ks=4, stride=4)

The stem cell design determines how the input image is first processed at the beginning of the architecture. ResNet used a 7x7 Conv layer (stride 2, for 2x downsample) followed by max-pooling (another 2x downsample) for a total 4x downsample. Vision Transformers use even more aggressive patching with very large kernel sizes (14x14 or 16x16) with non-overlapping convolution (kernel size equals stride).

The authors adopted a ResNet-style stem with kernel size 4x4 and stride 4 (4x downsample).

ResNet-50: 79.4% → 79.5% (+0.1%, cumulative: +3.4%)

Change 3: Applying the ResNeXt Idea

ResNeXt used grouped convolution in its bottleneck block to reduce FLOPs while increasing network width (number of channels in hidden layers). The authors use depthwise convolution in this context. Depthwise convolution performs a per-channel weighted sum equivalent to self-attention, mixing information only along the spatial dimension.

Following Swin Transformer, the width was increased from 64 to 96.

ResNet-50: 79.5% → 80.5% (+1.0%, cumulative: +4.4%)

(a): ResNeXt block / (b): inverted bottleneck / (c): position switch of depthwise conv layer

Change 4: Applying Inverted Bottleneck

In Transformers, the inverted bottleneck refers to the MLP block where the hidden layer dimension is 4x larger than the input dimension. In ConvNets, inverted bottleneck has been widely used since MobileNetV2 and is conceptually similar to the Transformer version. While depthwise convolution FLOPs increase, the overall network FLOPs decrease because the 1x1 conv layers (shortcut layers) in the downsampling residual blocks become smaller.

ResNet-50: 80.5% → 80.6% (+0.1%, cumulative: +4.5%)

Change 5: Using Larger Kernels

Self-attention in Vision Transformers has a non-local property, making the receptive field effectively global. In contrast, ConvNets have traditionally used small 3x3 kernels (since VGGNet) for efficient GPU computation. Swin Transformer applies local windows to self-attention with a minimum kernel size of 7x7.

To increase the kernel size, the depthwise conv must first be moved before the 1x1 conv. This mirrors how MSA (Multi-head Self Attention) precedes the MLP layer in Transformers.

Experiments showed that performance saturates beyond 7x7 kernels. FLOPs were reduced from 4.6G to 4.2G while maintaining the same accuracy.

Change 6: Changing Activation and Normalization Layers

The rightmost ConvNeXt block is the final design chosen by the authors.

Replacing ReLU with GELU

While ReLU is still widely used in ConvNets and was used in the original Transformer, later NLP Transformers (BERT, GPT-2) and ViT adopted GELU (Gaussian Error Linear Unit). Applying GELU yielded no performance improvement, but it demonstrated that GELU can work well in ConvNets.

Fewer Activation Functions

In Transformer blocks, activation is only applied once within the MLP block. Following this design, the authors removed all GELU activations from the residual block except for a single one between the two 1x1 layers.

ResNet-50: 80.6% → 81.3% (+0.7%, cumulative: +5.2%) — matching Swin-T performance.

Fewer Normalization Layers

The authors kept only a single BN (Batch Normalization) before the 1x1 conv and removed all other normalization layers.

ResNet-50: 81.3% → 81.4% (+0.1%, cumulative: +5.3%)

Replacing Batch Normalization with Layer Normalization

Transformers have demonstrated strong performance using LN. The authors confirmed that LN works well in this setting and even yields improvement.

ResNet-50: 81.4% → 81.5% (+0.1%, cumulative: +5.4%)

Separate Downsampling Layers

Swin Transformer merges patches by concatenating channels from 2x2 neighborhoods (4C) and projecting down to 2C. Expressed as convolution, this is a kernel size 2x2, stride 2 operation doubling the channel count. The authors replaced ResNet's traditional 3x3 stride-2 downsampling with this approach. Simply substituting caused training divergence, but adding LN at spatial resolution change points stabilized training (following Swin Transformer's approach).

ResNet-50: 81.5% → 82.0% (+0.5%, cumulative: +5.9%)