Activate or Not: Learning Customized Activation (ACON)

July 19, 2021 · paper-review

Ma, Ningning, et al. "Activate or Not: Learning Customized Activation."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Prerequisites

Swish Activation Function

swish(x) := x × σ(βx) = x / (1 + e^-βx)

It shows a non-linearly interpolated activation between a linear function and ReLU.
- When β = 0, it acts like a linear function f(x) = x/2.
- Conversely, when β → ∞, the sigmoid component acts as a 0-1 activation, making Swish behave like ReLU.
- When β = 1, it acts as the Sigmoid-weighted Linear Unit (SiL) function used in reinforcement learning.
- β can be a constant or, depending on the model, a trainable parameter.
It is frequently used as a replacement for ReLU in generative models.
Recently, Swish has also gained renewed attention in implicit representation networks.

Sigmoid

σ(x) = 1 / (1 + e^-x)

Here we note Sigmoid not as an activation per se, but as notation for compactly expressing equations. Remember that Swish can ultimately be expressed as the product of the input and the sigmoid of the input (with β multiplication).

Maxout Family

This is one of the foundational concepts behind activation functions like ReLU. From the paper by Goodfellow and Bengio, it suggests that selecting the maximum can approximate arbitrary convex functions.

Main Idea

ACON (ActivationOrNot) Activation Function

ACON-C(x) := (p₁ - p₂)x · σ(β(p₁ - p₂)x) + p₂x

When using ACON activation, a particular layer's activation can either pass linearly or be activated non-linearly.

The authors propose an activation function called ACON (and further, Meta-ACON). ACON is a trainable activation that decides whether to activate a neuron or not, adapted to each layer's characteristics.

How Was the ACON Formula Derived?

First, we need to look at a smoothed version of the Maximum Function max(x₁, ..., x_n). Computing the maximum is generally not differentiable, but its smooth approximation is.

Here, β acts as a switching factor:

When β → ∞, the function acts as the Maximum Function.
When β → 0, the function acts as the Arithmetic Mean.

Common activation functions in neural networks can be expressed in Maxout form: max(η_a(x), η_b(x))

For example, ReLU can be thought of as η_a(x) = x, η_b(x) = 0, making it a member of the Maxout Family. Leaky ReLU, FReLU, and others also belong to this family.

The goal of this paper is to use the smooth Maximum Function to approximate each activation function in the Maxout Family with a smooth, differentiable counterpart.

Swish Lives Inside ACON

It has been intuitively understood that Swish is a smooth version of ReLU. When you substitute into the Smooth Maximum Function and expand, the Swish formula naturally emerges. The authors state that this formally shows Swish is a Smooth Approximation of ReLU.

PReLU (Parametric ReLU) also has a corresponding smooth function. Finally, when expressing the weights of each linear function as p₁ and p₂ (the most general form), mapping Maxout Family to ACON Family yields the generalized ACON-C formula.

Properties of ACON

This graph shows ACON-C with p₁=1.2, p₂=-0.8 for various values of β.

When β is large, it responds like a maximum function with non-linear characteristics.
When β is close to 0, it approximates the mean function with linear characteristics.

This figure shows the ACON activation and its derivatives.

Left: How the activation function varies with p₁, p₂ coefficients when β is fixed.
Center: How the ACON derivative changes as β varies.
Right: How the ACON derivative changes with p₁, p₂ coefficients when β is fixed.

From the derivatives, we can observe that:

p₁ and p₂ determine the upper and lower bound values respectively.
β determines how quickly the derivative converges to the upper/lower bounds set by p₁ and p₂.

In Swish, only the hyperparameter β controls how quickly the bounds are reached. In ACON, p₁ and p₂ determine the bound values themselves, and these can also be learnable. Having learnable boundaries is essential for easier optimization, and the authors demonstrate this advantage through experiments.

Let the Network Decide Everything: Meta-ACON

Meta-ACON goes beyond making β a learnable parameter — it estimates β from the input feature map through FC layers.

This diagram compares ACON and Meta-ACON activations in the last BottleNeck layer of ResNet50, with 7 randomly sampled examples.

With ACON, all 7 samples show the same β distribution.
With Meta-ACON, the 7 samples show different β distributions. Smaller β values lead to more linear responses, while larger β values produce more non-linear responses.

Results

Looking at the ShuffleNetV2 results on ImageNet Classification, not only is training faster, but the error rate decreases when using Meta-ACON. Overall, the accuracy improvement grows as model size increases and Meta-ACON is applied.

Meta-ACON shows strong performance on ImageNet Classification compared to other activations. It also outperforms other activation functions on Object Detection and Semantic Segmentation for certain backbones.

Conclusion

Through the relationship between ReLU and Swish, we discovered a generalized formula encompassing new activation functions (the ACON Family), and from this foundation, we were introduced to a trainable activation function.

Trainable activation functions are not entirely new — ACON is not the first. Whether it can serve as a universally applicable activation across various sub-tasks remains uncertain. Nonetheless, this paper is significant for formally showing the relationship between ReLU and Swish through simple algebraic manipulation, while proposing a new Activation Family.

TL;DR

Extends the generalization beyond the existing Maxout family to introduce the ACON Family concept.
Makes the parameters that determine each activation in the ACON Family learnable, proposing ACON as a new activation function.
Swish was previously found via NAS and known to work well, but without clear explanation why — the ACON Family framework provides a theoretical basis for understanding this.

References

Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017). [paper]
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013. [paper]
Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. "TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning." NeurIPS 2020. [paper]