← Back to Blog

Activate or Not: Learning Customized Activation (ACON)

· paper-review

Ma, Ningning, et al. "Activate or Not: Learning Customized Activation."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Prerequisites

Swish Activation Function

swish(x) := x × σ(βx) = x / (1 + e-βx)

Swish activation function

Sigmoid

σ(x) = 1 / (1 + e-x)

Here we note Sigmoid not as an activation per se, but as notation for compactly expressing equations. Remember that Swish can ultimately be expressed as the product of the input and the sigmoid of the input (with β multiplication).

Maxout Family

This is one of the foundational concepts behind activation functions like ReLU. From the paper by Goodfellow and Bengio, it suggests that selecting the maximum can approximate arbitrary convex functions.

Main Idea

ACON (ActivationOrNot) Activation Function

ACON-C(x) := (p1 - p2)x · σ(β(p1 - p2)x) + p2x

ACON activation

When using ACON activation, a particular layer's activation can either pass linearly or be activated non-linearly.

The authors propose an activation function called ACON (and further, Meta-ACON). ACON is a trainable activation that decides whether to activate a neuron or not, adapted to each layer's characteristics.

How Was the ACON Formula Derived?

First, we need to look at a smoothed version of the Maximum Function max(x1, ..., xn). Computing the maximum is generally not differentiable, but its smooth approximation is.

Here, β acts as a switching factor:

Common activation functions in neural networks can be expressed in Maxout form: max(ηa(x), ηb(x))

For example, ReLU can be thought of as ηa(x) = x, ηb(x) = 0, making it a member of the Maxout Family. Leaky ReLU, FReLU, and others also belong to this family.

The goal of this paper is to use the smooth Maximum Function to approximate each activation function in the Maxout Family with a smooth, differentiable counterpart.

Swish Lives Inside ACON

Maxout family vs ACON family

It has been intuitively understood that Swish is a smooth version of ReLU. When you substitute into the Smooth Maximum Function and expand, the Swish formula naturally emerges. The authors state that this formally shows Swish is a Smooth Approximation of ReLU.

PReLU (Parametric ReLU) also has a corresponding smooth function. Finally, when expressing the weights of each linear function as p1 and p2 (the most general form), mapping Maxout Family to ACON Family yields the generalized ACON-C formula.

Properties of ACON

ACON example

This graph shows ACON-C with p1=1.2, p2=-0.8 for various values of β.

ACON properties

This figure shows the ACON activation and its derivatives.

From the derivatives, we can observe that:

In Swish, only the hyperparameter β controls how quickly the bounds are reached. In ACON, p1 and p2 determine the bound values themselves, and these can also be learnable. Having learnable boundaries is essential for easier optimization, and the authors demonstrate this advantage through experiments.

Let the Network Decide Everything: Meta-ACON

Meta-ACON goes beyond making β a learnable parameter — it estimates β from the input feature map through FC layers.

Meta-ACON distribution

This diagram compares ACON and Meta-ACON activations in the last BottleNeck layer of ResNet50, with 7 randomly sampled examples.

Results

Looking at the ShuffleNetV2 results on ImageNet Classification, not only is training faster, but the error rate decreases when using Meta-ACON. Overall, the accuracy improvement grows as model size increases and Meta-ACON is applied.

Results - detection and segmentation Results - comparison

Meta-ACON shows strong performance on ImageNet Classification compared to other activations. It also outperforms other activation functions on Object Detection and Semantic Segmentation for certain backbones.

Conclusion

Through the relationship between ReLU and Swish, we discovered a generalized formula encompassing new activation functions (the ACON Family), and from this foundation, we were introduced to a trainable activation function.

Trainable activation functions are not entirely new — ACON is not the first. Whether it can serve as a universally applicable activation across various sub-tasks remains uncertain. Nonetheless, this paper is significant for formally showing the relationship between ReLU and Swish through simple algebraic manipulation, while proposing a new Activation Family.

TL;DR

References

  1. Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017). [paper]
  2. Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013. [paper]
  3. Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. "TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning." NeurIPS 2020. [paper]