RELU-Function and Derived Function Review

: The activation function plays an important role in training and improving performance in deep neural networks (dnn). The rectified linear unit (relu) function provides the necessary non-linear properties in the deep neural network (dnn). However, few papers sort out and compare various relu activation functions. Most of the paper focuses on the efficiency and accuracy of certain activation functions used by the model, but does not pay attention to the nature and differences of these activation functions. Therefore, this paper attempts to organize the RELU-function and derived function in this paper. And compared the accuracy of different relu functions (and its derivative functions) under the Mnist data set. From the experimental point of view, the relu function performs the best, and the selu and elu functions perform poorly.


INTRODUCTION
In recent years, deep learning has become a research direction with the most potential and development momentum in the field of artificial intelligence and even in the entire computer field [1,25,26].The convolutional neural network in deep learning is a very important part of it [24].It can be seen everywhere in our daily life, such as in picture classification, natural language processing, speech recognition, and text classification fields play an important role [19,20,21].
In the early days of deep learning research, the sigmoid and tanh functions were widely utilized in volume integral models.And both of them are S-type saturation functions, and the problem of gradient dispersion is prone to occur during the training process.Kr-izhevsky et al. employed the linear unit rectifier linear unit as the activation function for the first time in the 2012 ImageNet ILSVRC competition [1,2].Relu function has good sparsity, fast convergence speed, simple calculation, and effectively solves the gradient dispersion problem caused by sigmoid and tanh.Since the gradient of relu is always zero at negative values, neurons may "dead" during training [3].From 2013 to 2015, some scholars proposed improved functions such as Leaky Relu, ELU, and PRelu to alleviate the problem of neuron "dead" [4,6,5].ELU will have gradient dispersion, and the derivative calculation complexity is exponential; Leaky Relu and PRelu introduce additional hyperparameter α, which needs to be adjusted differently according to the classification scenario, which increases the difficulty of model training.In this article, a brief review of the Rule function and its derivative functions is carried out, and compare their differences in form and their pros and cons.
At the same time, this study uses sigmoid, relu, elu, leaky-relu, selu, gelu to conduct comparative experiments on the MNIST dataset to observe their accuracy at differential epochs [22,23,5,4,11,13].It is expected that this study can point out a direction for the improvement of activation functions in the future.

Relu activation function
The proposal and application of the activation function ReLU solves the problem of "expansion and disappearance" in the sigmoid and tanh functions [6].

ReLu publicizes as follows：ReLU x
x if x 0 0 if x 0 .
The ReLU formula shows that if the input x is less than 0, the output is equal to 0; if the input x is greater than 0, the output is equal to the input.The differential equation of ReLU is: ReLU 1 if x 0 0 if x 0 .If the input x is greater than 0, the output is equal to 1; If the input is less than or equal to 0, the output becomes 0. When adopting the ReLU activation function, we will not get very small values.Instead, it is either 0 (causing some gradients to return nothing) or 1.When we introduced the ReLU function into the neural network, we also introduced a lot of sparsity [17].In a neural network, this means that the activated matrix contains many zeros.When a certain percentage (such as 50%) of activation is saturated, we call this neural network sparse.This can improve efficiency in terms of time and space complexity (constant values usually require less space and lower computational cost) [18].Yoshua Bengio et al. found that this component of ReLU can actually make neural networks perform better, and it also has the aforementioned efficiency in terms of time and space [7].
ReLU can also be extended to Noisy Relu including Gaussian noise.It is utilized in the restricted Boltzmann machine to solve computer vision task [8].Although the sparsity of the ReLU function solves the problem of the disappearance of the gradient caused by the "S-shaped" soft saturation activation function.However, the hard saturation of the negative half axis of ReLU is set to 0, which may lead to "neuronal dead" and also make its data distribution non-zero mean.The model may experience neuronal "dead" during the training process.

ELU activation function
Exponential linear unit ELU was first proposed by Djork-Arné Clevert et al. [5].ELU speeds up the learning speed of deep neural networks and improves the accuracy of function classification, just like rectified linear unit, Leaky RELU, and parametrized RELUs, ELU uses positive identities to solve the problem of vanishing gradients [9].Compared with other activation functions, ELU has better learning characteristics.Compared with RELU, ELU has a negative value, which will allow them to push the average unit activation value closer to 0, just like batch normalization, but it has lower computational complexity.ELU formula is as follows: 0   1   0 A mean close to zero will bring the normal gradient closer to the unit gradient, reducing bias effects and thus speeding up learning.
Although LRELU and PRELU also have negative values, they cannot ensure a noise-robust deactivation state.
But so far ELU also has several problems: Since it contains exponential calculations, the calculation time will be longer.
It cannot avoid the problem of gradient explosion.The neural network cannot learn the value of a by itself.

Leaky RELU activation function
Leaky RELU was first proposed by Andrew L. Maas.from the Department of Computer Science at Stanford University [10].Research shows that adopting a deep rectification network as an acoustic model for 300-hour exchange conversation speech recognition tasks, adopting a simple training program without pre-training, leaky relu has a 2% reduction in word error rate than its s-type network.It can be seen from the experiments in He, K. that leaky ReLU can achieve better results than ReLU and PReLU, but few people actually employ it [6].
In recent years, dnn with Leaky relu has proven to be a good acoustic model in speech.Zeiler et al. (2013) used up to 12 hidden layers to train rectifier networks on a proprietary voice search dataset containing hundreds of hours of training data.Recognition [14].After supervised training, the performance of rectifier dnn is significantly better than s-type dnn.Dahl et al. (2013) applied rectifier nonlinearities and dropout regularization to dnn with broadcast news LVCSR task with 50 hours of training data [15].The rectifier dnns with dropout is better than the stype network without dropout.
LRELU formula is as follows: Similar to ELU, Leaky ReLU can also avoid the dead ReLU problem because it allows a smaller gradient when calculating the derivative.Because it does not include exponential calculations, the calculation speed is faster than ELU.But Leaky Relu also cannot avoid the problem of gradient explosion, and the neural network cannot learn the alpha value.

PReLu activation function
Parametric Relu was first proposed by Kaiming He in 2015.The earliest analysis shows that the biggest disadvantage of ReLU is that the part of x<0 will cause neuron death [6].If you want the neuron to not die, you must make this part of the function produce gradients, that is, this part of the function needs to be transformed.PRELU is another attempt to fix the "dying relu" problem.It gives a relu function with a negative slope α, when x≥0, the function is not 0, but a small negative slope, where α is a learnable parameter.If its α is constant, it is also called Leaky relu.PReLU algorithm converges faster and has lower training error, In addition, the introduction of parameter α into the activation function will not lead to overfitting [6].In the experiment of Wei QingJie, deep convolutional network combined with regularization and PReLU activation function is adopted for image retrieval, which Improves image retrieval accuracy and resolves overfitting issues [29].
Based on the learnable activation and advanced initialization.Kaiming He et al. achieved 4.94% of the top five tests on the ImageNet 2012 classification data set, which is 26% higher than the ILSVRC 2014 winners (GoogleNet, 6.66%) [16].This result is the first time that this data set exceeds the level reported by humans (5.1%).
The PRelu segment is expressed as: prelu x x if x 0 αx if x 0 .As long as α is not equal to 1, this not only guarantees non-linearity, but also guarantees that neurons will not die.The α here can be learned, but generally set a relatively small number directly.

Randomized ReLU activation function
Random Rectified Linear Unit was first proposed and used in the Kaggle National Data Science Bowl (NDSB) competition [28]

SELU activation function
Self-normalizing neural networks (SELU) was first proposed by Klambauer, G et al. [11].The SELU function can be expressed as： SELU function induces self-normalizing properties.The paper shows [11] that the activations that are close to zero mean and unit variance propagated through many network layers will converge to zero mean and unit variance, even in the presence of noise and disturbance, this activation function performs well in standard feedforward neural networks (fnn), and the vanishing or exploding gradient problem is impossible according to theorems 2 and 3 in the paper [11].However, this activation function is relatively new, and more papers are needed to comparatively explore its application in architectures such as CNN and RNN.

SerLU activation function
SerLU was first proposed in Guoqiang.Zhang's paper [12].The function retains its normalization property on the basis of selu, but breaks its monotonicity.It has a peak in the negative part, but for larger negative input, the output value is close to 0, so the mean is the same as selu also tends to zero.SERLU is defined as SERLU x x x 0 α xe otherwise .At the same time, in order to prevent overfitting, the author designed a dropout scheme suitable for SERLU: shift-dropout.

GELU activation function
Gaussian error linear units (gelus) were first proposed by Dan Hendrycks et al. [13].The GELUS function seems to be the optimal technique in NLP, especially the Transformer model, which avoids the vanishing gradient problem.The GELU function expression is GELU x 0.5x 1 tanh 2/π x 0.044715x . Its derivative is GELU′ x 0.5tanh 0.0356774x 0.797885x 0.0535161x 0.398942x sech 0.0356774x 0.797885x 0.5

signrelu activation function
The signrelu function was first proposed by Guifang Lin and Wei Shen et al. in 2018 [27].Signrelu is a new type of unsaturated segment neuron activation function proposed based on the characteristics of relu and softsign functions.When the data is greater than zero, it utilizes the computing power of the relu function.When the data is less than zero, the softsign function is utilized for calculation, which retains its secondary axis information and corrects the data distribution, making it more faulttolerant.The expression for signrelu is SignReLu Based on the traditional convolutional neural network, signrelu enhances the data, enhances the normalization of local responses, and adopts methods such as maximum pooling.The signrelu function is put into the CIFAR-10 dataset to train and evaluate the model.The experimental results show that this function has a good effect on image classification, the convergence speed is faster, and the problem of the gradient diffusion model is effectively alleviated.Improve the image recognition accuracy of neural network.

COMPARATIVE ANALYSIS OF DIFFERENT ACTIVATION FUNCTIONS
In this paper, we conduct experiments with deep convolutional neural networks, adopting the MNIST dataset for testing.Its structure includes two 3*3 convolutional layers and two 2*2 maximum pooling layers, with a stride of 1 pixel and a full link method.
During training, we fixed grayscale images of size 28*28 on the input of our MNIST, employing the Adam selector, with batch_size selected as 128, dropout_rate set to 0.2, utilizing the cross-entropy loss function, and iterating 100 times.The SeLU function is special, we need to use the kernel initializer 'lecun_normal' and a special form of dropout AlphaDropout(), everything else remains as normal.Finally we can get the history plot from model.fit() and plot the change in loss and accuracy results for each activation function.

CONCLUSION
This paper reviews the definitions, advantages, disadvantages and experimental effects of several derivative functions of ReLU functions.The ReLU function solves the gradient vanishing problem, but it is also prone to the Dead ReLU Problem.ELU, LReLU, RReLU, SeLU, and GeLU can all solve the Dead ReLU Problem, but ELU includes exponential calculation, which will increase the amount of calculation, LReLU cannot learn a value, and PReLU needs more future development research.In recent years, the development of deep neural network is very fast, especially in image classification and natural language processing, the number of network layers is getting deeper and deeper, which makes people pay more attention to the training efficiency and accuracy of the network.The occurrence of this phenomenon has greatly stimulated the development of activation functions and a variety of new activation functions have emerged.But there are also some activation functions that have not been studied in this paper, and further investigation and analysis are needed to improve the views of this paper.

Figure 2 :Figure 3 :
Figure 2: Classification accuracy of different activation functions in different iterations in MNIST

Figure 1 :
The graphic depiction of RReLU function The core idea is that in the training process, α is randomly derived from a Gaussian distribution U(l,u), and then corrected during the test.And the experimental results show that RReLU outperforms ReLU, LReLU and PReLU in specific experiments.