ReLU, sigmoid and tanh, how activation functions affect your machine learning algorithms.

If you have been working with neural networks for a bit, you already know that we need to use activation functions in the hidden layers (well, and also in the output layer), in order to achieve non-linearity.

However, I really enjoy understanding WHY should we use some activation functions instead of others, furthermore, I like to know how different activation functions affect a given model.

In this post I will focus on classification problems, more specifically I will just consider binary classification problems, lets dive in

Binary classification outputs.

Say you have a neural network that will classify elements into two different categories, for example given an image, it could determine if it is a cat or a dog, in this case our output will be either 1 (cat) or 0 (dog). That’s where the sigmoid function comes in handy.

The formula for the sigmoid function is

\(sigmoid(x) = \frac{1}{1+e^{-x}}\\\)

And the way it looks if we plot it is this

What makes the sigmoid function good for classification problems is that it outputs a value between 0 and 1 that changes in an uniform manner, this makes the sigmoid a great function as an output for a classification model, we can then simply perform predictions such as

\(\hat{y} = 1 \text{ if } sigmoid(x) >= 0.5; \hat{y} = 0 \text{ if } sigmoid(x) < 0.5\\\)

So, why not use the sigmoid also as an activation function in the hidden layers of a neural network? that takes us to the next stage.

Backward propagation and weight updates.

Ultimately the way we update our weights and biases is this

W = W – \alpha \frac{\partial{Cost}}{\partial{W}} \\
b = b – \alpha \frac{\partial{Cost}}{\partial{b}} \\

This of course assumes that \(\alpha\) represents the learning rate. What is important to notice is that if our derivatives are too small, then our updates to \(W\) and \(b\) will also be small. The derivative of the sigmoid function is

\( \frac{e^z – e^{-z}}{e^z + e^{-z}} * (1 – \frac{e^z – e^{-z}}{e^z + e^{-z}}) \\\)

which turns out to be simply

\(sigmoid(x) * (1-sigmoid(x)) \\\)

if we plot it, this is what we get.

Here we have our first problem: the maximum value we will ever get is 0.25, already quite low, but things get much worse as our \(x\) gets away from 0, as then the derivatives get smaller and smaller. This ultimately means our updates to the \(W\) and \(b\) will be also small, thus making the learning process slow.

The solution? Using a different function.

tanh function to the rescue.

Instead of the sigmoid, we will have a look at the hyperbolic tangent, or simply \(tanh(x)\) which is defined as

\(tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \\\)

when plotted, looks like this

This one gives us an output between -1 and 1, but more interestingly, the derivative value is

\(1 – (\frac{e^x – e^{-x}}{e^x + e^{-x}})^ 2 \\\)

or in simpler terms

\(1 – tanh^2(x) \\\)

if we plot it, we get

Pay close attention to the vertical axis! unlike the derivative of the sigmoid function, in this one we reach up to a value of 1, this is so much better than the maximum of 0.25 we got with the sigmoid. This means that we will be updating our \(W\) and \(b\) values at a much quicker pace.

Simpler is better: the ReLU function.

But we still have another interesting candidate: the ReLU function. ReLU stands for Rectified Linear Unit, and it is defined simply as

\(relu(x) = max(0, x) \\\)

you can also define it as

relu(x) \begin{cases}
0  \text{ if }  x <0 \\
x  \text{ if } x >=0
\end{cases} \\

In any case, the relu function looks like this.

What I like of it is the simplicity, I have not done calculus in a while, but I still remember that

f(x)=x^a \\
f'(x) = a*x^{a-1} \\

This means that the derivative of a ReLU function is quite large, lets plot it.

Notice that I have not plotted the value when \(x<0\) in that case, the value of the derivative will be \(0\), however in the rest of the cases, we have quite a large derivative value.

So what?

How does this all affect the learning then? remember that our equations for updating \(W\) and \(b\) are

W = W – \alpha \frac{\partial{Cost}}{\partial{W}} \\
b = b – \alpha \frac{\partial{Cost}}{\partial{b}} \\

And also keep in mind that the derivatives of the functions are different, in particular, the derivative of the sigmoid is quite small compared to the other two functions.

To see how this will impact the learning, I wrote a python notebook, which you can check at my kaggle account, there I used the well known MNIST dataset, but only to classify digits 0 and 1, I run a simple neural network with 256 hidden units, and using the different activation functions mentioned here. The results are pretty obvious:

Notice that in all the cases, the learning rate and the number of hidden units was the same, also, the initial values of \(W\) and \(b\) were also the same.

It is fairly impressive how the tanh and ReLU functions are much better candidates as activation functions in the hidden layers.

In conclusion:

  1. The sigmoid function is the function you should use as the output function for classification problems, as the value range \([0, 1]\) matches exactly what a binary classification problem needs.
  2. The sigmoid function will also work as an activation function for the hidden layers, but it will not be as quick.
  3. The tanh function is pretty much always better to use in hidden layer than the sigmoid function.
  4. The relu function is also a good candidate for hidden layers as an activation function.

Happy coding.