ReLU, sigmoid and tanh, how activation functions affect your machine learning algorithms.

If you have been working with neural networks for a bit, you already know that we need to use activation functions in the hidden layers (well, and also in the output layer), in order to achieve non-linearity.

However, I really enjoy understanding WHY should we use some activation functions instead of others, furthermore, I like to know how different activation functions affect a given model.

In this post I will focus on classification problems, more specifically I will just consider binary classification problems, lets dive in

Binary classification outputs.

Say you have a neural network that will classify elements into two different categories, for example given an image, it could determine if it is a cat or a dog, in this case our output will be either 1 (cat) or 0 (dog). That’s where the sigmoid function comes in handy.

The formula for the sigmoid function is

\(sigmoid(x) = \frac{1}{1+e^{-x}}\\\)

And the way it looks if we plot it is this

What makes the sigmoid function good for classification problems is that it outputs a value between 0 and 1 that changes in an uniform manner, this makes the sigmoid a great function as an output for a classification model, we can then simply perform predictions such as

\(\hat{y} = 1 \text{ if } sigmoid(x) >= 0.5; \hat{y} = 0 \text{ if } sigmoid(x) < 0.5\\\)

So, why not use the sigmoid also as an activation function in the hidden layers of a neural network? that takes us to the next stage.

Backward propagation and weight updates.

Ultimately the way we update our weights and biases is this

\(
W = W – \alpha \frac{\partial{Cost}}{\partial{W}} \\
b = b – \alpha \frac{\partial{Cost}}{\partial{b}} \\
\)

This of course assumes that \(\alpha\) represents the learning rate. What is important to notice is that if our derivatives are too small, then our updates to \(W\) and \(b\) will also be small. The derivative of the sigmoid function is

\( \frac{e^z – e^{-z}}{e^z + e^{-z}} * (1 – \frac{e^z – e^{-z}}{e^z + e^{-z}}) \\\)

which turns out to be simply

\(sigmoid(x) * (1-sigmoid(x)) \\\)

if we plot it, this is what we get.

Here we have our first problem: the maximum value we will ever get is 0.25, already quite low, but things get much worse as our \(x\) gets away from 0, as then the derivatives get smaller and smaller. This ultimately means our updates to the \(W\) and \(b\) will be also small, thus making the learning process slow.

The solution? Using a different function.

tanh function to the rescue.

Instead of the sigmoid, we will have a look at the hyperbolic tangent, or simply \(tanh(x)\) which is defined as

\(tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \\\)

when plotted, looks like this

This one gives us an output between -1 and 1, but more interestingly, the derivative value is

\(1 – (\frac{e^x – e^{-x}}{e^x + e^{-x}})^ 2 \\\)

or in simpler terms

\(1 – tanh^2(x) \\\)

if we plot it, we get

Pay close attention to the vertical axis! unlike the derivative of the sigmoid function, in this one we reach up to a value of 1, this is so much better than the maximum of 0.25 we got with the sigmoid. This means that we will be updating our \(W\) and \(b\) values at a much quicker pace.

Simpler is better: the ReLU function.

But we still have another interesting candidate: the ReLU function. ReLU stands for Rectified Linear Unit, and it is defined simply as

\(relu(x) = max(0, x) \\\)

you can also define it as

\(
relu(x) \begin{cases}
0  \text{ if }  x <0 \\
x  \text{ if } x >=0
\end{cases} \\
\)

In any case, the relu function looks like this.

What I like of it is the simplicity, I have not done calculus in a while, but I still remember that

\(
f(x)=x^a \\
f'(x) = a*x^{a-1} \\
\)

This means that the derivative of a ReLU function is quite large, lets plot it.


Notice that I have not plotted the value when \(x<0\) in that case, the value of the derivative will be \(0\), however in the rest of the cases, we have quite a large derivative value.

So what?

How does this all affect the learning then? remember that our equations for updating \(W\) and \(b\) are

\(
W = W – \alpha \frac{\partial{Cost}}{\partial{W}} \\
b = b – \alpha \frac{\partial{Cost}}{\partial{b}} \\
\)

And also keep in mind that the derivatives of the functions are different, in particular, the derivative of the sigmoid is quite small compared to the other two functions.

To see how this will impact the learning, I wrote a python notebook, which you can check at my kaggle account, there I used the well known MNIST dataset, but only to classify digits 0 and 1, I run a simple neural network with 256 hidden units, and using the different activation functions mentioned here. The results are pretty obvious:

Notice that in all the cases, the learning rate and the number of hidden units was the same, also, the initial values of \(W\) and \(b\) were also the same.

It is fairly impressive how the tanh and ReLU functions are much better candidates as activation functions in the hidden layers.

In conclusion:

  1. The sigmoid function is the function you should use as the output function for classification problems, as the value range \([0, 1]\) matches exactly what a binary classification problem needs.
  2. The sigmoid function will also work as an activation function for the hidden layers, but it will not be as quick.
  3. The tanh function is pretty much always better to use in hidden layer than the sigmoid function.
  4. The relu function is also a good candidate for hidden layers as an activation function.

Happy coding.

 

Machine learning basics: the cost function

Machine learning is ultimately a way to make a program perform a task and to get that task done better over time. Cost functions define how good or bad a program is at performing such task, pretty much every problem consist on getting the value of the cost function to be as small as possible.

For our example, we will use a very simple dataset which consist on two variables: Car speed and distance to stop, our ultimate goal will be to, given a speed we have never seen before, predict what will be the distance to stop.

Lets define some common vocabulary:

  • \(X\) : These will be the observations, in our case it will represent the car speed.
  • \(y\): The correct answers to our observations, in this case the distance.
  • \(\hat{y}\) : Our own predictions given an \(X\)

Notice that all of the values above are actually vectors, or if you prefer, lists (possibly a more friendly term for a developer), this means that each of them can be accessed by indexes, such as

\(y_i\)

This takes us to define another element

  • \(n\): The total number of observations, in this case this means how many elements we have in \(X\) and \(y\)

Now, this is the data we are going to work with

Speed (X)
Distance (y)
4 2
7 4
8 16
9 10
10 18
11 17
12 14
13 26
14 26
15 20
16 32
17 32
18 42
19 36
20 32
22 66
23 54
24 70
25 85

We have a total of 19 observations here, now lets plot them

The distance required to stop a car depending on its speed.

With all this in our hands, we can start defining our cost function. A good intuition would be to say that our cost function is simply the difference between our predictions and the actual value.

For example at speed \(15\)km/h we need \(20\) meters to stop. Imagine that we have

  • \(ModelA\) that predicts that \(25 \) meters are needed, the error would be \(25 – 20 = 5 \).
  • \(ModelB \) that predicts that we need \(21\) meters to stop then the error would be \(22 – 20 = 2\) which is already a smaller error than the previous.
  • \(ModelC \) that predicts that we need \(19\) meters to stop? then the error would be \(19 – 20 = -1 \) this is a bit weird, as we want to make our error be close to 0, not to be negative. The solution for that would be to use a squared error instead, so that way the result would always be positive, lets recalculate the errors using squares.
  • \(ModelA = (25 – 20)^2 = 25 \)
  • \(ModelB = (22-20) ^ 2 = 4 \)
  • \(ModelC = (19-20) ^ 2 = 1 \)

More generally we can simply say \( Error = (\hat{y_i} – y_i)^2 \)

With this we can quickly conclude that the best model is the one with the smallest value for the cost function, in this case, that would be \(ModelC\).

The next step is to apply this to every point in the problem, our model should be able to predict what is the distance required to stop for any given speed, and we should be able to calculate the error of such prediction. The solution? apply exactly the same logic but to the whole set of data. As we mentioned before, distance and speed are both vectors, so we can simply do

\( Error =  (\hat{y_1} – y_1)^2 + (\hat{y_2} – y_2)^2 … + (\hat{y_n} – y_n)^2 \)

Or if we want to use a better mathematical term

\(
\begin{equation*}
Error = \sum_{n=1}^n (\hat{y}_n – y_n)^2
\end{equation*}
\)

Do not let the math intimidate you, the term \( \sum_{n=1}^n\) is just a loop over the elements of vectors.

We cannot simply keep adding the terms, think about it, this means that if we have a dataset with a lot of observations, our error will grow as we have more observations, the solution for that is to use the mean error instead, so lets add that to our formula.

\(
\begin{equation*}
Error = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2
\end{equation*}
\)

So now, what we have is the mean of all the squared errors, this function is surprisingly called “Mean Squared Error” or simply \(MAE\) and it will be an important concept for the rest of this post

\(
\begin{equation*}
MSE = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2
\end{equation*}
\)

Making predictions

Now, we have been talking a lot about \(\hat{y}\) but how can calculate it? in linear regression this is done by applying a simple formula

\( \hat{y_i} = wX_i + b \)

if we want to generalize it we can simply say

\( \hat{y} = wX + b \)

This introduces 2 new values \(w\) and \(b\)

  • \(w\) : Represents the weight that we need to calculate, this is the value by which we will multiply \(X\).
  • \(b\): Represents the bias, we will simply add this term, and we will NOT relate it to \(X\)

    An example will make this more clear, lets say \(w=-1, b=10\)

This shows a terrible prediction, our red line (that is, our model) does not align at all with our actual observations. The interesting part here is to quantify how bad it is, in order to do so, lets just have a look at the first \(5\) datapoints so we can do all calculations by hand.

Speed Distance predict
4 2 6
7 4 3
8 16 2
9 10 1
10 18 0

We will take the data points where \(i=1\) that is, the first row. So

\(X_1=4; y_1=2; \hat{y_1}=6\) so \(Error_1 = (\hat{y_1} – y_1)^2 = 16\)

If we apply \(MSE = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2\) we get \(MSE = 1/5 + (6-2)^2 + (3-4) ^2 + (2-16)^2 + (1-10) + (0-18)^2 = 123.6\)

Now, lets consider another model where \(w=3; b=-12\), then we get this

It is already obvious that this model is much better at predicting the distance, however the question is how much better? again the answer lies in \(MAE\), the values are

Speed Distance predict
4 2 0
7 4 9
8 16 12
9 10 15
10 18 18

So we can again calculate \( MSE = 1/5 + (0-2)^2 + (9-4) ^2 + (12-16)^2 + (15-12) + (18-18)^2 = 10.8 \)

This gives us a critical information, not only we can figure out which model is better, we can also quantify how much better the model is, and that becomes very important, imagine for example how relevant this could be for autonomous driving.

Cost functions for other problems.

\(MSE\) is a good cost function, but it only helps us for regression problems, that means problems where our ouput is a number, for example predicting how warm a day would be based on some variables or predicting what will be the value of a security in the stock market.

However there are many problems where we want to classify values, an example would be to know whether or not a car can stop completely or if it would have an accident, in this scenario \(MSE\) does not help us, we need another cost function.

The logistic function.

For binary classification problems where our output can only take two possible values, we want to use this little function \(logistic = \frac{1}{1+e^{-\hat{y}}}\) It does not look very intuitive, but if we actually plot it, we get.

Logistic sigmoid function

The interesting thing about this function is that it takes values between 0 and 1, so we can apply a similar measure to the error by simply comparing \(y\) with our \(\hat{y}\) which will take values from 0 to 1 while \(y\) will either be 0 or 1.

Conclusion

Cost functions are at the core of understanding machine learning as they ultimately provide a measure of success for a given model, they are also at the center of fundamental algorithms such as gradient descent.

It really helped me in the early day to calculate some of the functions by hand to fully understand their meaning.

There are many other cost functions that one needs to be aware, but these two are the core ones to start with. I strongly recommend going through a couple of examples with \(MSE\).

Happy coding.