Machine learning basics: the cost function

Machine learning is ultimately a way to make a program perform a task and to get that task done better over time. Cost functions define how good or bad a program is at performing such task, pretty much every problem consist on getting the value of the cost function to be as small as possible.

For our example, we will use a very simple dataset which consist on two variables: Car speed and distance to stop, our ultimate goal will be to, given a speed we have never seen before, predict what will be the distance to stop.

Lets define some common vocabulary:

  • \(X\) : These will be the observations, in our case it will represent the car speed.
  • \(y\): The correct answers to our observations, in this case the distance.
  • \(\hat{y}\) : Our own predictions given an \(X\)

Notice that all of the values above are actually vectors, or if you prefer, lists (possibly a more friendly term for a developer), this means that each of them can be accessed by indexes, such as


This takes us to define another element

  • \(n\): The total number of observations, in this case this means how many elements we have in \(X\) and \(y\)

Now, this is the data we are going to work with

Speed (X)
Distance (y)
4 2
7 4
8 16
9 10
10 18
11 17
12 14
13 26
14 26
15 20
16 32
17 32
18 42
19 36
20 32
22 66
23 54
24 70
25 85

We have a total of 19 observations here, now lets plot them

The distance required to stop a car depending on its speed.

With all this in our hands, we can start defining our cost function. A good intuition would be to say that our cost function is simply the difference between our predictions and the actual value.

For example at speed \(15\)km/h we need \(20\) meters to stop. Imagine that we have

  • \(ModelA\) that predicts that \(25 \) meters are needed, the error would be \(25 – 20 = 5 \).
  • \(ModelB \) that predicts that we need \(21\) meters to stop then the error would be \(22 – 20 = 2\) which is already a smaller error than the previous.
  • \(ModelC \) that predicts that we need \(19\) meters to stop? then the error would be \(19 – 20 = -1 \) this is a bit weird, as we want to make our error be close to 0, not to be negative. The solution for that would be to use a squared error instead, so that way the result would always be positive, lets recalculate the errors using squares.
  • \(ModelA = (25 – 20)^2 = 25 \)
  • \(ModelB = (22-20) ^ 2 = 4 \)
  • \(ModelC = (19-20) ^ 2 = 1 \)

More generally we can simply say \( Error = (\hat{y_i} – y_i)^2 \)

With this we can quickly conclude that the best model is the one with the smallest value for the cost function, in this case, that would be \(ModelC\).

The next step is to apply this to every point in the problem, our model should be able to predict what is the distance required to stop for any given speed, and we should be able to calculate the error of such prediction. The solution? apply exactly the same logic but to the whole set of data. As we mentioned before, distance and speed are both vectors, so we can simply do

\( Error =  (\hat{y_1} – y_1)^2 + (\hat{y_2} – y_2)^2 … + (\hat{y_n} – y_n)^2 \)

Or if we want to use a better mathematical term

Error = \sum_{n=1}^n (\hat{y}_n – y_n)^2

Do not let the math intimidate you, the term \( \sum_{n=1}^n\) is just a loop over the elements of vectors.

We cannot simply keep adding the terms, think about it, this means that if we have a dataset with a lot of observations, our error will grow as we have more observations, the solution for that is to use the mean error instead, so lets add that to our formula.

Error = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2

So now, what we have is the mean of all the squared errors, this function is surprisingly called “Mean Squared Error” or simply \(MAE\) and it will be an important concept for the rest of this post

MSE = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2

Making predictions

Now, we have been talking a lot about \(\hat{y}\) but how can calculate it? in linear regression this is done by applying a simple formula

\( \hat{y_i} = wX_i + b \)

if we want to generalize it we can simply say

\( \hat{y} = wX + b \)

This introduces 2 new values \(w\) and \(b\)

  • \(w\) : Represents the weight that we need to calculate, this is the value by which we will multiply \(X\).
  • \(b\): Represents the bias, we will simply add this term, and we will NOT relate it to \(X\)

    An example will make this more clear, lets say \(w=-1, b=10\)

This shows a terrible prediction, our red line (that is, our model) does not align at all with our actual observations. The interesting part here is to quantify how bad it is, in order to do so, lets just have a look at the first \(5\) datapoints so we can do all calculations by hand.

Speed Distance predict
4 2 6
7 4 3
8 16 2
9 10 1
10 18 0

We will take the data points where \(i=1\) that is, the first row. So

\(X_1=4; y_1=2; \hat{y_1}=6\) so \(Error_1 = (\hat{y_1} – y_1)^2 = 16\)

If we apply \(MSE = \frac{1}{n} \sum_{n=1}^n (\hat{y}_n – y_n)^2\) we get \(MSE = 1/5 + (6-2)^2 + (3-4) ^2 + (2-16)^2 + (1-10) + (0-18)^2 = 123.6\)

Now, lets consider another model where \(w=3; b=-12\), then we get this

It is already obvious that this model is much better at predicting the distance, however the question is how much better? again the answer lies in \(MAE\), the values are

Speed Distance predict
4 2 0
7 4 9
8 16 12
9 10 15
10 18 18

So we can again calculate \( MSE = 1/5 + (0-2)^2 + (9-4) ^2 + (12-16)^2 + (15-12) + (18-18)^2 = 10.8 \)

This gives us a critical information, not only we can figure out which model is better, we can also quantify how much better the model is, and that becomes very important, imagine for example how relevant this could be for autonomous driving.

Cost functions for other problems.

\(MSE\) is a good cost function, but it only helps us for regression problems, that means problems where our ouput is a number, for example predicting how warm a day would be based on some variables or predicting what will be the value of a security in the stock market.

However there are many problems where we want to classify values, an example would be to know whether or not a car can stop completely or if it would have an accident, in this scenario \(MSE\) does not help us, we need another cost function.

The logistic function.

For binary classification problems where our output can only take two possible values, we want to use this little function \(logistic = \frac{1}{1+e^{-\hat{y}}}\) It does not look very intuitive, but if we actually plot it, we get.

Logistic sigmoid function

The interesting thing about this function is that it takes values between 0 and 1, so we can apply a similar measure to the error by simply comparing \(y\) with our \(\hat{y}\) which will take values from 0 to 1 while \(y\) will either be 0 or 1.


Cost functions are at the core of understanding machine learning as they ultimately provide a measure of success for a given model, they are also at the center of fundamental algorithms such as gradient descent.

It really helped me in the early day to calculate some of the functions by hand to fully understand their meaning.

There are many other cost functions that one needs to be aware, but these two are the core ones to start with. I strongly recommend going through a couple of examples with \(MSE\).

Happy coding.