How to scale data for Machine Learning: standardize features

In this brief tutorial, you will see how to scale the features of your dataset. You will learn why it is important to standardize your data before using a machine learning model and see specific examples of how standardizing your data can improve the model’s performance.

This tutorial contains simple examples that data science beginners can follow to standardize the features of their dataset. You can also find the full codes used to generate/analyze data, so more advanced readers can still take advantage of this tutorial.

Whenever we have a distance-based machine learning algorithm, it is a good practice to standardize our data, to avoid issues with features having different units or ranges.


Overview

This short tutorial will show you the basics of standardizing data and how it affects the performance of simple machine learning models. Snippets of the Python codes are offered to help understand the specific implementation. If you want more details, feel free to check the codes and images used during the tutorial at this GitHub repository.

This tutorial will cover:

  • Basics on standardizing data
  • How to use general libraries to standardize data
  • Effect of standardization on distance calculations
  • Impact of standardization on machine learning performance

Scaling Data

In a nutshell, a machine learning algorithm will read some input variables and will predict the value of one or more output properties. For example, we can imagine a simple model that can predict the object type in a picture by its size and color.

However, it is common for the input variables (a.k.a. features) to have different units and ranges, which might cause one of the features to outweigh the other ones, even when they play a similar role in predicting the output property.

This issue can be solved by re-scaling the input data, so all features have the same range, while keeping their relative values the same. A common way to do this is to standardize data, where each feature is re-scaled to have a mean value of 0 and a standard deviation of 1. This can be done simply by using the standard scaler provided by sklearn with just a couple of lines:

from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)   # X is an array with all our features
X_scaled = scaler.transform(X)

Why scaling data in Machine Learning

Whenever we have a distance-based machine learning algorithm, it is a good practice to standardize our data, to avoid issues with features having different units or ranges, which artificially adds more weight to certain features in the model.

  • x1: in the (-10, 10) range
  • x2: in the (-1000, 1000) range

And we have an output property that is a function of x1 and x2: f(x1,x2)=sin(x1)+cos(x2). This random data is represented in the Figure below.

Below, we can see the code to generate such a dataset:

def generate_data():
    # initialize lists
    x1 = []
    x2 = []
    f  = []
    # set random seed for repdoducibility
    random.seed(2020)
    # calculate f(x1,x2) for 400 (20*20) points
    for i in range(20):
        provi_x1= []
        provi_x2= []
        provi_f = []
        for j in range(20):
            # set random x1 and x2 values
            item_x1 = random.uniform(-10,10)
            item_x2 = random.uniform(-1000,1000)
            # calculate f(x1,x2)
            item_f = np.sin(item_x1) + np.cos(item_x2)
            provi_x1.append(item_x1)
            provi_x2.append(item_x2)
            provi_f.append(item_f)
        x1.append(provi_x1)
        x2.append(provi_x2)
        f.append(provi_f)
    return x1,x2,f

It is important to note the different scales of the features. If we apply a standard scaler, we obtain the dataset shown in the Figure below. Note that the appearance of the data in this Figure is the same as before, except for the scale.

This different scale is significant because, in our raw non-scaled data, the distance between the right-most point and the left-most point will be approximately 20, but the distance between the top-most point and the bottom-most point will be around 2000. This means that our distance-based algorithm will consider the differences in the x2 feature as more important (since their absolute values are larger). Effectively, the algorithm sees data with the same x1 and x2 scales, as shown in the Figure below, where we can see how the absolute changes in x1 are much smaller than those in x2.

Effect of scaling data in Machine Learning

It is illuminating to see in practice how standardization impacts the performance of our model.

For this example, we will be using one of the simplest machine learning algorithms available, k Nearest-Neighbors (kNN).

For our purposes in this tutorial, all we need to know is that kNN assigns the values of unknown points as a weighted average of nearby points. For this algorithm (and other distance-based algorithms), it is critical to calculate the distances between different points to identify the nearest neighbors.

In the Figure below, with raw non-scaled data, we can see how the range of the x2 coordinate (between -1000 and 1000) are much larger than the range of x1 (between -10 and 10). We have zoomed in the greyed-out area and show three points of interest, whose coordinates are:

A: {x1 = -0.98, x2=-302.53}

B: {x1 = -2.16, x2=-286.79}

C: {x1 = -5.95, x2=-304.20}

We calculate the distances between these points as:

dAB = [ (-0.98 – (-2.16))2 + (-302.53 – (-286.79))2 ]1/2 = 15.8 and dAC = [ (-0.98 – (-5.95))2 +(-302.53 – (-304.20))2]1/2 = 5.2

Figure 4.

This means that A is being detected as being closer to C ( dAC = 5.2) than to B (dAB = 15.8) because the absolute differences in x2 are larger than those in x1. However, when using machine learning, we want our descriptors to be balanced, and we want them to have similar importance in our algorithm by default.

This is when scaling our data becomes important, and we show in the Figure below the standardized data. In this case, we have our three points of interest with coordinates:

A: {x1 =-0.19560, x2 =-0.57645}

B: {x1 =-0.46416, x2 =-0.54846}

C: {x1 =-1.09678, x2 =-0.579408}

If we calculate the distance between these points, we obtain:

dAB = [ (-0.19560 – (-0.46416))2 + (-0.57645 – (-0.54846))2]1/2 = 0.3 and dAC = [ (-0.19560 – (-1.09678))2 + (-0.57645 – (-0.579408))2 ]1/2 = 0.9

Figure 5.

In this case, we would obtain that the A point is three times close to B (dAB=0.3) than to C (dAC=0.9), which we can observe in the Figure above.

We can go one step further and use a cross-validation approach to estimate the accuracy of our model with un-scaled and with scaled data. Below, I show the code we can use to calculate the leave-one-out cross-validation accuracy – measured by the root mean squared error (RMSE) and Pearson correlation coefficient (r). For more details on cross-validation, feel free to visit my other post on this topic.

def kNN_function_LOO(X,y,scale_data=True):
    # Initialize lists with final results
    y_pred_total = []
    y_test_total = []
    # Scale data
    if scale_data == True:
        scaler = preprocessing.StandardScaler().fit(X)
        X_scaled = scaler.transform(X)
        X = X_scaled
    # calculate predicted values using LeaveOneOut Cross Validation and kNN
    ML_algorithm = KNeighborsRegressor(n_neighbors=6, weights='distance')
    y_predicted = cross_val_predict(ML_algorithm, X, y, cv=LeaveOneOut())
    y_real = y 
    r,_   = pearsonr(y_real, y_predicted)
    rmse  = sqrt(mean_squared_error(y_real, y_predicted))
    print('kNN leave-one-out cross-validation. RMSE: %.4f . r: %.4f' %(rmse,r))
    return y_real, y_predicted

We can actually see the impact of this data standardization in the Figure below, where I have used a leave-one-out cross-validation to predict the accuracy when using the raw non-scaled data. This results in a relatively large RMSE and a very small correlation coefficient.

However, when we do the same leave-one-out cross-validation with the standardized data, we see a significant improvement in the model performance, with a smaller RMSE and a much larger correlation coefficient.


Summary

In summary, we have seen how features with different scales might affect the performance of a machine learning algorithm. We showed how to standardize data, so all features have the same range, allowing for more accurate distances and improving the model’s performance.

If you are interested in following along, you can access the code to generate all data and images in this tutorial from this Github repository: https://github.com/marcosdelcueto/Tutorial_data_scaling_ML

Was this tutorial helpful to you? Let me know if you have any questions on how to standardize your data!


* Featured image by Carlos Muza on Unsplash


Marcos del Cueto, PhD

I use computational modeling and data-driven approaches to explain the behavior of complex systems in physics and chemistry. I enjoy programming and all things science.

 Subscribe in a reader