A Brief Guide to Cross-Validation: What It Is and How to Use It

We start by generating a two-dimensional data base and create a regression model with it. We discuss how to validate the Machine Learning model to assess its accuracy.

Finally, we show different validation and cross-validation methods and discuss their advantages and disadvantages.

This tutorial contains simple examples for data science beginners to understand and implement cross-validation methods, using available libraries. I provide the complete Python codes used during this tutorial, so more advanced readers can still get something out of it and use code snippets for their specific applications.

It is not trivial to decide what data should be in the training and test sets. We could for example save a small amount of our data to test the accuracy of the model. But how can one choose these test data in the least biased way possible?


Overview

We offer a basic tutorial on cross-validation and how to use it. Snippets of the Python codes are shown to help to understand the specific implementation. I also provide all codes and images at a public GitHub repository, so feel free to check them out if you want more details.

This short tutorial will cover:

  • How to generate a database
  • The need to validate the Machine Learning model
  • k-fold cross-validation
  • Leave-one-out cross-validation
  • Leave-one-out vs k-fold

Generate database

As an example, we will use data that follows the two-dimensional function f(x₁,x₂)=sin(x₁)+cos(x₂), plus a small random variation in the interval (-0.2,0.2) to slightly complicate the problem. Therefore, our data will follow the expression:

f(x_1,x_2) = sin(x_1) + cos(x_2) + rnd(-0.2, 0.2)

We generate our database as a 21 \times 21 grid in the interval x_1: -10, 10 and x_2: -10, 10 . We show in the image below the visual representation of the 441 points in the database.

We show below the code used to generate the dataset and prepare it for the ML regression model

#!/usr/bin/env python3
# Import libraries
import random
import numpy as np
######################################################################################################
# Main function to generate X,y dataset
def main():
    # Create {x1,x2,f} dataset every 1.0 from -10 to 10, with a noise of +/- 0.2
    x1,x2,f=generate_data(-10,10,1.0,0.2)
    # Prepare X and y for ML    
    X,y = prepare_data_for_ML(x1,x2,f)
######################################################################################################
# Function to generate x1,x2,f data
def generate_data(xmin,xmax,Delta,noise):
    # Calculate f=sin(x1)+cos(x2)
    x1 = np.arange(xmin,xmax+Delta,Delta)   # generate x1 values from xmin to xmax
    x2 = np.arange(xmin,xmax+Delta,Delta)   # generate x2 values from xmin to xmax
    x1, x2 = np.meshgrid(x1,x2)             # make x1,x2 grid of points
    f = np.sin(x1) + np.cos(x2)             # calculate for all (x1,x2) grid
    # Add random noise to f
    random.seed(2020)                       # set random seed for reproducibility
    for i in range(len(f)):
        for j in range(len(f[0])):
            f[i][j] = f[i][j] + random.uniform(-noise,noise)  # add random noise to f(x1,x2)
    return x1,x2,f
######################################################################################################
# Function fo transform x1,x2,f to numpy X,y
def prepare_data_for_ML(x1,x2,f):
    X = []
    for i in range(len(f)):
        for j in range(len(f)):
            X_term = []
            X_term.append(x1[i][j])
            X_term.append(x2[i][j])
            X.append(X_term)
    y=f.flatten()
    X=np.array(X)
    y=np.array(y)
    return X,y 

Need to validate the Machine Learning model

Our goal is to have a Machine Learning model that receive some feature values (x_1,x_2) as input, and returns an accurate f value. Therefore, we need some data to validate how accurate our model is. In this context, we will indistinctively refer to this set of data as validation or test set. The way one usually proceeds is to use a large portion of the data to train the Machine Learning model, and a small amount of the data (e.g. 10%) as test data.

It is not trivial to decide what data will be in the training and test sets. We could for example use the first 90% of our data to train the model, and the final 10% of the data to test the accuracy of the model, as shown in the picture below.

However, this train/test split is very biased, as the test data is not necessarily representative of the dataset as a whole. One can imagine how the test set could have been any other 10%-chunk randomly chosen.

In the code snippet below, we use the function KRR_function, in which we create a Kernel Ridge Regression model, in which we create a dataset with 40 points in the test set, and 401 points in the train set. To test the accuracy of our model, we predict the f value for all (x_1,x_2) configurations in the test set, and compare their values with the actual f values at those configurations.

#!/usr/bin/env python3
# Import libraries
import random
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
######################################################################################################
# Main function to generate dataset and call KRR function to perform model validation
def main():
    # Create {x1,x2,f} dataset every 1.0 from -10 to 10, with a noise of +/- 0.2
    x1,x2,f=generate_data(-10,10,1.0,0.2)
    # Prepare X and y for ML
    X,y = prepare_data_for_ML(x1,x2,f)
    hyperparams = (0.01,1.5)
    KRR_function(hyperparams,X,y)
######################################################################################################
# Function to do validation 
def KRR_function(hyperparams,X,y):
    # Assign hyper-parameters
    alpha_value,gamma_value = hyperparams
    # Split data into test and train: random state fixed for reproducibility
    random.seed(a=2020)
    test_index = random.choices(range(len(y)),k=40) # Select 40 data points for the test set
    train_index = np.setdiff1d(range(len(y)),test_index) # Select train set as those not in test set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Scale X_train and X_test
    scaler = preprocessing.StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    # Fit KRR with (X_train_scaled, y_train), and predict X_test_scaled
    KRR = KernelRidge(kernel='rbf',alpha=alpha_value,gamma=gamma_value)
    y_pred = KRR.fit(X_train_scaled, y_train).predict(X_test_scaled)
    # Calculate error metric of test and predicted values
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r_pearson, _ = pearsonr(y_test, y_pred)
    # Print results
    print('KRR validation . RMSE: %.4f . r: %.4f' %(rmse,r_pearson))
    return

The code above prints some error metrics to judge the accuracy of the model, like the root-mean-square error (RMSE) and the r Pearson correlation coefficient:

KRR validation . RMSE: 0.2194 . r: 0.9823

The RMSE value indicates that the average difference between the actual and predicted f values is quite small, and the r coefficient close to 1 tells us that the actual and predicted values follow a very similar trend.

In the code above, I fixed the random seed, so the same 40 points are always chosen for the test set. However, this result may be significantly biased, and one could in principle choose any other test set. So, which of these train/test split should we choose? This is precisely what cross-validation addresses.


k-fold cross-validation

One of the most common cross-validation methods is k-fold cross-validation, where we split the data set into k chunks (folds), and we perform k validation iterations. In each i iteration cycle, we use the i^{th} fold in the test set, and the rest is used in the training set.

It is advisable to choose the data of each fold randomly, so the train/test split is as unbiased as possible. In sci-kit learn, we can do this simply using the shuffle=True option when calling the k-fold cross-validator. For example, to have a randomly chosen 10-fold, we can use:

KFold(n_splits=10,shuffle=True)

I show below the code with a new KRR_function_CV implementing a k-fold cross-validation. The main difference with respect to regular validation, is that now we have an extra loop for each of our folds. After this loop, we compare the actual and predicted values to calculate the final error metrics.

#!/usr/bin/env python3
# Import libraries
import random
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
######################################################################################################
# Main function to generate dataset and call KRR function to perform model validation
def main():
    # Create {x1,x2,f} dataset every 1.0 from -10 to 10, with a noise of +/- 0.2
    x1,x2,f=generate_data(-10,10,1.0,0.2)
    # Prepare X and y for ML
    X,y = prepare_data_for_ML(x1,x2,f)
    hyperparams = (0.01,1.5)
    KRR_function_CV(hyperparams,X,y)
######################################################################################################
# Function to do validation 
def KRR_function_CV(hyperparams,X,y):
    # Assign hyper-parameters
    alpha_value,gamma_value = hyperparams
    # Initialize lists with final results
    y_pred_total = []
    y_test_total = []
    # Split data into test and train: random state fixed for reproducibility
    kf = KFold(n_splits=10,shuffle=True,random_state=2020)
    # kf-fold cross-validation loop
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Scale X_train and X_test
        scaler = preprocessing.StandardScaler().fit(X_train)
        X_train_scaled = scaler.transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        # Fit KRR with (X_train_scaled, y_train), and predict X_test_scaled
        KRR = KernelRidge(kernel='rbf',alpha=alpha_value,gamma=gamma_value)
        y_pred = KRR.fit(X_train_scaled, y_train).predict(X_test_scaled)
        # Append y_pred and y_test values of this k-fold step to list with total values
        y_pred_total.append(y_pred)
        y_test_total.append(y_test)
    # Flatten lists with test and predicted values
    y_pred_total = [item for sublist in y_pred_total for item in sublist]
    y_test_total = [item for sublist in y_test_total for item in sublist]
    # Calculate error metric of test and predicted values: rmse
    rmse = np.sqrt(mean_squared_error(y_test_total, y_pred_total))
    r_pearson,_=pearsonr(y_test_total,y_pred_total)
    print('KRR validation . RMSE: %.4f . r: %.4f' %(rmse,r_pearson))
    return

With this code, we obtain the following result:

KRR k-fold cross-validation . RMSE: 0.1886 . r: 0.9840

Leave-one-out cross-validation

If we keep increasing the number of folds, and we make it equal the total number of N data points in the dataset, we will have a very unbiased cross-validation, on what is known as leave-one-out cross-validation.

Leave-one-out cross-validation is an extreme case of k-fold cross-validation, in which we perform N validation iterations. At each i iteration, we train the model with all but the i^{th} data point, and the test set consists only of the i^{th} data point.

It can be implemented just like we did with k-fold cross-validation:

#!/usr/bin/env python3
# Import libraries
import random
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import LeaveOneOut
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
######################################################################################################
# Main function to generate dataset and call KRR function to perform model validation
def main():
    # Create {x1,x2,f} dataset every 1.0 from -10 to 10, with a noise of +/- 0.2
    x1,x2,f=generate_data(-10,10,1.0,0.2)
    # Prepare X and y for ML
    X,y = prepare_data_for_ML(x1,x2,f)
    hyperparams = (0.01,1.5)
    KRR_function_LOO(hyperparams,X,y)
######################################################################################################
# Function to do validation 
def KRR_function_LOO(hyperparams,X,y):
    # Assign hyper-parameters
    alpha_value,gamma_value = hyperparams
    # Initialize lists with final results
    y_pred_total = []
    y_test_total = []
    # Split data into test and train: random state fixed for reproducibility
    loo = LeaveOneOut()
    # kf-fold cross-validation loop
    for train_index, test_index in loo.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Scale X_train and X_test
        scaler = preprocessing.StandardScaler().fit(X_train)
        X_train_scaled = scaler.transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        # Fit KRR with (X_train_scaled, y_train), and predict X_test_scaled
        KRR = KernelRidge(kernel='rbf',alpha=alpha_value,gamma=gamma_value)
        y_pred = KRR.fit(X_train_scaled, y_train).predict(X_test_scaled)
        # Append y_pred and y_test values of this k-fold step to list with total values
        y_pred_total.append(y_pred)
        y_test_total.append(y_test)
    # Flatten lists with test and predicted values
    y_pred_total = [item for sublist in y_pred_total for item in sublist]
    y_test_total = [item for sublist in y_test_total for item in sublist]
    # Calculate error metric of test and predicted values: rmse
    rmse = np.sqrt(mean_squared_error(y_test_total, y_pred_total))
    r_pearson,_=pearsonr(y_test_total,y_pred_total)
    print('KRR validation . RMSE: %.4f . r: %.4f' %(rmse,r_pearson))
    plot_scatter(y_test_total,y_pred_total) # call to function to plot predictions
    return

The code above produce the following results:

KRR leave-one-out cross-validation . RMSE: 0.1775 . r: 0.9857

Finally, we can create a function to plot the actual and predicted results for each of the validation iterations of the leave-one-out cross-validation:

#!/usr/bin/env python3
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import mean_squared_error
######################################################################################################
def plot_scatter(x,y):
    x = np.array(x)
    y = np.array(y)
    fig = plt.figure()
    gs = gridspec.GridSpec(1, 1)
    rmse = np.sqrt(mean_squared_error(x,y))
    ma = np.max([x.max(), y.max()]) + 0.1 
    mi = np.min([x.min(), y.min()]) - 0.1 
    ax = plt.subplot(gs[0])
    ax.scatter(x, y, color="C0")
    ax.set_xlabel(r"Actual $f(x_1,x_2)$", size=14, labelpad=10)
    ax.set_ylabel(r"Predicted $f(x_1,x_2)$", size=14, labelpad=10)
    ax.set_xlim(mi, ma) 
    ax.set_ylim(mi, ma) 
    ax.set_aspect('equal')
    ax.plot(np.arange(mi, ma + 0.1, 0.1), np.arange(mi, ma + 0.1, 0.1), color="k", ls="--")
    ax.annotate(u'$RMSE$ = %.4f' % rmse, xy=(0.15,0.85), xycoords='axes fraction', size=12)
    # Print to file 
    file_name="prediction_loo.png"
    plt.savefig(file_name,dpi=600,bbox_inches='tight')

With the code above, we are able to generate the following figure, showing the actual value of each of the 441 points in our database, and the corresponding predicted value by our KRR model. The dashed line indicates what a perfect 1:1 correlation would look like.


Leave-one-out vs k-fold

As seen in the example above, leave-one-out cross-validation tends to give a more optimistic evaluation, returning a lower RMSE, as each of the N validation iterations is trained with more samples.

However, leave-one-out cross-validation can be more time-consuming, since more iterations are carried out, so sometimes one has to settle with k-fold validation.

Finally, it is worth mentioning that leave-one-out cross-validation tends to be more robust, as k-fold cross-validation might be dependent on what kind of data is in each fold, which might introduce bias in the validation.


Conclusions

In this tutorial, we have seen a brief introduction of validation and cross-validation. We have generated a simple two-dimensional database, and built a simple Kernel Ridge Regression model. Then, we have covered different (cross-)validation methods to estimate the accuracy of our Machine Learning model. Finally, we show some of the advantages and disadvantages of these methods.

If you are interested in how to use genetic algorithms to optimize the hyperparameters, making use of cross-validation, feel free to check my recent short article on that topic at Towards Data Science.

Remember that complete versions of the codes used here are available in an public GitHub repository.

Was this tutorial helpful to you? Let me know if you were able to successfully use cross-validation on your model


* Featured image by Timo Volz on Unsplash


Marcos del Cueto

I am a research associate at the University of Liverpool. I have a PhD in Theoretical Chemistry, and I am now diving into the applications of machine learning to materials discovery.

 Subscribe in a reader