Kernel Ridge Regression – Python Tutorial

We start by introducing linear regression. We show how Kernel Ridge Regression is much more flexible and can describe more complex data trends.

Finally, we describe how to optimize the model’s hyper-parameters to obtain an accurate non-linear regression.

This tutorial contains simple examples that data science beginners can follow to use Kernel Ridge Regression successfully. I provide the complete Python codes used during this tutorial, so more advanced readers can still get something out of it and use code snippets for their specific applications of KRR.

In situations when linear regression fails, we should use non-linear regression methods that allow greater flexibility.


Overview

Anyone with a basic Python knowledge could follow this tutorial. Complete Python codes are shown to help to understand the specific implementation. I also provide all codes and images at a public Github repository, so feel free to play with those as much as you want!

We will use Python’s scikit-learn library, which provides easy access to kernel ridge regression. This tutorial will cover:

  • Linear regression
  • When linear regression fails
  • Kernel Ridge Regression to the rescue
  • Regularization parameter
  • Optimized KRR

Linear Regression

A large number of processes in nature follow a linear relationship, and it is a good place to start when trying to fit data to a function. If we have a target property y (aka dependent variable) that depends on the values of a feature x (aka independent variable), they follow a linear relationship if y can be approximated as

y = a + b \cdot x

where a represents the y-intercept and b is the slope of the line. Note that x and y can be vectors of any dimensionality. For simplicity , we will consider here that both x and y are one-dimensional. This means that we have a single target property that depends on the value of just one feature.

We show in the Table below an example of a data set formed by 11 points in the range [-5,5] that roughly follow a linear relationship.

xn(arbitrary units)yn (arbitrary units)
-5.00-4.291
-4.00-2.860
-3.00-2.918
-2.00-1.954
-1.00-1.426
0.001.988
1.000.157
2.000.593
3.002.044
4.003.042
5.004.309
Table 1. (x,y) points forming our database.

These 11 points have been generated using the relation:

y = x + rnd(-2,2)

which means that the value of y equals that of x, plus a random variation that goes from -2 to +2. This allows us to create a simple data set that roughly follows a linear relationship.

I plan to write soon about how to do our own linear regressor, but for now, we will use scikit-learn’s built-in linear regressor. For our purposes, it is enough to know that we are trying to find the parameters a and b that create a line y = a + b·x, such that the y-difference between the regression line and our data y_n is minimized.

In the figure below, we show our 11 data points in blue, and the resulting linear regression in orange. As expected, we can see how there is a strong linear correlation.

To measure the error of our regressions, we are using the root-mean-square error (RMSE), which averages the differences of the actual y_n values in our database, and the value of the regression curve at the corresponding x_n values.

Figure 1. Points in Table 1 represented in blue. Linear regression represented in orange.

The code to generate this data set and perform the linear regression is shown below.

#!/usr/bin/env python3
# Import libraries
import math
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Initialize lists and set random seed
list_x = []
list_y = []
list_x_pred = []
random.seed(19)
# Create database with 11 points following quasi-lienar relation in interval x:[-5,5]
for i in range(-5,6):
    x = i
    rnd_number = random.uniform(-2,2)
    y = x + rnd_number
    list_x.append(x)
    list_y.append(y)
    print(x,y)
# Create list with 1010 points in interval x:[-5,5]
for i in range(-50,51):
    x = 0.1*i
    list_x_pred.append(x)
# Transform lists to np arrays
list_x = np.array(list_x).reshape(-1, 1)
list_x_pred = np.array(list_x_pred).reshape(-1, 1)
# Do linear regression using database with 11 points
regr = linear_model.LinearRegression()
regr.fit(list_x,list_y)
# Calculate value of linear regressor at 101 points in interval x:[-5,5]
list_y_pred = regr.predict(list_x_pred)
# Calculate value of linear regressor at 11 points in interval x:[-5,5]
new_y = regr.predict(list_x)
# Print rmse value
rmse = math.sqrt(mean_squared_error(new_y, list_y))
print('############################')
print('Root Mean Squared Error: %.1f' % rmse)
# Set axes and labels
fig = plt.figure()
ax = fig.add_subplot()
ax.set_xlim(-5.5,5.5)
ax.set_ylim(-5.5,5.5)
ax.xaxis.set_ticks(range(-5,6))
ax.yaxis.set_ticks(range(-5,6))
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.annotate(u'$RMSE$ = %.1f' % rmse, xy=(0.15,0.85), xycoords='axes fraction')
# Plot as orange line the regression line at interval
plt.plot(list_x_pred,list_y_pred,color='C1',linestyle='solid',linewidth=2)
# Plot as blue points the original database
plt.scatter(list_x, list_y,color='C0')
# Print plot to file
file_name='Figure_1.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()

When linear regression fails

Linear regression is ubiquitous and it should be a first go-to when trying to fit data.

For the next example, we have generated a larger database, with 21 points, in which y is calculated as:

y = (x+4) \cdot (x+1) \cdot (x-1) \cdot (x-3) + rnd(-1,1)

This means that y is calculated as a 4th order poolynomial plus a random variation in the interval [-1,1]. This dataset, along with the resulting linear regression is shown in the Figure below.

xn (arbitrary units)yn (arbitrary units)
-5.00-192.239
-4.5071.537
-4.000.537
-3.50-35.671
-3.00-48.052
-2.50-42.445
-2.00-29.914
-1.50-13.828
-1.00-0.084
-0.508.668
0.0011.419
0.508.778
1.000.209
1.50-10.370
2.00-18.790
2.50-16.722
3.00-0.018
3.5042.934
4.00120.075
4.50245.389
5.00431.082
Table 2. (x,y) points forming our database.
Figure 2. Points in Table 2 represented in blue. Linear regression represented in orange.

One clearly observes how the linear regression in orange fails to describe the trend followed by the blue points. This also results into a much larger RMSE.

#!/usr/bin/env python3
# Import libraries
import math
import random
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Initialize lists and set random seed
list_x = []
list_y = []
list_x_pred = []
random.seed(2020)
# Create database with 21 points following quasi-lienar relation in interval x:[-5,5]
for i in range(-10,11):
    x = i/2 
    rnd_number= random.uniform(-1,1)
    y = (x+4)*(x+1)*(x-1)*(x-3) + rnd_number
    list_x.append(x)
    list_y.append(y)
    print(x,y)
# Create list with 1060 points in interval x:[-5,5]
for i in range(-50,51):
    x = 0.1*i
    list_x_pred.append(x)
# Transform lists to np arrays
list_x = np.array(list_x).reshape(-1, 1)
list_x_pred = np.array(list_x_pred).reshape(-1, 1)
# Do linear regression using database with 21 points
regr = linear_model.LinearRegression()
regr.fit(list_x,list_y)
# Calculate value of linear regressor at 1060 points in interval x:[-5,5]
list_y_pred = regr.predict(list_x_pred)
# Calculate value of linear regressor at 21 points in interval x:[-5,5]
new_y = regr.predict(list_x)
# Print rmse value
rmse = math.sqrt(mean_squared_error(new_y, list_y))
print('############################')
print('Root Mean Squared Error: %.1f' % rmse)
# Set axes and labels
fig = plt.figure()
ax = fig.add_subplot()
ax.set_xlim(-5.5,5.5)
ax.set_ylim(-100,500)
ax.xaxis.set_ticks(range(-5,6))
ax.yaxis.set_ticks(range(-100,600,100))
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.annotate(u'$RMSE$ = %.1f' % rmse, xy=(0.15,0.85), xycoords='axes fraction')
# Plot as orange line the regression line at interval
plt.plot(list_x_pred,list_y_pred,color='C1',linestyle='solid',linewidth=2)
# Plot as blue points the original database
plt.scatter(list_x, list_y,color='C0')
# Print plot to file
file_name='Figure_2.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()

Kernel Ridge Regression to the rescue

When one is working with complex data, quite often linear regression is not enough to capture the peculiarities of the problem. In situations when linear regression fails, we should use non-linear regression methods that allow greater flexibility.

KRR uses the kernel trick to transform our dataset to the kernel space and then performs a linear regression in kernel-space. Therefore, one should always choose the appropriate kernel to the problem.

In this case, we will be using using a polynomial kernel. The polynomial kernel for two vectors (two points in our one-dimensional example) x1 and x2 is:

K(x_1,x_2) = (\gamma \cdot x_1^T \cdot x_2 + c)^d

where γ is the kernel coefficient, c is the independent term and d is the degree of the polynomial. In this case, γ and c play a minor role, and their default value of 1.0 is adequate, so we will only focus on optimizing the polynomial degree d.

I plan on writing about the importance of optimizing hyper-parameters, as well as different methods to do so in the near future.

In the figure below, we show the KRR regression using polynomial kernels of different degrees.

Figure 3. Four panels using a polynomial kernel of degree d to fit data in blue. Corresponding regression curves shown in orange.

We observe how the resulting RMSE with polynomials of degree 2 and 3 is still significant. This means that these kernels are not enough to capture the complexity of our problem, and we need larger order polynomials. We observe how the RMSE is significantly reduced for polynomials kernel of order 4 and above. Therefore, in this case, we would ideally use a polynomial degree of order 4.

The code used to perform these regressions and print the Figure above for different polynomial orders, is shown below.

#!/usr/bin/env python3
# Import libraries
import math
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.kernel_ridge import KernelRidge

# Initialize lists and set random seed
list_x = []
list_y = []
list_x_pred = []
list_y_real = []
random.seed(2020)
# Create database with 21 points following quasi-lienar relation in interval x:[-5,5]
for i in range(-10,11):
    x = i/2
    rnd_number= random.uniform(-1,1)
    y = (x+4)*(x+1)*(x-1)*(x-3) + rnd_number
    list_x.append(x)
    list_y.append(y)
    print(x,y)
# Create list with 1060 points in interval x:[-5,5]
for i in range(-50,51):
    x = 0.1*i
    list_x_pred.append(x)
    list_y_real.append((x+4)*(x+1)*(x-1)*(x-3))
# Transform lists to np arrays
list_x = np.array(list_x).reshape(-1, 1)
list_x_pred = np.array(list_x_pred).reshape(-1, 1)
# Do linear regression using database with 21 points
list_y_pred = []
short_list_y_pred = []
rmse_list = []
# For each of the tested polynomial degree values
for degree_value in [2,3,4,5]:
    krr = KernelRidge(alpha=1.0,kernel='polynomial',degree=degree_value)
    krr.fit(list_x,list_y)
    list_y_pred.append(krr.predict(list_x_pred))
    new_y = krr.predict(list_x)
    short_list_y_pred.append(new_y)
    # Print rmse value
    rmse = math.sqrt(mean_squared_error(new_y, list_y))
    rmse_list.append(rmse)
    print('############################')
    print('Degree:', degree_value)
    print('Root Mean Squared Error: %.1f' % rmse)
# Set axes and labels
fig, axs = plt.subplots(2, 2)
for ax in axs.flat:
    ax.set(xlabel='x', ylabel='y')
# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()
# Subplot top-left
axs[0, 0].scatter(list_x, list_y,color='C0')
axs[0, 0].plot(list_x_pred,list_y_pred[0],color='C1')
axs[0, 0].set_title(r'$d = 2$')
axs[0, 0].set_xlim(-5.5,5.5)
axs[0, 0].set_ylim(-100,500)
axs[0, 0].annotate(u'$RMSE$ = %.1f' % rmse_list[0], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot top-right
axs[0, 1].scatter(list_x, list_y,color='C0')
axs[0, 1].plot(list_x_pred,list_y_pred[1], color='C1')
axs[0, 1].set_title(r'$d = 3$')
axs[0, 1].set_xlim(-5.5,5.5)
axs[0, 1].set_ylim(-100,500)
axs[0, 1].annotate(u'$RMSE$ = %.1f' % rmse_list[1], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot bottom-left
axs[1, 0].scatter(list_x, list_y,color='C0')
axs[1, 0].plot(list_x_pred,list_y_pred[2], color='C1')
axs[1, 0].set_title(r'$d = 4$')
axs[1, 0].set_xlim(-5.5,5.5)
axs[1, 0].set_ylim(-100,500)
axs[1, 0].annotate(u'$RMSE$ = %.1f' % rmse_list[2], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot bottom-right
axs[1, 1].scatter(list_x, list_y,color='C0')
axs[1, 1].plot(list_x_pred,list_y_pred[3], color='C1')
axs[1, 1].set_title(r'$d = 5$')
axs[1, 1].set_xlim(-5.5,5.5)
axs[1, 1].set_ylim(-100,500)
axs[1, 1].annotate(u'$RMSE$ = %.1f' % rmse_list[3], xy=(0.15,0.85), xycoords='axes fraction')
# Print plot to file
file_name='Figure_3.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close() 

Regularization parameter

The regularization paremeter, α, should also be optimized. It controls the conditioning of the problem, and larger α values result into results that are more “general” and ignore the peculiarities of the problem. Larger values of α allow to ignore noise in the system, but this might result into the model being blind to actual trends of the data.

If we perform our kernel ridge regression for different α values, we can clearly see its effect, as shown below.

Figure 4. Four panels using KRR with regularization α to fit data in blue. Corresponding regression curves shown in orange.

In this case, a small α of approximately 0.1 results into a very accurate result.

#!/usr/bin/env python3
# Import libraries
import math
import random
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.kernel_ridge import KernelRidge

# Initialize lists and set random seed
list_x = []
list_y = []
list_x_pred = []
list_y_real = []
random.seed(2020)
# Create database with 21 points following quasi-lienar relation in interval x:[-5,5]
for i in range(-10,11):
    x = i/2 
    rnd_number= random.uniform(-1,1)
    y = (x+4)*(x+1)*(x-1)*(x-3) + rnd_number
    list_x.append(x)
    list_y.append(y)
    print(x,y)
# Create list with 1060 points in interval x:[-5,5]
for i in range(-50,51):
    x = 0.1*i
    list_x_pred.append(x)
    list_y_real.append((x+4)*(x+1)*(x-1)*(x-3))
# Transform lists to np arrays
list_x = np.array(list_x).reshape(-1, 1)
list_x_pred = np.array(list_x_pred).reshape(-1, 1)
# Do linear regression using database with 21 points
list_y_pred = []
short_list_y_pred = []
rmse_list = []
# For each of the tested polynomial degree values
for alpha_value in [0.0001,0.1,10000,100000000]:
    krr = KernelRidge(alpha=alpha_value,kernel='polynomial',degree=4)
    krr.fit(list_x,list_y)
    list_y_pred.append(krr.predict(list_x_pred))
    new_y = krr.predict(list_x)
    short_list_y_pred.append(new_y)
    # Print rmse value
    rmse = math.sqrt(mean_squared_error(new_y, list_y))
    rmse_list.append(rmse)
    print('###########################')
    print('Alpha:', alpha_value)
    print('Root Mean Squared Error: %.1f' % rmse)
# Set axes and labels
fig, axs = plt.subplots(2, 2)
for ax in axs.flat:
    ax.set(xlabel='x', ylabel='y')
# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()
# Subplot top-left
axs[0, 0].scatter(list_x, list_y,color='C0')
axs[0, 0].plot(list_x_pred,list_y_pred[0],color='C1')
axs[0, 0].set_title(r'$\alpha = 10^{-4}$')
axs[0, 0].set_xlim(-5.5,5.5)
axs[0, 0].set_ylim(-100,500)
axs[0, 0].annotate(u'$RMSE$ = %.1f' % rmse_list[0], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot top-right
axs[0, 1].scatter(list_x, list_y,color='C0')
axs[0, 1].plot(list_x_pred,list_y_pred[1], color='C1')
axs[0, 1].set_title(r'$\alpha = 10^{-1}$')
axs[0, 1].set_xlim(-5.5,5.5)
axs[0, 1].set_ylim(-100,500)
axs[0, 1].annotate(u'$RMSE$ = %.1f' % rmse_list[1], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot bottom-left
axs[1, 0].scatter(list_x, list_y,color='C0')
axs[1, 0].plot(list_x_pred,list_y_pred[2], color='C1')
axs[1, 0].set_title(r'$\alpha = 10^{4}$')
axs[1, 0].set_xlim(-5.5,5.5)
axs[1, 0].set_ylim(-100,500)
axs[1, 0].annotate(u'$RMSE$ = %.1f' % rmse_list[2], xy=(0.15,0.85), xycoords='axes fraction')
# Subplot bottom-right
axs[1, 1].scatter(list_x, list_y,color='C0')
axs[1, 1].plot(list_x_pred,list_y_pred[3], color='C1')
axs[1, 1].set_title(r'$\alpha = 10^{8}$')
axs[1, 1].set_xlim(-5.5,5.5)
axs[1, 1].set_ylim(-100,500)
axs[1, 1].annotate(u'$RMSE$ = %.1f' % rmse_list[3], xy=(0.15,0.85), xycoords='axes fraction')
# Print plot to file
file_name='Figure_4.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()

Optimized KRR

Finally, using the optimized d and α hyper-parameters, we can perform a kernel-ridge regression, as shown below, which results into a very accurate regression.

Figure 5. Kernel Ridge Regression of points in blue with optimized hyper-parameters. Regression curve shown in orange.

The final code to do this KRR and obtain Figure 5 are shown below:

#!/usr/bin/env python3
# Import libraries
import math
import random
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.kernel_ridge import KernelRidge

# Initialize lists and set random seed
list_x = []
list_y = []
list_x_pred = []
random.seed(2020)
# Create database with 21 points following quasi-lienar relation in interval x:[-5,5]
for i in range(-10,11):
    x = i/2 
    rnd_number= random.uniform(-1,1)
    y = (x+4)*(x+1)*(x-1)*(x-3) + rnd_number
    list_x.append(x)
    list_y.append(y)
    print(x,y)
# Create list with 1060 points in interval x:[-5,5]
for i in range(-50,51):
    x = 0.1*i
    list_x_pred.append(x)
# Transform lists to np arrays
list_x = np.array(list_x).reshape(-1, 1)
list_x_pred = np.array(list_x_pred).reshape(-1, 1)
# Do linear regression using database with 21 points
krr = KernelRidge(alpha=0.1,kernel='polynomial',degree=4)
krr.fit(list_x,list_y)
# Calculate value of linear regressor at 1060 points in interval x:[-5,5]
list_y_pred = krr.predict(list_x_pred)
# Calculate value of linear regressor at 21 points in interval x:[-5,5]
new_y = krr.predict(list_x)

# Print rmse value
rmse = math.sqrt(mean_squared_error(new_y, list_y))
print('############################')
print('Root Mean Squared Error: %.1f' % rmse)
# Set axes and labels
fig = plt.figure()
ax = fig.add_subplot()
ax.set_xlim(-5.5,5.5)
ax.set_ylim(-100,500)
ax.xaxis.set_ticks(range(-5,6))
ax.yaxis.set_ticks(range(-100,600,100))
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.annotate(u'$RMSE$ = %.1f' % rmse, xy=(0.15,0.85), xycoords='axes fraction')
# Plot as orange line the regression line at interval
plt.plot(list_x_pred,list_y_pred,color='C1',linestyle='solid',linewidth=2)
# Plot as blue points the original database
plt.scatter(list_x, list_y,color='C0')
# Print plot to file
file_name='Figure_5.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()             

Conclusions

In this tutorial, we have first seen a brief introduction of Kernel Ridge Regression. We have generated simple one-dimensional databases and seen when linear regression might be useful. Then, we have covered how KRR can be helpful in more complex databases, and how to use a polynomial kernel. Finally, we have seen how to optimize the main hyper-parameters of the model to obtain accurate predictions.

If you are interested in Machine Learning applications, you can check my recent posts on k-nearest neighbors regression and the use of user-defined metrics in scikit-learn.

If you are curious about the application of KRR in real problems, you can check our recent work at the University of Liverpool, in collaboration with the Northeast Normal University. We used KRR, among other ML methods, to predict the efficiency of organic solar cells.

Was this tutorial helpful to you? Let me know if you were able to successfully use a Kernel Ridge Regression!


* Featured image by Robert Proksa from FreeImages


Marcos del Cueto

I am a research associate at the University of Liverpool. I have a PhD in Theoretical Chemistry, and I am now diving into the applications of machine learning to materials discovery.

 Subscribe in a reader