Multiple Linear Regression (MLR) in Python

This article explains multiple linear regression and how to program multiple linear regression models in Python.

5 min readDec 6, 2022

Multiple linear regression (MLR) is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is called “multiple” because it involves multiple independent variables, and “linear” because the relationship between the independent and dependent variables is assumed to be linear. In MLR, a linear regression model is fitted to the data, which means that a line of best fit is drawn through the data. The slope and intercept of this line are chosen such that the sum of the squared distances between the predicted values and the actual values is minimized.

Formula and Calculation of Multiple Linear Regression

The formula for multiple linear regression is the same as the formula for simple linear regression, with the addition of more independent variables. The general form of the equation is:

**Figure 1:** General form of the regression equation

where Y is the dependent variable, X1, X2, … Xn are the independent variables, and b0, b1, b2, … bn are the coefficients (or weights) of the model.

The coefficients of the independent variables in the fitted model can be used to make predictions about the dependent variable, given a set of values for the independent variables. These predictions can then be used to make decisions or draw conclusions about the underlying relationship between the dependent and independent variables.

Now we can move on to the python implementation.

We’ll be building an MLR model to predict the CO2 emissions of cars. Before building our model, it is necessary to import and process the data and identify variables for our regression model.

Importing Libraries

#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import warnings
import warnings
warnings.filterwarnings("ignore")

# We will use some methods from the sklearn module
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score

# Reading the Dataset
df = pd.read_csv("/kaggle/input/cardataset/data.csv")

df.head()

df.shape

(36, 5)

print(df.corr())

        Volume    Weight       CO2
Volume  1.000000  0.753537  0.592082
Weight  0.753537  1.000000  0.552150
CO2     0.592082  0.552150  1.000000

print(df.describe())

        Volume       Weight         CO2
count    36.000000    36.000000   36.000000
mean   1611.111111  1292.277778  102.027778
std     388.975047   242.123889    7.454571
min     900.000000   790.000000   90.000000
25%    1475.000000  1117.250000   97.750000
50%    1600.000000  1329.000000   99.000000
75%    2000.000000  1418.250000  105.000000
max    2500.000000  1746.000000  120.000000

Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y.

It is common to name the list of independent values with an uppercase X and the list of dependent values with a lowercase y.

Equation: Sales = β0 + (β1 Weight) + (β2 Volume) + e
Setting the values for independent (X) variable and dependent (Y) variable

#Setting the value for X and Y
X = df[['Weight', 'Volume']]
y = df['CO2']

Checking for outliers

fig, axs = plt.subplots(2, figsize = (5,5))
plt1 = sns.boxplot(df['Weight'], ax = axs[0])
plt2 = sns.boxplot(df['Volume'], ax = axs[1])
plt.tight_layout()

Exploratory Data Analysis

Distribution of the target variable

sns.distplot(df['CO2']);

Relationship of CO2 with other variables

sns.pairplot(df, x_vars=['Weight', 'Volume'], y_vars='CO2', height=4, aspect=1, kind='scatter')
plt.show()

Heatmap

The sns.heatmap() function creates a visualization that shows the correlation matrix of a dataset as a heatmap. The annot parameter of this function shows or does not show correlation values in the cells of the heatmap. If this parameter is set to True, correlation values in cells are displayed.

# Create the correlation matrix and represent it as a heatmap.
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm')
plt.show()

Model Building

Splitting the dataset into train and test set

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

y_train.shape
(25,)

y_test.shape
(11,)

reg_model = linear_model.LinearRegression()

#Fitting the Multiple Linear Regression model
reg_model = LinearRegression().fit(X_train, y_train)

#Printing the model coefficients
print('Intercept: ',reg_model.intercept_)

# pair the feature names with the coefficients
list(zip(X, reg_model.coef_))

Intercept:  74.33882836589245

[('Weight', 0.0171800645996374), ('Volume', 0.0025046399866402976)]

#Predicting the Test and Train set result 
y_pred= reg_model.predict(X_test)  
x_pred= reg_model.predict(X_train)

print("Prediction for test set: {}".format(y_pred))

Prediction for test set: [ 90.41571939 102.16323413  99.56363213 104.56661845 101.54657652
  95.94770019 108.64011848 102.22654214  92.80374837  97.27327129
  97.57074463]

#Actual value and the predicted value
reg_model_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred})
reg_model_diff

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', mae)
print('Mean Square Error:', mse)
print('Root Mean Square Error:', r2)

Mean Absolute Error: 6.901980901636316
Mean Square Error: 63.39765310998792
Root Mean Square Error: 7.9622643205301795

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.

References

https://www.miuul.com/makine-ogrenmesi, Date of access: December 4, 2022
https://www.w3schools.com/python/python_ml_multiple_regression.asp, Date of access: December 4, 2022

Multiple Linear Regression (MLR) in Python

This article explains multiple linear regression and how to program multiple linear regression models in Python.

Formula and Calculation of Multiple Linear Regression

Exploratory Data Analysis

Model Building

Written by Emine Bozkus

No responses yet