Multiple Linear Regression (MLR) in Python

This article explains multiple linear regression and how to program multiple linear regression models in Python.

Emine Bozkus
5 min readDec 6, 2022

Multiple linear regression (MLR) is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is called “multiple” because it involves multiple independent variables, and “linear” because the relationship between the independent and dependent variables is assumed to be linear. In MLR, a linear regression model is fitted to the data, which means that a line of best fit is drawn through the data. The slope and intercept of this line are chosen such that the sum of the squared distances between the predicted values and the actual values is minimized.

Formula and Calculation of Multiple Linear Regression

The formula for multiple linear regression is the same as the formula for simple linear regression, with the addition of more independent variables. The general form of the equation is:

Figure 1: General form of the regression equation

where Y is the dependent variable, X1, X2, … Xn are the independent variables, and b0, b1, b2, … bn are the coefficients (or weights) of the model.

The coefficients of the independent variables in the fitted model can be used to make predictions about the dependent variable, given a set of values for the independent variables. These predictions can then be used to make decisions or draw conclusions about the underlying relationship between the dependent and independent variables.

Now we can move on to the python implementation.

We’ll be building an MLR model to predict the CO2 emissions of cars. Before building our model, it is necessary to import and process the data and identify variables for our regression model.

Importing Libraries

#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import warnings
import warnings
warnings.filterwarnings("ignore")

# We will use some methods from the sklearn module
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
# Reading the Dataset
df = pd.read_csv("/kaggle/input/cardataset/data.csv")
df.head()
df.shape
(36, 5)
print(df.corr())
        Volume    Weight       CO2
Volume 1.000000 0.753537 0.592082
Weight 0.753537 1.000000 0.552150
CO2 0.592082 0.552150 1.000000
print(df.describe())
        Volume       Weight         CO2
count 36.000000 36.000000 36.000000
mean 1611.111111 1292.277778 102.027778
std 388.975047 242.123889 7.454571
min 900.000000 790.000000 90.000000
25% 1475.000000 1117.250000 97.750000
50% 1600.000000 1329.000000 99.000000
75% 2000.000000 1418.250000 105.000000
max 2500.000000 1746.000000 120.000000

Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y.

It is common to name the list of independent values with an uppercase X and the list of dependent values with a lowercase y.

  • Equation: Sales = β0 + (β1 Weight) + (β2 Volume) + e
  • Setting the values for independent (X) variable and dependent (Y) variable
#Setting the value for X and Y
X = df[['Weight', 'Volume']]
y = df['CO2']

Checking for outliers

fig, axs = plt.subplots(2, figsize = (5,5))
plt1 = sns.boxplot(df['Weight'], ax = axs[0])
plt2 = sns.boxplot(df['Volume'], ax = axs[1])
plt.tight_layout()

Exploratory Data Analysis

Distribution of the target variable

sns.distplot(df['CO2']);

Relationship of CO2 with other variables

sns.pairplot(df, x_vars=['Weight', 'Volume'], y_vars='CO2', height=4, aspect=1, kind='scatter')
plt.show()

Heatmap

The sns.heatmap() function creates a visualization that shows the correlation matrix of a dataset as a heatmap. The annot parameter of this function shows or does not show correlation values in the cells of the heatmap. If this parameter is set to True, correlation values in cells are displayed.

# Create the correlation matrix and represent it as a heatmap.
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm')
plt.show()

Model Building

Splitting the dataset into train and test set

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)
y_train.shape
(25,)

y_test.shape
(11,)
reg_model = linear_model.LinearRegression()

#Fitting the Multiple Linear Regression model
reg_model = LinearRegression().fit(X_train, y_train)

#Printing the model coefficients
print('Intercept: ',reg_model.intercept_)

# pair the feature names with the coefficients
list(zip(X, reg_model.coef_))
Intercept:  74.33882836589245
[('Weight', 0.0171800645996374), ('Volume', 0.0025046399866402976)]
#Predicting the Test and Train set result 
y_pred= reg_model.predict(X_test)
x_pred= reg_model.predict(X_train)
print("Prediction for test set: {}".format(y_pred))
Prediction for test set: [ 90.41571939 102.16323413  99.56363213 104.56661845 101.54657652
95.94770019 108.64011848 102.22654214 92.80374837 97.27327129
97.57074463]
#Actual value and the predicted value
reg_model_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred})
reg_model_diff
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', mae)
print('Mean Square Error:', mse)
print('Root Mean Square Error:', r2)
Mean Absolute Error: 6.901980901636316
Mean Square Error: 63.39765310998792
Root Mean Square Error: 7.9622643205301795

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.

References

  1. https://www.miuul.com/makine-ogrenmesi, Date of access: December 4, 2022
  2. https://www.w3schools.com/python/python_ml_multiple_regression.asp, Date of access: December 4, 2022

--

--

Emine Bozkus
Emine Bozkus

Written by Emine Bozkus

👩‍💻Data Scientist | 🤖 Researcher

No responses yet