Multiple Linear Regression (MLR) in Python
This article explains multiple linear regression and how to program multiple linear regression models in Python.
Multiple linear regression (MLR) is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is called “multiple” because it involves multiple independent variables, and “linear” because the relationship between the independent and dependent variables is assumed to be linear. In MLR, a linear regression model is fitted to the data, which means that a line of best fit is drawn through the data. The slope and intercept of this line are chosen such that the sum of the squared distances between the predicted values and the actual values is minimized.
Formula and Calculation of Multiple Linear Regression
The formula for multiple linear regression is the same as the formula for simple linear regression, with the addition of more independent variables. The general form of the equation is:
where Y is the dependent variable, X1, X2, … Xn are the independent variables, and b0, b1, b2, … bn are the coefficients (or weights) of the model.
The coefficients of the independent variables in the fitted model can be used to make predictions about the dependent variable, given a set of values for the independent variables. These predictions can then be used to make decisions or draw conclusions about the underlying relationship between the dependent and independent variables.
Now we can move on to the python implementation.
We’ll be building an MLR model to predict the CO2 emissions of cars. Before building our model, it is necessary to import and process the data and identify variables for our regression model.
Importing Libraries
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import warnings
import warnings
warnings.filterwarnings("ignore")
# We will use some methods from the sklearn module
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
# Reading the Dataset
df = pd.read_csv("/kaggle/input/cardataset/data.csv")
df.head()
df.shape
(36, 5)
print(df.corr())
Volume Weight CO2
Volume 1.000000 0.753537 0.592082
Weight 0.753537 1.000000 0.552150
CO2 0.592082 0.552150 1.000000
print(df.describe())
Volume Weight CO2
count 36.000000 36.000000 36.000000
mean 1611.111111 1292.277778 102.027778
std 388.975047 242.123889 7.454571
min 900.000000 790.000000 90.000000
25% 1475.000000 1117.250000 97.750000
50% 1600.000000 1329.000000 99.000000
75% 2000.000000 1418.250000 105.000000
max 2500.000000 1746.000000 120.000000
Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y.
It is common to name the list of independent values with an uppercase X and the list of dependent values with a lowercase y.
- Equation: Sales = β0 + (β1 Weight) + (β2 Volume) + e
- Setting the values for independent (X) variable and dependent (Y) variable
#Setting the value for X and Y
X = df[['Weight', 'Volume']]
y = df['CO2']
Checking for outliers
fig, axs = plt.subplots(2, figsize = (5,5))
plt1 = sns.boxplot(df['Weight'], ax = axs[0])
plt2 = sns.boxplot(df['Volume'], ax = axs[1])
plt.tight_layout()
Exploratory Data Analysis
Distribution of the target variable
sns.distplot(df['CO2']);
Relationship of CO2 with other variables
sns.pairplot(df, x_vars=['Weight', 'Volume'], y_vars='CO2', height=4, aspect=1, kind='scatter')
plt.show()
Heatmap
The sns.heatmap() function creates a visualization that shows the correlation matrix of a dataset as a heatmap. The annot parameter of this function shows or does not show correlation values in the cells of the heatmap. If this parameter is set to True, correlation values in cells are displayed.
# Create the correlation matrix and represent it as a heatmap.
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm')
plt.show()
Model Building
Splitting the dataset into train and test set
We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)
y_train.shape
(25,)
y_test.shape
(11,)
reg_model = linear_model.LinearRegression()
#Fitting the Multiple Linear Regression model
reg_model = LinearRegression().fit(X_train, y_train)
#Printing the model coefficients
print('Intercept: ',reg_model.intercept_)
# pair the feature names with the coefficients
list(zip(X, reg_model.coef_))
Intercept: 74.33882836589245
[('Weight', 0.0171800645996374), ('Volume', 0.0025046399866402976)]
#Predicting the Test and Train set result
y_pred= reg_model.predict(X_test)
x_pred= reg_model.predict(X_train)
print("Prediction for test set: {}".format(y_pred))
Prediction for test set: [ 90.41571939 102.16323413 99.56363213 104.56661845 101.54657652
95.94770019 108.64011848 102.22654214 92.80374837 97.27327129
97.57074463]
#Actual value and the predicted value
reg_model_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred})
reg_model_diff
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('Mean Absolute Error:', mae)
print('Mean Square Error:', mse)
print('Root Mean Square Error:', r2)
Mean Absolute Error: 6.901980901636316
Mean Square Error: 63.39765310998792
Root Mean Square Error: 7.9622643205301795
Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!
If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.
References
- https://www.miuul.com/makine-ogrenmesi, Date of access: December 4, 2022
- https://www.w3schools.com/python/python_ml_multiple_regression.asp, Date of access: December 4, 2022