Flight Price Prediction - Data Vizualisation

In this post, I’m introducing, Flight Price Predictor, using machine learning algorithms and data visualization techniques based on some features such as what is the arrival time, what is departure time, what is duration of the flight, source, destination and more.

Flight ticket prices can be something hard to guess. If we have been provided with prices of flight tickets for various airlines between 5-6 months and between various cities, it would be easier to build a model which predicts the prices of the flights using various input features.

Let's see how we can analyze these ticket prices.

Create a conda environment and install required libraries

conda create -n fpp python=3.9
conda activate fpp
pip install flask flask_cors pandas seaborn sklearn openpyxl
flask run

Import Libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
import pickle

Read training data

train_data = pd.read_excel('Flight Dataset/Data_Train.xlsx')
train_data.head()

Check destination column values

train_data['Destination'].value_counts()

Merge Delhi and New Delhi

def newd(x):
    if x=='New Delhi':
        return 'Delhi'
    else:
        return x

train_data['Destination'] = train_data['Destination'].apply(newd)

Check train data info

train_data.info()

Create the day/month column as a date/time column

train_data['Journey_day'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.day
train_data['Journey_month'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.month

train_data.drop('Date_of_Journey',inplace=True,axis=1)

train_data.head()

Extract hours/minutes from time

train_data['Dep_hour'] = pd.to_datetime(train_data['Dep_Time']).dt.hour
train_data['Dep_min'] = pd.to_datetime(train_data['Dep_Time']).dt.minute
train_data.drop('Dep_Time',axis=1,inplace=True)

train_data['Arrival_hour'] = pd.to_datetime(train_data['Arrival_Time']).dt.hour
train_data['Arrival_min'] = pd.to_datetime(train_data['Arrival_Time']).dt.minute
train_data.drop('Arrival_Time',axis=1,inplace=True)

train_data.head()

Check destination column values

Extract duration column info

duration = list(train_data['Duration'])

for i in range(len(duration)):
    if len(duration[i].split()) != 2:
        if 'h' in duration[i]:
            duration[i] = duration[i] + ' 0m'
        else:
            duration[i] = '0h ' + duration[i]

duration_hour = []
duration_min = []

for i in duration:
    h,m = i.split()
    duration_hour.append(int(h[:-1]))
    duration_min.append(int(m[:-1]))

train_data['Duration_hours'] = duration_hour
train_data['Duration_mins'] = duration_min

train_data.drop('Duration',axis=1,inplace=True)
train_data.head()

Plot airline vs price

sns.catplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=6)

Create dummy columns out of airline columns

airline = train_data[['Airline']]
airline = pd.get_dummies(airline,drop_first=True)

Plot source vs price

sns.catplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)

Create dummy columns of source columns

source = train_data[['Source']]
source = pd.get_dummies(source,drop_first=True)
source.head()

Plot destination vs price

sns.catplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)

Create dummy columns of the destination column

destination = train_data[['Destination']]
destination = pd.get_dummies(destination,drop_first=True)
destination.head()

Drop crap columns and check the value in total stop columns

train_data.drop(['Route','Additional_Info'],inplace=True,axis=1)
train_data['Total_Stops'].value_counts()

Convert labels into numbers into Total_stop columns

train_data['Total_Stops'].replace({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4},inplace=True)
train_data.head()

Check the shapes of all data frames

print(airline.shape)
print(source.shape)
print(destination.shape)
print(train_data.shape)

Combine all data frames

data_train = pd.concat([train_data,airline,source,destination],axis=1)
data_train.drop(['Airline','Source','Destination'],axis=1,inplace=True)
data_train.head()

Take out train data and train data labels

X = data_train.drop('Price',axis=1)
X.head()

y = data_train['Price']
y.head()

Check correlations between columns

plt.figure(figsize=(10,10))
sns.heatmap(train_data.corr(),cmap='viridis',annot=True)

Split data into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Train model for flight price prediction

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

max_features = ['auto', 'sqrt']

max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]

min_samples_split = [2, 5, 10, 15, 100]

min_samples_leaf = [1, 2, 5, 10]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(), param_distributions = random_grid,
                               scoring='neg_mean_squared_error', n_iter = 10, cv = 5, 
                               verbose=1, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train)

Take predictions and plot residuals

prediction = rf_random.predict(X_test)

plt.figure(figsize = (8,8))
sns.distplot(y_test-prediction)
plt.show()

Plot test vs predictions

Source code link - https://pythonsandbox.dev/w5a65xpeu4mj

That's a wrap.