Flight Price Prediction - Data Vizualisation
In this post, I’m introducing, Flight Price Predictor, using machine learning algorithms and data visualization techniques based on some features such as what is the arrival time, what is departure time, what is duration of the flight, source, destination and more.
Flight ticket prices can be something hard to guess. If we have been provided with prices of flight tickets for various airlines between 5-6 months and between various cities, it would be easier to build a model which predicts the prices of the flights using various input features.
Let's see how we can analyze these ticket prices.
Create a conda environment and install required libraries
conda create -n fpp python=3.9
conda activate fpp
pip install flask flask_cors pandas seaborn sklearn openpyxl
flask run
Import Libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
import pickle
Read training data
train_data = pd.read_excel('Flight Dataset/Data_Train.xlsx')
train_data.head()
Check destination column values
train_data['Destination'].value_counts()
Merge Delhi and New Delhi
def newd(x):
if x=='New Delhi':
return 'Delhi'
else:
return x
train_data['Destination'] = train_data['Destination'].apply(newd)
Check train data info
train_data.info()
Create the day/month column as a date/time column
train_data['Journey_day'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.day
train_data['Journey_month'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.month
train_data.drop('Date_of_Journey',inplace=True,axis=1)
train_data.head()
Extract hours/minutes from time
train_data['Dep_hour'] = pd.to_datetime(train_data['Dep_Time']).dt.hour
train_data['Dep_min'] = pd.to_datetime(train_data['Dep_Time']).dt.minute
train_data.drop('Dep_Time',axis=1,inplace=True)
train_data['Arrival_hour'] = pd.to_datetime(train_data['Arrival_Time']).dt.hour
train_data['Arrival_min'] = pd.to_datetime(train_data['Arrival_Time']).dt.minute
train_data.drop('Arrival_Time',axis=1,inplace=True)
train_data.head()
Check destination column values
Extract duration column info
duration = list(train_data['Duration'])
for i in range(len(duration)):
if len(duration[i].split()) != 2:
if 'h' in duration[i]:
duration[i] = duration[i] + ' 0m'
else:
duration[i] = '0h ' + duration[i]
duration_hour = []
duration_min = []
for i in duration:
h,m = i.split()
duration_hour.append(int(h[:-1]))
duration_min.append(int(m[:-1]))
train_data['Duration_hours'] = duration_hour
train_data['Duration_mins'] = duration_min
train_data.drop('Duration',axis=1,inplace=True)
train_data.head()
Plot airline vs price
sns.catplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=6)
Create dummy columns out of airline columns
airline = train_data[['Airline']]
airline = pd.get_dummies(airline,drop_first=True)
Plot source vs price
sns.catplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)
Create dummy columns of source columns
source = train_data[['Source']]
source = pd.get_dummies(source,drop_first=True)
source.head()
Plot destination vs price
sns.catplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)
Create dummy columns of the destination column
destination = train_data[['Destination']]
destination = pd.get_dummies(destination,drop_first=True)
destination.head()
Drop crap columns and check the value in total stop columns
train_data.drop(['Route','Additional_Info'],inplace=True,axis=1)
train_data['Total_Stops'].value_counts()
Convert labels into numbers into Total_stop columns
train_data['Total_Stops'].replace({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4},inplace=True)
train_data.head()
Check the shapes of all data frames
print(airline.shape)
print(source.shape)
print(destination.shape)
print(train_data.shape)
Combine all data frames
data_train = pd.concat([train_data,airline,source,destination],axis=1)
data_train.drop(['Airline','Source','Destination'],axis=1,inplace=True)
data_train.head()
Take out train data and train data labels
X = data_train.drop('Price',axis=1)
X.head()
y = data_train['Price']
y.head()
Check correlations between columns
plt.figure(figsize=(10,10))
sns.heatmap(train_data.corr(),cmap='viridis',annot=True)
Split data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Train model for flight price prediction
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(), param_distributions = random_grid,
scoring='neg_mean_squared_error', n_iter = 10, cv = 5,
verbose=1, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train)
Take predictions and plot residuals
prediction = rf_random.predict(X_test)
plt.figure(figsize = (8,8))
sns.distplot(y_test-prediction)
plt.show()
Plot test vs predictions
Source code link - https://pythonsandbox.dev/w5a65xpeu4mj
That's a wrap.