Flight Fare Prediction: Let's Find the Best Rates

Due to the recent global pandemic, prices of air travel have increased significantly, causing financial challenges for airlines worldwide. As a result, travel disruptions have ensued, leading to 2.3 million job losses in the aviation sector. Prior to the pandemic, air tickets were already subject to price fluctuations due to the competitiveness of the aviation industry. Being the only mean of rapid global transportation, airlines are driven to attract customers by offering the best prices possible. As such, it is increasingly important for airlines to utilize data-driven analytics to automate their pricing to make sure they remain competitive while operating at optimum capacity.
Please visit the Project GitHub repository for code.
Objective
There are several potential applications for robust ML algorithms within the aviation industry, including determining ticket prices, selling personalized products, managing crew and optimizing fuel and routes.
The focus of the project is to use historical data to develop a predictive model that can forecast ticket prices within 10% of their true value.
Use cases and challenges
There are several reasons why businesses in the travel industry might want to use flight price prediction software. For example, online travel agencies (OTAs) might use it to help them attract more visitors by being able to show them competitive rates. Similarly, airlines might use the same technology to forecast rates for their competitors and then adjust their own pricing strategies accordingly. A third example of this is a passenger-side predictor, which suggests the best time to buy a ticket so that travelers can make informed decisions.
In all cases, the task is quite challenging because numerous internal and external factors influence airfares.
Internal factors include
- purchase and departure dates,
- the number of available airlines and flights,
- fare class,
- the current market demand,
- flight distance
External factors embrace events going on in the arrival or departure cities — like
- holidays,
- concerts,
- sports competitions,
- festivals,
- Terrorist attacks,
- natural disasters,
- epidemic outbursts
Though it’s impossible to cover every external eventuality — say, nothing foreshadowed the 2020 coronavirus pandemic in the middle of 2019 — we still can predict quite a lot using the correct data and advanced machine learning (ML) models.
Data
The data is secondary data publicly available at Kaggle and contains 300k instances of flight booking details for domestic travel between the busiest cities in India during the year 2022. Data was originally sourced using the “Octoparse” web scraping tool for 50 days on Easemytrip which is a leading Indian-based travel platform, to obtain meaningful full insights by conducting various statistical hypothesis testing.
Although there is a cleaned version of the data, I have usined the unclened versions of the dataset and perfromed cleaning as required.
Data Dictionary
Cleaning and Preprocessing
The key objective of this stage was to avoid common mistakes and information mishandling by first applying standard data science cleaning and preprocessing techniques. The initial step was to combine two uncleaned datasets The following steps were carried out to clean the dataset:
- Checked and appropriately deal with duplicate infermation.
- Checked for missing values - none found
- Converted initial data formats to machine ML ready formats.
The following steps were carried out to preprocess the data:
- Performed feature extraction after converting
date
,dep_time
andarr_time
variables to datetime format. - One-Hot encoded
from
andto
columns. - Manually encoded
stop
column. - Perfomrd target guided mean encoding instead of One-Hot on
airline
column to keep the dimensionality low. - Treated outliers in the
target
column using median imputation.
Some Insights from EDA
- What are the most popular airlines in the region?

- “Air India” and “Vistara” are the key performers in the domestic aviation industry in India provide, therefore other airlines can revise their pricing strategies to offer better rates than “Air India” and “Vistara”.
- Period of Day Analysys__
- Passengers can expect calmer travel during late nights while Mornings and Evenings seem to be busiest.
- The cheapest period of the day for air travel?
- Passengers can expect cheaper travel during late nights.
- Does flight duration has an impact on the ticket price?
- Flight prices increase with added duration; passengers should avoid stopover flights to get the best rates.
Modeling Approach
Four machine learning models and a neural network model were trained, extensively tuned, and evaluated on four standard metrics. Following are the ML and DL models used;
- Linear Regression
Attempts to model the relationship between features and target by fitting a linear equation and predicting new values of our target variables.
- K-nearest Neighbors Regression
KNN uses the notion of similarity and distance to predict the numerical target by averaging the observations in the same neighbourhood.
- Random Forest Regression
It is an ensemble learning method which fits several classifying decision trees on various sub-samples of the dataset and combines predictions to make a more accurate prediction than a single model.
- Histogram-Based Gradient Boosting Regression
It is a gradient-boosting algorithm which takes a collection of decision trees as the base estimators and trains each estimator on the errors of the previous ones to make a more accurate prediction. Although like XGBoost, this employs a technique of binning the continuous input variables to a few hundred unique values, thus making it faster on large datasets.
- Deep Neural Network
A set of algorithms strung together, modeled loosely after the human brain by interconnected nodes in a layered structure, that are designed to recognize patterns.
Evaluation Metrics
-
R-Squared Score - the coefficient of determination which is the proportion of the variance in the dependent variable that is predictable from all the independent variables.
-
Adjusted R-Squared Score - measures the proportion of variation explained by only those independent variables that really help in explaining the dependent variable.
-
Mean Absolute Error (MAE) - the average error value in a set of predicted values without considering direction.
-
Root Mean Squared Error (RMSE) - the square root of the average squared error value in a set of predicted values.
-
Mean Absolute Percentage Error (MAPE) - the average error in a set of predicted values without considering direction as a percentage. Can be more than 100%.
The general process followed for modeling is:
- Load the data
- Fit the model
- Make predictions and evaluate the model
To evaluate the model, I created a custom function that recorded metrics and picked the best-performing model to predict the ticket price.
Models Summary
The table below summarizes the models and their performances.
- The
Model Name
column contains the variable name the model is saved to. - The
Regressor
column tells us which regression algorithm was used. - The
Scaling Method
column tells if the data was scaled before fitting the model and which scaler was used. - The
Cross-validation
column tells us whether cross-validation was performed and type. - The
Train R2
column contains the R2 score on train data. - The
Test R2
column contains the R2 score on test data. - The
Test RMSE
column contains the RMSE for test data. - The
Test MAE
column contains the MAE for test data. - The
Test R2
column contains the Test MAPE for test data as a percentage.
Objective Achieved !!!
Our objective was to predict the flight fare within 10% of its true value, which we have achieved through not only one but two models, A Random forest model capable of predicting the flight fare within 6.33% of its true value and Neural Network capable of predicting within 6.81% of true value.
However, if we are to pick a single best model, Random Forest is the best as it has the least error. The following graph shows the predicted price using of the Random Forest model against the actual price.
- As intended, we observe a clear linear relationship between our predicted price and the actual price.
- The distributions of the actual price and predicted price are identical further cementing the predictive power of the model.
Model Deployment
The following video shows a practical implementation of the model as an interactive web application. It uses the best model and is capable of real-time feedback. This app was created using Flask and Bootstrap web development frameworks.
You can review the code for the web app Here
Whats Next?
The next step is to reduce the model error, which will increase the predictive accuracy. This can be done by adding more observations and features, such as ancillary charges, meal options, infotainment systems, etc., and by retraining and tuning the model. Additionally, the app can be used at a global scale by introducing international flight data prior to training.
- Python
- Regression
- Data Visualization
- Supervised Learning
- Deep Learning
- Google Colab
- Neural Network
- Tensorflow
- Keras
- Flask
- Deployment