Restaurant Revenue Prediction

A detailed look at this blog illustrates when and where to open a new restaurant across different regional locations.

10 min readMar 5, 2021

A wise planning of investment is very essential before establishing a new restaurant.

This blog explains an end-to-end real-world machine learning case study where I explained a few insights from my work and also have a glance at a full-fledged web-app built with the ML model. Hope the ones who enjoy reading this give a free clap !!!

Below are the sections followed through the entire blog:

Business Problem
Data Overview
Evaluation Metric
Exploratory Data Analysis And Feature Engineering
Key Insights
Transforming the data
Machine Learning Models
Model Comparison
Deployment
Future work
References

Food industry plays a crucial part in the enhancement of the country’s economy. This mainly plays a key role in metropolitan cities. Where restaurants are essential parts of social gatherings and in recent days there are different varieties of quick-service restaurants like food trucks and takeaways. With this recent rise in restaurant types, it is difficult to decide when and where to open a new restaurant.

Business Problem

With over 1,200 quick-service restaurants across the globe, TFI is the company where it owns several well-known restaurants across different parts of Europe and Asia. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites. We have been encountered four different types of restaurants. They are inline, mobile, drive-thru, and food court. So deciding to open a new restaurant is challenging with these emerging quick-service restaurants.

In recent days, even restaurant sites also include a large investment of time and capital. Geographical locations and cultures also impact the long-time survival of the firm. With the subjective data, it is difficult to extrapolate the place where to open a new restaurant. So TF1 needs a model such that they can effectively invest in new restaurant sites. This competition is to predict the annual restaurant sales of 100,000 regional locations.

Data Overview

TFI company has provided a dataset with 137 restaurants in the training phase and 100,000 restaurants for the testing phase.

Id: Restaurant ID
Open Date: Opening date of a restaurant
City: City where the restaurant is located
City Group: Type of the city. Big cities, or Other.
Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive-Thru, MB: Mobile
P1- P37: There are three categories of these obfuscated data. Demographic data -include population in any given area, age and gender distribution, development scales. Real estate data - related to the m2 of the location, front facade of the location, car park availability. Commercial data -include the existence of points of interest including schools, banks, other QSR operators.
Revenue: The revenue column indicates the transformed revenue of the restaurant in a given year and is the target of predictive analysis.

Evaluation Metric

Type of Machine learning problem: We are asked to predict the revenue of the restaurant in a given year, so this problem can be best framed as a Regression problem.

Let’s discuss here with most efficient and commonly used evaluation metric in regression problems.
Root Mean Squared Error(RMSE):
RMSE is the most popular evaluation metric where it follows an assumption that errors obtained are unbiased and follow a normal distribution. RMSE is the square root of average squared residuals/errors.

The Squareroot in RMSE makes the scale of the errors be in scale as the scale of the target variable. Example: Let’s consider target variable ‘revenue’ has its unit in ‘dollars’, then RMSE will have its unit in ‘dollars’.
The Square term in RMSE prevents canceling the positive and negative error values.

As RMSE is highly affected by outlier values, it is mandatory to handle outliers from the data before using this metric. Lower the value better is the performance of the model.

Exploratory Data Analysis And Feature Engineering

Exploratory Data Analysis is an approach used to analyze the data and get some meaningful insights from it.

Feature Engineering is the process of extracting new features from the given raw data.

Analyzing target variable — ‘revenue’ :

It is clear that we experience skewness in the target variable. We can observe that the ‘revenue’ variable is right-skewed/positively skewed. This skewness caused due to outliers. These outliers appeared might be due to a few restaurants really having higher revenue or mistakenly chosen values. We can apply some transformations to the variable which decreases the effect of outliers.

Analyzing the ‘Type’ column :

The above plots represent the count of different types of restaurants in training and testing datasets. Since the count of ‘DT’(drive-thru) &, ‘MB’(mobile) is very less, which indicates these two are the less preferred type of restaurants where ‘FC’(Food Court) is the most preferred type of restaurant.
Note: There is no single occurrence of ‘MB’ (mobile restaurants) in the training dataset.

Analyzing ‘City’ column :

Approx 60% of restaurants opened mainly in Istanbul, Ankara, Izmir.

Type vs Revenue variable:

‘FC’(FoodCourt) & ‘IL’(Inline) are the preferred type of restaurants which are having similar distribution and also provide good revenue.

City Group vs Revenue:

We can observe that few restaurants in ‘Big Cities’ are making a higher revenue margin. So opening a new restaurant in ‘Big Cities’ is preferred for getting high revenue

City vs Revenue:

Izmir, Istanbul, Elazig, and Edirne are the few cities where average revenue income is high for the restaurants.

Adding new features from ‘Open Date’ variable:

Analyzing Month:

Mostly, new restaurants are opening during August and December.
Restaurants are having good revenue during April and September months.

City and Year (Vs) Revenue:

Key insights obtained from the above graph is that among other cities ‘Istanbul’ is the city which provides the highest revenue and also most of them are preferring to open new restaurants almost every year since 1997. ‘Ankara’ & ‘Izmir’ are the cities which provide good revenue and next to ‘Istanbul’ these are the most preferable cities for opening a new restaurant every year.

Type and Year (Vs) Revenue:

Almost every year mostly new restaurants prefer either ‘FC’(Food Court) or ‘IL’ (Inline) as their restaurant type. The average revenue for both of these restaurant types is similar.

City Group and Year (Vs) Revenue:

We can observe that since 1997 every year a few restaurants are opening in ‘Big Cities’ and also there is good revenue between the year 1997–2012.

City and City Group (Vs) Revenue:

The key insight from this plot is that only Ankara, Istanbul, and Izmir are the only Big Cities.
Among other cities Edime, Elazig, and Trabzon provide higher revenue.

City and Type (Vs) Revenue:

In ‘Big cities’ like Istanbul, Izmir, and Ankara preference can be given to ‘FC’(Food Court) restaurant type to maximize the revenue.

Key Insights

Type:

Food Court(FC) & Inline(IL) are the most preferred restaurant types, while Drive-Thru and Mobile are the restaurant types where even most of the cities don’t prefer. Restaurant Type FC(food court) yields high revenue in Big cities.

City:

Around 60% of restaurants opened mainly in Istanbul, Ankara, Izmir.
Istanbul is the city that provides the highest revenue and is the most preferred city to open a new restaurant.

City Group:

The key insight is that Big Cities includes only Ankara, Istanbul, and Izmir.
Edime, Elazig, and Trabzon are the Other cities that provide higher revenue.

Month:

Mostly, new restaurants are opening during August and December.
Restaurants are having good revenue during April and September months.

P1-P37:

Dataset has a total of 37 p-variables (P1- P37): There are three categories of these obfuscated data. They are Demographic data, Real estate data, and Commercial data. Some of these ‘P’ features are highly correlated with each other and having less or no correlation with the ‘revenue’ variable. So these features can be neglected.

Transforming the data

Applying Standard Scalar:

Standard Scaler follows standard normal distribution where mean is centered to zero and scales data to unit variance.

Applying Principal Component Analysis:

As there is high dimensionality in data we can use Principal Component Analysis(PCA). It helps in reducing the dimensions of data and focus on the ones with the largest variance components.

Machine Learning Models

Let's discuss here a few different types of ML models that are built on the given dataset after applying scaling or PCA.

Lasso regression or L1 Regularization:

This technique is that Lasso shrinks the coefficient of the less important feature’s to zero thus, removing some features altogether. Which is a kind of automatic feature selection process where only a subset of the most important features is left with non-zero weights.

Ridge Regression or L2 Regularization:

This is the technique in which it minimizes the coefficient of less important features, where it does not enforce to zero. Thus minimizing the impact on the trained model yields much accurate regression with test data.

Random Forest Regressor:

Random Forest is an ensemble technique where multiple base models work together to form a more powerful model. It is the most popular bagging algorithm which can reduce variance in a model without impacting the bias.
It follows the approach of Random Boostrap sampling where a different sample of features is selected by each base model.
RF = DT + Bagging ( row sampling with replacement ) + column sampling(feature bagging) + majority vote/mean,median (aggregation)

XG Boost:

Boosting is a sequential ensemble technique where each subsequent model attempts to correct errors of the previous model.
XGBoost is one of the powerful ensemble learning techniques used for building supervised regression models. In XGBoost when all the predictions are combined, only good predictions are summed up to give the final prediction. It uses a pre-sorted algorithm and a Histogram-based algorithm for computing the best split.

XGBoost : Boosting + Randomization

XGBoost : — GBDT (Pseudo residual + additive) + row sampling + column sampling

Light GBM:

LightGBM is a gradient boosting algorithm. While other algorithms trees grow horizontally(level-wise approach), LightGBM grows vertically(leaf-wise approach).
The best split is determined using the Gradient-based-One-Side sampling technique.
This algorithm apart from being more accurate, it is a light version and also time-saving than xgboost.

Custom Stacking Classifier:

Stacking is a custom ensemble learning method that combines multiple machine learning algorithms via meta-learning in which a newly generated custom model learns how to best combine the predictions from multiple base models. So basically, multiple base models will predict individually, and here meta-model using those predictions generates the final predictions. For this problem statement tried multiple base learning models and LightGBM used as a final meta-model.

Model Comparison

Kaggle Leaderboard score

Deployment

Finally deployed a sample web app using Streamlit and deployed the model in AWS.

Link to the web app:

Streamlit -Webapp

ec2-18-218-146-215.us-east-2.compute.amazonaws.com

Future work

The current project is built based on different types of Machine Learning algorithms which yields great predictions.
To improve model predictions and evaluation metric, need to work on few more feature engineering techniques and also will try with some deep learning models.

References

Github Repository

The full implementation of the above ML models can be accessed through the below Github link.

Saiharishc/Restaurant-Revenue-Prediction

This repo gives an end to end implementation of regression models…

github.com