Airbnb Data Science Project

Problem Formulation

What are the factors and features of a listing that make an Airbnb listing more expensive?

Practical Motivation

Airbnb has provided many travellers a great, easy and convenient place to stay during their travels. Similarly, it has also given an opportunity for many to earn extra revenue by listing their properties for residents to stay. However, with so many listings available with varying prices, how can an aspiring host know what type of property to invest in if his main aim is to list it in Airbnb and earn rental revenue? Additionally, if a traveller wants to find the cheapest listing available but with certain features he prefers like 'free parking' etc, how does he know what aspects to look into to find a suitable listing? There are many factors which influence the price of a listing. Which is why we aim to find the most important factors that affect the price and more importantly the features that is common among the most expensive listings. This will allow an aspiring Airbnb host to ensure that his listing is equipped with those important features such that he will be able to charge a higher price without losing customers. Moreover, a traveller will also know the factors to look into to get the lowest price possible while having certain features he prefers.

Data Collection

The datasets for this project was obtained from kaggle. As the project was part of a data science course, we used the Airbnb dataset for Seattle and analysed the listings in Seattle. The datasets can be obtained here.

Data Preparation/Cleaning

As the dataset from kaggle was not very suitable for data analysis, we had to change the format of some data in the dataset. We also had to do separate data preparation for exploratory analysis and machine learning. Some of our data preparation were:

Convert price of listings from strings to floats and also remove the '$' sign.
Change NaN values to the integer 0.
Removing unwanted columns from the dataset such as listing_url, scrape_id etc.
Cleaning the textual data into a form that would be suitable for Python's NLTK.
Encoding the categorical variables so that it can be fit into the regression models later.
Separating the data into predictor and response variables.
Separating the data into training and testing sets (Training Sets: Testing Sets = 80% : 20%)

Exploratory Analysis

The main problem was broken into 3 sub-problems each targeting a different aspect of the dataset.

Sub-Problem 1: What are the features/facilities/ammenities of a property that affect its price?

Our first sub-problem was to focus on the physical features and facilities of the property itself. We wanted to see if there were any common features among the highly priced listings. We mainly focused on the listing's room type, the property type, number of bedrooms and common ammenities.

Analyzing the listings based on room types

As can be seen from the countplot, most of the listings were the entire home/apartment. There are almost twice as many entire home/apartment listings as private room listings. This gives a small insight into the type of listings available and the number of each type.

‏‏‎ ‎‎‎

Analyzing the listings based on the property type.

From the above graph, we can see that there are a lot more listings of apartment and full houses than any other property type in seattle. Together with the earlier discovery that hosts prefer to list their full property than just a room or shared room, it can be inferred that most listings in Seattle are entire apartments or entire houses.

Analyzing the prices for the different room and property types.

From the above heatmap, with lighter colour representing lower price and darker representing higher price, we can see that shared rooms have the lighest colour hence cheapest. Private rooms have a slightly darker colour so they are in the middle, and entire houses are the darkest thus the most expensive. It is also important to note that the highest number of listings which was house and apartments actually have very similar prices for each of the room_type category. All of this tells us that the room_type and property_type both play a very important role in the final price of the listing.

Anaylzing the listings based on the number of bedrooms.

From the box plots above, it can be seen that listings have higher prices as the number of bedrooms increase.

Analyzing if any particular ammenity results in higher prices.

The word cloud above was taken from the top 100 listings in terms of their price. We can see that the listings with the highest prices have ammenities such as Washer, Dryer, Heating, Wireless Internet, Smoke Detector, Free Parking, Kid Friendly. So, an aspiring Airbnb host should ensure that his property contains these ammenities so that he can charge a higher price. Similarly, if a traveller does not require any of these ammenities, he can opt for a listing without them to save cost. Ammenities and their influence into the price will be further explored in depth in the machine learning section of the project.

‏‏‎ ‎‎‎

Sub-Problem 2: Are there particular locations in Seattle where Airbnb listings fetch higher prices?

Our second sub-problem was to focus on the location of a listing. We wanted to see if there were any common neighbourhoods among the highly priced listings.

Analyzing number of listings of each room type in the neighbourhoods

The map above shows clusters of the different types of listings. Yellow cirles represents entire house/apartments, red circles represent private rooms and blue circles represent shared rooms. From the points on the map above, we can see that most of the yellow cirles are concentrated in central Seattle. That is, most of the listings that list the entire house/apartment are concentrated in central Seattle and in particular neighbourhoods like Broadway, Belltown, Fremont, Pikepine Market, Queen Anne. Additionally, we have already noted that the more expensive listings are those that list the entire property. You can drag around the map and zoom to observe the listings.

Finding the number of listings for each neighbourhood and the median price

From the plots above, we can see that most of the listings appear in Broadway, Belltown, Fremont, East/North/West Queen Anne etc. This gives us a good insight into the potential neighbourhoods where there are higher number of listings which we can tap into. By analyzing the number of listings and prices for each neighborhood, we can get a clearer understanding of which neighbourhoods have a lot of expensive listings. Looking at the analysis done so far, we can see that certain neighbourhoods are indeed more 'expensive' than others. However, some of those neighbourhoods do not have as many listings as other expensive neighbourhoods. Since our problem was to identify factors that make a listing more expensive, we can infer that these neighbourhoods tend to have more expensive listings. However, a more thorough inference would be to identify neighbourhoods that have both a higher number of listings and higher price as lower number of listings would mean fewer available listing for a customer to choose. As such, neighbourhoods such as Belltown, West Queen Anne are neighbourhoods that have a lot of expensive listings. (I'm sorry if the images are blurry, you can see the actual plots in the jupyter notebook.)

Sub-Problem 3: Does textual data in the summary and sentiments of reviews affect price?

Here we wanted to see if there are any relation between sentiment of reviews and the summary of the listing against the price.

Common words in the summary of expensive listings

This word cloud shows shows the most frequently used words in the summaries of the top 100 most expensive listings. We can see that they all have particularly 3 words in common: seattle, home & view. Other words like: kitchen, bedroom, walk, modern also commonly appear in those listings. But could it just be that all listings have these words in their summaries? To find out, lets also analyze the summaries of the cheapest listings and see if we can infer anything from that.

‏‏‎ ‎‎‎

Common words in the summary of the cheapest listings

Here we have seen the most common words in the summary of the cheapest listings. As it can be seen from the wordcloud, indeed there are overlapping words with the most expensive listings. Words like: seattle, bedroom, home appear frequently in both. So they do not tell us anything special. However, words like: view, modern & walk appear more frequently in expensive listings as opposed to cheaper listings. So it turns out that indeed there are certain words which appear more frequently among expensive listings.

‏‏‎ ‎‎‎

Analyzing if review sentiment has any relation with price

Python's NLTK library was used for the sentiment analysis. From the graphs above, we can conlcude 3 things. First is that most reviews do not have much negativity. Only a few reviews have a modicum of negativity. Infact, most of the reviews have no negativity classified in the 0.0 negative sentiment. Second thing we can see is that there are a lot of reviews with a reasonable amount of positivity. However, the final thing we can conclude is that most of the reviews have much neutrality. If most of the reviews have a lot of neutrality, we cannot infer much on positivity/negativity of comments with respect to price since the bulk of reviews all fall in the neutral category. So we can conclude that most reviews are written with a neutral sentiment although there is a very slight tilt to positive sentiments.

‏‏‎

Machine Learning

Exploratory Analysis helped us look into location, text of summary and other features. But
Machine Learning was used to find the most important features and ammenities that affect its price.

Regression Models for Machine Learning

Regression models are used to target a prediction value based on independent variables and it is mostly used for finding out the relationship between variables as well as prediction/forecasting. Here, we use regression models to help predict the price based on the significant predictor variables of a property (other than location and reviews/summary), identified in Exploratory Analysis. While trying to predict the price, we will also be able to confirm the most important features that affect its price.

Model 1: Linear Regression

Linear Regression is a machine learning algorithm that is based on supervised learning. It performs the regression task to predict a dependent variable value (in this case, price) based on given independent variables (in this case, the identified predictor variables). It then tries to find a linear relationship between the variables and predicts the price based on the linear line. Here, we have trained the model to follow the following formula: Regression Problem : Price = 𝑎 × (Predictor Variables) + 𝑏. The above is a general formula, however, since we have multiple predictor variables, there will be more than 1 coefficient (each for one predictor variable).

Points that lie on or near the diagonal line means that the values predicted by the Linear Regression model are highly accurate. If the points are away from the diagonal line, the points have been wrongly predicted.

Model 2: Random Forest Regression

Random Forest is an emsemble technique that is able to perform both Regression and Classification tasks with the use of multiple decision trees and a technique that is called Bootstrap Aggression. The idea behind this technique is to combine multiple decision trees in its prediction rather than replying on individual decision trees. Here, we use the RandomForestRegressor to help predict the price while also finding out the most important variables (i.e features). Importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more a variable is used to make key decisions with decision trees, the higher its relative importance. As such, feature importance can be used to interpret our data to understand the most important features that define our predictions. In this case, looking at the bar chart above, the predictor variable that is associated with a taller bar means that the variable has a higher importance in the Random Tree Regression Model in predicting price.

Model 3: XGBoost

XGBoost is an open source library that provides a high-performance implementation of gradient boost decision trees (similar to the decision trees that we have learnt). It is a machine learning model that is able to perform prediction tasks regardless of Regression or Classification. The key idea of Gradient Boosted Decision Trees is that they build a series of trees in which each tree is trained so that it attempts to correct the mistakes of the previous tree in the series.

Model 4: CatBoost

CatBoost is a high performance open source gradient boosting on decision trees. It can be used to solve both Classification and Regression problems.

Importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more a variable is used to make key decisions with decision trees, the higher its relative importance. As such, feature importance can be used to interpret our data to understand the most important features that define our predictions. In this case, looking at the bar chart above, the predictor variable that is associated with a longer bar means that the variable has a higher importance in the CatBoost Regression Model in predicting price.

Model 5: Ridge Regression

Ridge Regression is meant to be an upgrade to linear regression. It is similar to linear regression where it can be used to for regression and classification. Ridge Regression is good at handling overfitting. The difference in the equation for Ridge Regression is that it penalize RSS by adding another term and search for the minimization. We can iterate different 𝜆 values as the additional term to find the best fit for a Ridge Regression model.

From this graph, we can see that the most important predictor among the 5 is bedrooms. For Ridge Regression, the beta estimate of each predictor will converge to zero (but will never reach zero) as lambda increases, the faster it converges to zero, the less important the predictor is. For this case, the most important predictor is bedrooms. The reason why the beta estimate does not reach zero is because Ridge Regression does not drop any predictors, unlike Lasso Regression, which we will observe later on.

Model 6: Lasso Regression

Lasso Regression is similar to Ridge Regression, meant to be an upgrade to linear regression, it also can be used for Regression and Classification. Lasso Regression can be used for feature selection, where some predictors will be cast out after a lambda reaches a certain value. Lasson Regression also requires a 𝜆 value to be iterated to find the best fit.

From this graph, we can see that the most important predictor among the 5 is also bedrooms. For Lasso Regression, the faster the beta estimate of the predictor reaches zero (the predictor has been dropped) as lambda increases, the less important the predictor is. As we can see from the graph, bedrooms does not even hit zero after when has reached its highest value of 8, compared to the beta estimate of hot_tub_sauna_or_pool which reached zero a lot faster than bedrooms.

‏‏‎ ‎‎‎

Evaluation & Conclusion

With all the insights gained from exploratory analysis and the machine learning models used,
We can conclude a few this by determining the best model as well as the most important features.

Evaluation of the various Machine Learning Models

Regression	Error Type	Error Value
Linear Regression (Test Set)	Mean Squared Error (MSE)	3588.0181
Linear Regression (Test Set)	R Squared (R^2)	0.5173
Random Forest Regression (Test Set)	Mean Squared Error (MSE)	3466.0835
Random Forest Regression (Test Set)	R Squared (R^2)	0.5337
XGBoost (Test Set)	Mean Squared Error (MSE)	4285.6151
XGBoost (Test Set)	R Squared (R^2)	0.4234
CatBoost (Test Set)	Mean Squared Error (MSE)	4744.8753
CatBoost (Test Set)	R Squared (R^2)	0.3616
Ridge Regression (Test Set)	Mean Squared Error (MSE)	3582.975
Ridge Regression (Test Set)	R Squared (R^2)	0.518
Lasso Regression (Test Set)	Mean Squared Error (MSE)	3582.683
Lasso Regression (Test Set)	R Squared (R^2)	0.518

From the table above, it can be seen that the Random Forrest Regression Model has the lowest MSE and highest R^2 values among all the other models. Moreover, when comparing the different graphs of True Values VS Predicting Values for each regression model (can be seen under the heading of the different models), the above conclusion can be said as true as the scatter plot of Random Forrest Regression shows that many of the points are situated near the diagonal line thus being the most accurate. It is also interesting to note that the MSE and R^2 values of the Linear Regression, Ridge Regression and Lasso Regression are very close to each other, this is most probably due to the similarities in the 3 models, as the Ridge and Lasso Regression model are supposed to be an upgraded version of Linear Regression.

K-Fold Cross Validation

However, Random Train/Test Set Splits may not always be enough as it can be subjected to selection biased during the split process (even if its randomly split). This is especially so if the dataset is small. Train/Test Set Splits can also cause over-fitted predicted models that can also affect its performance metrics. As such, to overcome the pitfalls in Train/Test set split evaluation, K-Fold Cross Validation is also performed. Here, the whole dataset is used to calcualte the performance of the regression models. This validation method is more popular simply because it generally results in a less biased or less optimistic estimate of the model.

K-Fold Cross Validation is where the dataset will be split into k number of folds in which each fold is used as a testing point. Here, k=10 was used as it was a value that has been found to generally result in a model skill estimate with low bias and a modest variance. After running the K-Fold Cross Validation, results similar to table above was produced. This confirmed our inital conclusion that Random Forest Regression was the best model used to predict price.

The Most Important Features of a Listing

We then did further analysis to find out which feature was the most important in affecting the price of a listing. We first test out its prediction on a specific instance. Then, using a library called TreeInterpreter, we decompose the Random Forest prediction into a sum of contributions from each feature: Prediction = Bias + Feature1 x Contribution1 + … + FeatureN x ContributionN. This will show us how each individual feature contributed in its prediction based on individual results. A positive result would mean that the feature has a positive impact on the prediction while a negative result shows a negative impact. If the prediction of price is fairly accurate (comparing to its true value), then its contributions of individual features would also be deemed fairly reliable.

When the TreeInterpreter library was run, we found that ammenities such as amenities such as a hot tub/sauna/pool, gym, elevator and the number of bedrooms were the most important in influencing the price of a listing. The code for this analysis as well as the entire project can be found in the jupyter notebook.

Conclusion

From all the analysis done above, we can confidently answer our initial question of the factors that make a listing more expensive. An aspiring Airbnb host, if investing on a new property in Seattle, should focus on the following factors to maximize the price of his listing. Additionally a traveller who wants to pay the lowest possible price for a listing might want to avoid having these features in his prospective housing :

Entire properties listed instead of just a single room fetch the highest prices.
Apartment and landed house tend to be the most expensive and the most abundant properties in Airbnb.
The more bedrooms a property has, the higher its price. The highest prices are fetched by 6 room properties.
There are plenty of listings in Belltown or West Queen Anne and they tend to be expensive.
Words like: view, modern & walk all frequently appear in the summary of the more expensive listing.
Ammenities such as: Washer, Dryer, Heating, Wireless Internet, Smoke Detector, Free Parking, Kid Friendly, TV, HotTub/Sauna/Pool, Gyms and Elevators are all common among the more expensive listings.
The reviews a listing gets (quality or quantity) does not have much of an impact in its price.

Data Science Project

Practical Motivation

Data Collection

Data Preparation/Cleaning

Sub-Problem 1: What are the features/facilities/ammenities of a property that affect its price?

Analyzing the listings based on room types

Analyzing the listings based on the property type.

Analyzing the prices for the different room and property types.

Anaylzing the listings based on the number of bedrooms.

Analyzing if any particular ammenity results in higher prices.

Sub-Problem 2: Are there particular locations in Seattle where Airbnb listings fetch higher prices?

Analyzing number of listings of each room type in the neighbourhoods

Finding the number of listings for each neighbourhood and the median price

Sub-Problem 3: Does textual data in the summary and sentiments of reviews affect price?

Common words in the summary of expensive listings

Common words in the summary of the cheapest listings

Analyzing if review sentiment has any relation with price

Regression Models for Machine Learning

Model 1: Linear Regression

Model 2: Random Forest Regression

Model 3: XGBoost

Model 4: CatBoost

Model 5: Ridge Regression

Model 6: Lasso Regression

Evaluation of the various Machine Learning Models

K-Fold Cross Validation

The Most Important Features of a Listing

Conclusion

Python

Pandas

Scikit-learn

Matplotlib

BokehJS

Python NLTK

XGBoost

CatBoost

TreeInterpreter

View the project on GitHub