Predicting Prices for Munich Airbnb

Airbnb is a prime example of a disruptive innovation, that is now one of the largest marketplaces for accomodation with over 7 million properties in more than 220 countries. With this project I sought to utilize scraped data from Airbnb listings to carry out statistical analyses and ultimately predict the total cost for two people staying four nights in the city of Munich, Germany.

After initial cleaning and wrangling of the dataset, I carried out an exploratory data analysis (EDA) to investigate existing relationships between variables, especially within and between price, neighbourhood / region, room and property type, as well as reviews and cancellation policy. As I will explain in greater detail below, I grouped the neighbourhoods within zones based on both personal experience and an official map of zones based on accomodation quality and price from the city of Munich. Key observations within our EDA were that there is a heavily right skewed distribution of price and reviews, and that no linear relationship regarding price could be observed; this led us to use the log of the total price for 4 days going forward with our regression.

I progressively improved the model of regression by investigating the effect of all variables as displayed through t- and p-values. My final and best model includes the most extensive list of variables, including for example the addition of logical variables for the only two significant amenities (elevator and shampoo). I ultimately arrived at an adjusted R-squared value of around 40%. Given that the correlation matrix and other early analyses showed rather weak / limited relationships between variables, I believe this is a good result based on the given dataset. Lastly, plots of residuals (i.e. QQ-plot, residuals vs. fitted) as well as variation inflation factor analyses showed that all assumptions of a linear regression (L-I-N-E) were met.

listings <- vroom("http://data.insideairbnb.com/germany/bv/munich/2020-06-20/data/listings.csv.gz") %>% 
    clean_names()
#glimpse(listings) # checking variable headers

Data Preprocessing

Selecting variables and changing to the relevant type

I first select all potentially relevant variables from my data frame. The data is cleaned into number or factor to begin Exploratory Data Analysis (EDA). The raw dataset I create here is called munich_listings.

#Selecting all the relevant variables
munich_listings<- listings %>% 
  select(id, 
         host_is_superhost,
         host_listings_count,
         neighbourhood_cleansed,
         latitude,
         longitude,
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         bed_type,
          #square_feet, we noticed that a lot of values are missing so excluded this variable
         price,
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         number_of_reviews,
         reviews_per_month,
         review_scores_rating,
         review_scores_accuracy,
         review_scores_cleanliness,
         review_scores_checkin,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         is_location_exact,
         amenities,
         instant_bookable,
         cancellation_policy,
         availability_365,
         availability_90,
         last_review,
         listing_url,
         last_scraped) %>% 
#Converting characters to "doubles" and factors where appropriate
  mutate(neighbourhood_cleansed=factor(neighbourhood_cleansed),
         property_type,
         room_type=factor(room_type),
         price=parse_number(price),
         security_deposit=parse_number(security_deposit),
         cleaning_fee=parse_number(cleaning_fee),
         extra_people=parse_number(extra_people),
         cancellation_policy=factor(cancellation_policy),
         bed_type=factor(bed_type),
         amenities_count= str_count(listings$amenities, ","))
#Inspecting data frame to make sure all the variables are correctly attributed
glimpse(munich_listings) 
## Rows: 11,172
## Columns: 39
## $ id                          <dbl> 36720, 97945, 114695, 127383, 157808, 159…
## $ host_is_superhost           <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TR…
## $ host_listings_count         <dbl> 1, 1, 3, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2,…
## $ neighbourhood_cleansed      <fct> Ludwigsvorstadt-Isarvorstadt, Hadern, Ber…
## $ latitude                    <dbl> 48.1, 48.1, 48.1, 48.2, 48.2, 48.1, 48.1,…
## $ longitude                   <dbl> 11.6, 11.5, 11.6, 11.6, 11.6, 11.5, 11.5,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <fct> Entire home/apt, Entire home/apt, Entire …
## $ accommodates                <dbl> 2, 2, 5, 4, 2, 3, 4, 2, 2, 2, 2, 1, 16, 5…
## $ bathrooms                   <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1…
## $ bedrooms                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ beds                        <dbl> 1, 1, 3, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0, 3,…
## $ bed_type                    <fct> Futon, Real Bed, Real Bed, Real Bed, Real…
## $ price                       <dbl> 95, 80, 95, 120, 35, 55, 55, 65, 54, 67, …
## $ security_deposit            <dbl> 100, NA, 500, NA, 100, 0, 200, NA, 190, N…
## $ cleaning_fee                <dbl> 30, 10, 60, 28, 10, 60, 20, NA, 32, NA, 0…
## $ guests_included             <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1,…
## $ extra_people                <dbl> 30, 10, 50, 0, 15, 30, 15, 0, 0, 0, 0, 20…
## $ minimum_nights              <dbl> 2, 2, 2, 2, 1, 3, 2, 3, 1, 2, 3, 2, 1, 1,…
## $ maximum_nights              <dbl> 730, 90, 30, 14, 36, 90, 1125, 14, 4, 30,…
## $ number_of_reviews           <dbl> 25, 131, 53, 84, 0, 33, 467, 64, 211, 89,…
## $ reviews_per_month           <dbl> 0.34, 1.23, 0.49, 0.76, NA, 0.31, 4.39, 0…
## $ review_scores_rating        <dbl> 98, 97, 95, 98, NA, 93, 99, 91, 97, 97, 9…
## $ review_scores_accuracy      <dbl> 10, 10, 9, 10, NA, 9, 10, 9, 10, 10, 10, …
## $ review_scores_cleanliness   <dbl> 10, 10, 10, 10, NA, 9, 10, 9, 10, 10, 9, …
## $ review_scores_checkin       <dbl> 10, 10, 10, 10, NA, 9, 10, 10, 10, 10, 10…
## $ review_scores_communication <dbl> 10, 10, 10, 10, NA, 10, 10, 10, 10, 10, 1…
## $ review_scores_location      <dbl> 10, 9, 9, 10, NA, 9, 10, 9, 10, 10, 10, 1…
## $ review_scores_value         <dbl> 9, 9, 9, 10, NA, 9, 10, 9, 9, 10, 9, 10, …
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ amenities                   <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen,H…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <fct> strict_14_with_grace_period, flexible, st…
## $ availability_365            <dbl> 0, 82, 59, 6, 0, 142, 260, 90, 0, 111, 1,…
## $ availability_90             <dbl> 0, 2, 48, 6, 0, 4, 46, 90, 0, 43, 0, 89, …
## $ last_review                 <date> 2017-07-22, 2019-10-03, 2019-10-06, 2020…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/36720", "ht…
## $ last_scraped                <date> 2020-06-21, 2020-06-20, 2020-06-21, 2020…
## $ amenities_count             <int> 10, 35, 36, 37, 24, 37, 32, 19, 31, 22, 2…

In munich_listings, we have 11172 items and 46 columns.

Here are some noticeable changes I made:
  • neighbourhood_cleansed, room_type, cancellation_policy and bed_type are changed into factors.
  • price, security_deposit, cleaning_fee, extra_people and amenities_count are changed into numbers.
So now I have:
  • host_is_superhost,
  • is_location_exact,
  • instant_bookable as logical variables
  • neighbourhood_cleansed,
  • room_type,
  • bed_type,
  • cancellation_policy as factor variables
  • amenities,
  • property_type as character variables

Data cleaning

I now create a new data frame called munich_listings_cleaned to do some required changes. Here, I deal with missing values/NAs, and clean the data for property type. Also, I filter the items upon min/max nights and accommodates for the 2 people to live for 4 nights.

Filter dataset for two people and 4 nights
#Clean dataset for cleaning_fee, security_deposit, property_type, minimum_nights and accommodates
munich_listings_cleaned <- munich_listings %>%
  mutate(cleaning_fee = case_when(      #considering cleaning_fee as 0 if displayed as NA
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee),
    security_deposit = case_when(      #considering security_deposit as 0 if displayed as NA
    is.na(security_deposit) ~ 0, 
    TRUE ~ security_deposit),
    prop_type_simplified = case_when(   #regrouping of property_types: put all less popular property types into "Other"
    property_type %in% c("Apartment",
                         "House",
                         "Condominium",
                         "Loft")~ property_type , 
    TRUE ~ "Other"),
    prop_type_simplified=factor(prop_type_simplified)) %>% #creating factors
  filter(minimum_nights<=4, 
         maximum_nights>=4, 
         accommodates>=2) #filtering dataframe for 2 people and 4 nights
#Visually inspecting cleaned data set
#glimpse(munich_listings_cleaned)
#skim(munich_listings_cleaned)

For the NAs:
I assume NA as 0 in cleaning fee and security deposit, which means I can book Airbnb without paying for these 2 services. So I didn’t make deletion here.

For property_type:
I arranged the data set and find the top 5 kinds of Airbnbs in Munich, which are Apartment, House, Condominium, Loft and others. I transferred the variable into factors.

Filtering:
I filter the room with minimum_night and maximum_night so that they can be booked for a 4-night stay. Also, the room should accommodate at least 2 people.

Calculating total price

Then, I construct the formula for the total price of 4 days into data frame munich_listings_total_price: I create total_price_4_days as my target variable for regression representing total price of 4-night stay of two people. The if_else statement will allow to include the option of adding 1 extra guest to an AirBnB that has accommodates = 1. The final multiplier of 1.142 is the 14.2% service fee for AirBnB bookings that the company charges per booking.

munich_listings_total_price<-munich_listings_cleaned %>% 
  mutate(total_price_4_days=price*4+ #calculating the total price for 4 days 2 guests
           cleaning_fee+
           if_else(guests_included==1, 
                   extra_people*4,0))

Creating a new data frame for further analysis

I will now create a new data frame called “munich_listings_region” grouping the Airbnbs geographically and making some changes for the subsequent analysis.

Three variable classes are created:
  • region: grouped into 5 by the average price of each neighborhood
  • rating_group: grouped into 3 by whether the rating is over 90
  • Amenities: numerous different amenity words were checked for significance, only two remained. Interestingly they are shampoo and elevator.
munich_listings_region <- munich_listings_total_price %>% 
  mutate(
region = case_when( #creating variable that clusters neighbourhoods for further analysis
      neighbourhood_cleansed=="Altstadt-Lehel"~"zone_1",
      neighbourhood_cleansed=="Ludwigsvorstadt-Isarvorstadt"~"zone_1",
      neighbourhood_cleansed=="Maxvorstadt"~"zone_1",
      neighbourhood_cleansed=="Schwabing-West"~"zone_2",
      neighbourhood_cleansed=="Au-Haidhausen"~"zone_2",
      neighbourhood_cleansed=="Sendling"~"zone_2",
      neighbourhood_cleansed=="Sendling-Westpark"~"zone_2",
      neighbourhood_cleansed=="Schwanthalerhöhe"~"zone_1",
      neighbourhood_cleansed=="Neuhausen-Nymphenburg"~"zone_3",
      neighbourhood_cleansed=="Moosach"~"zone_5",
      neighbourhood_cleansed=="Milbertshofen-Am Hart"~"zone_5",
      neighbourhood_cleansed=="Schwabing-Freimann"~"zone_3",
      neighbourhood_cleansed=="Bogenhausen"~"zone_4",
      neighbourhood_cleansed=="Berg am Laim"~"zone_4",
      neighbourhood_cleansed=="Tudering-Riem"~"zone_1",
      neighbourhood_cleansed=="Ramersdorf-Perlach"~"zone_5",
      neighbourhood_cleansed=="Obergiesing"~"zone_2",
      neighbourhood_cleansed=="Untergiesing-Harlaching"~"zone_4",
      neighbourhood_cleansed=="Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln"~"zone_3",
      neighbourhood_cleansed=="Hadern"~"zone_5",
      neighbourhood_cleansed=="Pasing-Obermenzing"~"zone_3",
      neighbourhood_cleansed=="Aubing-Lochhausen-Langwied"~"zone_4",
      neighbourhood_cleansed=="Allach-Untermenzing"~"zone_3",
      neighbourhood_cleansed=="Feldmoching-Hasenbergl"~"zone_3",
      neighbourhood_cleansed=="Laim"~"zone_5"
      ),
rating_group= case_when( #clustering review_scores_rating to 2 groups
  review_scores_rating <90 ~ "Under 90",
  TRUE ~ "Over 90"),
# is_pool=case_when(
#   grepl("Pool", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_gym=case_when(
#   grepl("Gym", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_private_entrance=case_when(
#   grepl("Private entrance", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_balcony=case_when(
#   grepl("balcony", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_kitchen=case_when(
#   grepl("Kitchen", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
is_elevator=case_when( # turned out to be significant
  grepl("Elevator", 
        amenities, 
        fixed=TRUE) ~ TRUE,
  TRUE ~FALSE),
# is_washer=case_when(
#   grepl("Washer", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_dryer=case_when(
#   grepl("Dryer", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_free_parking=case_when(
#   grepl("Free parking on premises", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_paid_parking=case_when(
#   grepl("Paid parking off premises", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_essentials=case_when(
#   grepl("Essentials", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
is_shampoo=case_when( #turned out to be significant
  grepl("Shampoo", 
        amenities, 
        fixed=TRUE) ~ TRUE,
  TRUE ~FALSE))
# is_host_greets_you=case_when(
#   grepl("Host greets you", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_garden=case_when(
#   grepl("Garden or backyard", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE))


munich_listings_region <- munich_listings_region %>%  #cleaning dataframe from all the missing values
 na.omit()

Key variable descriptions

Here are description of the key variables in our dataset:
    dependent variable:
  • total_price_4_days
    independent variable:
  • property_type: type of accommodation (House, Apartment, etc.)
  • room_type:
    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing
  • review_scores_rating: Average review score (0 - 100)
  • longitude , latitude: geographical coordinates to help us locate the listing
  • region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price
  • prop_type_simplified: type of accommodation (House, Apartment, Loft, Condominium)
  • room_type:Entire home/apt, Private room, Shared room
  • number_of_reviews: Total number of reviews for the listing
  • reviews_per_month: Number of reviews per month
  • review_scores_: Rating for in reviews in different aspects
  • rating_group: Average review score (0 - 100) grouped by 90
  • longitude , latitude: geographical coordinates to help us locate the listing
  • region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price
  • availability_365: Available days in the last 365 days
  • is_elevator and is_shampoo: Whether there is elevator or shampoo facilitated

Exploratory Data Analysis

Now that I have cleaned my data sets for the specific target (4 nights, 2 people), I will conduct an exploratory data analysis.

Summary statistics and favstats

#summary to check for NA's and general statistics
#summary(munich_listings_region)

#running favstats on some interesting variable combinations and keeping the most interesting ones
favstats(price~accommodates, data=munich_listings_region) 
accommodatesminQ1medianQ3maxmeansdnmissing
21550  70100       999       85.5     65.1     36480
31264  90135       1e+03       111       79.9     8930
41180  115180       8e+03       154       288       11630
53594  139200       1.12e+03181       146       2050
63296.2172300       1e+03       234       199       2140
73489  140215       700       181       131       440
825128  228414       995       308       251       660
965125  226288       950       311       325       60
1025196  294612       1.45e+03437       375       180
11149262  4752.74e+039e+03       2.52e+034.32e+0340
12125285  325551       800       409       221       100
133939  3939       39       39              10
14185242  300360       420       302       118       30
163535  35111       839       145       250       100
favstats(price~neighbourhood_cleansed, data=munich_listings_region)
neighbourhood_cleansedminQ1medianQ3maxmeansdnmissing
Allach-Untermenzing1842  75  110  530       111  113  350
Altstadt-Lehel2580  120  180  800       153  111  2220
Au-Haidhausen2560  85  120  1.45e+03115  116  4080
Aubing-Lochhausen-Langwied1641.565  149  380       99  81.3560
Berg am Laim2555  76.5131  400       103  75.71100
Bogenhausen2357  80  120  500       97.966.42970
Feldmoching-Hasenbergl2545  62.598.2350       88.468  740
Hadern1545  79  100  350       84.458.4730
Laim2050  80  121  585       100  82.82160
Ludwigsvorstadt-Isarvorstadt2870  100  150  9e+03       172  499  7170
Maxvorstadt2865  90  140  999       122  107  6660
Milbertshofen-Am Hart1249  70  100  400       86  58.62490
Moosach2550  70  100  800       101  113  1030
Neuhausen-Nymphenburg2152.279.5120  899       104  83.74340
Obergiesing1550  80  130  700       109  98.92130
Pasing-Obermenzing2146  70  125  800       105  104  1190
Ramersdorf-Perlach1545  60  90  420       75  49.42150
Schwabing-Freimann2055  80  120  1e+03       106  101  3390
Schwabing-West1156.280  120  1e+03       107  89.64460
Schwanthalerhöhe2570  104  160  1e+03       136  115  2600
Sendling2559.290  135  590       113  88.52580
Sendling-Westpark2052  80  120  990       109  105  2080
Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln2550  75  120  1.12e+0398.893.52110
Tudering-Riem3050  75  120  999       127  152  1610
Untergiesing-Harlaching2860  80  120  500       107  77.71950
favstats(price~host_is_superhost, data=munich_listings_region)
host_is_superhostminQ1medianQ3maxmeansdnmissing
FALSE1160881309e+03119206  52800
TRUE1850751101e+0310191.910050
favstats(price~prop_type_simplified, data=munich_listings_region)
prop_type_simplifiedminQ1medianQ3maxmeansdnmissing
Apartment1159  85  1298e+03113  15755100
Condominium1955.289.5150995139  1491820
House2045  65  10089096.71072160
Loft3575  99  1449e+03236  907980
Other2054  89  144999144  1842790
favstats(price~minimum_nights, data=munich_listings_region)
minimum_nightsminQ1medianQ3maxmeansdnmissing
11553801258e+03       114197  23510
21160851299e+03       116216  26490
31560891441.45e+03126129  9670
4236090147800       11685.83180

Correlation Matrix

From the summary and favstats investigations, I have decided to conduct further exploratory data analysis through ggplot2. I will first build a correlation martix to spot the relationships between the particular variables.

munich_listing_is_numeric<-munich_listings_region[,sapply(munich_listings_region,is.numeric),with=FALSE]%>%
  na.omit() #I have created a dataframe that contains only numerical variables from our original dataframe in order to build the Correlation Matrix.

corMatrix <- as.data.frame(cor(munich_listing_is_numeric))
corMatrix$var1 <- rownames(corMatrix)
corMatrix2 <- corMatrix %>%
  gather(key = var2, value = r, 1:28) # selecting coloumns from dataframe
ggplot(corMatrix2,aes(x = var1, y = var2, fill = r)) +
  geom_tile() +
  geom_text(aes(label = round(r, 2)), size = 2) +
  scale_fill_gradient2(low = "#ff585d", #adding colour to matrix
                       high = "#00bf6f", 
                       mid = "white") +
  labs(title = "Correlation Matrix",y="",x="") +
theme_bw()+
  theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        strip.text= element_text(family="Montserrat", face = "plain"),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=15,face="bold", family= "Montserrat"),
        axis.text.y=element_text(family="Montserrat", size=5),
        axis.title.y=element_blank(),
        axis.title.x=element_blank(),
        axis.text.x=element_text(family="Montserrat", angle = 90, hjust = 1,size=5),
        legend.text=element_text(family="Montserrat", size=5),
        legend.title=element_text(family="Montserrat", size=7, face="bold"),
        legend.position="bottom")

Further analysis for collinear variables

#munich_listing_is_numeric[,18:24]%>% #tried to spot the correlation between the review-related variables using ggpairs plot
#  ggpairs()

munich_listing_is_numeric%>% #used the ggpairs plot to further analyse the bottom left part of the correlation matrix
  select(accommodates,bathrooms,
         bedrooms,
         beds,
         cleaning_fee,
         extra_people,
         guests_included,
         total_price_4_days,
         security_deposit)%>%
  ggpairs()+
theme(panel.grid.major.y = element_line(color = "gray60", size = 0.2),
        strip.text= element_text(size=5, family="Montserrat", face = "bold"),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 2, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "grey40",size=15,face="bold", family= "Montserrat"),
        plot.caption = element_text(color = "grey40", face="italic",size= 7,family= "Montserrat",hjust=0),
        axis.title.y = element_text(size = 4, angle = 90, family="Montserrat", face = "bold"),
        axis.text.y=element_text(family="Montserrat", size=4),
        axis.title.x = element_text(size = 4, family="Montserrat", face = "bold"),
        axis.text.x=element_text(family="Montserrat", size=4),
        legend.text=element_text(family="Montserrat", size=4),
        legend.title=element_text(family="Montserrat", size=4, face="bold"))

Key findings

The correlation matrix above displays two key ‘green zones’ where there are moderate to strong correlations present between variables. In the upper right corner, the plot illustrates the positive correlations between the various review score components, indicating that when an Airbnb scores well on one criterium it will tend to also have a higher rating on the other criteria. The strongest correlatio here is between the total review score and the review score for accuracy, at a level of 0.74. In the lower left corner we can see positive correlations between variables ranging from weak to strong. As one would expect, the number of people an Airbnb in Munich accomodates has a strong positive correlation with the number of beds and the number of bedrooms. There is a moderatore positive correlation between the total accomodated and the cleaning fee. Lastly, there is a moderate positive correlation between the cleaning fee and the security deposit, likely attributable to the fact that these properties are of a higher standard, as is mentioned on Airbnb’s website (deposits are usually based on a home’s features).

Looking at the independent variable of interest for this project, the total price for a 4-day stay for two people, I only find weak positive correlations when disregarding the obvious connection to daily price. With a level of 0.29 there is a weak to moderate positive correlation between the total price and the number of people an Airbnb can accommodate; this is further supported by weak correlations (0.22) between total price and the number of bedrooms and beds. I will now continue to investigate relationships between my variables, in particular categorical variables not included in the above matrix.

Informative visualisations

ggplot(listings,aes(x=number_of_reviews))+
  geom_histogram(binwidth = 4)+
    xlim(0,250)+
    ylim(0,1000)+
     labs(title="Most of the airbnb accomodations have up to 20 reviews",
          subtitle="Histrogram examining distribution of reviews",
          x="Number of Reviews", 
          y="Quantity")+
  theme_bw()+
  theme(panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

ggplot(munich_listing_is_numeric,
       aes(x=extra_people,y=total_price_4_days))+
  geom_point()+
  geom_smooth(method="lm")+
    ylim(0,3000)+
    xlim(0,100)+
      labs(title="Higher the Extra People Charge, the Higher the Overall Price", 
           subtitle="Correlation between price per extra person and total price per 4 nights stay",
           x="Price per extra person", 
           y="Total price for 4 nights")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

ggplot(munich_listings_region, aes(x=total_price_4_days))+
  geom_density(bins=20)+
  xlim(0,4000) +
  labs(title="The density plot of total price for 4 nights is heavily right-skwed", 
       subtitle="Distribution of total price for 4 nights",
       x="Density",  
       y="Total price for 4 nights")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=10,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

The distribution of price for 4 nights stay is heavilty right-skewed. I will examine now distribution of a logarithm of that price.

ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
  geom_density(bins=20)+
  scale_x_log10()+
  xlim(0,2500) +
  labs(title="Logarithmic Total Price Shows Nature of Price Clusters", 
              subtitle="Distribution of the log price for 4 nights stay",
       x="Density",  
       y="Total price for 4 nights")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

The log_price is heavily right-skewed as well.

ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
  geom_histogram(bins=100)+
  xlim(0,2500)+
  labs(title="Most Airbnbs cost around €300 for 4 Nights", 
       subtitle="Histogram of total price per 4 nights in Munich",
       x="Total price for 4 nights", 
       y= "Quantity")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

#Calculated mean price for 4 nights per room type
munich_listings_region %>%
  group_by(room_type) %>%
  summarize(mean_price_roomtype = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_roomtype)) %>%
  ggplot(aes(y=reorder(room_type, mean_price_roomtype), x = mean_price_roomtype)) + 
    geom_col() +
      labs(title="Hotel rooms are The Most Expensive Airbns in Munich",
           subtitle="Mean price for 4 night by room type",
           x="Average price for 4 nights per room",  
           y="Room type")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

#Calculated mean price for 4 nights per neighbourhood
munich_listings_region %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(mean_price_neighbourhood = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_neighbourhood)) %>%
  ggplot(aes(y=reorder(neighbourhood_cleansed, mean_price_neighbourhood), x=mean_price_neighbourhood)) +
    geom_col()+
     labs(title="Average price for 4 nights per in particular neighbourhoods", 
          x="Average price for per room",  
          y="Neighbourhood")+
  theme_bw()+
  coord_flip()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        plot.subtitle = element_text(color = "black",size=10,face="plain", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat",face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7,angle = 70, hjust = 1),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

#Calculated mean price for 4 nights per property type
munich_listings_region %>%
  group_by(prop_type_simplified) %>%
  summarize(mean_price_property = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_property)) %>%
  ggplot(aes(y=reorder(prop_type_simplified, mean_price_property), x = mean_price_property)) + 
    geom_col() +
      labs(title="Lofts Come at a Premium in Munich, Houses Present\n a Good Value Proposition", 
           x="Average price for per room",  
           y="Property type")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

#Calculated count of particular property types
munich_listings_region %>%
  group_by(prop_type_simplified) %>%
  mutate(count_property=count("Apartment")) %>%
  arrange((count_property)) %>%
  ggplot(aes(x=reorder(prop_type_simplified, desc(count_property)), y = count_property)) + 
    geom_col() +
      labs(title="Apartments Dominate Airbnb's Listings", 
           x="Property type",  
           y="Quantity")+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

#Calculated average price for particular cancellation policies
munich_listings_region %>%
 group_by(cancellation_policy) %>%
  ggplot(aes(x=reorder(cancellation_policy,total_price_4_days ), y = total_price_4_days)) + 
    geom_boxplot() +
      labs(title="Average prices per 4 nights for an Airbnb according to particular \ncancellation policies", 
           y="Price",  
           x="Cancellation policy")+
  scale_y_log10(limits=c(100,10000))+
  theme_bw()+
    theme(panel.grid.major.y = element_line(color = "gray60", size = 0.1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(size = 1, colour = "grey80"),
        axis.ticks = element_line(size = 3,colour = "grey80"),
        axis.ticks.length = unit(.20, "cm"),
        plot.title = element_text(color = "black",size=12,face="bold", family= "Montserrat"),
        axis.title.y = element_text(size = 8, angle = 90, family="Montserrat", face = "plain"),
        axis.text.y=element_text(family="Montserrat", size=7),
        axis.title.x = element_text(size = 8, family="Montserrat", angle = 90,face = "plain"),
        axis.text.x=element_text(family="Montserrat", size=7),
        legend.text=element_text(family="Montserrat", size=7),
        legend.title=element_text(family="Montserrat", size=8, face="bold"))

Mapping

Now, I will conduct the mapping of the Airbnbs’ locations on the Munich map. I decided to colour the data in regards to a particular zone they are located in, to have a better sense of the density of the accommodation in these zones. The zones were grouped by highest mean rental price, since it created the largest significance in our models later on.

pallette <- colorFactor(c("red", "blue", "green", "yellow","purple"), domain = c("zone_1", "zone_2", "zone_3", "zone_4","zone_5"))

    leaflet(data = munich_listings_region) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                 radius = 2,
                 color = ~pallette(region),
                   fillColor = ~region,
                   group = ~ region,
                   clusterId=~region,
                   fillOpacity = 0.4,
                   popup = ~listing_url, 
                   label = ~paste( prop_type_simplified, "Min nights", "=", minimum_nights))

Regression

Now I will start building my models. I will start from models with only a few variables and I will gradually try to build the model with the best fitting data and the biggest possible adjusted R-squared value. Running each model, I will as well check the colinearity analysis to cut confounding variables. For that reason I will use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for our predictors. A general guideline is that a VIF larger than 5 or 10 is large, and the model may suffer from colinearity. I will remove the variable in question and run our model again without it if such a VIF occurs.

For my models I will use the log value of total_prices_4_days since the distribution of it is more bell shaped than the regular value and thus will be better descried by the model.

Model 1

I will fit our first regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

#Regression using log because normally distributed.
model1 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              review_scores_rating, 
            data=munich_listings_region)
msummary(model1)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.088376   0.082553   73.75  < 2e-16 ***
## prop_type_simplifiedCondominium  0.133395   0.044026    3.03   0.0025 ** 
## prop_type_simplifiedHouse       -0.120477   0.040560   -2.97   0.0030 ** 
## prop_type_simplifiedLoft         0.303971   0.059575    5.10  3.5e-07 ***
## prop_type_simplifiedOther        0.105398   0.035956    2.93   0.0034 ** 
## number_of_reviews               -0.001296   0.000154   -8.42  < 2e-16 ***
## review_scores_rating            -0.000460   0.000867   -0.53   0.5958    
## 
## Residual standard error: 0.584 on 6278 degrees of freedom
## Multiple R-squared:  0.0185, Adjusted R-squared:  0.0176 
## F-statistic: 19.7 on 6 and 6278 DF,  p-value: <2e-16
car::vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.01  4               1
## number_of_reviews    1.01  1               1
## review_scores_rating 1.01  1               1
#Noticed that variable review_scores_rating and "Other" and "House" categories 
#in prop_type_simplified are also insignificant. 
#Dropping review_scores_rating.

After running model1, we can notice, that “review_scores_rating” is insignificant for the linear regression model as the p-value is bigger than 0.05.Therefore I will drop it. THe dummy variable “prop_type_simplified” turned out to be insignificant for Houses and Other property types. Anyway, I will keep the variable prop_type_simplified as some of it’s variables are important for our model. The Adjusted R-squared in this model is only 2,25%. I will try to fit more variables in our model in order to increase the accuracy.

I will add as well an example of interpretation of our data in logarithmic lm model.

The coefficient interpretation of review_scores_rating in regards to total_price_4_days is as follows: If the review_scores_rating increases by one, the total_price_4_days decreases by 0,0003%.

The coefficient interpretation of prop_type_simplified in regards to total_price_4_days is as follows: In regards to a particular property type the total_price_4_days behaves as follows: - (property type: Apartment) : total_price_4_days just takes the “Intercept” variable and increases by 6,08%. - (property type: Condominium) : prop_type_simplifiedCondominium=1; total_price_4_days increases by 0.18%. - (property type: House): prop_type_simplifiedHouse=1; total_price_4_days decreases by 0,065%. - (property type: Loft): prop_type_simplifiedLoft=1; total_price_4_days increases by 0.301%. - (property type: Other): prop_type_simplifiedOther=1; total_price_4_days increases by 0.06%.

Model 2

I want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. I will fit a regression model that includes all of the explanatory variables in model1 plus room_type.

model2 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              review_scores_rating+
              room_type, 
            data=munich_listings_region)
msummary(model2)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.149972   0.078743   78.10  < 2e-16 ***
## prop_type_simplifiedCondominium  0.102904   0.041903    2.46  0.01409 *  
## prop_type_simplifiedHouse        0.034654   0.039001    0.89  0.37429    
## prop_type_simplifiedLoft         0.215260   0.056691    3.80  0.00015 ***
## prop_type_simplifiedOther        0.118736   0.036437    3.26  0.00113 ** 
## number_of_reviews               -0.001216   0.000146   -8.32  < 2e-16 ***
## review_scores_rating             0.000413   0.000827    0.50  0.61778    
## room_typeHotel room              0.296750   0.101852    2.91  0.00359 ** 
## room_typePrivate room           -0.376066   0.014681  -25.62  < 2e-16 ***
## room_typeShared room            -0.252149   0.063355   -3.98    7e-05 ***
## 
## Residual standard error: 0.554 on 6275 degrees of freedom
## Multiple R-squared:  0.115,  Adjusted R-squared:  0.114 
## F-statistic: 90.8 on 9 and 6275 DF,  p-value: <2e-16
car::vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.19  4            1.02
## number_of_reviews    1.01  1            1.00
## review_scores_rating 1.01  1            1.01
## room_type            1.19  3            1.03

The room_type has increased our adjusted R-squared up to 0.13. The p-value for each room type is less than 0,05, thus the room type variable is important and we will keep it in our model.

Model 3

Are the number of bathrooms, bedrooms, beds, or size of the house (accommodates) significant predictors of price_4_nights?

model3 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              beds+
              accommodates, 
            data=munich_listings_region)
msummary(model3)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.572998   0.026251  212.30  < 2e-16 ***
## prop_type_simplifiedCondominium  0.071778   0.038345    1.87  0.06126 .  
## prop_type_simplifiedHouse       -0.111393   0.036073   -3.09  0.00202 ** 
## prop_type_simplifiedLoft         0.115665   0.052001    2.22  0.02616 *  
## prop_type_simplifiedOther        0.001071   0.033556    0.03  0.97455    
## number_of_reviews               -0.001522   0.000134  -11.36  < 2e-16 ***
## room_typeHotel room              0.539684   0.093272    5.79  7.5e-09 ***
## room_typePrivate room           -0.248913   0.014236  -17.48  < 2e-16 ***
## room_typeShared room            -0.296707   0.057856   -5.13  3.0e-07 ***
## bathrooms                        0.124320   0.023376    5.32  1.1e-07 ***
## bedrooms                         0.048444   0.012833    3.77  0.00016 ***
## beds                            -0.025160   0.008086   -3.11  0.00187 ** 
## accommodates                     0.148472   0.006718   22.10  < 2e-16 ***
## 
## Residual standard error: 0.506 on 6272 degrees of freedom
## Multiple R-squared:  0.263,  Adjusted R-squared:  0.261 
## F-statistic:  186 on 12 and 6272 DF,  p-value: <2e-16
car::vif(model3)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.26  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.35  3            1.05
## bathrooms            1.21  1            1.10
## bedrooms             1.93  1            1.39
## beds                 2.45  1            1.57
## accommodates         2.50  1            1.58

All the variables in our model apart from “beds” variable ware significant as t-value of these variables is more than 2. In our further models we will keep “bedrooms”, “bathrooms” and “accommodates”, however I will drop the “beds” as they are correlated to other variables.

Model 4

Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

model4 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              host_is_superhost, 
            data=munich_listings_region)
msummary(model4)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.58557    0.02604  214.49  < 2e-16 ***
## prop_type_simplifiedCondominium  0.06643    0.03833    1.73   0.0831 .  
## prop_type_simplifiedHouse       -0.11569    0.03607   -3.21   0.0013 ** 
## prop_type_simplifiedLoft         0.11464    0.05205    2.20   0.0277 *  
## prop_type_simplifiedOther       -0.00874    0.03343   -0.26   0.7938    
## number_of_reviews               -0.00150    0.00014  -10.71  < 2e-16 ***
## room_typeHotel room              0.54283    0.09335    5.81  6.4e-09 ***
## room_typePrivate room           -0.24687    0.01427  -17.30  < 2e-16 ***
## room_typeShared room            -0.30343    0.05787   -5.24  1.6e-07 ***
## bathrooms                        0.11950    0.02334    5.12  3.1e-07 ***
## bedrooms                         0.03550    0.01215    2.92   0.0035 ** 
## accommodates                     0.13772    0.00576   23.92  < 2e-16 ***
## host_is_superhostTRUE           -0.01540    0.01827   -0.84   0.3995    
## 
## Residual standard error: 0.507 on 6272 degrees of freedom
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.26 
## F-statistic:  185 on 12 and 6272 DF,  p-value: <2e-16
car::vif(model4)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.25  4            1.03
## number_of_reviews    1.10  1            1.05
## room_type            1.35  3            1.05
## bathrooms            1.21  1            1.10
## bedrooms             1.73  1            1.32
## accommodates         1.83  1            1.35
## host_is_superhost    1.10  1            1.05

Superhosts do not command a pricing premium in Munich, therefore I will drop this variable in our further models. I can see that the VIF for bedrooms and accommodates has a bit higher VIF, however it is still not high enough to worry about it.

Model 5

Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t.
After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?

model5 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+bathrooms+
              bedrooms+accommodates+
              is_location_exact, 
            data=munich_listings_region)
msummary(model5)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.581474   0.028604  195.13  < 2e-16 ***
## prop_type_simplifiedCondominium  0.066161   0.038340    1.73   0.0845 .  
## prop_type_simplifiedHouse       -0.116372   0.036065   -3.23   0.0013 ** 
## prop_type_simplifiedLoft         0.113322   0.052045    2.18   0.0295 *  
## prop_type_simplifiedOther       -0.008855   0.033432   -0.26   0.7911    
## number_of_reviews               -0.001531   0.000134  -11.41  < 2e-16 ***
## room_typeHotel room              0.541088   0.093351    5.80  7.1e-09 ***
## room_typePrivate room           -0.247594   0.014245  -17.38  < 2e-16 ***
## room_typeShared room            -0.302153   0.057904   -5.22  1.9e-07 ***
## bathrooms                        0.119077   0.023355    5.10  3.5e-07 ***
## bedrooms                         0.035545   0.012156    2.92   0.0035 ** 
## accommodates                     0.137657   0.005757   23.91  < 2e-16 ***
## is_location_exactTRUE            0.004097   0.016237    0.25   0.8008    
## 
## Residual standard error: 0.507 on 6272 degrees of freedom
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.26 
## F-statistic:  185 on 12 and 6272 DF,  p-value: <2e-16
car::vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.24  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.35  3            1.05
## bathrooms            1.21  1            1.10
## bedrooms             1.73  1            1.32
## accommodates         1.83  1            1.35
## is_location_exact    1.01  1            1.00

The variable “is_location_exact” does not have a significant influence on the price of an Airbnb in Munich (p-value bigger than 0.05). Therefore, I will drop it.

Model 6

Now I will use a variable that I created - "region" that clusters all the neighbourhood to 5 zones and I will see how the location affects the price for Airbnb in the model.

model6 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              region, 
            data=munich_listings_region)
msummary(model6)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.714978   0.027012  211.57  < 2e-16 ***
## prop_type_simplifiedCondominium  0.060742   0.037425    1.62   0.1046    
## prop_type_simplifiedHouse       -0.054203   0.035507   -1.53   0.1269    
## prop_type_simplifiedLoft         0.113490   0.050802    2.23   0.0255 *  
## prop_type_simplifiedOther        0.001530   0.032756    0.05   0.9628    
## number_of_reviews               -0.001616   0.000131  -12.34  < 2e-16 ***
## room_typeHotel room              0.497268   0.091203    5.45  5.2e-08 ***
## room_typePrivate room           -0.234415   0.013930  -16.83  < 2e-16 ***
## room_typeShared room            -0.284806   0.056523   -5.04  4.8e-07 ***
## bathrooms                        0.118175   0.022790    5.19  2.2e-07 ***
## bedrooms                         0.038658   0.011872    3.26   0.0011 ** 
## accommodates                     0.137573   0.005620   24.48  < 2e-16 ***
## regionzone_2                    -0.139446   0.016825   -8.29  < 2e-16 ***
## regionzone_3                    -0.195816   0.018061  -10.84  < 2e-16 ***
## regionzone_4                    -0.198716   0.022296   -8.91  < 2e-16 ***
## regionzone_5                    -0.334734   0.020304  -16.49  < 2e-16 ***
## 
## Residual standard error: 0.495 on 6269 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.295 
## F-statistic:  176 on 15 and 6269 DF,  p-value: <2e-16
car::vif(model6)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.28  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.36  3            1.05
## bathrooms            1.21  1            1.10
## bedrooms             1.73  1            1.32
## accommodates         1.83  1            1.35
## region               1.04  4            1.00

The region of Munich has a significant influence on the price. T-value of all the zone is way more than |2| and our adjusted R-squared went up - it suggests that model 6 better describes the real data than our previous models.

Model 7

What is the effect of cancellation_policy on price_4_nights, after I control for other variables?

model7 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              region+
              cancellation_policy, 
            data=munich_listings_region)
msummary(model7)
##                                                Estimate Std. Error t value
## (Intercept)                                     5.68179    0.02810  202.18
## prop_type_simplifiedCondominium                 0.04985    0.03715    1.34
## prop_type_simplifiedHouse                      -0.06235    0.03524   -1.77
## prop_type_simplifiedLoft                        0.11069    0.05041    2.20
## prop_type_simplifiedOther                       0.01612    0.03253    0.50
## number_of_reviews                              -0.00168    0.00013  -12.87
## room_typeHotel room                             0.50907    0.09051    5.62
## room_typePrivate room                          -0.22810    0.01388  -16.43
## room_typeShared room                           -0.29699    0.05610   -5.29
## bathrooms                                       0.11434    0.02262    5.06
## bedrooms                                        0.04181    0.01179    3.55
## accommodates                                    0.13068    0.00562   23.26
## regionzone_2                                   -0.13437    0.01670   -8.05
## regionzone_3                                   -0.18655    0.01795  -10.39
## regionzone_4                                   -0.19215    0.02213   -8.68
## regionzone_5                                   -0.32519    0.02018  -16.12
## cancellation_policymoderate                     0.00772    0.01509    0.51
## cancellation_policystrict_14_with_grace_period  0.14262    0.01565    9.11
## cancellation_policysuper_strict_30              0.13203    0.49206    0.27
## cancellation_policysuper_strict_60              0.02713    0.49100    0.06
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.17960    
## prop_type_simplifiedHouse                       0.07691 .  
## prop_type_simplifiedLoft                        0.02813 *  
## prop_type_simplifiedOther                       0.62028    
## number_of_reviews                               < 2e-16 ***
## room_typeHotel room                             1.9e-08 ***
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            1.2e-07 ***
## bathrooms                                       4.4e-07 ***
## bedrooms                                        0.00039 ***
## accommodates                                    < 2e-16 ***
## regionzone_2                                    1.0e-15 ***
## regionzone_3                                    < 2e-16 ***
## regionzone_4                                    < 2e-16 ***
## regionzone_5                                    < 2e-16 ***
## cancellation_policymoderate                     0.60895    
## cancellation_policystrict_14_with_grace_period  < 2e-16 ***
## cancellation_policysuper_strict_30              0.78846    
## cancellation_policysuper_strict_60              0.95594    
## 
## Residual standard error: 0.491 on 6265 degrees of freedom
## Multiple R-squared:  0.308,  Adjusted R-squared:  0.306 
## F-statistic:  147 on 19 and 6265 DF,  p-value: <2e-16
car::vif(model7)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.28  4            1.03
## number_of_reviews    1.03  1            1.01
## room_type            1.37  3            1.05
## bathrooms            1.21  1            1.10
## bedrooms             1.74  1            1.32
## accommodates         1.86  1            1.36
## region               1.05  4            1.01
## cancellation_policy  1.06  4            1.01

The cancellation policy of 14 days seems to have a significant impact on the price for 4 nights. This is why I will keep the variable “cancellation policy” in my model. The Adjusted R-squared again went up by one percent. Let me keep trying adding more variables that may turn out significant for the model.

Final Model

Now I will create the model with numerous significant data that I checked to be relevant and significant to create our best fitting regression model.

model_wild_west<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
                      prop_type_simplified+
                      number_of_reviews* #multiplied because of colinearity
                      reviews_per_month+
                      room_type*  # multiplied because of colinearity 
                      bedrooms+
                      bathrooms+
                      accommodates+
                      region+
                      cancellation_policy+
                      review_scores_value+
                      review_scores_cleanliness+
                      review_scores_checkin+
                      review_scores_location+
                      security_deposit+
                      rating_group+
                      instant_bookable+
                      availability_365+
                      availability_90+
                      maximum_nights+
                      minimum_nights+
                      is_elevator+
                      is_shampoo,
                    data=munich_listings_region)
msummary(model_wild_west) 
##                                                 Estimate Std. Error t value
## (Intercept)                                     2.36e+00   5.15e-02   45.75
## prop_type_simplifiedCondominium                 1.73e-02   1.52e-02    1.14
## prop_type_simplifiedHouse                      -2.75e-02   1.47e-02   -1.87
## prop_type_simplifiedLoft                        4.35e-02   2.07e-02    2.10
## prop_type_simplifiedOther                       1.35e-02   1.36e-02    1.00
## number_of_reviews                              -1.02e-03   1.26e-04   -8.09
## reviews_per_month                              -4.82e-02   3.91e-03  -12.33
## room_typeHotel room                             2.78e-01   6.37e-02    4.36
## room_typePrivate room                          -1.36e-03   1.06e-02   -0.13
## room_typeShared room                           -1.44e-01   2.32e-02   -6.20
## bedrooms                                        5.34e-02   5.46e-03    9.78
## bathrooms                                       4.79e-02   9.29e-03    5.16
## accommodates                                    5.04e-02   2.33e-03   21.66
## regionzone_2                                   -5.87e-02   6.86e-03   -8.56
## regionzone_3                                   -7.87e-02   7.42e-03  -10.61
## regionzone_4                                   -8.15e-02   9.12e-03   -8.94
## regionzone_5                                   -1.30e-01   8.45e-03  -15.38
## cancellation_policymoderate                     8.61e-03   6.25e-03    1.38
## cancellation_policystrict_14_with_grace_period  4.86e-02   6.56e-03    7.40
## cancellation_policysuper_strict_30             -4.57e-02   2.01e-01   -0.23
## cancellation_policysuper_strict_60             -6.72e-02   2.01e-01   -0.33
## review_scores_value                            -4.56e-02   3.41e-03  -13.38
## review_scores_cleanliness                       2.14e-02   3.24e-03    6.61
## review_scores_checkin                           1.17e-02   4.18e-03    2.79
## review_scores_location                          2.15e-02   4.18e-03    5.14
## security_deposit                                4.85e-05   7.24e-06    6.71
## rating_groupUnder 90                           -3.34e-02   9.15e-03   -3.65
## instant_bookableTRUE                            3.78e-02   5.66e-03    6.67
## availability_365                                1.47e-04   3.51e-05    4.20
## availability_90                                 6.93e-04   1.09e-04    6.35
## maximum_nights                                  3.67e-06   1.57e-06    2.33
## minimum_nights                                 -1.63e-02   3.20e-03   -5.09
## is_elevatorTRUE                                 1.09e-02   5.29e-03    2.07
## is_shampooTRUE                                  1.13e-02   5.27e-03    2.13
## number_of_reviews:reviews_per_month             1.90e-04   2.41e-05    7.89
## room_typeHotel room:bedrooms                   -1.04e-01   4.79e-02   -2.17
## room_typePrivate room:bedrooms                 -1.03e-01   8.24e-03  -12.47
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.25581    
## prop_type_simplifiedHouse                       0.06090 .  
## prop_type_simplifiedLoft                        0.03548 *  
## prop_type_simplifiedOther                       0.31864    
## number_of_reviews                               7.0e-16 ***
## reviews_per_month                               < 2e-16 ***
## room_typeHotel room                             1.3e-05 ***
## room_typePrivate room                           0.89790    
## room_typeShared room                            6.1e-10 ***
## bedrooms                                        < 2e-16 ***
## bathrooms                                       2.6e-07 ***
## accommodates                                    < 2e-16 ***
## regionzone_2                                    < 2e-16 ***
## regionzone_3                                    < 2e-16 ***
## regionzone_4                                    < 2e-16 ***
## regionzone_5                                    < 2e-16 ***
## cancellation_policymoderate                     0.16853    
## cancellation_policystrict_14_with_grace_period  1.5e-13 ***
## cancellation_policysuper_strict_30              0.82010    
## cancellation_policysuper_strict_60              0.73774    
## review_scores_value                             < 2e-16 ***
## review_scores_cleanliness                       4.2e-11 ***
## review_scores_checkin                           0.00526 ** 
## review_scores_location                          2.8e-07 ***
## security_deposit                                2.2e-11 ***
## rating_groupUnder 90                            0.00026 ***
## instant_bookableTRUE                            2.7e-11 ***
## availability_365                                2.7e-05 ***
## availability_90                                 2.2e-10 ***
## maximum_nights                                  0.01965 *  
## minimum_nights                                  3.7e-07 ***
## is_elevatorTRUE                                 0.03885 *  
## is_shampooTRUE                                  0.03289 *  
## number_of_reviews:reviews_per_month             3.5e-15 ***
## room_typeHotel room:bedrooms                    0.02973 *  
## room_typePrivate room:bedrooms                  < 2e-16 ***
## 
## Residual standard error: 0.2 on 6248 degrees of freedom
## Multiple R-squared:  0.391,  Adjusted R-squared:  0.387 
## F-statistic:  111 on 36 and 6248 DF,  p-value: <2e-16
model_wild_west_colinear<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
                      prop_type_simplified+
                      number_of_reviews+ #linearised for colinearity
                      reviews_per_month+
                      room_type+  # linearised for colinearity 
                      bedrooms+
                      bathrooms+
                      accommodates+
                      region+
                      cancellation_policy+
                      review_scores_value+
                      review_scores_cleanliness+
                      review_scores_checkin+
                      review_scores_location+
                      security_deposit+
                      rating_group+
                      instant_bookable+
                      availability_365+
                      availability_90+
                      maximum_nights+
                      minimum_nights+
                      is_elevator+
                      is_shampoo,
                    data=munich_listings_region)
car::vif(model_wild_west_colinear) # car VIF struggles with multiplied variables so a new unmultiplied model is used to check.
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      1.39  4            1.04
## number_of_reviews         2.45  1            1.56
## reviews_per_month         2.60  1            1.61
## room_type                 1.60  3            1.08
## bedrooms                  1.77  1            1.33
## bathrooms                 1.22  1            1.11
## accommodates              1.91  1            1.38
## region                    1.12  4            1.01
## cancellation_policy       1.14  4            1.02
## review_scores_value       1.79  1            1.34
## review_scores_cleanliness 1.69  1            1.30
## review_scores_checkin     1.48  1            1.22
## review_scores_location    1.41  1            1.19
## security_deposit          1.07  1            1.03
## rating_group              1.62  1            1.27
## instant_bookable          1.09  1            1.04
## availability_365          2.50  1            1.58
## availability_90           2.38  1            1.54
## maximum_nights            1.02  1            1.01
## minimum_nights            1.15  1            1.07
## is_elevator               1.07  1            1.03
## is_shampoo                1.05  1            1.03

In the final model I tested variables from the previous models that were significant and tested much more variables that in my opinion could as well affect the total_price_4_days. I tested the variables connected to review scores - i.e. review_scores_value, review_scores_cleanliness, review_scores_checking_ review_scores_location etc. Only the ones mentioned turned out to be significant for the model.

Afterwards I checked for security_deposit, rating_group, instants_bookable and availability variables. Two of them (availability_60 and availability_30) turned out to be insignificant, so I decided to drop them.

Thereafter, I added host_listings_count as I believe that the number of properties the host has may affect the standard, build some economies of scales perhaps and therefore affect somehow the price. This factor as well turned out to be significant.

Later I tested maximum_nights and minimum_nights. In the next step I was testing whether particular types of amenities have any significant impact on the price. It turned out that two of them - elevator and shampoo (as they are always part of some welcome packs) are also significant for the price’s prediction. Moreover, I added two interaction variables - room_type&bedrooms and number_of_reviews&review_per_month as I believe there is much interaction happening between them. The final model has adjusted R-squared at the level of 38.7% and a RSE at the level of 0.2. Checking the VIF throughout, I can see that the GVIF value is well below 5 and we can be assured that the colinearity is not affecting the model significantly.


Diagnostics

Checking Residuals

In the next step I will plot residuals, analyze their behaviour and check whether they are distributed within the norms. Afterwards I will compare all the models and compare how they evolved.

#plotting residuals
autoplot(model_wild_west)+
  theme_bw()

# comparing significance of variables among model iterations
huxreg(model2, 
       model3, 
       model6, 
       model7, 
       model_wild_west)
(1)(2)(3)(4)(5)
(Intercept)6.150 ***5.573 ***5.715 ***5.682 ***2.357 ***
(0.079)   (0.026)   (0.027)   (0.028)   (0.052)   
prop_type_simplifiedCondominium0.103 *  0.072    0.061    0.050    0.017    
(0.042)   (0.038)   (0.037)   (0.037)   (0.015)   
prop_type_simplifiedHouse0.035    -0.111 ** -0.054    -0.062    -0.028    
(0.039)   (0.036)   (0.036)   (0.035)   (0.015)   
prop_type_simplifiedLoft0.215 ***0.116 *  0.113 *  0.111 *  0.044 *  
(0.057)   (0.052)   (0.051)   (0.050)   (0.021)   
prop_type_simplifiedOther0.119 ** 0.001    0.002    0.016    0.014    
(0.036)   (0.034)   (0.033)   (0.033)   (0.014)   
number_of_reviews-0.001 ***-0.002 ***-0.002 ***-0.002 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.000                                    
(0.001)                                   
room_typeHotel room0.297 ** 0.540 ***0.497 ***0.509 ***0.278 ***
(0.102)   (0.093)   (0.091)   (0.091)   (0.064)   
room_typePrivate room-0.376 ***-0.249 ***-0.234 ***-0.228 ***-0.001    
(0.015)   (0.014)   (0.014)   (0.014)   (0.011)   
room_typeShared room-0.252 ***-0.297 ***-0.285 ***-0.297 ***-0.144 ***
(0.063)   (0.058)   (0.057)   (0.056)   (0.023)   
bathrooms        0.124 ***0.118 ***0.114 ***0.048 ***
        (0.023)   (0.023)   (0.023)   (0.009)   
bedrooms        0.048 ***0.039 ** 0.042 ***0.053 ***
        (0.013)   (0.012)   (0.012)   (0.005)   
beds        -0.025 **                         
        (0.008)                           
accommodates        0.148 ***0.138 ***0.131 ***0.050 ***
        (0.007)   (0.006)   (0.006)   (0.002)   
regionzone_2                -0.139 ***-0.134 ***-0.059 ***
                (0.017)   (0.017)   (0.007)   
regionzone_3                -0.196 ***-0.187 ***-0.079 ***
                (0.018)   (0.018)   (0.007)   
regionzone_4                -0.199 ***-0.192 ***-0.082 ***
                (0.022)   (0.022)   (0.009)   
regionzone_5                -0.335 ***-0.325 ***-0.130 ***
                (0.020)   (0.020)   (0.008)   
cancellation_policymoderate                        0.008    0.009    
                        (0.015)   (0.006)   
cancellation_policystrict_14_with_grace_period                        0.143 ***0.049 ***
                        (0.016)   (0.007)   
cancellation_policysuper_strict_30                        0.132    -0.046    
                        (0.492)   (0.201)   
cancellation_policysuper_strict_60                        0.027    -0.067    
                        (0.491)   (0.201)   
reviews_per_month                                -0.048 ***
                                (0.004)   
review_scores_value                                -0.046 ***
                                (0.003)   
review_scores_cleanliness                                0.021 ***
                                (0.003)   
review_scores_checkin                                0.012 ** 
                                (0.004)   
review_scores_location                                0.021 ***
                                (0.004)   
security_deposit                                0.000 ***
                                (0.000)   
rating_groupUnder 90                                -0.033 ***
                                (0.009)   
instant_bookableTRUE                                0.038 ***
                                (0.006)   
availability_365                                0.000 ***
                                (0.000)   
availability_90                                0.001 ***
                                (0.000)   
maximum_nights                                0.000 *  
                                (0.000)   
minimum_nights                                -0.016 ***
                                (0.003)   
is_elevatorTRUE                                0.011 *  
                                (0.005)   
is_shampooTRUE                                0.011 *  
                                (0.005)   
number_of_reviews:reviews_per_month                                0.000 ***
                                (0.000)   
room_typeHotel room:bedrooms                                -0.104 *  
                                (0.048)   
room_typePrivate room:bedrooms                                -0.103 ***
                                (0.008)   
room_typeShared room:bedrooms                                        
                                        
N6285        6285        6285        6285        6285        
R20.115    0.263    0.297    0.308    0.391    
logLik-5206.674    -4633.202    -4484.509    -4432.369    1207.271    
AIC10435.349    9294.405    9003.017    8906.737    -2338.542    
*** p < 0.001; ** p < 0.01; * p < 0.05.

The residuals behave in an appropriate way, hence I assume that the model is correct. Though there is a slight gradient in Scale-Location, and slight tendency in Residuals vs Fitted. The Leverage tends around the mean and the normal Q-Q is linear for the most part. These slight issues are due to the quality of the data scraper.

From the table comparing all the models I can spot, that our R-squared went up through out the process of finding the best solution. I can as well spot which variables were added and dropped at which stages.

Model applyinh and predicting the outcome

Now, I will find a price of the Airbnbs that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90.

I am using the logarithmic model log(total_price_4_days) in the predict function since my regression is based on the log(total_price_4_days). First, I will create a new table that I will filter according to the conditions above. In the next step, I will anti-log the model_wild_west. At the end, I will predict the prices for my filtered accommodations and I will create for them the Confidence Intervals. I will do it in two ways in order to compare our scores.

munich_listings_predict<- munich_listings_region %>%
  mutate(price=log(total_price_4_days)) %>% #converting to log form for prediction
  filter(room_type=="Private room" &
           number_of_reviews>=10 & 
           rating_group=="Over 90")


predict_df<-10^predict(model_wild_west, # converting from log form to nominal
                    newdata = munich_listings_predict, 
                    interval= "confidence")
#sanity check
summary(predict_df)
##       fit           lwr           upr     
##  Min.   :136   Min.   :116   Min.   :157  
##  1st Qu.:256   1st Qu.:242   1st Qu.:271  
##  Median :300   Median :282   Median :316  
##  Mean   :315   Mean   :296   Mean   :336  
##  3rd Qu.:358   3rd Qu.:338   3rd Qu.:377  
##  Max.   :762   Max.   :649   Max.   :958
#using broom augment
model_prediction <- broom::augment(model_wild_west, 
                                   newdata= munich_listings_predict)
model_prediction <- model_prediction %>% 
  mutate(lower_95=10^.fitted-1.96*abs(10^(.resid)),#creating 95% confidence interval
         upper_95=10^.fitted+1.96*abs(10^(.resid))) %>% 
  select(.fitted,
         lower_95,
         upper_95, 
         total_price_4_days) %>% 
  mutate(.fitted=10^.fitted)
#sanity check
summary(model_prediction)
##     .fitted       lower_95      upper_95   total_price_4_days
##  Min.   :136   Min.   :134   Min.   :138   Min.   :  92      
##  1st Qu.:256   1st Qu.:254   1st Qu.:258   1st Qu.: 217      
##  Median :300   Median :297   Median :302   Median : 280      
##  Mean   :315   Mean   :313   Mean   :317   Mean   : 327      
##  3rd Qu.:358   3rd Qu.:356   3rd Qu.:359   3rd Qu.: 380      
##  Max.   :762   Max.   :762   Max.   :763   Max.   :4036

Using the predict and augment function I can observe a mean price of around 315, which is close to the actual total mean price of 327. I also see that the 1st and 3rd quartiles for both of the prediction methods all line up. Differences appear in the lower and upper confidence level boundaries between the two functions, where the predict function’s interval actually captures the true mean, the augment misses it by €10. Despite this I can get a sense of confidence for the linear regression’s accuracy due to the tight spread and capturing of the true mean. The next step is conducting a sanity check by checking the RMSE of my model.

Data Training and RMSE

In the next step I will split my data into two parts. I will train one part and later test another one. In the next step I will compare the results.

set.seed(1234)
train_test_split <- initial_split(munich_listings_predict, prop=0.7) # splitting dataset
munich_train<- training(train_test_split)
munich_test<- testing(train_test_split)

rmse_train <- munich_train %>%  #training portion for RMSE
  mutate(predictions=predict(model_wild_west,.)) %>% 
  summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>% 
  pull()
rmse_train
## [1] 82.7
rmse_test <- munich_test %>% 
  mutate(predictions=predict(model_wild_west,.)) %>% 
  summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>% 
  pull()
rmse_test
## [1] 54.1

I can see that the RMSE is an order of magnitude below our prices, which confirms that though our R^2 is low, the accuracy is very high.

Thank you for your interest in our study project. I hope you found it interesting.