Predicting Home Prices in Boulder, CO

The Zestimate, Zillow’s algorithm for predicting home values, has become an influential fixture of the residential real estate market, at times being mistaken for an appraisal tool. Homeowners sued Zillow in 2017 for allegedly undervaluing homes and creating an obstacle to their sale. More recently, Zillow shut down its home-flipping business after incurring more than $550 million in expected losses.

Our goal was to develop a machine learning pipeline to predict home prices in Boulder County, Colorado, enhancing our model with open data on local features in addition to characteristics of the homes themselves, while also testing and adjusting the model for spatial and socioeconomic biases to ensure generalizability and strengthen its predictive power.

However, this task presented two important challenges. First, we did not take into account any previous sales or tax assessment data that could better inform our model. Second, we are conducting this analysis in Boulder County, Colorado, a county that is not itself a coherent urban unit but rather is composed of Denver’s outer suburbs, medium-sized towns and almost half of its area covered in forest. Furthermore, Boulder is a very homogeneous and particular county, with a 91% white population, a median income of $81,390 (compared to $62,843 nationwide) and 61% of people 25 or older with a bachelor’s degree or higher (compared to 31% nation-wide) in 2019 estimates.

We implemented a linear regression-based supervised machine learning model for predicting house prices. We built up a pipeline to clean, wrangle, and merge data from various sources; evaluate it, taking out or leaving data elements depending on their usefulness; and testing it for accuracy, obtaining metrics on the efficacy of each model iteration. We repeated the process numerous times to refine the model and obtain the most accurate and generalizable possible results.

Ultimately, we were able to create a model that predicted home prices fairly well across the board, with a mean average percent error (MAPE) solidly under 15 percent. The model performed very well on homes sold for under $1 million, which represented the majority of homes sold. Its most glaring shortcomings were at the very high end of the home price distribution, with MAPE around 40 percent for homes sold for more than $2 million. These homes constituted a small minority of the sample, but the size of the errors points to the need for more refinement before this model could be considered production-ready.

We believe this model is a promising first step toward accurately predicting home prices in Boulder County, but further work is necessary to improve its accuracy on the high end of the price range to reduce the risk of further litigation and negative publicity arising from multi-million-dollar prediction errors for the county’s most expensive homes.

Work made in collaboration with Elisabeth Ericson.

predictive-modeling linear-regression real-estate machine-learning zillow geospatial