Development Risk Prediction for Vacant Lots

In recent years, the City of Philadelphia has been foreclosing on tax delinquent vacant properties in bulk. Many of these properties represent cherished community assets including public parks, urban gardens, and side yards. Often, neighbors who have tended to these properties for years never find out that they are up for sale until they have already been bought. A variety of grassroots and legal advocacy organizations have taken up the mantle to explore these options, and more, to ensure that vacant lots remain in the care of the neighbors and communities who have invested in them for years.

In collaboration with our client, Philadelphia Legal Assistance, we developed a story map to summarize this complex issue and a Data Dashboard where users can explore the conditions and predicted development risk of vacant lots in Philadelphia.

The story map and dashboard can be accessed here. A brief presentation for this project is available here.

This project was made in collaboration with Max Masuda-Farkas and Gillian Xuezhu Zhao.

Real-Estate-Development Machine-Learning Predictive-Modelling

Multimodal bike-bus interactions in NYC

What is the relationship between bus arrivals and departures from a metropolitan bus stop in New York City, and the bike-share use in their vicinity? Is there a way to assess the interactions between both systems in order to provide advantages to users that transfer between both systems?

By creating a method to simultaneously retrieve real-time geolocated data from both the New York City’s MTA Bus API and Citi Bike’s Open Data live feed, it is possible to measure the possible interaction between both modes. To do so, three bus stop-bike dock pairs were selected among the same MTA Bus route, the M15 Bus in its northward direction on Manhattan’s First Avenue.

The complete project report can be found here.

transportation multimodal bikeshare transit data-analysis geospatial

Citi Bike demand predictive model for Brooklyn

The booming bike-sharing programs in US metropolitan centers are one of the few tech-enabled innovations in urban transportation of the last decade that, despite being less notorious than automated vehicles or delivery drones, has actually been successfully implemented, integrated with other forms of transportation, and grown steadily in number of programs and stations, propelled by its widespread popularity with urban residents.

However, one of the main challenges these systems confront is their reliance in a fleet of trucks to redistribute the bikes across stations and counterbalance the natural aggregated flow of people’s origin-destination trips throughout the city. The resulting dispersion can rapidly disrupt the system’s operations: when a dock station gets completely full or completely empty it is rendered unusable.

The goal of this project is to develop a space-time regression model that can predict the use of New York City’s Citi Bike bike share system in the borough of Brooklyn and help plan the daily operations of bike redistribution by forecasting the demand by hour through the following week. These models allow for the logistical operation of Citi Bike to move bikes from stations with low forecasted demand, but more than half of their docks occupied, to stations with high forecasted demand on the shortest distance possible and on-time, expanding the actual dock capacity of the highly used stations.

The complete project report can be found here.

bike-share transportation predictive-modelling citi-bike

Predicting Home Prices in Boulder, CO

The Zestimate, Zillow’s algorithm for predicting home values, has become an influential fixture of the residential real estate market, at times being mistaken for an appraisal tool. Homeowners sued Zillow in 2017 for allegedly undervaluing homes and creating an obstacle to their sale. More recently, Zillow shut down its home-flipping business after incurring more than $550 million in expected losses.

Our goal was to develop a machine learning pipeline to predict home prices in Boulder County, Colorado, enhancing our model with open data on local features in addition to characteristics of the homes themselves, while also testing and adjusting the model for spatial and socioeconomic biases to ensure generalizability and strengthen its predictive power.

However, this task presented two important challenges. First, we did not take into account any previous sales or tax assessment data that could better inform our model. Second, we are conducting this analysis in Boulder County, Colorado, a county that is not itself a coherent urban unit but rather is composed of Denver’s outer suburbs, medium-sized towns and almost half of its area covered in forest. Furthermore, Boulder is a very homogeneous and particular county, with a 91% white population, a median income of $81,390 (compared to $62,843 nationwide) and 61% of people 25 or older with a bachelor’s degree or higher (compared to 31% nation-wide) in 2019 estimates.

We implemented a linear regression-based supervised machine learning model for predicting house prices. We built up a pipeline to clean, wrangle, and merge data from various sources; evaluate it, taking out or leaving data elements depending on their usefulness; and testing it for accuracy, obtaining metrics on the efficacy of each model iteration. We repeated the process numerous times to refine the model and obtain the most accurate and generalizable possible results.

Ultimately, we were able to create a model that predicted home prices fairly well across the board, with a mean average percent error (MAPE) solidly under 15 percent. The model performed very well on homes sold for under $1 million, which represented the majority of homes sold. Its most glaring shortcomings were at the very high end of the home price distribution, with MAPE around 40 percent for homes sold for more than $2 million. These homes constituted a small minority of the sample, but the size of the errors points to the need for more refinement before this model could be considered production-ready.

We believe this model is a promising first step toward accurately predicting home prices in Boulder County, but further work is necessary to improve its accuracy on the high end of the price range to reduce the risk of further litigation and negative publicity arising from multi-million-dollar prediction errors for the county’s most expensive homes.

Work made in collaboration with Elisabeth Ericson.

predictive-modeling linear-regression real-estate machine-learning zillow geospatial

Visualizing Retaliatory Evictions in Philadelphia

The goal of this project was to visualize the different aspects that describe the phenomenon of retaliatory evictions in the City of Philadelphia.

Retaliatory evictions consist in the process where property owners, after failing to comply with some aspect of their rental agreement or the sanitary or basic building conditions in their rental properties, have a building violation filed against them by their renters and in return threaten to or evict them. Historically, this has become common practice for landlords throughout Philadelphia, especially in low-income neighborhoods, as they usually are legally represented in court and their renters are not.

The interactive maps and report for this project are available here.

geospatial data-analysis evictions Philadelphia housing python

Forecasting Domestic Violence in Chicago

Family violence is one of the few clear examples of phenomenon that because of their nature, can be more effectively prevented by adaptive predictive policing tools than by the conventional policing strategies. This is due to three of its inherent characteristics:

First, family violence is widespread. According to a 2005 report by the US Department of Justice Bureau of Justice Statistics, family violence amounted to 11% of all the reported and unreported violent incidents between 1998 and 2002.

Second, family violence often goes unnoticed even though it tends to be chronic, given that it almost exclusively happens in private spaces and within family structures that are difficult to intervene by others. Also, because these incidents usually go unreported and are usually engrained within family dynamics, they usually happen numerous times in the same household.

Third, and most important, family violence is systematically unreported for a myriad of reasons, including financial dependency of the victim to the offender, psychological intimidation, public shame or religion. According to the same DoJ report, two out of five family violence incidents go unreported, with the most common reasons being that the incident was a “private/personal matter” (34% of the time) and to “protect the offender” (in 12% of occasions). Other important reason for not reporting family violence could be fear of retaliation, especially since of all incidents reported to police between 1998 and 2002, only 36% resulted in an arrest.

In order to produce an algorithm that can predict risk of family violence, or rather its most commonly reported manifestation, domestic battery, we used the city of Chicago as the testing ground, taking relevant open data from the city that could be translated into possible predictors of the latent risk of family violence, following the main axiom behind Environmental Criminology that crime “is patterned according to the criminogenic nature of the environment.”

The positive aspect of a model that predicts domestic battery is that it can be translated into preventive measures alternative to the continuing over policing practices that are commonly put in place in the majority non-white neighborhoods of Chicago. For example, this tool could be translated into the allocation of social worker services, targeted information about services for reporting physical abuse, or community workshops that tackle family violence on a more approachable and preventive level.

The complete project report can be found here.

machine-learning chicago family-violence domestic-battery