Which Model Performs Better? Philadelphia or Kansas City?

Abstract image of numbers connected as a network on dark background

In a previous post, I explored the topic of model usefulness. In this post, I’ll focus on evaluating model performance.

At UrbanSim, we're building simulation models of urban development and using AI techniques to improve their accuracy in forecasting urban development patterns over time. Since we help many public agencies inform plans that have substantial public impacts, such as regional transportation plans that may involve infrastructure investments of massive scale and cost, we need to make sure the models being used perform well enough to provide solid insights to support these operational planning processes.

Over the past several months we have been developing a robust benchmarking suite to assist in evaluating models. One metric is never enough. We recommend that if you take an interest in any urban model, you consider using similar benchmarks to inform your assessments.

Specifically, we wanted to learn how capable any version of an urban model is in predicting spatial dynamics over a time frame long enough to experience substantial change, such as over ten years. Some models, including UrbanSim, simulate change using annual time steps, so a 10 year simulation represents ten one-year simulations, and leaves a lot of room for errors to propagate over time. And many people have argued in the past that the more microscopic and fine-grained a model is, the more likely it is that errors propagate over time to the point that model accuracy might be too low to support practical planning decisions. We wanted to put these kinds of critiques to the test.

To evaluate different model versions, we set up a benchmarking suite to understand how any change affects model performance. We used 2010 as the starting point for these simulations, and used the available 2020 Census data as the observed data targets to assess goodness of fit. The model is a representation of the entire real estate market dynamic from year to year, at a census block level of detail. We will use the proposed model benchmarks to evaluate our models as we change any important aspect of their design and specification.

A Cautionary Note about Model Assessment

Models used for urban planning purposes, predicting transportation patterns and urban development outcomes, should be evaluated carefully. Be careful to ask if the model training (or calibration) used any geographic constants to absorb errors in predictions. And if you see a model being trained on one year and being evaluated for the same year, it can be quite misleading. Let’s say we train a model on year 2000 data and use constants for all the geographies we over- or under-predict, and then calibrate those constants to minimize the difference between our model predictions and observed data. Sounds reasonable? Not so fast. If you use a lot of calibrated geographic constants, you are sweeping the errors of the model under the rug, so to speak. Worse yet, the model becomes insensitive because your constants are inhibiting model sensitivity. And you are not gaining any insight into how well the model predicts where it counts: the future.

In the models we evaluate below, we use no geographic constants. Only behavioral variables and parameters, and we evaluate the models over time. Let’s look at what a more principled and robust assessment for urban models looks like. Below we use a model in development for the Kansas City region for illustrative purposes, focusing on measures related to predicting spatial changes in households.

Assessment 1: Simulate 10 Years and Compare to Observed Data

How do the predicted number of households by tract in 2020, after 10 years of simulation from 2010, compare to the recently released 2020 Census data? One way to examine this is as a scatterplot of observed vs predicted values, to see if they align tightly along a 45-degree line, which would indicate a perfect prediction. Points that lie above the line indicate over-prediction, and those below the line indicate under-prediction.

Scatter plot chart of predicted vs observed households — Predicted vs Observed Households by Census Tract in 2020, after simulating from 2010 at the Census Block level. Kansas City Region.

A second way to interpret the data behind the scatterplot is to compute a Pearson correlation coefficient between the observed and predicted values in 2020, which provides a single scale value from 0 (no correlation) to 1 (perfect correlation). The correlation between predicted and observed households in 2020, after 10 years of simulation, is 0.974.

One should be cautious about over-interpreting these initial benchmarks, however, since some areas may not have changed dramatically over a ten year period. If they are very similar in 2020 to what they looked like in 2010, then you would expect a high correlation between 2010 and 2020 values.

Assessment 2: Simulate 10 Years and Compare Changes Over Time

While these are reassuringly high prediction accuracies after 10 years of simulation, we probe more deeply to see how the model is doing with regard to predicting the amount of change from 2010 to 2020 by census tract. Let’s look at household change to illustrate.

Bar chart of predicted change in households vs observed change — Comparison of Predicted Change in Households from 2010 to 2020 to Observed Change by Census Tract. Kansas City Region.

In this figure we can see the observed and predicted distribution of change in households by census tract. Note that there are a fair number of census tracts in which households declined in Kansas City. This is not as unusual as one might think, even in relatively fast-growing sunbelt cities you will find locations in which household counts decline. Many models are structurally incapable of predicting decline because they do not model households moving out – they only can add households to the previous count.

The following figure shows the same information as the bar chart but represents it as a kernel density (in which the area under the curve adds up to 1). This view shows how predicted and observed distribution of household change over the decade compare to each other, without imposing the categories used to produce the bar chart view. It also provides more insight into the scales of growth and decline by census tract, on the X axis.

Kernel density plot of predicted vs observed change in households — Observed and Predicted Change in Households by Census Tract form 2010 to 2020 as a Kernel Density plot.

The Pearson correlation coefficient comparing observed and predicted change in households over the decade provide a more challenging metric than the initial one on 2020 totals, since they subtract the 2010 values from the 2020 data and focus only on the changes. For Kansas City, the correlation between predicted and observed household change was 0.734, a very high correlation for tract-level changes in households simulated over a decade.

In short, what we would like to see is that a model is able to realistically capture growth, stability and decline of neighborhoods.

Our Methodology

How are we achieving these kinds of results? We will save some of that discussion for another post, but for now, suffice it to say that we are pioneering the use of artificial intelligence to train behavioral models. The goal is to attain some of the advantages in predictive accuracy offered by AI techniques, while retaining the transparency and behavioral realism of our simulation models, to effectively support planning applications that require transparency and explainability of counter-factual experiments evaluating alternative transportation infrastructure projects and local land use regulations.

We will be discussing these innovations much more in the coming weeks, as we begin to launch our next-generation UrbanSim platform in more metropolitan areas.

Acknowledgments: Although many people have contributed to the ideas and the benchmarking suite described in this post, we want to particularly recognize Frank Lenk at Mid-America Regional Council for his generosity of time, ideas and feedback as we have shared numerous versions of the model under development for the Kansas City region.

How well do urban models predict the future of cities?

A Cautionary Note about Model Assessment

Assessment 1: Simulate 10 Years and Compare to Observed Data

Assessment 2: Simulate 10 Years and Compare Changes Over Time

Our Methodology

Philadelphia vs Kansas City: which model performs better?

How Useful are Urban Models in 2022?