Abstract:
Methods for predicting ridership for future urban rail systems or extensions often have poor accuracy.
One study shows that predicted ridership is overestimated by about 50%, on average, for a
broad sample of urban rail systems worldwide. The ridership estimates produced by most transit
agencies in the United States are not based on regression models. This thesis presents a framework
for feature generation and regression modeling for estimating urban rail ridership in the United
States. Features are generated using publicly available data from the US Census Bureau at the
zip code level. Monte Carlo geographic sampling from zip code shapefiles generates features for
each station on a rail network, representing characteristics within walking distance of that station.
Network connections and travel times are used to generate a second set of features representing
characteristics within commuting distance of each station. Several models are developed using
different regression types and are compared in terms of accuracy and selected features. Some of
the generated models provide system-level ridership predictions within 20% of the true value for a
sample set of six US urban rail systems.