Title: Enhancing the Quality of Transferred Household Travel Survey Data: A Bayesian Updating Approach Using MCMC with Gibbs Sampling
1Enhancing the Quality of Transferred Household
Travel Survey Data A Bayesian Updating Approach
Using MCMC with Gibbs Sampling
- Yongping Zhang
- Kouros Mohammadian, PhD
- Department of Civil and Materials Engineering
- University of Illinois at Chicago
The 11th TRB National Transportation Planning
Applications Conference May 7, 2007
2Data Transferability
- The idea is to use data collected in one context
in a new context. This can reduce or eliminate
the need for a large data collection in the
application context. - Previous Studies
- ITE trip generation tables
- NCHRP 365 (Nancy McGuckin, et al)
- Highly aggregate
- ORNLs NPTS/NHTS transferability study (Pat Hu,
et al) - Aggregate (CT level)
- Data simulation (Stopher and Greaves)
- Disaggregate (HH level), CRT classification
method, limited number of independent variables
3Project Approach
- Consider larger set of variables
- NHTS and CTPP datasets
- Use quantifiable variables that can be easily
predicted or are available from other sources
(e.g., PUMS) - Consider variables representing Land-use, Urban
form, and transportation system characteristics - Advanced clustering, updating, and simulation
approaches
4Data
- Data Sources
- 2001 NHTS, 2000 CTPP, PUMS, 2003 TTI, Tiger/Line
GIS data files - Data Cleaning
- 33 variables of demographics, socio-economics
and land use - Individual level Age group, Race/Ethnicity,
Education, Occupation - Household level HH size, Income, Adults,
Vehicles, Drivers, Workers - Census tract level Housing, Employment, and
Population densities - New Variables
5New Variables
- Intersection density (Tiger/Line)
- No. of intersections / Area
- Road density (Tiger/Line)
- Road length / Area
- Pedestrian environment (Tiger/Line)
- Block size Road length / No. of intersections
- Transit friendly environment (CTPP)
- Transit users / Total no. of workers
- Transit trips / Total no. of trips
- Congestion factor
- Travel time index (TTI report for 85 MSAs)
- Avg. travel time / Free flow TT in that region
6Dependent Variables
- Travel Characteristics (from NHTS trip file
aggregated to HH level) - VMT for each household
- No. of trips
- No. of mandatory trips
- No. of maintenance trips
- No. of discretionary trips
- No. of transit trips in the HH
- No. of private vehicle trips
- No. of non-motorized (bicycles and walk) trips
- No. of tours
- Average trips per tour
- Average trip distance in miles for all HH members
- No. of transit users in the HH
- No. of carpool users in the HH
- Percentage of public transit usage in the HH
- Percentage of carpool usage among workers in the
HH - Total commute distance in the HH
- Average commute distance in the HH
7(No Transcript)
8Clustering
- Classification schema is a critical issue
- Clustering methods tested include K-Means,
hierarchical, CRT, TwoStep, ANN - 11 clusters were generated using TwoStep
clustering method - ONLY national data is used
9Clusters
- Rich and Smart
- middle age families
- professional or managerial white collar jobs
- graduate degrees
- high incomes
- majority live in suburbs.
- greater part are White but also some Asian
- Young Achievers
- Young couples without children or mainly with
pre-school children - college degrees
- white collar jobs in sales, service, technical,
and professional - mid-range income.
- higher percentages live in suburb or rural areas.
- Kids-centered Families
- middle aged and working class families
- pre-school and school age children
- usually have college education
- mid-rage to high level income
- primarily White and live in suburb or town
10Clusters, cont.
- Rural Blues
- working class, middle aged families
- pre-school and school age children
- mainly high school graduates
- blue collar jobs (farming, manufacturing, etc)
- low to mid-range income
- greater part are White and mainly live in rural
area or small towns. - Working Mixing Pot
- working class White, Black, Asian, or Hispanic
- single adults or couples
- college or high school education
- low to mid-range income
- Mainstream Families
- mid-scale, upper mid age, White
- large working class couples or families with
older children - college or high school education
- mid-range to high level income
- suburb or rural areas
11Clusters, cont.
- Senior Couples
- senior couples,
- majority working and some are retired
- greater part is White but include some Black,
Asian, or American-Indians - suburb or rural areas.
- Sustaining Minority Families
- low income,
- middle aged, working class families
- mainly Hispanic or Black but also some Asian and
White - majority have not finished high school
- service, sales, manufacturing, farming, or
construction jobs - Forever Youngs
- White senior couples, empty nesters
- mostly retired but some have sales, service, or
managerial jobs - low to mid-range income
12Clusters, cont.
- Traditional Seniors
- mainly retired single individuals and some
retired couples - low income.
- majority are White but some Black, Asian, or
American-Indians - Neo Urbans
- Small families/couples or single individuals
- dense urban areas
- college education
- low to mid-range income
- sales, service, or professional jobs
- dominant race is White but a significant number
are Black, Asian, and Hispanic
13Cluster-Based Travel Characteristics
14(No Transcript)
15Transferability
- An ANN model (with genetic algorithm) is used to
simulate cluster membership as a function of 11
factors for each HH in add-on datasets - The model has 92.4 prediction potential
- Travel characteristics are transferred from
national clusters to add-on data according to
their cluster membership - Weighted observed and Predicted travel
characteristics are compared
16Comparison of Weighted Trip Count per Person
17Comparison of Weighted Mandatory Trips per Person
18Original Comparison of Transit Usage Not so
good! some clusters need improvement
- Compared to No. of Trips, the prediction of
transit usage is not so good. - Cluster 5,8,10,11 show significant difference and
need improvement.
19Improvement to Clusters Using CRT
- 1. The first level of tree is grown upon the
difference of the No. of vehicles in the
household (own vehicle or not). - 2. Improvement of the model due to this level
is defined by improvement/(Variance of Node 0). - For example, here 0.0017 equals to 13.3, and
0.009 equals to 7.05 and 0.0002 equals to 1.57. - Total model improvement is about 22.
20Considering DistributionsTrip Rate
Nice match shown! however, not always the case.
How to improve the transferability?
21Considering DistributionsTrip Distance
Not So Good! Needs to be improved
22Considering Distributions
- Various distributions were fitted to the dataset
including - Normal, Gamma, Weibull, Exponential, Max Extreme,
Lognormal, Logistic, Students t, Min Extreme,
Triangular, General Beta, Pareto, Uniform,
Binomial, Geometric, Hyper Geometric, and
Poisson. - The fitting results are interpreted by
- examining the rankings of the three fit
statistics - A-D, K-S, and Chi-squared statistics
- visually judging of plots, density and cumulative
curves - p-value and critical values at different sig.
levels. - Non-normal distributions are dominant (e.g.,
Gamma)
23Gamma Distribution
Gamma function
k gt 0 is the shape parameter ? gt 0 is the scale
parameter the location parameter determines where
the origin is located
PDF
CDF
24Fitted Distribution with Parameters for each
Variable by Cluster
25(No Transcript)
26Bayesian Updating
- Local updating can significantly improve the
quality of the transferred data - Used Bayesian updating
- Traditionally in transferability literature only
variables with normal distributions have been
studied due to the simplicity in calculation of
posterior from normal prior and likelihood. - In practice, the variables of interest (i.e., the
likelihood) can take various distributional
forms.
27Bayesian Updating
- f(x?) is the probability function for the
observed data x (i.e., local sample), given the
unknown parameter ?, - g(?) is the prior distribution for ?,
- k(?x) is the posterior distribution for ? given
observed data x - The technique can be expanded to situations when
no prior data is available. - The analyst can do successive updating,
- using the new information without losing the
gains from the old one.
28Bayesian Updating (2)
- The National sample of NHTS 2001 is used as the
source for the prior information - A small local sample is randomly selected from
the NY add-on, leaving the rest for validation - Bootstrap method is used to resample the data and
justify the prior distribution assumptions of
parameters of interest (i.e., scale and shape for
Normal distribution), - Normal distribution is fitted to each of the
resample datasets.
29Bayesian Updating (3)
- Then, Markov Chain Monte Carlo (MCMC) simulation
with Gibbs Sampling is utilized to update the
prior with the small local sample. - Assuming the updated variables of interest are
still Gamma distributed, the posterior of
parameters are used to derive the updated means
and SD of the variables. - Updated parameters are then compared with the
validation data and national data to test the
effectiveness of the updating procedure. - The comparisons prove that significant
improvement is achieved. - The improvement increases with the local sample
size - a relatively cost-effective sample size is
suggested
30- Root Mean Square Error (RMSE) decreases with the
increase of sample size. - There is instability when the sample size within
each cluster is smaller than 45 observations. - A sample size of 75 per cluster seems to be the
most cost-effective plan.
31Updating Results
- Updated mean values are significantly improved
towards validation data.
32Summary of Updating Results
Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person Trip Rates per Person
Cluster National National National National National-updated National-updated National-updated National-updated State of New York State of New York State of New York State of New York
Cluster Location Shape Scale Mean Location Shape Scale Mean Location Shape Scale Mean
2 -0.83 5.42 0.88 3.94 -0.83 5.15 0.92 3.91 -0.30 3.47 1.14 3.66
3 -3.13 12.31 0.61 4.38 -3.13 12.05 0.61 4.22 -1.66 8.44 0.67 3.99
4 -0.99 6.42 0.77 3.95 -0.99 6.05 0.80 3.85 -0.42 4.43 0.89 3.53
8 -0.13 3.14 1.15 3.48 -0.13 2.90 1.12 3.13 0.18 2.40 1.24 3.16
11 0.04 2.52 1.47 3.75 0.04 2.44 1.45 3.58 0.32 2.20 1.40 3.39
Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person Trip Distance per Person
Cluster National National National National National-updated National-updated National-updated National-updated State of New York State of New York State of New York State of New York
Cluster Location Shape Scale Mean Location Shape Scale Mean Location Shape Scale Mean
2 -0.09 1.45 21.28 30.67 -0.09 1.34 21.04 28.10 -0.07 1.32 20.84 27.33
3 -0.49 1.68 18.91 31.18 -0.49 1.62 18.93 30.18 0.11 1.53 19.31 29.62
4 -0.22 1.61 18.55 29.59 -0.22 1.45 19.98 28.75 -0.02 1.30 20.59 26.67
5 -0.09 1.20 24.93 29.93 -0.09 1.20 24.03 28.84 -0.09 1.19 23.97 28.36
6 -0.43 1.91 18.12 34.18 -0.43 1.89 18.22 34.01 -0.08 1.58 21.40 33.69
7 0.11 1.48 22.69 33.58 0.11 1.54 21.69 33.51 -0.08 1.52 20.75 31.55
8 -0.12 1.06 24.08 25.38 -0.12 1.03 24.03 24.63 -0.09 0.90 22.91 20.53
9 -0.09 1.16 21.43 24.72 -0.09 1.16 22.23 25.65 -0.03 1.17 22.17 25.91
33(No Transcript)
34Population Synthesizing and Travel Data Simulation
- Using PUMS Data, NYC population is synthesized.
- All of the contextual factors were calculated for
each HH. - Synthetic population with all required 33
variables was generated. - Using the ANN model, cluster memberships are
obtained. - Travel data are simulated for each HH using Monte
Carlo simulation of each travel attribute with
updated parameters of the fitted distributions.
35Comparison of Simulated and Add-on NYC Samples
(Trips per Person)
36Comparison of Simulated and Add-on NYC Samples
(Trip Distance per Person)