Title: Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph
1Machine Vision for Urban Model Capture
Exploiting Scale, Achieving AutomationSeth
TellerMIT Graphics GroupJoint work with Eric
Amram, Matthew Antone, Michael Bosse, Satyan
Coorg, Doug DeCouto, Manish Jethwa, Neel Master,
Ivan Petrakiev, Qixiang Sun, Franck Taillandier,
Stefano Totaro, Xiaoguang Wang et al.
2Example Dataset
500 nodes spanning 500 meters 10,000 HDR
images (Debevec, Malik 97) 50,000 raw
Megapixel images
- Most node pairs are entirely unrelated!
3Tackling Scale, Extent, Automation
- Scale number of input images, features
- Extent size, scope of acquisition region
- Automation end-to-end from sensor to CAD
2D images
3D model
4Motivation
- Why automate? Good interactive tools exist.
Yes but either scale-limited or very expensive
and produce models for restricted set of
viewpoints - System bottleneck is human interaction time
- Façade, 80 man-hours to model 20 buildings
representing tower and surround (Debevec 01) - LA Basin, 100,000 man-hours (50 man-years)
to model 15,000 urban structures (Jepson 00)
All require skilled operators, careful view
planning - Interactive tools are not on technology curve!
Automation.
Where do you go from here?
5Four Key Ideas
Sensor 1. Metadata captured with each image 2.
Omni-directional (wide-FOV) images Algorithmics
3. Projective, probabilistic uncertainty 4.
Asymptotically linear running times Tradeoff
Extract one expensive human add many cheap
images
61. Camera Metadata
Intuition identifies images likely to overlap
y
N
x
N images gt O(N2) timeto discover overlaps
N images gt O(N) timeto discover overlaps
72. Omni-directional Imagery
Intuition avoids classical aperture
problem Integrating surround is
fundamental advantage Yields superior
robustness and accuracy
8- 3. Projective, probabilistic uncertainty
- Intuition account for noisy features, pose
- Bingham densities (1974) generalize Gaussians
- Completely specified by parameters k1 lt k2 lt k3
0 - Advantages
- Appropriate from theoretical standpoint
(projective) - Can fuse noisy features into accurate aggregate
estimates - Can defer or avoid hard (deterministic)
decisions
Symmetric Polar
Symmetric Equatorial
Asymmetric Polar
Asymmetric Equatorial
Uniform
94. Linear Asymptotics
- Intuition crucial for spatial extent, free
viewpoint
- Capture a km2 to 1-cm resolution 1010 cm2
fragments - 104 images needed to observe each fragment just
once! - Need several views, several pixels per fragment
- Bottom line need at least 105 images per km2
1 cm2 fragments
Mega-pixel camera
1 kilometer
10Talk Overview
- Introduction Motivation
- Key Ideas
- Automated 6-DOF Image Registration
- Large-scale image datasets
- Large-scale model extraction
- Long-term goals, vision
- Conclusion
11Image registration (bundle-adjustment)
- Traditional method manual correspondence
- Several limitations
- Infeasible for large data sets
- Numerical instability
- Human inability to match
?
12Automated Registration (with Matthew Antone)
- Intuition each node
- detects a local, rigid frame in scene
- then aligns itself to its neighbors
- Break 6-DOFs into 3 2 1 DOFs
13Detecting rigid 3-DOF frame
Assume 2 distinct vanishing points visible
VPs found in many scenes
(Coughlan and Yuille, NIPS 2000)
- Dualize each edge to a point on the unit sphere
CVPR 2000
14Transforming every edge from a single image
yields evident band structure on the sphere
but not enough structure to reliably estimate
(or even identify!) the scene vanishing points
15However, transforming edges from every image in
wide-FOV node yields strong band structure!
16Registration algorithm (3 DOFs)
- Reduce each node to a few accurately-estimated
VPs - Cost tens of CPU-seconds per node (1000s of
edges)
17Propagation step alignment to neighbors
- This is O(n4) in the of VPs (usually n lt 5)
- Achieves rotational registration to 0.05o (1
pixel) - Note no individual feature matching outperforms
human!
18Advantages of wide-FOV images
- More VPs (3.2 3.6 per node vs. 0.3 0.7 per
image) - More accurate VPs (by factor of 1 to 3)
H-T peak strength
VP variance
Registration more robust, more accurate with
wide-FOV (Contrast to Collins, Weiss 90
Becker 95 Leung, McLean 96)
19Translational registration (2 DOFs)
Requires persistent, local observations we use
sub-pixel point features
From valid VPs only
20Spherical Epipolar Geometry
- Since any two rotationally registered nodes are
related by a pure (3DOF) translation
All world points must move on a pencil of
epipolar great circles through (2DOF) translation
direction Problem we dont know direction, or
matches!
21Registering Translations
- Position A point features
- on Gaussian sphere
Position B point features on Gaussian sphere
Overlaid point features Rotationally
aligned Significant overlap
22Finding the translation direction (2 DOFs)
- Hough transform all plausible point pairs to
great circles - Great circles reinforce at/near true baseline
direction - Yields direction (up to scale)relating nodes A, B
23Probabilistic Matching
- As in Wells (IJCV 1997), Chui et al. (CVPR 2000)
Dellaert et al. (CVPR 2000), but - Our approach
- Handles unknown numbers of features, matches
- Handles unknown occlusion and outliers
- Searches only 2 DOFs (not 5, rot. trans.)
- Scales to hundreds of thousands of features
24Probabilistic Matching (cont.)
- For each adjacent node pair, i/j featurematch
represented as Mij in 0,1 - Swap, split, join operations mutatevalid
matches, add/remove outliers - Matrix mutated 10 - 100,000 times in Monte-Carlo
E-M process - Within MC step, new binary matrices produced as a
Markov Chain, with transition probabilities based
on geometric error averaging these produces Mij - In practice, soft matching of 2,000 features
converges within 10 E-M steps - Strong prior from HT 30 sec. 1 E-M step 20
sec.
25Initial baseline estimate overlaid point features
Fix baseline estimate Form plausible match
weights
Fix baseline estimate Mutate match weights
Fix match weights Re-fit baseline estimate
Accurate baseline
26Error study (synthetic data)
End-to-end baseline error is linear in feature
noise
27Robustness study (synthetic data)
Baseline estimation robust up to 90 percent
outliers
28Advantage of wide-FOV (real data)
Hough transform peak as FOV increases
29Final DOF Fix absolute scale
Output edge lengths
Input
- Solve linear system with C-LAPACK (uses 1
CPU-minute) - Register to initial GPS position estimates for
geo-referencing
30End-to-end consistency measures
Width of distributions at 95 confidence
- Variety of node sets, spanning hundreds of meters
- Typical inter-node baseline 10-20 meters
- Hundreds of thousands of edge, point features
- Node positions consistent to 5 centimeters
- Node orientations consistent to 0.1 degrees
- Epipolar geometry consistent to 4 pixels
31End-to-end Epipolar geometry
After rot. alignment
After 6-DOF alignment
32Comparison to manual bundle adjustment
Manual
Automated
33Performance in the presence of clutter
Automated
Manual
34Performance with poor initial pose
Initial
Refined
Robust to initial pose errors of up to 7 meters,
17 degrees
35Currently placing all data on-line
- URL is http//city.lcs.mit.edu/data
- Time-stamped, calibrated, HDR (log-radiance)
images - Intrinsic calibration sub-pixel edge, point
features - Geo-referenced 6-DOF pose (ECEF metric units)
- Interactive browsing (download facility underway)
36Application 3D reconstruction
- Voxel, silhouettes (Potmesil 87, Szeliski 93,
Kanade et al. 95) - Space-sweep, voxel-coloring (Collins 96,
Szeliski Golland 98 Seitz Dyer 99
Kutulakos Seitz 00 etc.) - T(N,V) O(NV) with N images, V voxels
- This grows with square of reconstruction volume
- KS 00 N 16, V 5x107, T 4 CPU-hours
- Our dataset N 104, V 1011 (campus _at_10cm)
- Extrapolating, T(N, V) 100 CPU-years
37Asymptotic improvement (w/ Manish Jethwa)
- Intuition distant image pairs shouldnt interact
- Let each image affect only constant of voxels
- Now T(N,V) O(NV)
- Grows linearly with recon. volume, of images
- For our dataset, we estimate a few CPU-weeks
- Also must handle unknown background, clutter
38Whats next (1)
- Scalable 3D reconstruction
39Whats next (2)
- New operating regimes (with Michael Black)
- Omni-video
- Different 30Hz, low resolution, short baselines
- Architectural interiors
- Different dimensions, illumination, clutter
40Whats next (3)
- Robotic image acquisition (with Draper Labs)
- Autonomous helicopter w/ 6-DOF navigation
- On-board omni-directional video camera
- Eventually simultaneous cooperative capture
41Conclusions
- Ideas for automation, scaling, view freedom
- Metadata, Wide-FOV, Uncertainty, Asymptotics
- Enable fundamentally new capability
- Controlled image acquisition over wide areas
- Datasets of interest to IBR, vision communities
- Long-term project vision, goals
- More general operating regimes
- Robotic image acquisition, model capture
42Further information
- http//city.lcs.mit.edu
- http//city.lcs.mit.edu/data
- http//graphics.lcs.mit.edu
- http//graphics.lcs.mit.edu/publications.html
Thanks to
NSF, DARPA, ONR Intel, Interval, NTT
43(No Transcript)
44Conclusions
- Fully-automated model acquisition is possible
- Augmented sensor allows spatial and input
scaling, andreplaces human-aided initialization
in classical algorithms - Spherical images are a fundamental enabling
technique (more than simply a practical
advantage) - Ensemble features and low-DOF optimization
eliminate need for hard feature correspondence - Large numbers of images can overcome even severe
clutter and occlusion, efficiently - End-to-end architecture provides effective
testbed for algorithms for registration,
reconstruction, and high-fidelity
(domain-specific) scene element extraction
45System Limitations
Limited spatial extent, number of structures
In progress acquisition of MIT campus (1
km2) Vertical facades only rooftops
procedural In progress richer shape primitives
(model selection) Diffuse lighting, diffuse
surfaces In progress directional, inverse
global illumination Use of prior knowledge
about common materials Foliage removed via median
statistics and masking In progress foliage
segmentation, tree modeling Validation of
resultsIn progress independent navigation,
structure survey
46Metadata Operational Advantages
- If scene-relative
- O(N) asympotics parallel image capture
- Makes interactive initialization unnecessary
- If Earth-relative
- Sun direction from time, geo-referencing
- Output can be overlaid with existing GIS data
47- Projective Features
- Antipodal equivalence
- Natural duality
- 3D Edge 1-D family of coplanar points
- 3D Point 1-D family of copunctual lines
Line Dual
3-D Edge
3-D Point
Focal Point
Pencil of Lines
Image Plane
Great Circle
48Slide
49Slide
50Slide
51Slide
52Slide
53Slide
54Slide
55Slide
56Slide
57Slide
58Slide
59Slide
60Slide
61Slide
62Slide
63Slide
64Slide
65Slide
66Slide
67Slide
68Slide
69Slide
70Slide
71Slide
72Slide
73Maps are fundamental
John Speed, 1626 Image Courtesy Norman B.
Leventhal Collection, Boston
- People make sense of their environment by
creating and using maps
743D Geometric Models Simulation
(From MIT/UCB CityWalk project)
Example shadow studies for architecture, urban
planning
75Models are an essential starting point!
- With urban models, one can simulate (e.g.)
- Touring the space (tourists, customers, students)
- Virtual sets (ads, movies, games, socializing)
- Emergencies (fires, terrorism, floods etc.)
- Military operations (people, vehicles,
sightlines) - Traffic (pedestrian, bicycle, vehicle etc.)
- Construction (views, shadows, wind, energy use)
- Utilities infrastructure (gas, power, water,
data) - Path-planning (for physically or visually
impaired) - But where do these models come from ?
76Satellite and Aerial Photogrammetry
- E.g., Moffitt, Mikhail 80 Slama 80 Ackermann
80 McKeown, McGlone 93 Mayer 98, Ascender
(Marengoni et al. 99) - Limitations
- High-altitude images have low spatial resolution
- Nadir views are highly oblique for vertical
surfaces - Side views occluded due to urban canyons
77Terrestrial Computer Vision
- Foundational work in various settings
- Camera calibration (intrinsic parameters)
- E.g., Faugeras, Toscani 86 Tsai 87
- Exterior orientation, scene structure (point
clouds) - Kruppa 13, Ullman 79, Longuet-Higgins 81
- Stereo (dense depth maps from image pairs,
triples) - Marr, Poggio 79 Baker, Binford 81 Grimson
81 Shashua 97 - Structure from closely-spaced image
sequences/sets - Tomasi, Kanade 92 Azarbayejani, Pentland
95Collins 96, Beardsley et al. 97, Baillard,
Zisserman 99
78Spatial and Combinatorial Scaling Limitations
- No prior algorithm demonstrated on all of
- Thousands of images extended spatial area wide
baselines general illumination significant
occlusion and clutter
- Short baselines, tracking failures limit spatial
scale - Underlying O(n2) assumption limits combinatorial
scale - Private coordinate systems enable only serial
acquisition
79Alternative Human-operated modeling tools
- E.g, Knopp 94 Becker 95 Taylor, Kriegman 95,
Debevec et al. 96 Jepson 96 Shum et al. 98
Gibson 99 Cipolla et al. 99 Gruen-Wang 99 - Rely on human operator to do one or more of
- Establish working coordinate system and units
- Roughly create and situate block model of scene
- Roughly place and orient each camera
- Indicate common structure among images (points,
edges, faces, blocks, higher-order shapes) - Indicate subject and clutter portions of each
image (I.e., paint away trees, people, cars,
etc.) - Human frames, initializes, constrains, classifies
80Example Interactive System Façade (Debevec et
al. 96)
- Tasks done by human operator
Tasks done by computer
Frame Acquire/select related set of images
Establish working units, coordinate system
Provide user interface Manage images, geometry,
and constraints
Initialize Specify rough block model of scene
structure
Roughly place and orient cameras
Optimize feature, structure, and camera estimates
Constrain Indicate visible structure in each
image (points, edges, faces, blocks, etc.)
Combine manually masked textures fromdifferent
viewpoints Render final model
Classify Segment each image into subject and
clutter (I.e., paint away trees, people,
other buildings, cars, etc.)
Images courtesy Paul Debevec used with
permission
81Human-operated modeling tools
- Good results from small of images, but require
- Uncluttered views of isolated structures
- Significant camera standoff (tens of meters)
- Overlapping structure, or surveyed fiducial
points or other ground control, to register
multiple datasets - Hours of skilled human effort per building
- Scaling limitations here too !
- Limited number of input images
- Limited occlusion and visual clutter
- Limited number of output structures
- Limited parallelism
82Urban models design targets, scaling
- Capture a km2 (about ½ sq. mile) to a feature
size of one centimeter (about ½ inch) 1010 cm2
total - Digital image yields 106 pixels, so 104 images
are needed to observe each cm2 fragment just
once! - In practice, need 3-10 views of each surface
fragment, and 3-10 pixels per observation - Bottom line need at least 105 images per km2
83Our approach to urban model capture
Acquire 1000s of georeferenced images
Extract model of coarse,fine geometry and
appearance
Insert images into spatialindex establish
approximate image adjacency revise 6-DOF
alignment
System development strategy Breadth-first, not
depth-first!
IUW 97 Pacific Graphics 98 ISPRS 99
ECCV/SMILE 2000 (submitted)
84Rationale
- Challenge Solutions
- 1. Cant expend O(n2) time Geo-referenced
smart cameraMost image pairs unrelated for
framing, initialization Serial acquisition,
algorithms Hierarchical spatial index
for scaling, parallelism - 2. Narrow-FOV imagery (aperture Use
high-resolution, super- - problem estimation failure) hemispherical
imagery - 3. Feature matching under wide Avoid feature
matchingbaselines and general Use ensemble
features, softillumination is difficult
matching techniques instead - 4. Cant rely on human to Acquire thousands of
images identify/remove clutter Use consensus
methods and robust statistics
85Thrusts of this effort
City
Campus
General Illumination
Increasing Scale, Generality, Automation
Office Park
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
86Talk overview
- Motivation
- Scaling issues context
- Smart pose-camera
- Ensemble features and 6-DOF registration
- Reconstruction without correspondence
- Removing clutter
- Increasing fidelity
- Conclusions
87(No Transcript)
88Geo-referenced digital pose camera
(With Doug DeCouto)
Designed in concert with Peace River Studios,
Cambridge MA
89Motorized pan-tilt head for mosaic acquisition
(analogous to QTVR)
90Two Individual Mosaics
Each is about 75 Mega-Pixels, but can be acquired
at arbitrarily high resolution (at the cost of
time, CPU) Our design target calls for 1K pixels
per radian (57o)at a typical viewing standoff
of about 10 meters
91Image acquisition early dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
Adjacency graph
(CVPR 98)
92Image alignment (exterior orientation)
- 1 Mega-pixel camera with a 1-radian FOV has 1
mrad resolution (3 arc-min, 1/20 deg., 1cm _at_ 10m
standoff) - For registration to one pixel, we must localize
camera position to 1cm, and orientation to 1mrad
(1/20 degree)
GPS satellites
feature
10m
cameras
GPS receiver
earth
- Differential GPS receivers claim accuracy of 2cm
- So attach GPS receiver, log position (latitude,
longitude, altitude) and time with each image
93Sensor GPS/IMU navigation estimates
- Good to about 2 meters position, 2 degrees
heading(gt100 pixels) still have a registration
problem!
94Sensor GPS/IMU navigation estimates
- Raw estimates
- (from nav sensors)
Refined estimates (desired)
95Imagery Control Exterior Orientation
Each node must be controlled, or registered, in
a common, global (Earth) coordinate system
Image-assisted user interface auto-corresponds
point features(requires several hours of user
time) Mosaicing significant engineering
advantage Goal full automation of
geo-referencing process
96- Manual Correspondence Disadvantages
- Infeasible for large data sets
- Potential for human error
- Unstable solutions
?
97Global registration DOF argument
Output position x,y,z for each of the V input
nodes, satisfying the directional constraints
Input
- Each node adds 3 DOFs (position)
- Each adjacency fixes 2 DOFs (3D direction, up to
scale) - Necessary condition 2E 4 gt 3V
- Sufficiency every edge part of a D edge-adjacent
to another D - New, open question in rigidity theory
- Previously joint angles and/or lengths
- Solve linear system with C-LAPACK (uses 1
CPU-minute) - Then register with (unbiased) GPS position
estimates
98End-to-end Registration VPs, Points
99End-to-end Registration Epipoles
After rot. alignment
Registered
100Comparison to manual bundle adjustment
Manual
Auto
101Performance in presence of clutter
Manual
Auto
102Performance with poor pose initialization
Initial
Refined
103Registered pose-image dataset (gt 4,000 images,
25Gb, six billion pixel observations)
Dominant cost mosaics (8 CPU-hours lt1 hour
real time)
104Structure extraction without correspondence
Histogramming algorithm identifies orientations
of significant vertical façades in vicinity of
cameras (with Satyan Coorg)
CVPR 99
105Façade detection
Sweep-plane algorithm identifies location and
spatial extent of each (coarse) vertical façade
106Recovered coarse façades
False positives removed with absolute area
threshold
107Result for example dataset
Dominant cost plane sweep (8 CPU-hours on this
data) Generalizes to other shapes, given
sufficient CPU
108Texture-mapping from images
Can map closest image onto surface, but several
problems Inherits lighting conditions, shadows,
reflections from imageCluttering elements
(trees, people, cars) pasted onto
surfaceOff-plane relief (window moldings,
etc.) not modeled
109Texture estimation challenges
110Iterative Consensus Texture Estimation
Robust, weighted median - statistics algorithm
estimates texture/BRDF for each building façade
weighted xyY median
Sharpening, masking
Algorithm removes structural occlusion foliage
blur (obliquity) color and lighting
variations! (Also inverse global illumination,
Yu et al. 99)
CVPR 99
111Masking away occlusion, clutter
- (With Eric Amram, Stefano Totaro, Franck
Taillandier)
112Without masking
With masking
113Texture estimation results
Input Raw photograph
Output Synthetic texture
- Made possible by many observations
- A sensor and aggregation algorithm that
effectively see through complex foliage and
clutter
114Textured model (with overlaid aerial image)
115Increasing scale 3 Campus datasets
East Campus
Full Campus
116(No Transcript)
117Thrusts of this effort
City
Campus
General Illumination
Office Park
Increasing Scale, Generality, Automation
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
118Capturing surface relief
- Idea assume surfaces is nearly planarrecover
deviations using generalized stereo(Szeliski
94, Sawhney 94, Kumar et al. 94, Debevec et
al. 96) - Based on terrain reconstruction algorithms(and
implementations) of Fua and LeClerc
119(No Transcript)
120(No Transcript)
121Symbolic Window Extraction
- Based on Wang et al. (Proc. SPIE 97)
- Oriented region-growing technique
- Applied to composite façade images
- After removal of occlusion, shadows
- Planned application to
- Mesh regularization (quantized depth)
- Modeling color from multiple distributions
122(No Transcript)
123(No Transcript)
124(No Transcript)
125(No Transcript)
126(No Transcript)
127(No Transcript)
128(No Transcript)
129(No Transcript)
130(No Transcript)
131Capturing 3D Models of Existing Trees(with Ilya
Shlyakhter and Max Rozenoer co-advised by Julie
Dorsey)
- Input pose-images. Output 3D tree model.
132Reconstruction Steps
- Segment tree region
- Reconstruct 3D shape
- Infer major branches
- Grow minor branches, leaves to fill 3D shape
- Assign colors from images
133Reconstructing Trees 3D Shape
- Volumetric intersection
- Extrude each silhouette to polyhedral cone
- Intersect cones to obtain 3D shape
134Infer plausible branch structure
- Find 3D medial axis
- Fix nodes at terminal points (branch tips)
- Use vertices from even-order convex hulls
(Rappoport92)
1st order hull ABCD 2nd order hull
CED Nontrivial even-order hulls correspond to
branch tips
135Grow Remainder of Tree
- Procedural model L-systems
- Rewriting rules specify branching
- Normally, start with one shoot
- Here, start with complete skeleton
Simple example 1st rule directs growth 2nd rule
directs branching
136Coloring the leaves
- View-dependent mapping colors back-projected
from image most closely matching viewpoint - Alternative match color distribution to that
found in input images
137Matched stills
138Detail views
139(No Transcript)
140Take-home messages
- Fully-automated model acquisition is possible,
in principle and in practice - Augmented sensor removes combinatorialbottleneck
inherent in classical algorithms - Spherical images are a fundamental enabling
technique (more than simply a practical
advantage) - Ensemble, rather than individual,
featureslargely eliminate need for
correspondence - Large numbers of images can overcome even severe
clutter and occlusion, efficiently - Even surprisingly complex structures (e.g.,
trees)can be plausibly modeled from observation
141Acquisition is the application !
- End-to-end system for model acquisition
- From sensor directly to textured CAD/GIS
- Remove human from the loop
- Advantage Removes scaling, throughput limits
- Tradeoff instrumentation limited domain
- Evaluate
- Costs (development, one-time, ongoing)
- Efficiency (computation, storage resources)
- Fidelity (faithfulness to reference model)
- Utility (to military, maintenance, simulators)
142Overview Urban model acquisition
- Automation
- Automatic exterior calibration of imagery
- Generalization Aggregation
- 3D reconstruction, merging
- Texture, occlusion, relief estimation
- Collaboration with Fua (EPFL) and LeClerc (SRI)
- Symbolic window extraction
- Collaboration with Wang and Hansen (UMass)
- Scale and Throughput
- Data acquisition, distributed processing
- Input and Output Validation
- Sensor improvement surveying efforts
143Project Goals End-to-End
- Develop a sensor and algorithms to extract
geodetic textured CAD models from initially
uncontrolled imagery, without a human in the loop - Five parts
- Develop novel sensor pose camera forimagery
and approximate exterior orientation - Deploy sensor to acquire pose mosaics
- Refine estimates of exterior orientation
- Extract geometry, textures (BRDFs)
- Evaluate and validate models, cost, etc.
144Research/Engineering Footprints
1 2
Ascender Façade MIT/City
Number of images Tens, Tens, Thousands, Imagery
type Aerial Near-ground Near-ground 6-DOF
camera pose From human From human Instruments
optim. Structure extraction Roof-matching By
human Automatic detection optimization optimi
zation optimization Number of structures Scores
One to Tens Arbitrary Output coord- Specified
by Specified by Geodetic (Earth) inate
system operator operator coordinates Texture
Procedural Manual Automatic w/
matching segmentation robust statistics Scaling
capability Unclear Unclear Spatial
index Parallel model acquis- None None Use of
geodetic ition and merging coordinates, index
1 UMASS 2 Berkeley
145Engineering rationale, choices
- General vs. restricted envt class
- Few vs. many images
- Satellite, aerial vs. ground imagery
- Video vs. single-frame camera
- Resolution vs. large field of view
- Optics, CCD, pan/tilt rig both
- Geo-referencing data w/ each image
- 6-DOF three translation, three rotationin
Earth coordinates (lat, long, alt, NED) - Breadth-first (not depth-first) development
146Urban Model Capture
Vertical Façade Extraction
Sparse Reconstruction
Horizontal line segment identification
Step 1. Low-level feature detection
Step 2. Computing frustums
Step 3. Compute vertex, line extrusions
Space sweep finds dominant facades
Step 4. Matching vertex extrusions
Step 5. Matching line extrusions
Step 6. Computing surfaces
Citations Coorg, CVPR 1999 Mellor, IUW
1997 Cutler, MIT MEng 1999 Chou, IUW 1997
Amram, MIT MSc 1998 Faugeras and Keriven, SS
1997
The reconstructed model of a portion of
Technology Square
Vertex extrusions corresponding to a vertex
element
Line extrusions corresponding to a line element
Variational Surface Evolution
Geometry Extraction
Geometry Aggregation
1. Single camera projection
2. Multiple cameras projection
1. Data
2. Spatial Index
3. Images mis-aligned
4. Surface evolution
5. Alignment improves
3. Surface Patches
4. Extended Surfaces
5. Final Surface
147Edge, line, corner detection
148Sensor challenges (low -gt high)
- Fuse data streams from diversesensors (GPS, IMU,
omnicam, etc.) - Achieve meaningful error bounds
- Effectively incorporate high-levelknowledge
about platform motion - Translations, rotations, stops
- Disambiguate GPS noise from multipath
- Bootstrap from crude model capture
149Feature detection challenges
- Characterize, achieve theoreticallyoptimal
estimation of edges - Effectively combat local clutter
- E.g., obscurations of single edge
- Effectively combat false positives
- E.g., tree limbs disguised as edges
- Propagate useful error bounds
- E.g., to downstream algorithms forvanishing
point estimation
1503D Reconstruction challenges
- Expressiveness of template
- Polyhedra, surfaces of revolution etc.
- Variations in feature size
- From signage lettering to large buildings
- Principled idea of when to believein a
multiply-reinforced element - Validation goodness metric, etc.
151Why we need texture
- Show buildings with raw imagery mapped onto them
trees are stuck on to buildings! - Challenges multiple views
- Differing lighting
- Distinct occlusion in each image
- Surface is non-planar each image sees different
piece - Can undo lighting Yu99, but how to dealwith
occlusion? - Strategies to generate textures
- Assume no occlusion (simple averaging)
- Rely on human user to paint away textures
152Alignment is automation bottleneck
- Overview of end-to-end pipeline
- Recovering rotation and translationfor acquired
hemispherical images - Short-baseline techniques not applicable
- Exploit navigation informationlarge number
(1000s) of images - Tack decouple rotation, translationsolve
independently (w/ Matt Antone)
153Registration challenges
- Robust VP estimation from edgeclasses (rather
than single edges) - Robust translation directions fromlow-level
features (edges, points) - Use of dense (area) information?
- Allocation of error
- Orbit of optical center intrinsicsspherical
mosaicing error noise in feature extraction etc.
154Texture estimation challenges
- How best to incorporate
- Billions of pixel observations
- Non-planar surface geometry
- Appearance models of varying power (diffuse,
specular, BRDF, etc.) - High-level knowledge of repetitivestructure,
common material types - How to validate our results?
155Increasing Scale, Throughput
- Scale sensor, spatial infrastructure
- Sensor node time reduced to 1 minute from 5
minutes in 1998, including HDR - Input second MIT dataset, severalhundred nodes
across East Campus - Throughput map algorithms to parallel,
distributed Linux cluster - 1-32 CPUs with near-linear speedups
- Currently limited by I/O bandwidth
156(No Transcript)
157(No Transcript)
158Evaluation Criteria
- Throughput
- Complexity
- Fidelity (Geometric, Photometric)
- Adoption of tools and models by users
- Assessment of results by community
159Module improvements
- 10x area implies 10x data size
- Data scaling, naming conventions
- Speed
- Pose-camera (Argus) improvements
- Parallel distributed processing pipeline
- Accuracy, validation
- 6-DOF raw navigation data
- Imagery control (exterior calibration)
- Derived feature points, edges, faces
- Relief extraction
- Symbolic windows
160Texture algorithm
- Show four steps in algorithm
- Initialize per-image occlusion mask to 0.5
- Texture occlusion mask yields consensus
- Then correlate consensus with images tore-form
occlusion mask - Show several steps of the algorithm!
- Image, mask, imagemask, consensus
161Generalization and Aggregation
- Several 3D reconstruction techniques
- Large planar surfaces (Coorg)
- Small surfels (Mellor)
- Bottom-up surface inferences (Chou)
- Aggregation phase (Cutler)
- Principled merging of algorithm outputsto
produce single consistent CAD model
162Next step extracting geometry
- Several approaches, with overlapping,partially
complementary operating regimes - Vertical façade extraction
- Finds large vertical surfaces from horizontal
edges - Low-level feature hypothesis and promotion
- Bottom-up, from sparse point and edge features
- Dense surfel optimization
- Treats world as dense cloud of surface patches
- Aggregation phase
163(No Transcript)
164(No Transcript)
165Geometry from sparse features (with George Chou)
166(No Transcript)
167(No Transcript)
168(No Transcript)
169(No Transcript)
170(No Transcript)
171Geometry from dense surfels (with J.P. Mellor)
- Generalizes Collins space-sweep (96)
- Related to Kutulakos and Seitzs space-carving
(98)
IUW 97 Mellor 99
172(No Transcript)
173Geometry aggregation (with Barb Cutler)
Cutler 99
174Toward Automated Exterior Registration
With Manish Jethwa, Neel Master
175Preliminary results (with overlaid aerial image)
Model represents about 1 CPU-Day at 200 MHz
Next acquire full MIT campus compare to
refer-ence model captured via traditional
surveying
176Validation
- Input (Mike Bosse)
- Survey waypoints to characterizeprecision,
accuracy of navigation sensor - Suppress sub-systems (GPS, inertial, odometry) to
gauge contribution of each - Output (Qixiang Sun)
- Synthetic inputs, idealized results
- Real inputs, optimization residuals
- Compare reconstructed models tosurveyed,
hand-solved models
177Evaluation Criteria
- Throughput
- Complexity
- Fidelity (Geometric, Photometric)
- Adoption of tools and models by users
- Assessment of results by community
178From the East
From the South
179Connections to other communities
- MIT Physical Plant
- Well-maintained 2D CAD
- City of Cambridge Planning Dept.
- Surveying, GIS expertise/expectations
- MIT Depts of Architecture, Urban Planning
- GIS software, demographic data
180From maps to models
- A model is any dataset in an electronic form
- suitable for manipulation by a computer program
Map (paper chart, or scanned image)
Model (city locations, explicit road networks)
181Models enable visualization and simulation!
Route planning
(Examples from MapQuest)
182Image acquisition First dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
(CVPR 98)
183Four design tradeoffs
- General computer vision problem is hardSo focus
on urban environments for now - Previous approaches use few imagesInstead
acquire thousands of images(Only way to overcome
clutter automatically) - Cant assume O(n2) image pairs relatedSo use
sensor that identifies, by proximity and
direction, those images that are likely to be
related - Human-operated modeling tools use negligible CPU.
Instead we use massive parallelism, I/O
184- Parameterized tree generators
- Biology-based enforce botanical growth laws
- Good fidelity to sunlight availability, other
factors - Hard to control final shape
- Examples Lindenmayer68, Prusinkiewicz88
- Geometry-based specify detailed geometric
parameters - More direct control of shape
- Fidelity to biology, environment not enforced
- Examples Bloomenthal85, Greene89
- Both generator classes yield one of a family of
trees, not a particular, observed tree
185Hybrid Approach
- Geometry-based infer plausible branch structure
directly from observations - Biology-based use growth model to fill tree
volume with minor branches, leaves - Texture/coloration step color the
leavesaccording to original image observations
186Segmentation (identifying tree pixels)
- Currently manual
- Preliminary filter implemented(also Haering,
Lobo 1999)
187Tree Reconstruction Summary
- Preliminary solution for one instance of a hard
inverse problem forcing a procedural model to
reproduce an existing object - Hybrid approach allows direct control of final
shape while relegating details to a procedural
model that enforces biological fidelity
188Urban models design targets, scaling
- Capture a km2 (about ½ sq. mile) to a feature
size of one centimeter (about ½ inch) 1010 cm2
total - Using a 1-MegaPixel digital camera, 104 images
needed just to observe each cm2 fragment once! - In practice, need 3-10 views of each surface
fragment, and 3-10 pixels per observation - Bottom line need at least 105 images per km2
- Quadratic-time algorithms clearly not applicable
- Human operators cant even look at 105 images,
let alone manipulate them. So what do we do?
189Classical ambiguity rotation vs. translation
- Caused by limited camera FOV
- To first order, rotation, translation
indistinguishable when FOE far outside image - Show example
190History Computer Vision
- Camera calibration (intrinsic parameters)
- E.g., Faugeras, Toscani 86 Tsai 87
- Exterior orientation, scene structure (point
clouds) - Kruppa 13, Ullman 79, Longuet-Higgins 81
- Stereo (dense depth maps from image pairs,
triples) - Marr, Poggio 79 Baker, Binford 81 Grimson
81 Shashua 97 - Structure from closely-spaced image sequences
- Tomasi, Kanade 92 Azarbayejani, Pentland
95Collins 96, Beardsley et al. 97, Baillard,
Zisserman 99
Cam 1
Cam 2
Cam 1
Cam 2
191Model capture An analogy
- How can one capture an existing physical document
into a word processor, for editing?
?
192Option 1 Type it in
Uses existing skills, hardware, and software
Accurate (depending on input, operator skills)
- Requires skilled human operator(s), pro-
portional to number of pages to be input
Increasing computer speed doesnt generally
increase system throughputThus, human is
eventually the bottleneck
193Option 2 Scan and OCR document
- A) Acquire digital photograph of printed pages
- B) Apply OCR (Optical Character Recognition)
algorithms to extract a model of letters, words - C) Output document in machine-readable form
OCR Algorithms
Scanner
194Scanning Advantages disadvantages
No human in the loop! Throughput increases
with technology Parallel capture (scanners,
processing) possible Accurate (depending on
input, algorithms) Someone must
develop scanner, algorithms (Possibly years
until commercial viability)
195Back to Geometric Model Capture
- We are developing a scanner and a suite of
extraction algorithms for urban environments!
Input pictures of urban environment
Output textured 3D CAD model in Earth
coordinates
(Lat, long, alt., and orientation)
Vision algorithms
Novel Camera
196Vision Calibration, Correspondence, Structure,
Appearance
197Hidden assumption quadratic complexity
- Most algorithms assume all input images are
related! - Expend O(n2) time searching for correlations
- But for extended terrestrial imagery, overlap is
sparse - This is impractical (and wasteful) for large n
- Aperture problem (limited FOV) makes things worse
198Hidden assumption private coordinates
- Local coordinate system used for each image set
- These algorithms cannot use parallel inputs
- No clear way to combine models generated across
runs
x
y
x
y
y
y
x
x
y
3.
1.
2.
x
199Integration barrier disjoint operating regimes
- Hundreds of algorithms exist for
particularsub-tasks of the computer vision
problem - Feature detection, camera calibration,
short-baseline exterior orientation and structure
from motion, feature correspondence, etc. - However operating assumptions are restrictive,
or unstated, or both, making composition hard! - Examples short vs. long baseline orthographic
vs. perspective controlled vs. diffuse vs.
general illumination known vs. unknown camera
calibration local vs. global processing/consisten
cy - Not simply a systems integration problem
200Scale constrains us severely
- Cant control illumination conditions
- Cant instrument environment with fiducials
- Cant precisely control camera placement(in
contrast to, e.g., stereo rigs or other gantries) - Cant afford O(n2)-time algorithms
- Cant assume a single, serial image sequence
- Cant have human in the processing loop
201A fundamental optical tradeoff resolution vs.
field of view
CCD array (1K x 1K pixels)
Wide-angle (e.g. fisheye) lens Large field of
view, but Low angular resolution
Long (e.g. telephoto) lens High angular
resolution, but Small field of view
Images courtesy Helmut Dersch used with
permission
202Mosaic generation (with Satyan Coorg)
Each node is 25-250 images tiling a
sphere about a mechanically fixed optical
center Each node correlated to form spherical
mosaic Camera internal parameters auto-calibrated
Computation is fully automated (no human in
loop) Per node (50 images), 20 CPU-minutes _at_ 200
MHz
CVPR 98 IJCV (to appear)
203Two engineering problems
- First obscuration, multi-path, and electronic
noise degrade GPS accuracy to about 20m, and
make it only intermittently available - Second GPS is a 3-DOF position sensor only it
gives no information about (3-DOF) heading
multi-path
clear line of sight
obscuration
urban canyon
204GPS/Inertial Navigation (with Michael Bosse)
- GPS is unbiased, but only intermittently
available - Inertial is continuously available, but drifts
- Strategy combine sensors to achieve continuous
2m,2o solution then refine to 1cm,0.05o using
images
- Decoupled GPS, Inertial Coupled GPS, Inertial
GPS World April 2000 ECCV/SMILE 2000 (submitted)
205Focus of expansion contraction
- FOE is special point from which entire
worldlooms as you move toward it - Matched by a second, antipodal point, the focus
of contraction (usually not in view)
Image courtesy Steve Mann used with permission
206Demonstration
207Increasing Throughput
- Scale sensor, spatial infrastructure
- Sensor node time reduced to 1 minute from 5
minutes, including HDR imagery - Processing MIT dataset, nearly a thousand nodes
spanning entire campus - Currently mapping algorithms to parallel,
distributed Linux cluster - 1-64 CPUs with near-linear speedups
- Current bottleneck is disk I/O bandwidth
208(No Transcript)
209(No Transcript)
210(No Transcript)
211Animation with novel views