Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph

About This Presentation

Title:

Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph

Description:

Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph – PowerPoint PPT presentation

Number of Views:530

Avg rating:3.0/5.0

Slides: 196

Provided by: setht9

Category:

more less

Transcript and Presenter's Notes

Title: Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph

1
Machine Vision for Urban Model Capture
Exploiting Scale, Achieving AutomationSeth
TellerMIT Graphics GroupJoint work with Eric
Amram, Matthew Antone, Michael Bosse, Satyan
Coorg, Doug DeCouto, Manish Jethwa, Neel Master,
Ivan Petrakiev, Qixiang Sun, Franck Taillandier,
Stefano Totaro, Xiaoguang Wang et al.
2
Example Dataset
500 nodes spanning 500 meters 10,000 HDR
images (Debevec, Malik 97) 50,000 raw
Megapixel images

Most node pairs are entirely unrelated!

3
Tackling Scale, Extent, Automation

Scale number of input images, features
Extent size, scope of acquisition region
Automation end-to-end from sensor to CAD

2D images
3D model
4
Motivation

Why automate? Good interactive tools exist.
Yes but either scale-limited or very expensive
and produce models for restricted set of
viewpoints
System bottleneck is human interaction time
Façade, 80 man-hours to model 20 buildings
representing tower and surround (Debevec 01)
LA Basin, 100,000 man-hours (50 man-years)
to model 15,000 urban structures (Jepson 00)
All require skilled operators, careful view
planning
Interactive tools are not on technology curve!

Automation.
Where do you go from here?
5
Four Key Ideas
Sensor 1. Metadata captured with each image 2.
Omni-directional (wide-FOV) images Algorithmics
3. Projective, probabilistic uncertainty 4.
Asymptotically linear running times Tradeoff
Extract one expensive human add many cheap
images
6
1. Camera Metadata
Intuition identifies images likely to overlap
y
N
x
N images gt O(N2) timeto discover overlaps
N images gt O(N) timeto discover overlaps
7
2. Omni-directional Imagery
Intuition avoids classical aperture
problem Integrating surround is
fundamental advantage Yields superior
robustness and accuracy
8

3. Projective, probabilistic uncertainty
Intuition account for noisy features, pose
Bingham densities (1974) generalize Gaussians
Completely specified by parameters k1 lt k2 lt k3
0
Advantages
Appropriate from theoretical standpoint
(projective)
Can fuse noisy features into accurate aggregate
estimates
Can defer or avoid hard (deterministic)
decisions

Symmetric Polar
Symmetric Equatorial
Asymmetric Polar
Asymmetric Equatorial
Uniform
9
4. Linear Asymptotics

Intuition crucial for spatial extent, free
viewpoint

Capture a km2 to 1-cm resolution 1010 cm2
fragments
104 images needed to observe each fragment just
once!
Need several views, several pixels per fragment
Bottom line need at least 105 images per km2

1 cm2 fragments
Mega-pixel camera
1 kilometer
10
Talk Overview

Introduction Motivation
Key Ideas
Automated 6-DOF Image Registration
Large-scale image datasets
Large-scale model extraction
Long-term goals, vision
Conclusion

11
Image registration (bundle-adjustment)

Traditional method manual correspondence
Several limitations
Infeasible for large data sets
Numerical instability
Human inability to match

?
12
Automated Registration (with Matthew Antone)

Intuition each node
detects a local, rigid frame in scene
then aligns itself to its neighbors
Break 6-DOFs into 3 2 1 DOFs

13
Detecting rigid 3-DOF frame
Assume 2 distinct vanishing points visible
VPs found in many scenes
(Coughlan and Yuille, NIPS 2000)

Dualize each edge to a point on the unit sphere

CVPR 2000
14
Transforming every edge from a single image
yields evident band structure on the sphere
but not enough structure to reliably estimate
(or even identify!) the scene vanishing points
15
However, transforming edges from every image in
wide-FOV node yields strong band structure!
16
Registration algorithm (3 DOFs)

Reduce each node to a few accurately-estimated
VPs
Cost tens of CPU-seconds per node (1000s of
edges)

17
Propagation step alignment to neighbors

This is O(n4) in the of VPs (usually n lt 5)
Achieves rotational registration to 0.05o (1
pixel)
Note no individual feature matching outperforms
human!

18
Advantages of wide-FOV images

More VPs (3.2 3.6 per node vs. 0.3 0.7 per
image)
More accurate VPs (by factor of 1 to 3)

H-T peak strength
VP variance
Registration more robust, more accurate with
wide-FOV (Contrast to Collins, Weiss 90
Becker 95 Leung, McLean 96)
19
Translational registration (2 DOFs)
Requires persistent, local observations we use
sub-pixel point features

From edge intersections

From valid VPs only
20
Spherical Epipolar Geometry

Since any two rotationally registered nodes are
related by a pure (3DOF) translation

All world points must move on a pencil of
epipolar great circles through (2DOF) translation
direction Problem we dont know direction, or
matches!
21
Registering Translations

Position A point features
on Gaussian sphere

Position B point features on Gaussian sphere
Overlaid point features Rotationally
aligned Significant overlap
22
Finding the translation direction (2 DOFs)

Hough transform all plausible point pairs to
great circles
Great circles reinforce at/near true baseline
direction
Yields direction (up to scale)relating nodes A, B

23
Probabilistic Matching

As in Wells (IJCV 1997), Chui et al. (CVPR 2000)
Dellaert et al. (CVPR 2000), but
Our approach
Handles unknown numbers of features, matches
Handles unknown occlusion and outliers
Searches only 2 DOFs (not 5, rot. trans.)
Scales to hundreds of thousands of features

24
Probabilistic Matching (cont.)

For each adjacent node pair, i/j featurematch
represented as Mij in 0,1
Swap, split, join operations mutatevalid
matches, add/remove outliers
Matrix mutated 10 - 100,000 times in Monte-Carlo
E-M process
Within MC step, new binary matrices produced as a
Markov Chain, with transition probabilities based
on geometric error averaging these produces Mij
In practice, soft matching of 2,000 features
converges within 10 E-M steps
Strong prior from HT 30 sec. 1 E-M step 20
sec.

25
Initial baseline estimate overlaid point features
Fix baseline estimate Form plausible match
weights
Fix baseline estimate Mutate match weights
Fix match weights Re-fit baseline estimate
Accurate baseline
26
Error study (synthetic data)
End-to-end baseline error is linear in feature
noise
27
Robustness study (synthetic data)
Baseline estimation robust up to 90 percent
outliers
28
Advantage of wide-FOV (real data)
Hough transform peak as FOV increases
29
Final DOF Fix absolute scale
Output edge lengths
Input

Solve linear system with C-LAPACK (uses 1
CPU-minute)
Register to initial GPS position estimates for
geo-referencing

30
End-to-end consistency measures
Width of distributions at 95 confidence

Variety of node sets, spanning hundreds of meters
Typical inter-node baseline 10-20 meters
Hundreds of thousands of edge, point features
Node positions consistent to 5 centimeters
Node orientations consistent to 0.1 degrees
Epipolar geometry consistent to 4 pixels

31
End-to-end Epipolar geometry
After rot. alignment
After 6-DOF alignment
32
Comparison to manual bundle adjustment
Manual
Automated
33
Performance in the presence of clutter
Automated
Manual
34
Performance with poor initial pose
Initial
Refined
Robust to initial pose errors of up to 7 meters,
17 degrees
35
Currently placing all data on-line

URL is http//city.lcs.mit.edu/data
Time-stamped, calibrated, HDR (log-radiance)
images
Intrinsic calibration sub-pixel edge, point
features
Geo-referenced 6-DOF pose (ECEF metric units)
Interactive browsing (download facility underway)

36
Application 3D reconstruction

Voxel, silhouettes (Potmesil 87, Szeliski 93,
Kanade et al. 95)
Space-sweep, voxel-coloring (Collins 96,
Szeliski Golland 98 Seitz Dyer 99
Kutulakos Seitz 00 etc.)
T(N,V) O(NV) with N images, V voxels
This grows with square of reconstruction volume
KS 00 N 16, V 5x107, T 4 CPU-hours
Our dataset N 104, V 1011 (campus _at_10cm)
Extrapolating, T(N, V) 100 CPU-years

37
Asymptotic improvement (w/ Manish Jethwa)

Intuition distant image pairs shouldnt interact
Let each image affect only constant of voxels
Now T(N,V) O(NV)
Grows linearly with recon. volume, of images
For our dataset, we estimate a few CPU-weeks
Also must handle unknown background, clutter

38
Whats next (1)

Scalable 3D reconstruction

39
Whats next (2)

New operating regimes (with Michael Black)
Omni-video
Different 30Hz, low resolution, short baselines
Architectural interiors
Different dimensions, illumination, clutter

40
Whats next (3)

Robotic image acquisition (with Draper Labs)
Autonomous helicopter w/ 6-DOF navigation
On-board omni-directional video camera
Eventually simultaneous cooperative capture

41
Conclusions

Ideas for automation, scaling, view freedom
Metadata, Wide-FOV, Uncertainty, Asymptotics
Enable fundamentally new capability
Controlled image acquisition over wide areas
Datasets of interest to IBR, vision communities
Long-term project vision, goals
More general operating regimes
Robotic image acquisition, model capture

42
Further information

http//city.lcs.mit.edu
http//city.lcs.mit.edu/data
http//graphics.lcs.mit.edu
http//graphics.lcs.mit.edu/publications.html

Thanks to
NSF, DARPA, ONR Intel, Interval, NTT
43
(No Transcript)
44
Conclusions

Fully-automated model acquisition is possible
Augmented sensor allows spatial and input
scaling, andreplaces human-aided initialization
in classical algorithms
Spherical images are a fundamental enabling
technique (more than simply a practical
advantage)
Ensemble features and low-DOF optimization
eliminate need for hard feature correspondence
Large numbers of images can overcome even severe
clutter and occlusion, efficiently
End-to-end architecture provides effective
testbed for algorithms for registration,
reconstruction, and high-fidelity
(domain-specific) scene element extraction

45
System Limitations
Limited spatial extent, number of structures
In progress acquisition of MIT campus (1
km2) Vertical facades only rooftops
procedural In progress richer shape primitives
(model selection) Diffuse lighting, diffuse
surfaces In progress directional, inverse
global illumination Use of prior knowledge
about common materials Foliage removed via median
statistics and masking In progress foliage
segmentation, tree modeling Validation of
resultsIn progress independent navigation,
structure survey
46
Metadata Operational Advantages

If scene-relative
O(N) asympotics parallel image capture
Makes interactive initialization unnecessary
If Earth-relative
Sun direction from time, geo-referencing
Output can be overlaid with existing GIS data

Projective Features
Antipodal equivalence
Natural duality
3D Edge 1-D family of coplanar points
3D Point 1-D family of copunctual lines

Line Dual
3-D Edge
3-D Point
Focal Point
Pencil of Lines
Image Plane
Great Circle
48
Slide

Bullet

49
Slide

Bullet

50
Slide

Bullet

51
Slide

Bullet

52
Slide

Bullet

53
Slide

Bullet

54
Slide

Bullet

55
Slide

Bullet

56
Slide

Bullet

57
Slide

Bullet

58
Slide

Bullet

59
Slide

Bullet

60
Slide

Bullet

61
Slide

Bullet

62
Slide

Bullet

63
Slide

Bullet

64
Slide

Bullet

65
Slide

Bullet

66
Slide

Bullet

67
Slide

Bullet

68
Slide

Bullet

69
Slide

Bullet

70
Slide

Bullet

71
Slide

Bullet

72
Slide

Bullet

73
Maps are fundamental
John Speed, 1626 Image Courtesy Norman B.
Leventhal Collection, Boston

People make sense of their environment by
creating and using maps

74
3D Geometric Models Simulation
(From MIT/UCB CityWalk project)
Example shadow studies for architecture, urban
planning
75
Models are an essential starting point!

With urban models, one can simulate (e.g.)
Touring the space (tourists, customers, students)
Virtual sets (ads, movies, games, socializing)
Emergencies (fires, terrorism, floods etc.)
Military operations (people, vehicles,
sightlines)
Traffic (pedestrian, bicycle, vehicle etc.)
Construction (views, shadows, wind, energy use)
Utilities infrastructure (gas, power, water,
data)
Path-planning (for physically or visually
impaired)
But where do these models come from ?

76
Satellite and Aerial Photogrammetry

E.g., Moffitt, Mikhail 80 Slama 80 Ackermann
80 McKeown, McGlone 93 Mayer 98, Ascender
(Marengoni et al. 99)
Limitations
High-altitude images have low spatial resolution
Nadir views are highly oblique for vertical
surfaces
Side views occluded due to urban canyons

77
Terrestrial Computer Vision

Foundational work in various settings
Camera calibration (intrinsic parameters)
E.g., Faugeras, Toscani 86 Tsai 87
Exterior orientation, scene structure (point
clouds)
Kruppa 13, Ullman 79, Longuet-Higgins 81
Stereo (dense depth maps from image pairs,
triples)
Marr, Poggio 79 Baker, Binford 81 Grimson
81 Shashua 97
Structure from closely-spaced image
sequences/sets
Tomasi, Kanade 92 Azarbayejani, Pentland
95Collins 96, Beardsley et al. 97, Baillard,
Zisserman 99

78
Spatial and Combinatorial Scaling Limitations

No prior algorithm demonstrated on all of
Thousands of images extended spatial area wide
baselines general illumination significant
occlusion and clutter

Short baselines, tracking failures limit spatial
scale
Underlying O(n2) assumption limits combinatorial
scale
Private coordinate systems enable only serial
acquisition

79
Alternative Human-operated modeling tools

E.g, Knopp 94 Becker 95 Taylor, Kriegman 95,
Debevec et al. 96 Jepson 96 Shum et al. 98
Gibson 99 Cipolla et al. 99 Gruen-Wang 99
Rely on human operator to do one or more of
Establish working coordinate system and units
Roughly create and situate block model of scene
Roughly place and orient each camera
Indicate common structure among images (points,
edges, faces, blocks, higher-order shapes)
Indicate subject and clutter portions of each
image (I.e., paint away trees, people, cars,
etc.)
Human frames, initializes, constrains, classifies

80
Example Interactive System Façade (Debevec et
al. 96)

Tasks done by human operator

Tasks done by computer
Frame Acquire/select related set of images
Establish working units, coordinate system
Provide user interface Manage images, geometry,
and constraints
Initialize Specify rough block model of scene
structure
Roughly place and orient cameras
Optimize feature, structure, and camera estimates

Constrain Indicate visible structure in each
image (points, edges, faces, blocks, etc.)
Combine manually masked textures fromdifferent
viewpoints Render final model

Classify Segment each image into subject and
clutter (I.e., paint away trees, people,
other buildings, cars, etc.)
Images courtesy Paul Debevec used with
permission
81
Human-operated modeling tools

Good results from small of images, but require
Uncluttered views of isolated structures
Significant camera standoff (tens of meters)
Overlapping structure, or surveyed fiducial
points or other ground control, to register
multiple datasets
Hours of skilled human effort per building
Scaling limitations here too !
Limited number of input images
Limited occlusion and visual clutter
Limited number of output structures
Limited parallelism

82
Urban models design targets, scaling

Capture a km2 (about ½ sq. mile) to a feature
size of one centimeter (about ½ inch) 1010 cm2
total
Digital image yields 106 pixels, so 104 images
are needed to observe each cm2 fragment just
once!
In practice, need 3-10 views of each surface
fragment, and 3-10 pixels per observation
Bottom line need at least 105 images per km2

83
Our approach to urban model capture
Acquire 1000s of georeferenced images
Extract model of coarse,fine geometry and
appearance
Insert images into spatialindex establish
approximate image adjacency revise 6-DOF
alignment
System development strategy Breadth-first, not
depth-first!
IUW 97 Pacific Graphics 98 ISPRS 99
ECCV/SMILE 2000 (submitted)
84
Rationale

Challenge Solutions
1. Cant expend O(n2) time Geo-referenced
smart cameraMost image pairs unrelated for
framing, initialization Serial acquisition,
algorithms Hierarchical spatial index
for scaling, parallelism
2. Narrow-FOV imagery (aperture Use
high-resolution, super-
problem estimation failure) hemispherical
imagery
3. Feature matching under wide Avoid feature
matchingbaselines and general Use ensemble
features, softillumination is difficult
matching techniques instead
4. Cant rely on human to Acquire thousands of
images identify/remove clutter Use consensus
methods and robust statistics

85
Thrusts of this effort
City
Campus
General Illumination
Increasing Scale, Generality, Automation
Office Park
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
86
Talk overview

Motivation
Scaling issues context
Smart pose-camera
Ensemble features and 6-DOF registration
Reconstruction without correspondence
Removing clutter
Increasing fidelity
Conclusions

87
(No Transcript)
88
Geo-referenced digital pose camera
(With Doug DeCouto)
Designed in concert with Peace River Studios,
Cambridge MA
89
Motorized pan-tilt head for mosaic acquisition
(analogous to QTVR)
90
Two Individual Mosaics
Each is about 75 Mega-Pixels, but can be acquired
at arbitrarily high resolution (at the cost of
time, CPU) Our design target calls for 1K pixels
per radian (57o)at a typical viewing standoff
of about 10 meters
91
Image acquisition early dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
Adjacency graph
(CVPR 98)
92
Image alignment (exterior orientation)

1 Mega-pixel camera with a 1-radian FOV has 1
mrad resolution (3 arc-min, 1/20 deg., 1cm _at_ 10m
standoff)
For registration to one pixel, we must localize
camera position to 1cm, and orientation to 1mrad
(1/20 degree)

GPS satellites
feature
10m
cameras
GPS receiver
earth

Differential GPS receivers claim accuracy of 2cm
So attach GPS receiver, log position (latitude,
longitude, altitude) and time with each image

93
Sensor GPS/IMU navigation estimates

Good to about 2 meters position, 2 degrees
heading(gt100 pixels) still have a registration
problem!

94
Sensor GPS/IMU navigation estimates

Raw estimates
(from nav sensors)

Refined estimates (desired)
95
Imagery Control Exterior Orientation
Each node must be controlled, or registered, in
a common, global (Earth) coordinate system
Image-assisted user interface auto-corresponds
point features(requires several hours of user
time) Mosaicing significant engineering
advantage Goal full automation of
geo-referencing process
96

Manual Correspondence Disadvantages
Infeasible for large data sets
Potential for human error
Unstable solutions

?
97
Global registration DOF argument
Output position x,y,z for each of the V input
nodes, satisfying the directional constraints
Input

Each node adds 3 DOFs (position)
Each adjacency fixes 2 DOFs (3D direction, up to
scale)
Necessary condition 2E 4 gt 3V
Sufficiency every edge part of a D edge-adjacent
to another D
New, open question in rigidity theory
Previously joint angles and/or lengths
Solve linear system with C-LAPACK (uses 1
CPU-minute)
Then register with (unbiased) GPS position
estimates

98
End-to-end Registration VPs, Points
99
End-to-end Registration Epipoles
After rot. alignment
Registered
100
Comparison to manual bundle adjustment
Manual
Auto
101
Performance in presence of clutter
Manual
Auto
102
Performance with poor pose initialization
Initial
Refined
103
Registered pose-image dataset (gt 4,000 images,
25Gb, six billion pixel observations)
Dominant cost mosaics (8 CPU-hours lt1 hour
real time)
104
Structure extraction without correspondence
Histogramming algorithm identifies orientations
of significant vertical façades in vicinity of
cameras (with Satyan Coorg)
CVPR 99
105
Façade detection
Sweep-plane algorithm identifies location and
spatial extent of each (coarse) vertical façade
106
Recovered coarse façades
False positives removed with absolute area
threshold
107
Result for example dataset
Dominant cost plane sweep (8 CPU-hours on this
data) Generalizes to other shapes, given
sufficient CPU
108
Texture-mapping from images
Can map closest image onto surface, but several
problems Inherits lighting conditions, shadows,
reflections from imageCluttering elements
(trees, people, cars) pasted onto
surfaceOff-plane relief (window moldings,
etc.) not modeled
109
Texture estimation challenges
110
Iterative Consensus Texture Estimation
Robust, weighted median - statistics algorithm
estimates texture/BRDF for each building façade
weighted xyY median
Sharpening, masking
Algorithm removes structural occlusion foliage
blur (obliquity) color and lighting
variations! (Also inverse global illumination,
Yu et al. 99)
CVPR 99
111
Masking away occlusion, clutter

(With Eric Amram, Stefano Totaro, Franck
Taillandier)

112
Without masking
With masking
113
Texture estimation results
Input Raw photograph
Output Synthetic texture

Made possible by many observations
A sensor and aggregation algorithm that
effectively see through complex foliage and
clutter

114
Textured model (with overlaid aerial image)
115
Increasing scale 3 Campus datasets
East Campus
Full Campus
116
(No Transcript)
117
Thrusts of this effort
City
Campus
General Illumination
Office Park
Increasing Scale, Generality, Automation
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
118
Capturing surface relief

Idea assume surfaces is nearly planarrecover
deviations using generalized stereo(Szeliski
94, Sawhney 94, Kumar et al. 94, Debevec et
al. 96)
Based on terrain reconstruction algorithms(and
implementations) of Fua and LeClerc

119
(No Transcript)
120
(No Transcript)
121
Symbolic Window Extraction

Based on Wang et al. (Proc. SPIE 97)
Oriented region-growing technique
Applied to composite façade images
After removal of occlusion, shadows
Planned application to
Mesh regularization (quantized depth)
Modeling color from multiple distributions

122
(No Transcript)
123
(No Transcript)
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Capturing 3D Models of Existing Trees(with Ilya
Shlyakhter and Max Rozenoer co-advised by Julie
Dorsey)

Input pose-images. Output 3D tree model.

132
Reconstruction Steps

Segment tree region
Reconstruct 3D shape
Infer major branches
Grow minor branches, leaves to fill 3D shape
Assign colors from images

133
Reconstructing Trees 3D Shape

Volumetric intersection
Extrude each silhouette to polyhedral cone
Intersect cones to obtain 3D shape

134
Infer plausible branch structure

Find 3D medial axis
Fix nodes at terminal points (branch tips)
Use vertices from even-order convex hulls
(Rappoport92)

1st order hull ABCD 2nd order hull
CED Nontrivial even-order hulls correspond to
branch tips
135
Grow Remainder of Tree

Procedural model L-systems
Rewriting rules specify branching
Normally, start with one shoot
Here, start with complete skeleton

Simple example 1st rule directs growth 2nd rule
directs branching
136
Coloring the leaves

View-dependent mapping colors back-projected
from image most closely matching viewpoint
Alternative match color distribution to that
found in input images

137
Matched stills
138
Detail views
139
(No Transcript)
140
Take-home messages

Fully-automated model acquisition is possible,
in principle and in practice
Augmented sensor removes combinatorialbottleneck
inherent in classical algorithms
Spherical images are a fundamental enabling
technique (more than simply a practical
advantage)
Ensemble, rather than individual,
featureslargely eliminate need for
correspondence
Large numbers of images can overcome even severe
clutter and occlusion, efficiently
Even surprisingly complex structures (e.g.,
trees)can be plausibly modeled from observation

141
Acquisition is the application !

End-to-end system for model acquisition
From sensor directly to textured CAD/GIS
Remove human from the loop
Advantage Removes scaling, throughput limits
Tradeoff instrumentation limited domain
Evaluate
Costs (development, one-time, ongoing)
Efficiency (computation, storage resources)
Fidelity (faithfulness to reference model)
Utility (to military, maintenance, simulators)

142
Overview Urban model acquisition

Automation
Automatic exterior calibration of imagery
Generalization Aggregation
3D reconstruction, merging
Texture, occlusion, relief estimation
Collaboration with Fua (EPFL) and LeClerc (SRI)
Symbolic window extraction
Collaboration with Wang and Hansen (UMass)
Scale and Throughput
Data acquisition, distributed processing
Input and Output Validation
Sensor improvement surveying efforts

143
Project Goals End-to-End

Develop a sensor and algorithms to extract
geodetic textured CAD models from initially
uncontrolled imagery, without a human in the loop
Five parts
Develop novel sensor pose camera forimagery
and approximate exterior orientation
Deploy sensor to acquire pose mosaics
Refine estimates of exterior orientation
Extract geometry, textures (BRDFs)
Evaluate and validate models, cost, etc.

144
Research/Engineering Footprints
1 2

Ascender Façade MIT/City
Number of images Tens, Tens, Thousands, Imagery
type Aerial Near-ground Near-ground 6-DOF
camera pose From human From human Instruments
optim. Structure extraction Roof-matching By
human Automatic detection optimization optimi
zation optimization Number of structures Scores
One to Tens Arbitrary Output coord- Specified
by Specified by Geodetic (Earth) inate
system operator operator coordinates Texture
Procedural Manual Automatic w/
matching segmentation robust statistics Scaling
capability Unclear Unclear Spatial
index Parallel model acquis- None None Use of
geodetic ition and merging coordinates, index
1 UMASS 2 Berkeley
145
Engineering rationale, choices

General vs. restricted envt class
Few vs. many images
Satellite, aerial vs. ground imagery
Video vs. single-frame camera
Resolution vs. large field of view
Optics, CCD, pan/tilt rig both
Geo-referencing data w/ each image
6-DOF three translation, three rotationin
Earth coordinates (lat, long, alt, NED)
Breadth-first (not depth-first) development

146
Urban Model Capture
Vertical Façade Extraction
Sparse Reconstruction
Horizontal line segment identification
Step 1. Low-level feature detection
Step 2. Computing frustums
Step 3. Compute vertex, line extrusions
Space sweep finds dominant facades
Step 4. Matching vertex extrusions
Step 5. Matching line extrusions
Step 6. Computing surfaces
Citations Coorg, CVPR 1999 Mellor, IUW
1997 Cutler, MIT MEng 1999 Chou, IUW 1997
Amram, MIT MSc 1998 Faugeras and Keriven, SS
1997
The reconstructed model of a portion of
Technology Square
Vertex extrusions corresponding to a vertex
element
Line extrusions corresponding to a line element
Variational Surface Evolution
Geometry Extraction
Geometry Aggregation
1. Single camera projection
2. Multiple cameras projection
1. Data
2. Spatial Index
3. Images mis-aligned
4. Surface evolution
5. Alignment improves
3. Surface Patches
4. Extended Surfaces
5. Final Surface
147
Edge, line, corner detection

With Manish Jethwa

148
Sensor challenges (low -gt high)

Fuse data streams from diversesensors (GPS, IMU,
omnicam, etc.)
Achieve meaningful error bounds
Effectively incorporate high-levelknowledge
about platform motion
Translations, rotations, stops
Disambiguate GPS noise from multipath
Bootstrap from crude model capture

149
Feature detection challenges

Characterize, achieve theoreticallyoptimal
estimation of edges
Effectively combat local clutter
E.g., obscurations of single edge
Effectively combat false positives
E.g., tree limbs disguised as edges
Propagate useful error bounds
E.g., to downstream algorithms forvanishing
point estimation

150
3D Reconstruction challenges

Expressiveness of template
Polyhedra, surfaces of revolution etc.
Variations in feature size
From signage lettering to large buildings
Principled idea of when to believein a
multiply-reinforced element
Validation goodness metric, etc.

151
Why we need texture

Show buildings with raw imagery mapped onto them
trees are stuck on to buildings!
Challenges multiple views
Differing lighting
Distinct occlusion in each image
Surface is non-planar each image sees different
piece
Can undo lighting Yu99, but how to dealwith
occlusion?
Strategies to generate textures
Assume no occlusion (simple averaging)
Rely on human user to paint away textures

152
Alignment is automation bottleneck

Overview of end-to-end pipeline
Recovering rotation and translationfor acquired
hemispherical images
Short-baseline techniques not applicable
Exploit navigation informationlarge number
(1000s) of images
Tack decouple rotation, translationsolve
independently (w/ Matt Antone)

153
Registration challenges

Robust VP estimation from edgeclasses (rather
than single edges)
Robust translation directions fromlow-level
features (edges, points)
Use of dense (area) information?
Allocation of error
Orbit of optical center intrinsicsspherical
mosaicing error noise in feature extraction etc.

154
Texture estimation challenges

How best to incorporate
Billions of pixel observations
Non-planar surface geometry
Appearance models of varying power (diffuse,
specular, BRDF, etc.)
High-level knowledge of repetitivestructure,
common material types
How to validate our results?

155
Increasing Scale, Throughput

Scale sensor, spatial infrastructure
Sensor node time reduced to 1 minute from 5
minutes in 1998, including HDR
Input second MIT dataset, severalhundred nodes
across East Campus
Throughput map algorithms to parallel,
distributed Linux cluster
1-32 CPUs with near-linear speedups
Currently limited by I/O bandwidth

156
(No Transcript)
157
(No Transcript)
158
Evaluation Criteria

Throughput
Complexity
Fidelity (Geometric, Photometric)
Adoption of tools and models by users
Assessment of results by community

159
Module improvements

10x area implies 10x data size
Data scaling, naming conventions
Speed
Pose-camera (Argus) improvements
Parallel distributed processing pipeline
Accuracy, validation
6-DOF raw navigation data
Imagery control (exterior calibration)
Derived feature points, edges, faces
Relief extraction
Symbolic windows

160
Texture algorithm

Show four steps in algorithm
Initialize per-image occlusion mask to 0.5
Texture occlusion mask yields consensus
Then correlate consensus with images tore-form
occlusion mask
Show several steps of the algorithm!
Image, mask, imagemask, consensus

161
Generalization and Aggregation

Several 3D reconstruction techniques
Large planar surfaces (Coorg)
Small surfels (Mellor)
Bottom-up surface inferences (Chou)
Aggregation phase (Cutler)
Principled merging of algorithm outputsto
produce single consistent CAD model

162
Next step extracting geometry

Several approaches, with overlapping,partially
complementary operating regimes
Vertical façade extraction
Finds large vertical surfaces from horizontal
edges
Low-level feature hypothesis and promotion
Bottom-up, from sparse point and edge features
Dense surfel optimization
Treats world as dense cloud of surface patches
Aggregation phase

163
(No Transcript)
164
(No Transcript)
165
Geometry from sparse features (with George Chou)
166
(No Transcript)
167
(No Transcript)
168
(No Transcript)
169
(No Transcript)
170
(No Transcript)
171
Geometry from dense surfels (with J.P. Mellor)

Generalizes Collins space-sweep (96)
Related to Kutulakos and Seitzs space-carving
(98)

IUW 97 Mellor 99
172
(No Transcript)
173
Geometry aggregation (with Barb Cutler)
Cutler 99
174
Toward Automated Exterior Registration
With Manish Jethwa, Neel Master
175
Preliminary results (with overlaid aerial image)
Model represents about 1 CPU-Day at 200 MHz
Next acquire full MIT campus compare to
refer-ence model captured via traditional
surveying
176
Validation

Input (Mike Bosse)
Survey waypoints to characterizeprecision,
accuracy of navigation sensor
Suppress sub-systems (GPS, inertial, odometry) to
gauge contribution of each
Output (Qixiang Sun)
Synthetic inputs, idealized results
Real inputs, optimization residuals
Compare reconstructed models tosurveyed,
hand-solved models

177
Evaluation Criteria

Throughput
Complexity
Fidelity (Geometric, Photometric)
Adoption of tools and models by users
Assessment of results by community

178
From the East
From the South
179
Connections to other communities

MIT Physical Plant
Well-maintained 2D CAD
City of Cambridge Planning Dept.
Surveying, GIS expertise/expectations
MIT Depts of Architecture, Urban Planning
GIS software, demographic data

180
From maps to models

A model is any dataset in an electronic form
suitable for manipulation by a computer program

Map (paper chart, or scanned image)
Model (city locations, explicit road networks)
181
Models enable visualization and simulation!
Route planning
(Examples from MapQuest)
182
Image acquisition First dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
(CVPR 98)
183
Four design tradeoffs

General computer vision problem is hardSo focus
on urban environments for now
Previous approaches use few imagesInstead
acquire thousands of images(Only way to overcome
clutter automatically)
Cant assume O(n2) image pairs relatedSo use
sensor that identifies, by proximity and
direction, those images that are likely to be
related
Human-operated modeling tools use negligible CPU.
Instead we use massive parallelism, I/O

184

Parameterized tree generators
Biology-based enforce botanical growth laws
Good fidelity to sunlight availability, other
factors
Hard to control final shape
Examples Lindenmayer68, Prusinkiewicz88
Geometry-based specify detailed geometric
parameters
More direct control of shape
Fidelity to biology, environment not enforced
Examples Bloomenthal85, Greene89
Both generator classes yield one of a family of
trees, not a particular, observed tree

185
Hybrid Approach

Geometry-based infer plausible branch structure
directly from observations
Biology-based use growth model to fill tree
volume with minor branches, leaves
Texture/coloration step color the
leavesaccording to original image observations

186
Segmentation (identifying tree pixels)

Currently manual
Preliminary filter implemented(also Haering,
Lobo 1999)

187
Tree Reconstruction Summary

Preliminary solution for one instance of a hard
inverse problem forcing a procedural model to
reproduce an existing object
Hybrid approach allows direct control of final
shape while relegating details to a procedural
model that enforces biological fidelity

188
Urban models design targets, scaling

Capture a km2 (about ½ sq. mile) to a feature
size of one centimeter (about ½ inch) 1010 cm2
total
Using a 1-MegaPixel digital camera, 104 images
needed just to observe each cm2 fragment once!
In practice, need 3-10 views of each surface
fragment, and 3-10 pixels per observation
Bottom line need at least 105 images per km2
Quadratic-time algorithms clearly not applicable
Human operators cant even look at 105 images,
let alone manipulate them. So what do we do?

189
Classical ambiguity rotation vs. translation

Caused by limited camera FOV
To first order, rotation, translation
indistinguishable when FOE far outside image
Show example

190
History Computer Vision

Camera calibration (intrinsic parameters)
E.g., Faugeras, Toscani 86 Tsai 87
Exterior orientation, scene structure (point
clouds)
Kruppa 13, Ullman 79, Longuet-Higgins 81
Stereo (dense depth maps from image pairs,
triples)
Marr, Poggio 79 Baker, Binford 81 Grimson
81 Shashua 97
Structure from closely-spaced image sequences
Tomasi, Kanade 92 Azarbayejani, Pentland
95Collins 96, Beardsley et al. 97, Baillard,
Zisserman 99

Cam 1
Cam 2
Cam 1
Cam 2
191
Model capture An analogy

How can one capture an existing physical document
into a word processor, for editing?

?
192
Option 1 Type it in
Uses existing skills, hardware, and software
Accurate (depending on input, operator skills)

Requires skilled human operator(s), pro-
portional to number of pages to be input
Increasing computer speed doesnt generally
increase system throughputThus, human is
eventually the bottleneck

193
Option 2 Scan and OCR document

A) Acquire digital photograph of printed pages
B) Apply OCR (Optical Character Recognition)
algorithms to extract a model of letters, words
C) Output document in machine-readable form

OCR Algorithms
Scanner
194
Scanning Advantages disadvantages
No human in the loop! Throughput increases
with technology Parallel capture (scanners,
processing) possible Accurate (depending on
input, algorithms) Someone must
develop scanner, algorithms (Possibly years
until commercial viability)
195
Back to Geometric Model Capture

We are developing a scanner and a suite of
extraction algorithms for urban environments!

Input pictures of urban environment
Output textured 3D CAD model in Earth
coordinates
(Lat, long, alt., and orientation)
Vision algorithms
Novel Camera
196
Vision Calibration, Correspondence, Structure,
Appearance
197
Hidden assumption quadratic complexity

Most algorithms assume all input images are
related!
Expend O(n2) time searching for correlations
But for extended terrestrial imagery, overlap is
sparse
This is impractical (and wasteful) for large n
Aperture problem (limited FOV) makes things worse

198
Hidden assumption private coordinates

Local coordinate system used for each image set
These algorithms cannot use parallel inputs
No clear way to combine models generated across
runs

x
y
x
y
y
y
x
x
y
3.
1.
2.
x
199
Integration barrier disjoint operating regimes

Hundreds of algorithms exist for
particularsub-tasks of the computer vision
problem
Feature detection, camera calibration,
short-baseline exterior orientation and structure
from motion, feature correspondence, etc.
However operating assumptions are restrictive,
or unstated, or both, making composition hard!
Examples short vs. long baseline orthographic
vs. perspective controlled vs. diffuse vs.
general illumination known vs. unknown camera
calibration local vs. global processing/consisten
cy
Not simply a systems integration problem

200
Scale constrains us severely

Cant control illumination conditions
Cant instrument environment with fiducials
Cant precisely control camera placement(in
contrast to, e.g., stereo rigs or other gantries)
Cant afford O(n2)-time algorithms
Cant assume a single, serial image sequence
Cant have human in the processing loop

201
A fundamental optical tradeoff resolution vs.
field of view
CCD array (1K x 1K pixels)
Wide-angle (e.g. fisheye) lens Large field of
view, but Low angular resolution
Long (e.g. telephoto) lens High angular
resolution, but Small field of view
Images courtesy Helmut Dersch used with
permission
202
Mosaic generation (with Satyan Coorg)
Each node is 25-250 images tiling a
sphere about a mechanically fixed optical
center Each node correlated to form spherical
mosaic Camera internal parameters auto-calibrated
Computation is fully automated (no human in
loop) Per node (50 images), 20 CPU-minutes _at_ 200
MHz
CVPR 98 IJCV (to appear)
203
Two engineering problems

First obscuration, multi-path, and electronic
noise degrade GPS accuracy to about 20m, and
make it only intermittently available
Second GPS is a 3-DOF position sensor only it
gives no information about (3-DOF) heading

multi-path
clear line of sight
obscuration
urban canyon
204
GPS/Inertial Navigation (with Michael Bosse)

GPS is unbiased, but only intermittently
available
Inertial is continuously available, but drifts
Strategy combine sensors to achieve continuous
2m,2o solution then refine to 1cm,0.05o using
images

Decoupled GPS, Inertial Coupled GPS, Inertial

GPS World April 2000 ECCV/SMILE 2000 (submitted)
205
Focus of expansion contraction

FOE is special point from which entire
worldlooms as you move toward it
Matched by a second, antipodal point, the focus
of contraction (usually not in view)

Image courtesy Steve Mann used with permission
206
Demonstration
207
Increasing Throughput