Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph - PowerPoint PPT Presentation

1 / 195
About This Presentation
Title:

Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph

Description:

Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph – PowerPoint PPT presentation

Number of Views:530
Avg rating:3.0/5.0
Slides: 196
Provided by: setht9
Category:

less

Transcript and Presenter's Notes

Title: Machine Vision for Urban Model Capture: Exploiting Scale, Achieving Automation Seth Teller MIT Graph


1
Machine Vision for Urban Model Capture
Exploiting Scale, Achieving AutomationSeth
TellerMIT Graphics GroupJoint work with Eric
Amram, Matthew Antone, Michael Bosse, Satyan
Coorg, Doug DeCouto, Manish Jethwa, Neel Master,
Ivan Petrakiev, Qixiang Sun, Franck Taillandier,
Stefano Totaro, Xiaoguang Wang et al.
2
Example Dataset
500 nodes spanning 500 meters 10,000 HDR
images (Debevec, Malik 97) 50,000 raw
Megapixel images
  • Most node pairs are entirely unrelated!

3
Tackling Scale, Extent, Automation
  • Scale number of input images, features
  • Extent size, scope of acquisition region
  • Automation end-to-end from sensor to CAD

2D images
3D model
4
Motivation
  • Why automate? Good interactive tools exist.
    Yes but either scale-limited or very expensive
    and produce models for restricted set of
    viewpoints
  • System bottleneck is human interaction time
  • Façade, 80 man-hours to model 20 buildings
    representing tower and surround (Debevec 01)
  • LA Basin, 100,000 man-hours (50 man-years)
    to model 15,000 urban structures (Jepson 00)
    All require skilled operators, careful view
    planning
  • Interactive tools are not on technology curve!

Automation.
Where do you go from here?
5
Four Key Ideas
Sensor 1. Metadata captured with each image 2.
Omni-directional (wide-FOV) images Algorithmics
3. Projective, probabilistic uncertainty 4.
Asymptotically linear running times Tradeoff
Extract one expensive human add many cheap
images
6
1. Camera Metadata
Intuition identifies images likely to overlap
y
N
x
N images gt O(N2) timeto discover overlaps
N images gt O(N) timeto discover overlaps
7
2. Omni-directional Imagery
Intuition avoids classical aperture
problem Integrating surround is
fundamental advantage Yields superior
robustness and accuracy
8
  • 3. Projective, probabilistic uncertainty
  • Intuition account for noisy features, pose
  • Bingham densities (1974) generalize Gaussians
  • Completely specified by parameters k1 lt k2 lt k3
    0
  • Advantages
  • Appropriate from theoretical standpoint
    (projective)
  • Can fuse noisy features into accurate aggregate
    estimates
  • Can defer or avoid hard (deterministic)
    decisions

Symmetric Polar
Symmetric Equatorial
Asymmetric Polar
Asymmetric Equatorial
Uniform
9
4. Linear Asymptotics
  • Intuition crucial for spatial extent, free
    viewpoint
  • Capture a km2 to 1-cm resolution 1010 cm2
    fragments
  • 104 images needed to observe each fragment just
    once!
  • Need several views, several pixels per fragment
  • Bottom line need at least 105 images per km2

1 cm2 fragments
Mega-pixel camera
1 kilometer
10
Talk Overview
  • Introduction Motivation
  • Key Ideas
  • Automated 6-DOF Image Registration
  • Large-scale image datasets
  • Large-scale model extraction
  • Long-term goals, vision
  • Conclusion

11
Image registration (bundle-adjustment)
  • Traditional method manual correspondence
  • Several limitations
  • Infeasible for large data sets
  • Numerical instability
  • Human inability to match

?
12
Automated Registration (with Matthew Antone)
  • Intuition each node
  • detects a local, rigid frame in scene
  • then aligns itself to its neighbors
  • Break 6-DOFs into 3 2 1 DOFs

13
Detecting rigid 3-DOF frame
Assume 2 distinct vanishing points visible
VPs found in many scenes
(Coughlan and Yuille, NIPS 2000)
  • Dualize each edge to a point on the unit sphere

CVPR 2000
14
Transforming every edge from a single image
yields evident band structure on the sphere
but not enough structure to reliably estimate
(or even identify!) the scene vanishing points
15
However, transforming edges from every image in
wide-FOV node yields strong band structure!
16
Registration algorithm (3 DOFs)
  • Reduce each node to a few accurately-estimated
    VPs
  • Cost tens of CPU-seconds per node (1000s of
    edges)

17
Propagation step alignment to neighbors
  • This is O(n4) in the of VPs (usually n lt 5)
  • Achieves rotational registration to 0.05o (1
    pixel)
  • Note no individual feature matching outperforms
    human!

18
Advantages of wide-FOV images
  • More VPs (3.2 3.6 per node vs. 0.3 0.7 per
    image)
  • More accurate VPs (by factor of 1 to 3)

H-T peak strength
VP variance
Registration more robust, more accurate with
wide-FOV (Contrast to Collins, Weiss 90
Becker 95 Leung, McLean 96)
19
Translational registration (2 DOFs)
Requires persistent, local observations we use
sub-pixel point features
  • From edge intersections

From valid VPs only
20
Spherical Epipolar Geometry
  • Since any two rotationally registered nodes are
    related by a pure (3DOF) translation

All world points must move on a pencil of
epipolar great circles through (2DOF) translation
direction Problem we dont know direction, or
matches!
21
Registering Translations
  • Position A point features
  • on Gaussian sphere

Position B point features on Gaussian sphere
Overlaid point features Rotationally
aligned Significant overlap
22
Finding the translation direction (2 DOFs)
  • Hough transform all plausible point pairs to
    great circles
  • Great circles reinforce at/near true baseline
    direction
  • Yields direction (up to scale)relating nodes A, B

23
Probabilistic Matching
  • As in Wells (IJCV 1997), Chui et al. (CVPR 2000)
    Dellaert et al. (CVPR 2000), but
  • Our approach
  • Handles unknown numbers of features, matches
  • Handles unknown occlusion and outliers
  • Searches only 2 DOFs (not 5, rot. trans.)
  • Scales to hundreds of thousands of features

24
Probabilistic Matching (cont.)
  • For each adjacent node pair, i/j featurematch
    represented as Mij in 0,1
  • Swap, split, join operations mutatevalid
    matches, add/remove outliers
  • Matrix mutated 10 - 100,000 times in Monte-Carlo
    E-M process
  • Within MC step, new binary matrices produced as a
    Markov Chain, with transition probabilities based
    on geometric error averaging these produces Mij
  • In practice, soft matching of 2,000 features
    converges within 10 E-M steps
  • Strong prior from HT 30 sec. 1 E-M step 20
    sec.

25
Initial baseline estimate overlaid point features
Fix baseline estimate Form plausible match
weights
Fix baseline estimate Mutate match weights
Fix match weights Re-fit baseline estimate
Accurate baseline
26
Error study (synthetic data)
End-to-end baseline error is linear in feature
noise
27
Robustness study (synthetic data)
Baseline estimation robust up to 90 percent
outliers
28
Advantage of wide-FOV (real data)
Hough transform peak as FOV increases
29
Final DOF Fix absolute scale
Output edge lengths
Input
  • Solve linear system with C-LAPACK (uses 1
    CPU-minute)
  • Register to initial GPS position estimates for
    geo-referencing

30
End-to-end consistency measures
Width of distributions at 95 confidence
  • Variety of node sets, spanning hundreds of meters
  • Typical inter-node baseline 10-20 meters
  • Hundreds of thousands of edge, point features
  • Node positions consistent to 5 centimeters
  • Node orientations consistent to 0.1 degrees
  • Epipolar geometry consistent to 4 pixels

31
End-to-end Epipolar geometry
After rot. alignment
After 6-DOF alignment
32
Comparison to manual bundle adjustment
Manual
Automated
33
Performance in the presence of clutter
Automated
Manual
34
Performance with poor initial pose
Initial
Refined
Robust to initial pose errors of up to 7 meters,
17 degrees
35
Currently placing all data on-line
  • URL is http//city.lcs.mit.edu/data
  • Time-stamped, calibrated, HDR (log-radiance)
    images
  • Intrinsic calibration sub-pixel edge, point
    features
  • Geo-referenced 6-DOF pose (ECEF metric units)
  • Interactive browsing (download facility underway)

36
Application 3D reconstruction
  • Voxel, silhouettes (Potmesil 87, Szeliski 93,
    Kanade et al. 95)
  • Space-sweep, voxel-coloring (Collins 96,
    Szeliski Golland 98 Seitz Dyer 99
    Kutulakos Seitz 00 etc.)
  • T(N,V) O(NV) with N images, V voxels
  • This grows with square of reconstruction volume
  • KS 00 N 16, V 5x107, T 4 CPU-hours
  • Our dataset N 104, V 1011 (campus _at_10cm)
  • Extrapolating, T(N, V) 100 CPU-years

37
Asymptotic improvement (w/ Manish Jethwa)
  • Intuition distant image pairs shouldnt interact
  • Let each image affect only constant of voxels
  • Now T(N,V) O(NV)
  • Grows linearly with recon. volume, of images
  • For our dataset, we estimate a few CPU-weeks
  • Also must handle unknown background, clutter

38
Whats next (1)
  • Scalable 3D reconstruction

39
Whats next (2)
  • New operating regimes (with Michael Black)
  • Omni-video
  • Different 30Hz, low resolution, short baselines
  • Architectural interiors
  • Different dimensions, illumination, clutter

40
Whats next (3)
  • Robotic image acquisition (with Draper Labs)
  • Autonomous helicopter w/ 6-DOF navigation
  • On-board omni-directional video camera
  • Eventually simultaneous cooperative capture

41
Conclusions
  • Ideas for automation, scaling, view freedom
  • Metadata, Wide-FOV, Uncertainty, Asymptotics
  • Enable fundamentally new capability
  • Controlled image acquisition over wide areas
  • Datasets of interest to IBR, vision communities
  • Long-term project vision, goals
  • More general operating regimes
  • Robotic image acquisition, model capture

42
Further information
  • http//city.lcs.mit.edu
  • http//city.lcs.mit.edu/data
  • http//graphics.lcs.mit.edu
  • http//graphics.lcs.mit.edu/publications.html

Thanks to
NSF, DARPA, ONR Intel, Interval, NTT
43
(No Transcript)
44
Conclusions
  • Fully-automated model acquisition is possible
  • Augmented sensor allows spatial and input
    scaling, andreplaces human-aided initialization
    in classical algorithms
  • Spherical images are a fundamental enabling
    technique (more than simply a practical
    advantage)
  • Ensemble features and low-DOF optimization
    eliminate need for hard feature correspondence
  • Large numbers of images can overcome even severe
    clutter and occlusion, efficiently
  • End-to-end architecture provides effective
    testbed for algorithms for registration,
    reconstruction, and high-fidelity
    (domain-specific) scene element extraction

45
System Limitations
Limited spatial extent, number of structures
In progress acquisition of MIT campus (1
km2) Vertical facades only rooftops
procedural In progress richer shape primitives
(model selection) Diffuse lighting, diffuse
surfaces In progress directional, inverse
global illumination Use of prior knowledge
about common materials Foliage removed via median
statistics and masking In progress foliage
segmentation, tree modeling Validation of
resultsIn progress independent navigation,
structure survey
46
Metadata Operational Advantages
  • If scene-relative
  • O(N) asympotics parallel image capture
  • Makes interactive initialization unnecessary
  • If Earth-relative
  • Sun direction from time, geo-referencing
  • Output can be overlaid with existing GIS data

47
  • Projective Features
  • Antipodal equivalence
  • Natural duality
  • 3D Edge 1-D family of coplanar points
  • 3D Point 1-D family of copunctual lines

Line Dual
3-D Edge
3-D Point
Focal Point
Pencil of Lines
Image Plane
Great Circle
48
Slide
  • Bullet

49
Slide
  • Bullet

50
Slide
  • Bullet

51
Slide
  • Bullet

52
Slide
  • Bullet

53
Slide
  • Bullet

54
Slide
  • Bullet

55
Slide
  • Bullet

56
Slide
  • Bullet

57
Slide
  • Bullet

58
Slide
  • Bullet

59
Slide
  • Bullet

60
Slide
  • Bullet

61
Slide
  • Bullet

62
Slide
  • Bullet

63
Slide
  • Bullet

64
Slide
  • Bullet

65
Slide
  • Bullet

66
Slide
  • Bullet

67
Slide
  • Bullet

68
Slide
  • Bullet

69
Slide
  • Bullet

70
Slide
  • Bullet

71
Slide
  • Bullet

72
Slide
  • Bullet

73
Maps are fundamental
John Speed, 1626 Image Courtesy Norman B.
Leventhal Collection, Boston
  • People make sense of their environment by
    creating and using maps

74
3D Geometric Models Simulation
(From MIT/UCB CityWalk project)
Example shadow studies for architecture, urban
planning
75
Models are an essential starting point!
  • With urban models, one can simulate (e.g.)
  • Touring the space (tourists, customers, students)
  • Virtual sets (ads, movies, games, socializing)
  • Emergencies (fires, terrorism, floods etc.)
  • Military operations (people, vehicles,
    sightlines)
  • Traffic (pedestrian, bicycle, vehicle etc.)
  • Construction (views, shadows, wind, energy use)
  • Utilities infrastructure (gas, power, water,
    data)
  • Path-planning (for physically or visually
    impaired)
  • But where do these models come from ?

76
Satellite and Aerial Photogrammetry
  • E.g., Moffitt, Mikhail 80 Slama 80 Ackermann
    80 McKeown, McGlone 93 Mayer 98, Ascender
    (Marengoni et al. 99)
  • Limitations
  • High-altitude images have low spatial resolution
  • Nadir views are highly oblique for vertical
    surfaces
  • Side views occluded due to urban canyons

77
Terrestrial Computer Vision
  • Foundational work in various settings
  • Camera calibration (intrinsic parameters)
  • E.g., Faugeras, Toscani 86 Tsai 87
  • Exterior orientation, scene structure (point
    clouds)
  • Kruppa 13, Ullman 79, Longuet-Higgins 81
  • Stereo (dense depth maps from image pairs,
    triples)
  • Marr, Poggio 79 Baker, Binford 81 Grimson
    81 Shashua 97
  • Structure from closely-spaced image
    sequences/sets
  • Tomasi, Kanade 92 Azarbayejani, Pentland
    95Collins 96, Beardsley et al. 97, Baillard,
    Zisserman 99

78
Spatial and Combinatorial Scaling Limitations
  • No prior algorithm demonstrated on all of
  • Thousands of images extended spatial area wide
    baselines general illumination significant
    occlusion and clutter
  • Short baselines, tracking failures limit spatial
    scale
  • Underlying O(n2) assumption limits combinatorial
    scale
  • Private coordinate systems enable only serial
    acquisition

79
Alternative Human-operated modeling tools
  • E.g, Knopp 94 Becker 95 Taylor, Kriegman 95,
    Debevec et al. 96 Jepson 96 Shum et al. 98
    Gibson 99 Cipolla et al. 99 Gruen-Wang 99
  • Rely on human operator to do one or more of
  • Establish working coordinate system and units
  • Roughly create and situate block model of scene
  • Roughly place and orient each camera
  • Indicate common structure among images (points,
    edges, faces, blocks, higher-order shapes)
  • Indicate subject and clutter portions of each
    image (I.e., paint away trees, people, cars,
    etc.)
  • Human frames, initializes, constrains, classifies

80
Example Interactive System Façade (Debevec et
al. 96)
  • Tasks done by human operator

Tasks done by computer
Frame Acquire/select related set of images
Establish working units, coordinate system
Provide user interface Manage images, geometry,
and constraints
Initialize Specify rough block model of scene
structure
Roughly place and orient cameras
Optimize feature, structure, and camera estimates

Constrain Indicate visible structure in each
image (points, edges, faces, blocks, etc.)
Combine manually masked textures fromdifferent
viewpoints Render final model

Classify Segment each image into subject and
clutter (I.e., paint away trees, people,
other buildings, cars, etc.)
Images courtesy Paul Debevec used with
permission
81
Human-operated modeling tools
  • Good results from small of images, but require
  • Uncluttered views of isolated structures
  • Significant camera standoff (tens of meters)
  • Overlapping structure, or surveyed fiducial
    points or other ground control, to register
    multiple datasets
  • Hours of skilled human effort per building
  • Scaling limitations here too !
  • Limited number of input images
  • Limited occlusion and visual clutter
  • Limited number of output structures
  • Limited parallelism

82
Urban models design targets, scaling
  • Capture a km2 (about ½ sq. mile) to a feature
    size of one centimeter (about ½ inch) 1010 cm2
    total
  • Digital image yields 106 pixels, so 104 images
    are needed to observe each cm2 fragment just
    once!
  • In practice, need 3-10 views of each surface
    fragment, and 3-10 pixels per observation
  • Bottom line need at least 105 images per km2

83
Our approach to urban model capture
Acquire 1000s of georeferenced images
Extract model of coarse,fine geometry and
appearance
Insert images into spatialindex establish
approximate image adjacency revise 6-DOF
alignment
System development strategy Breadth-first, not
depth-first!
IUW 97 Pacific Graphics 98 ISPRS 99
ECCV/SMILE 2000 (submitted)
84
Rationale
  • Challenge Solutions
  • 1. Cant expend O(n2) time Geo-referenced
    smart cameraMost image pairs unrelated for
    framing, initialization Serial acquisition,
    algorithms Hierarchical spatial index
    for scaling, parallelism
  • 2. Narrow-FOV imagery (aperture Use
    high-resolution, super-
  • problem estimation failure) hemispherical
    imagery
  • 3. Feature matching under wide Avoid feature
    matchingbaselines and general Use ensemble
    features, softillumination is difficult
    matching techniques instead
  • 4. Cant rely on human to Acquire thousands of
    images identify/remove clutter Use consensus
    methods and robust statistics

85
Thrusts of this effort
City
Campus
General Illumination
Increasing Scale, Generality, Automation
Office Park
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
86
Talk overview
  • Motivation
  • Scaling issues context
  • Smart pose-camera
  • Ensemble features and 6-DOF registration
  • Reconstruction without correspondence
  • Removing clutter
  • Increasing fidelity
  • Conclusions

87
(No Transcript)
88
Geo-referenced digital pose camera
(With Doug DeCouto)
Designed in concert with Peace River Studios,
Cambridge MA
89
Motorized pan-tilt head for mosaic acquisition
(analogous to QTVR)
90
Two Individual Mosaics
Each is about 75 Mega-Pixels, but can be acquired
at arbitrarily high resolution (at the cost of
time, CPU) Our design target calls for 1K pixels
per radian (57o)at a typical viewing standoff
of about 10 meters
91
Image acquisition early dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
Adjacency graph
(CVPR 98)
92
Image alignment (exterior orientation)
  • 1 Mega-pixel camera with a 1-radian FOV has 1
    mrad resolution (3 arc-min, 1/20 deg., 1cm _at_ 10m
    standoff)
  • For registration to one pixel, we must localize
    camera position to 1cm, and orientation to 1mrad
    (1/20 degree)

GPS satellites
feature
10m
cameras
GPS receiver
earth
  • Differential GPS receivers claim accuracy of 2cm
  • So attach GPS receiver, log position (latitude,
    longitude, altitude) and time with each image

93
Sensor GPS/IMU navigation estimates
  • Good to about 2 meters position, 2 degrees
    heading(gt100 pixels) still have a registration
    problem!

94
Sensor GPS/IMU navigation estimates
  • Raw estimates
  • (from nav sensors)

Refined estimates (desired)
95
Imagery Control Exterior Orientation
Each node must be controlled, or registered, in
a common, global (Earth) coordinate system
Image-assisted user interface auto-corresponds
point features(requires several hours of user
time) Mosaicing significant engineering
advantage Goal full automation of
geo-referencing process
96
  • Manual Correspondence Disadvantages
  • Infeasible for large data sets
  • Potential for human error
  • Unstable solutions

?
97
Global registration DOF argument
Output position x,y,z for each of the V input
nodes, satisfying the directional constraints
Input
  • Each node adds 3 DOFs (position)
  • Each adjacency fixes 2 DOFs (3D direction, up to
    scale)
  • Necessary condition 2E 4 gt 3V
  • Sufficiency every edge part of a D edge-adjacent
    to another D
  • New, open question in rigidity theory
  • Previously joint angles and/or lengths
  • Solve linear system with C-LAPACK (uses 1
    CPU-minute)
  • Then register with (unbiased) GPS position
    estimates

98
End-to-end Registration VPs, Points
99
End-to-end Registration Epipoles
After rot. alignment
Registered
100
Comparison to manual bundle adjustment
Manual
Auto
101
Performance in presence of clutter
Manual
Auto
102
Performance with poor pose initialization
Initial
Refined
103
Registered pose-image dataset (gt 4,000 images,
25Gb, six billion pixel observations)
Dominant cost mosaics (8 CPU-hours lt1 hour
real time)
104
Structure extraction without correspondence
Histogramming algorithm identifies orientations
of significant vertical façades in vicinity of
cameras (with Satyan Coorg)
CVPR 99
105
Façade detection
Sweep-plane algorithm identifies location and
spatial extent of each (coarse) vertical façade
106
Recovered coarse façades
False positives removed with absolute area
threshold
107
Result for example dataset
Dominant cost plane sweep (8 CPU-hours on this
data) Generalizes to other shapes, given
sufficient CPU
108
Texture-mapping from images
Can map closest image onto surface, but several
problems Inherits lighting conditions, shadows,
reflections from imageCluttering elements
(trees, people, cars) pasted onto
surfaceOff-plane relief (window moldings,
etc.) not modeled
109
Texture estimation challenges
110
Iterative Consensus Texture Estimation
Robust, weighted median - statistics algorithm
estimates texture/BRDF for each building façade
weighted xyY median
Sharpening, masking
Algorithm removes structural occlusion foliage
blur (obliquity) color and lighting
variations! (Also inverse global illumination,
Yu et al. 99)
CVPR 99
111
Masking away occlusion, clutter
  • (With Eric Amram, Stefano Totaro, Franck
    Taillandier)

112
Without masking
With masking
113
Texture estimation results
Input Raw photograph
Output Synthetic texture
  • Made possible by many observations
  • A sensor and aggregation algorithm that
    effectively see through complex foliage and
    clutter

114
Textured model (with overlaid aerial image)
115
Increasing scale 3 Campus datasets
East Campus
Full Campus
116
(No Transcript)
117
Thrusts of this effort
City
Campus
General Illumination
Office Park
Increasing Scale, Generality, Automation
Clutter and Occlusion
Building
Fragment
Windows, Trees,
Richer geometry
Texture, lighting
Increasing Fidelity
118
Capturing surface relief
  • Idea assume surfaces is nearly planarrecover
    deviations using generalized stereo(Szeliski
    94, Sawhney 94, Kumar et al. 94, Debevec et
    al. 96)
  • Based on terrain reconstruction algorithms(and
    implementations) of Fua and LeClerc

119
(No Transcript)
120
(No Transcript)
121
Symbolic Window Extraction
  • Based on Wang et al. (Proc. SPIE 97)
  • Oriented region-growing technique
  • Applied to composite façade images
  • After removal of occlusion, shadows
  • Planned application to
  • Mesh regularization (quantized depth)
  • Modeling color from multiple distributions

122
(No Transcript)
123
(No Transcript)
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Capturing 3D Models of Existing Trees(with Ilya
Shlyakhter and Max Rozenoer co-advised by Julie
Dorsey)
  • Input pose-images. Output 3D tree model.

132
Reconstruction Steps
  • Segment tree region
  • Reconstruct 3D shape
  • Infer major branches
  • Grow minor branches, leaves to fill 3D shape
  • Assign colors from images

133
Reconstructing Trees 3D Shape
  • Volumetric intersection
  • Extrude each silhouette to polyhedral cone
  • Intersect cones to obtain 3D shape

134
Infer plausible branch structure
  • Find 3D medial axis
  • Fix nodes at terminal points (branch tips)
  • Use vertices from even-order convex hulls
    (Rappoport92)

1st order hull ABCD 2nd order hull
CED Nontrivial even-order hulls correspond to
branch tips
135
Grow Remainder of Tree
  • Procedural model L-systems
  • Rewriting rules specify branching
  • Normally, start with one shoot
  • Here, start with complete skeleton

Simple example 1st rule directs growth 2nd rule
directs branching
136
Coloring the leaves
  • View-dependent mapping colors back-projected
    from image most closely matching viewpoint
  • Alternative match color distribution to that
    found in input images

137
Matched stills
138
Detail views
139
(No Transcript)
140
Take-home messages
  • Fully-automated model acquisition is possible,
    in principle and in practice
  • Augmented sensor removes combinatorialbottleneck
    inherent in classical algorithms
  • Spherical images are a fundamental enabling
    technique (more than simply a practical
    advantage)
  • Ensemble, rather than individual,
    featureslargely eliminate need for
    correspondence
  • Large numbers of images can overcome even severe
    clutter and occlusion, efficiently
  • Even surprisingly complex structures (e.g.,
    trees)can be plausibly modeled from observation

141
Acquisition is the application !
  • End-to-end system for model acquisition
  • From sensor directly to textured CAD/GIS
  • Remove human from the loop
  • Advantage Removes scaling, throughput limits
  • Tradeoff instrumentation limited domain
  • Evaluate
  • Costs (development, one-time, ongoing)
  • Efficiency (computation, storage resources)
  • Fidelity (faithfulness to reference model)
  • Utility (to military, maintenance, simulators)

142
Overview Urban model acquisition
  • Automation
  • Automatic exterior calibration of imagery
  • Generalization Aggregation
  • 3D reconstruction, merging
  • Texture, occlusion, relief estimation
  • Collaboration with Fua (EPFL) and LeClerc (SRI)
  • Symbolic window extraction
  • Collaboration with Wang and Hansen (UMass)
  • Scale and Throughput
  • Data acquisition, distributed processing
  • Input and Output Validation
  • Sensor improvement surveying efforts

143
Project Goals End-to-End
  • Develop a sensor and algorithms to extract
    geodetic textured CAD models from initially
    uncontrolled imagery, without a human in the loop
  • Five parts
  • Develop novel sensor pose camera forimagery
    and approximate exterior orientation
  • Deploy sensor to acquire pose mosaics
  • Refine estimates of exterior orientation
  • Extract geometry, textures (BRDFs)
  • Evaluate and validate models, cost, etc.

144
Research/Engineering Footprints
1 2

Ascender Façade MIT/City
Number of images Tens, Tens, Thousands, Imagery
type Aerial Near-ground Near-ground 6-DOF
camera pose From human From human Instruments
optim. Structure extraction Roof-matching By
human Automatic detection optimization optimi
zation optimization Number of structures Scores
One to Tens Arbitrary Output coord- Specified
by Specified by Geodetic (Earth) inate
system operator operator coordinates Texture
Procedural Manual Automatic w/
matching segmentation robust statistics Scaling
capability Unclear Unclear Spatial
index Parallel model acquis- None None Use of
geodetic ition and merging coordinates, index
1 UMASS 2 Berkeley
145
Engineering rationale, choices
  • General vs. restricted envt class
  • Few vs. many images
  • Satellite, aerial vs. ground imagery
  • Video vs. single-frame camera
  • Resolution vs. large field of view
  • Optics, CCD, pan/tilt rig both
  • Geo-referencing data w/ each image
  • 6-DOF three translation, three rotationin
    Earth coordinates (lat, long, alt, NED)
  • Breadth-first (not depth-first) development

146
Urban Model Capture
Vertical Façade Extraction
Sparse Reconstruction
Horizontal line segment identification
Step 1. Low-level feature detection
Step 2. Computing frustums
Step 3. Compute vertex, line extrusions
Space sweep finds dominant facades
Step 4. Matching vertex extrusions
Step 5. Matching line extrusions
Step 6. Computing surfaces
Citations Coorg, CVPR 1999 Mellor, IUW
1997 Cutler, MIT MEng 1999 Chou, IUW 1997
Amram, MIT MSc 1998 Faugeras and Keriven, SS
1997
The reconstructed model of a portion of
Technology Square
Vertex extrusions corresponding to a vertex
element
Line extrusions corresponding to a line element
Variational Surface Evolution
Geometry Extraction
Geometry Aggregation
1. Single camera projection
2. Multiple cameras projection
1. Data
2. Spatial Index
3. Images mis-aligned
4. Surface evolution
5. Alignment improves
3. Surface Patches
4. Extended Surfaces
5. Final Surface
147
Edge, line, corner detection
  • With Manish Jethwa

148
Sensor challenges (low -gt high)
  • Fuse data streams from diversesensors (GPS, IMU,
    omnicam, etc.)
  • Achieve meaningful error bounds
  • Effectively incorporate high-levelknowledge
    about platform motion
  • Translations, rotations, stops
  • Disambiguate GPS noise from multipath
  • Bootstrap from crude model capture

149
Feature detection challenges
  • Characterize, achieve theoreticallyoptimal
    estimation of edges
  • Effectively combat local clutter
  • E.g., obscurations of single edge
  • Effectively combat false positives
  • E.g., tree limbs disguised as edges
  • Propagate useful error bounds
  • E.g., to downstream algorithms forvanishing
    point estimation

150
3D Reconstruction challenges
  • Expressiveness of template
  • Polyhedra, surfaces of revolution etc.
  • Variations in feature size
  • From signage lettering to large buildings
  • Principled idea of when to believein a
    multiply-reinforced element
  • Validation goodness metric, etc.

151
Why we need texture
  • Show buildings with raw imagery mapped onto them
    trees are stuck on to buildings!
  • Challenges multiple views
  • Differing lighting
  • Distinct occlusion in each image
  • Surface is non-planar each image sees different
    piece
  • Can undo lighting Yu99, but how to dealwith
    occlusion?
  • Strategies to generate textures
  • Assume no occlusion (simple averaging)
  • Rely on human user to paint away textures

152
Alignment is automation bottleneck
  • Overview of end-to-end pipeline
  • Recovering rotation and translationfor acquired
    hemispherical images
  • Short-baseline techniques not applicable
  • Exploit navigation informationlarge number
    (1000s) of images
  • Tack decouple rotation, translationsolve
    independently (w/ Matt Antone)

153
Registration challenges
  • Robust VP estimation from edgeclasses (rather
    than single edges)
  • Robust translation directions fromlow-level
    features (edges, points)
  • Use of dense (area) information?
  • Allocation of error
  • Orbit of optical center intrinsicsspherical
    mosaicing error noise in feature extraction etc.

154
Texture estimation challenges
  • How best to incorporate
  • Billions of pixel observations
  • Non-planar surface geometry
  • Appearance models of varying power (diffuse,
    specular, BRDF, etc.)
  • High-level knowledge of repetitivestructure,
    common material types
  • How to validate our results?

155
Increasing Scale, Throughput
  • Scale sensor, spatial infrastructure
  • Sensor node time reduced to 1 minute from 5
    minutes in 1998, including HDR
  • Input second MIT dataset, severalhundred nodes
    across East Campus
  • Throughput map algorithms to parallel,
    distributed Linux cluster
  • 1-32 CPUs with near-linear speedups
  • Currently limited by I/O bandwidth

156
(No Transcript)
157
(No Transcript)
158
Evaluation Criteria
  • Throughput
  • Complexity
  • Fidelity (Geometric, Photometric)
  • Adoption of tools and models by users
  • Assessment of results by community

159
Module improvements
  • 10x area implies 10x data size
  • Data scaling, naming conventions
  • Speed
  • Pose-camera (Argus) improvements
  • Parallel distributed processing pipeline
  • Accuracy, validation
  • 6-DOF raw navigation data
  • Imagery control (exterior calibration)
  • Derived feature points, edges, faces
  • Relief extraction
  • Symbolic windows

160
Texture algorithm
  • Show four steps in algorithm
  • Initialize per-image occlusion mask to 0.5
  • Texture occlusion mask yields consensus
  • Then correlate consensus with images tore-form
    occlusion mask
  • Show several steps of the algorithm!
  • Image, mask, imagemask, consensus

161
Generalization and Aggregation
  • Several 3D reconstruction techniques
  • Large planar surfaces (Coorg)
  • Small surfels (Mellor)
  • Bottom-up surface inferences (Chou)
  • Aggregation phase (Cutler)
  • Principled merging of algorithm outputsto
    produce single consistent CAD model

162
Next step extracting geometry
  • Several approaches, with overlapping,partially
    complementary operating regimes
  • Vertical façade extraction
  • Finds large vertical surfaces from horizontal
    edges
  • Low-level feature hypothesis and promotion
  • Bottom-up, from sparse point and edge features
  • Dense surfel optimization
  • Treats world as dense cloud of surface patches
  • Aggregation phase

163
(No Transcript)
164
(No Transcript)
165
Geometry from sparse features (with George Chou)
166
(No Transcript)
167
(No Transcript)
168
(No Transcript)
169
(No Transcript)
170
(No Transcript)
171
Geometry from dense surfels (with J.P. Mellor)
  • Generalizes Collins space-sweep (96)
  • Related to Kutulakos and Seitzs space-carving
    (98)

IUW 97 Mellor 99
172
(No Transcript)
173
Geometry aggregation (with Barb Cutler)
Cutler 99
174
Toward Automated Exterior Registration
With Manish Jethwa, Neel Master
175
Preliminary results (with overlaid aerial image)
Model represents about 1 CPU-Day at 200 MHz
Next acquire full MIT campus compare to
refer-ence model captured via traditional
surveying
176
Validation
  • Input (Mike Bosse)
  • Survey waypoints to characterizeprecision,
    accuracy of navigation sensor
  • Suppress sub-systems (GPS, inertial, odometry) to
    gauge contribution of each
  • Output (Qixiang Sun)
  • Synthetic inputs, idealized results
  • Real inputs, optimization residuals
  • Compare reconstructed models tosurveyed,
    hand-solved models

177
Evaluation Criteria
  • Throughput
  • Complexity
  • Fidelity (Geometric, Photometric)
  • Adoption of tools and models by users
  • Assessment of results by community

178
From the East
From the South
179
Connections to other communities
  • MIT Physical Plant
  • Well-maintained 2D CAD
  • City of Cambridge Planning Dept.
  • Surveying, GIS expertise/expectations
  • MIT Depts of Architecture, Urban Planning
  • GIS software, demographic data

180
From maps to models
  • A model is any dataset in an electronic form
  • suitable for manipulation by a computer program

Map (paper chart, or scanned image)
Model (city locations, explicit road networks)
181
Models enable visualization and simulation!
Route planning
(Examples from MapQuest)
182
Image acquisition First dataset
Early prototype of pose camera deployed in and
around Tech Square (4 structures) 81 nodes
4,000 geo-located images 20 Gb
(CVPR 98)
183
Four design tradeoffs
  • General computer vision problem is hardSo focus
    on urban environments for now
  • Previous approaches use few imagesInstead
    acquire thousands of images(Only way to overcome
    clutter automatically)
  • Cant assume O(n2) image pairs relatedSo use
    sensor that identifies, by proximity and
    direction, those images that are likely to be
    related
  • Human-operated modeling tools use negligible CPU.
    Instead we use massive parallelism, I/O

184
  • Parameterized tree generators
  • Biology-based enforce botanical growth laws
  • Good fidelity to sunlight availability, other
    factors
  • Hard to control final shape
  • Examples Lindenmayer68, Prusinkiewicz88
  • Geometry-based specify detailed geometric
    parameters
  • More direct control of shape
  • Fidelity to biology, environment not enforced
  • Examples Bloomenthal85, Greene89
  • Both generator classes yield one of a family of
    trees, not a particular, observed tree

185
Hybrid Approach
  • Geometry-based infer plausible branch structure
    directly from observations
  • Biology-based use growth model to fill tree
    volume with minor branches, leaves
  • Texture/coloration step color the
    leavesaccording to original image observations

186
Segmentation (identifying tree pixels)
  • Currently manual
  • Preliminary filter implemented(also Haering,
    Lobo 1999)

187
Tree Reconstruction Summary
  • Preliminary solution for one instance of a hard
    inverse problem forcing a procedural model to
    reproduce an existing object
  • Hybrid approach allows direct control of final
    shape while relegating details to a procedural
    model that enforces biological fidelity

188
Urban models design targets, scaling
  • Capture a km2 (about ½ sq. mile) to a feature
    size of one centimeter (about ½ inch) 1010 cm2
    total
  • Using a 1-MegaPixel digital camera, 104 images
    needed just to observe each cm2 fragment once!
  • In practice, need 3-10 views of each surface
    fragment, and 3-10 pixels per observation
  • Bottom line need at least 105 images per km2
  • Quadratic-time algorithms clearly not applicable
  • Human operators cant even look at 105 images,
    let alone manipulate them. So what do we do?

189
Classical ambiguity rotation vs. translation
  • Caused by limited camera FOV
  • To first order, rotation, translation
    indistinguishable when FOE far outside image
  • Show example

190
History Computer Vision
  • Camera calibration (intrinsic parameters)
  • E.g., Faugeras, Toscani 86 Tsai 87
  • Exterior orientation, scene structure (point
    clouds)
  • Kruppa 13, Ullman 79, Longuet-Higgins 81
  • Stereo (dense depth maps from image pairs,
    triples)
  • Marr, Poggio 79 Baker, Binford 81 Grimson
    81 Shashua 97
  • Structure from closely-spaced image sequences
  • Tomasi, Kanade 92 Azarbayejani, Pentland
    95Collins 96, Beardsley et al. 97, Baillard,
    Zisserman 99

Cam 1
Cam 2
Cam 1
Cam 2
191
Model capture An analogy
  • How can one capture an existing physical document
    into a word processor, for editing?

?
192
Option 1 Type it in
Uses existing skills, hardware, and software
Accurate (depending on input, operator skills)
  • Requires skilled human operator(s), pro-
    portional to number of pages to be input
    Increasing computer speed doesnt generally
    increase system throughputThus, human is
    eventually the bottleneck

193
Option 2 Scan and OCR document
  • A) Acquire digital photograph of printed pages
  • B) Apply OCR (Optical Character Recognition)
    algorithms to extract a model of letters, words
  • C) Output document in machine-readable form

OCR Algorithms
Scanner
194
Scanning Advantages disadvantages
No human in the loop! Throughput increases
with technology Parallel capture (scanners,
processing) possible Accurate (depending on
input, algorithms) Someone must
develop scanner, algorithms (Possibly years
until commercial viability)
195
Back to Geometric Model Capture
  • We are developing a scanner and a suite of
    extraction algorithms for urban environments!

Input pictures of urban environment
Output textured 3D CAD model in Earth
coordinates
(Lat, long, alt., and orientation)
Vision algorithms
Novel Camera
196
Vision Calibration, Correspondence, Structure,
Appearance
197
Hidden assumption quadratic complexity
  • Most algorithms assume all input images are
    related!
  • Expend O(n2) time searching for correlations
  • But for extended terrestrial imagery, overlap is
    sparse
  • This is impractical (and wasteful) for large n
  • Aperture problem (limited FOV) makes things worse

198
Hidden assumption private coordinates
  • Local coordinate system used for each image set
  • These algorithms cannot use parallel inputs
  • No clear way to combine models generated across
    runs

x
y
x
y
y
y
x
x
y
3.
1.
2.
x
199
Integration barrier disjoint operating regimes
  • Hundreds of algorithms exist for
    particularsub-tasks of the computer vision
    problem
  • Feature detection, camera calibration,
    short-baseline exterior orientation and structure
    from motion, feature correspondence, etc.
  • However operating assumptions are restrictive,
    or unstated, or both, making composition hard!
  • Examples short vs. long baseline orthographic
    vs. perspective controlled vs. diffuse vs.
    general illumination known vs. unknown camera
    calibration local vs. global processing/consisten
    cy
  • Not simply a systems integration problem

200
Scale constrains us severely
  • Cant control illumination conditions
  • Cant instrument environment with fiducials
  • Cant precisely control camera placement(in
    contrast to, e.g., stereo rigs or other gantries)
  • Cant afford O(n2)-time algorithms
  • Cant assume a single, serial image sequence
  • Cant have human in the processing loop

201
A fundamental optical tradeoff resolution vs.
field of view
CCD array (1K x 1K pixels)
Wide-angle (e.g. fisheye) lens Large field of
view, but Low angular resolution
Long (e.g. telephoto) lens High angular
resolution, but Small field of view
Images courtesy Helmut Dersch used with
permission
202
Mosaic generation (with Satyan Coorg)
Each node is 25-250 images tiling a
sphere about a mechanically fixed optical
center Each node correlated to form spherical
mosaic Camera internal parameters auto-calibrated
Computation is fully automated (no human in
loop) Per node (50 images), 20 CPU-minutes _at_ 200
MHz
CVPR 98 IJCV (to appear)
203
Two engineering problems
  • First obscuration, multi-path, and electronic
    noise degrade GPS accuracy to about 20m, and
    make it only intermittently available
  • Second GPS is a 3-DOF position sensor only it
    gives no information about (3-DOF) heading

multi-path
clear line of sight
obscuration
urban canyon
204
GPS/Inertial Navigation (with Michael Bosse)
  • GPS is unbiased, but only intermittently
    available
  • Inertial is continuously available, but drifts
  • Strategy combine sensors to achieve continuous
    2m,2o solution then refine to 1cm,0.05o using
    images
  • Decoupled GPS, Inertial Coupled GPS, Inertial

GPS World April 2000 ECCV/SMILE 2000 (submitted)
205
Focus of expansion contraction
  • FOE is special point from which entire
    worldlooms as you move toward it
  • Matched by a second, antipodal point, the focus
    of contraction (usually not in view)

Image courtesy Steve Mann used with permission
206
Demonstration
207
Increasing Throughput
  • Scale sensor, spatial infrastructure
  • Sensor node time reduced to 1 minute from 5
    minutes, including HDR imagery
  • Processing MIT dataset, nearly a thousand nodes
    spanning entire campus
  • Currently mapping algorithms to parallel,
    distributed Linux cluster
  • 1-64 CPUs with near-linear speedups
  • Current bottleneck is disk I/O bandwidth

208
(No Transcript)
209
(No Transcript)
210
(No Transcript)
211
Animation with novel views
Write a Comment
User Comments (0)
About PowerShow.com