Title: Towards direct spatial manipulation of virtual 3D objects using visionbased tracking and gesture rec
1Towards direct spatial manipulation of virtual
3D objects using vision-based tracking and
gesture recognitionof unmarked hands
Master s thesis (dissertação de mestrado)
defense slides March 28, 2008 Advisor Prof.
Marcelo Gattass Co-advisor Prof. Alberto Barbosa
Raposo
2Motivation
- Manipulating 3D objects
- in an intuitive fashion
- using bare hands
3Objective
Implementing five basic 3D manipulation
operations
- Selection Deselection
- Translation
- Rotation
- Scaling
4Selected Past Systems
(marked, instrumented and unmarked hands)
5Cutler et al (1997)
6Forsberg et al (1998)
7Segen, Kumar (2000)
8Schkolne (2001)
9Bettio et al (2007)
10Our approach
11Workplace
12Stereo rig calibration
13Stereo rig calibration
Jean-Yves Bouguet
14Stereo rig calibration
- Zhangs method
- Recovers
- Intrinsic parameters
- Two focal lengths fx, fy
- The principal point (ox, oy)
- Distortion parameters (k1, k2, k3, k4)
- Extrinsic parameters
- Rotation matrix R
- Translation vector
15Stereo rig calibration
- Extrinsic parameters R, of the stereo rig are
then
16Stereo rig calibration
- Obtaining fundamental matrix F
- Needed later on for triangulation based on stereo
vision - Holds for any pairof correspondingpoints
17Stereo rig calibration
Finding fundamental matrix F of the stereo rig
Left camera view
Right camera view
Views differ slightly (stereo disparity)
18Stereo rig calibration
Finding fundamental matrix F of the stereo rig
Left camera view
Right camera view
Find corresponding points (squares meeting
points)
There are 76 42 corresponding points for an
8x7 checkerboard
19Stereo rig calibration
Finding fundamental matrix F of the stereo rig
Hartley, Zisserman MVG
20Stereo rig calibration
Finding fundamental matrix F of the stereo
rig (normalized 8-point algorithm for F)
Hartley, Zisserman MVG
21State switching using gestures
Each gesture is a switch triggering an event
Left hand
Right hand
22State switching using gestures
Examples of manipulation operations
23Viola-Jones detection method
(applied to hand detection gesture recognition)
Hit
Hit and false hit
24Viola-Jones detection method
(applied to hand detection gesture recognition)
Hit and multiple false hits
Miss
25Viola-Jones detection method
- invariance with regard to background
- insensitivity to changes in illumination/lighting
- invariance with regard to camera
- invariance with regard to scale
- fast execution (15x faster than previous best
methods) - works with gray images only color is not needed
- very long training times (up to several days for
one object (or hand posture) on a 30-node cluster)
26Viola-Jones detection method
Originally developed for face detection
However, works for any type of object
27Viola-Jones detection method
Extended Viola-Jones method by Lienhart, Maydt
28Viola-Jones detection method
Extended Viola-Jones method by Lienhart, Maydt
29Viola-Jones detection method
Strong classifier obtained by AdaBoost
A linear combination of weak classifiers
ht(x) (a weak classifier a rectangular feature)
Extended Viola-Jones method by Lienhart, Maydt
30Viola-Jones detection method
Example A strong classifier consisting of two
weak classifiers
Extended Viola-Jones method by Lienhart, Maydt
31Viola-Jones detection method
- Strong classifiers can be arbitrarily accurate
but tend to become slow as more weak classifiers
are added during the learning process - Way out cascades of strong classifiers
- Basically, several strong classifiers linked into
a chain
Extended Viola-Jones method by Lienhart, Maydt
32Viola-Jones detection method
A cascade of strong classifiers
Extended Viola-Jones method by Lienhart, Maydt
332D hand trackingusing Flocks of KLT features
342D hand trackingusing Flocks of KLT features
352D hand trackingusing Flocks of KLT features
- hands mean position average of KLT features
positions
362D hand trackingusing Flocks of KLT features
- Two conditions enforced at each frame
- No two KLT features can be closer to each other
than some threshold distance - No KLT feature can be further from the feature
median than a second threshold distance
2005 Kolsch, Turk - Hand tracking with Flocks of
Features
373D reconstruction of hands position using
triangulation
- A 3D point in the scene gets projected on both
the left and right screen (image planes)
3D point
Hartley, Zisserman MVG
383D reconstruction of hands position using
triangulation
- Ideal case rays back-projected from measured
pixel points do meet in space
Hartley, Zisserman MVG
393D reconstruction of hands position using
triangulation
- Real life rays back-projected from imperfectly
measured pixel points do not meet in space
Hartley, Zisserman MVG
403D reconstruction of hands position using
triangulation
- Solution mid-point method intersection
estimated as the point of minimum distance from
both rays
413D reconstruction of hands position using
triangulation
- Better minimize geometric error by finding
points , so that
Hartley, Zisserman MVG
423D reconstruction of hands position using
triangulation
- That is, minimize the cost function
- Having , use any triangulation method
(e.g. mid-point) to find the originating 3D point
Hartley, Zisserman MVG
433D reconstruction of hands position using
triangulation
Hartley, Zisserman MVG
443D reconstruction of hands position using
triangulation
Hartley, Zisserman MVG
45Basic ingredients summary
- A well-defined workplace setup with a calibrated
stereo rig - Viola-Jones method for hand detection and
recognition - Flocks of KLT features for 2D hand tracking (in
both cameras views) - Triangulation (based on stereo vision) for
recovery of the third hand coordinate (depth)
using two tracked 2D positions
46Tests Results
47Tracing lines
48Detector performance (hand posture OPEN)
49Detector performance (hand posture POINTING)
50Detector performance (hand posture FIST)
51Video
52Contributions
531) 3D TRACKING OF UP TO TWO UNMARKED HANDS
- Key ingredients
- 2D flock-of-KLT-features hand tracking
- Triangulation based on stereo vision for
extracting hands third dimension
542) A NOVEL SPATIAL-INPUT DEVICE
- Key ingredients
- The aforementioned 3D unmarked hand tracking
- Use of the Viola-Jones detection method for state
switching - Two hands give a 2 x 3 6 d.o.f. spatial input
device
553) FREE-HAND SPATIAL MANIPULATION
- In conjuction with the aforementioned spatial
input device, the prototype developed enables the
user to - Manipulate 3D virtual objects using free-hand
motion - In other words, there is no need to instrument
the users hands in any way in order to perform
3D manipulation operations
56Limitations
57Flocks-of-features sometimes drift to surrounding
objects
- Flocks of tracked features sometimes drift to
other objects - Can especially happen on cluttered desks
58(Still) overly high false hit rates
- Hand detectors false hit rates we achieved are
still too high - longer training sessions more powerful
computing resources needed - Various heuristics come to rescue (e.g. the
average posture in the last 1000 miliseconds)
59Future work
60Richer set of manipulations and deformations
Going beyond the basic set of manipulations we
want deformations too
61Advanced (volumetric) topological data structures
Needed to support advanced deformation operations
62Increasing robustness of detection
- The goal isnt to add more gestures the goal is
to increase robustness of the existing 2-3
gestures detection - Too many gestures lead to users cognitive
overload (at least in the beginning) - 2-3 gestures suffice to implement A LOT of
functionality
63Improving sense of where
- Improving sense of position and orientation in
the fishtank-VR by adding spatial cues - Shadows cast by 3D objects
- 2D Projections of 3D objects planes XY, YZ, ZX
64Wide-angle/fish eye cameras
Increasing the workspace
65Cameras towards the user
Workspace (use hands here)
Desktop computer users
66Cameras towards the user
Notebook users
67Cameras towards the user
camera built into the cellphone
use hands here
Mobile platform users ? 3D modeling on cell phones
68Model-based (3D) hand tracking
3D hand tracking for more expressive manipulation
M. Bray, E. Koller-Meier, L. Van Gool (2007)
69Dynamic gestures
Dynamic gestures for more expressive manipulation
Hand posture changes in space AND time
70Human factors
Comfort zones for hand actions, while standing
Kölsch 2004
71Human factors
osha.gov
72Human factors
Achieving comfort putting elbows on chair
supports
73Natural fit head-mounted displays
(user can reach into the display volume in front
of her/him)
Eyewear (personal displays) by Lumus Inc.
(www.lumus-optical.com/)
74Integrating other ways to recover hands depth
hand
ZCam such a camera would eliminate the
calibration and triangulation steps
3DV Systems' ZCam depth-sensing camera
75Thank you