Title: Screenplay Alignment for Closed-System Speaker Identification and Analysis of Feature Films
1Screenplay Alignment for Closed-System Speaker
Identification and Analysis of Feature Films
- Robert Turetsky
- rob_at_ee.columbia.edu
- Columbia Univ / Philips Research
- ICME 2004 - June 29, 2004
2Talk Organization
- Motivation for Content-based Analysis
- Unsupervised Speaker ID on films
- Concerning Screenplays
- Screenplay Alignment for Label Generation
- Experiments, Evaluation, Future Work
3Technology Impacts Film
Production
Distribution
Consumption
4Content-Based Analysis Motivation
- With so much content out there, users suffer from
information overload - How can they find exactly what they want? How
can they find new things they like? - Content-based analysis attempts to find what is
important and what is unique
5Voice Fingerprinting forSpeaker Identification
- Extremely difficult on film audio!
- Many different emotional contexts
- Different acoustic environments (room tone)
- Noise assumptions do not hold
- Sound design/FX leads to burst noises
- Noise is correlated with speech (soundtrack)
- SNR can be low with soundtrack
- Very little published work on film audio!
6Deliverable
- Closed-system speaker identification on any
main character (gt 5 of dialogue) - Completely self-referential, requires no user
intervention - Takes advantage of supervised learning methods
- Can be combined with face ID for robust character
detection
7The Screenplay
- Used as a map of the movie for every member of
the cast and crew - Contains description of scenes, characters,
costumes, action and dialogue - Usually formatted very regularly
- Available for thousands of movies
- An untapped resource in the automatic film
analysis community
Example Screenplay ?
8(No Transcript)
9Challenges with Screenplays
- No timecode associated with events
- Lines/scenes are often cut, shuffled or added
- Formatting is a guideline not a standard
- Proposed Solution
- Parse the screenplay into data structure
- Align screenplay with timestamped subtitles
- Use timestamped dialogues as ground truth for
multimodal statistical models of salient objects
within the film
10Character ID Architecture
Audio Features
Statistical Model
Video signal
Closed Captions
Alignment
Character ID
Screenplay
Actor Identification
IMDb.com
11Screenplay parsing
- SCENE . SCENE DIAL_START SLUG
TRANSITION - DIAL_START \t ltCHAR NAMEgt (V.O.O.S.)? \n
- \t DIALOGUE PAREN
- DIALOGUE \t .? \n\n
- PAREN \t (.?)
- TRANSITION \t ltTRANS NAMEgt
- SLUG
- ltSCENE gt?. ltINT/EXTgtltERNAL.gt? - ltLOCgt lt- TIMEgt?
-
12Closed Captions Capture
- Subtitles stored on DVD as MPEG movie overlay
- SubRip 1.17.1 performs video OCR, w/timestamp
- Manual Training Period per font
- Confusion I and l
- Alternative Closed captions from UDF
13Alignment Similarity Matrix
- Binary array compare each word in CC/screenplay
- Dialogues that align form diagonal lines
- Noise common or repeated words
14Screenplay vs. CC Distance Matrix
- Dynamic programming Find strong diagonal
segments - Median filter on slope to identify properly
aligned long dialogues and discard spurious short
matches
15Screenplay Alignment Result
- Obtain time-stamped dialogues from closed-caption
times - Identify speaker by screenplay label
Screenplay Alignment, Wall Street
16Analysis of Label Accuracy
CRAIG LESTER LOTTE MALKO MAXINE OTHER
CRAIG 82 0 1 1 0 11
LESTER 0 41 0 0 0 0
LOTTE 0 0 40 0 0 2
MALKO 0 0 0 25 0 2
MAXINE 0 0 1 0 71 4
- Being John Malkovich 311/334 (93.1) of closed
captions successfully labeled
17Coverage The need for Statistical Methods
Movie CCs Accuracy Coverage
Being John Malkovich 1436 311 (93) 334 (23)
L.A. Confidential 1666 522 (95) 548 (33)
Wall Street 2342 850 (89) 954 (41)
Magnolia 2672 747 (89) 843(32)
- For each film alignment accuracy is 90, but
only 30 of dialogue is aligned! - Need statistical methods to learn from labeled
examples to label all dialogues
18Speaker Identification
- Alignment allows unsupervised speaker ID using
(much easier!) supervised classifier - Best performing feature .5 sec (40 frames) of
MFCC - 8 component GMM per main character, trained using
EM - Winner-take-all likelihood for final voting
CRAIG LESTER LOTTE MALKO MAXINE
CRAIG 57 17 21 27 28
LESTER 5 77 4 15 1
LOTTE 8 0 49 10 13
MALKO 7 2 8 31 4
MAXINE 23 4 19 17 55
19Summary and Conclusions
- The screenplay contains a wealth of detail on
film semantics and the intentions of the
filmmakers - The screenplay can be time-stamped and mined for
salient objects (e.g. characters) and story
descriptors - Incomplete alignment can be used to create models
of objects for further analysis