Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction - PowerPoint PPT Presentation

About This Presentation
Title:

Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction

Description:

ProTools by Digidesign up to 64 channels of 24-bit, 48Khz audio I/O ... Some real-time DSP: Chorus, Compression, Flange, Distortion, Echo, Reverb. DirectMusic ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 54
Provided by: johnn96
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction


1
Tools for Sound, Speech, and Multi-modal
Interaction
  • Johnny Lee
  • 05-830 Advanced UI Software

2
Sound
3
Sound
  • Authoring Tools
  • Recording, Playback
  • SFX libraries
  • Editing,Mixing
  • MIDI
  • Developer Tools
  • Software APIs
  • FFT libraries

4
Recording Sound
  • Most laptops have built-in mono microphones

(Schoeps)
5
Recording Sound
6
Recording Sound
7
Playing Sound
  • Most laptops have built in speakers

8
Multichannel Audio
  • ProTools by Digidesign up to 64 channels of
    24-bit, 48Khz audio I/O

9
Multichannel Audio
10
(No Transcript)
11
Sound Libraries
  • SoundIdeas (http//www.sound-ideas.com/)
  • General 6000
  • Hanna Barbara (http//gs304.sp.cs.cmu.edu/sfx/)
  • Lots of other smaller suppliers of stock sound
    libraries

12
Editing/Mixing Sounds
  • LogicAudio, SoundForge, Peak, SoundEdit16, many
    others.
  • Edits sound kind of like a text-editor.
  • Sophisitcated DSP (some realtime)
  • Synchronization with video and MIDI support

13
MIDI
  • Musical Instrument Digital Interface
  • Hardware communication layer
  • 5-pin din, uni-directional with pass-thru
  • Software protocol layer
  • MIDI Commands are 2-3 bytes
  • Note specification
  • Device configuration (128 controllers)
  • Device Control/Synchronization

14
MIDI
  • Lots of general purpose fields
  • Simple electronics (2 resistors and PIC
    processor)
  • Semi-popular option for simple control/robotics
    applications.

15
MOD files
  • File size can be tiny if using a MIDI synthesizer
    is used at playback time.
  • Playback quality depends on the quality of the
    synthesizer
  • MOD files (module format) combine MIDI data with
    WAV samples to produce high quality consistent
    playback in a relatively small file.

16
(No Transcript)
17
Software APIs for sound
18
Microsoft DirectX 9.0
  • DirectX is
  • DirectDraw 2D drawing
  • Direct3D 3D drawing
  • DirectInput input/haptic devices
  • DirectPlay network gaming
  • DirectShow video streams
  • DirectSound wave audio I/O
  • DirectMusic soundtrack management and MIDI
  • DirectSetup DirectX installation routines

19
DirectSound
  • WAV capture
  • Multi-channel sound playback
  • Full duplex
  • 3D specification of sound sources.
  • Some real-time DSP Chorus, Compression, Flange,
    Distortion, Echo, Reverb

20
DirectMusic
  • Coordinates several sound files (MIDI, wav, etc.)
    into soundtracks.
  • Sequencing (timelines, cueing, and
    synchronization).
  • Supports dynamic composition, variation, and
    transitioning between songs/parts.
  • Dynamic content authored in DirectMusic Producer

21
DirectMusic
  • Compositions can be made with DLS (downloadable
    sound) files a cross-platform smart audio
    file format designed for dynamic loading in
    interactive applications.
  • DLS MIDI WAV for interactive apps

22
MacOS X Core Audio
23
MacOS X Core Audio
  • Sound Manager routines for resource management
    and play/recording sound
  • AudioToolbox sophisitcated DSP architecture,
    sequencing/composition
  • MIDI Services device abstraction, control, and
    patching
  • Audio HAL medium level I/O access (real-time,
    low-latency, multi-channel, floating point is
    standard access)
  • IOKit low level device access
  • Drivers, Hardware - blarg
  • Full Java API provided

24
Java
  • Basic data structures and routines for loading,
    playing, and stopping sounds.
  • java.applet.AudioClip
  • javax.sound.midi
  • javax.sound.midi.spi
  • javax.sound.sampled
  • javax.sound.sampled.spi
  • I/O device access is somewhat limited.
  • Ive been told that synchronization is a problem
    in Java.

25
Voice as Sound
  • Voice as sound using non-verbal voice input for
    interactive control. Takeo Igarashi, John F.
    Hughes UIST 2001 155-156
  • STFT, FFT analysis
  • Extension to SUITEKeys

26
Fourier Transform(FT)
  • Simple properties about a sound can be gotten
    by looking at the data file duration, volume
  • More interesting analysis requires some DSP
    mainly Fourier Transform.

27
Fourier Transform
  • FT extracts the frequency content from a given
    segment of audio.

28
Fourier Transform
29
Fast Fourier Transform(FFT)
  • FFT is a fast computational algorithm for doing
    discrete Fourier transform (DFT).
  • Implementations available in most languages.
  • Good reference source Numerical Recipes in C

30
Speech (spech)
31
Speech Synthesis
  • Three categories of speech synthesizers
  • Articulatory synth - uses physical model of the
    physiology of speech production and physics of
    sound generation in the vocal apparatus
  • Formant synth - acoustic-phonetic approach to
    synthesis. Applies hundreds of filters loosely
    associated to the movement of articulators using
    rules.
  • Concatenative synth - segmental database that
    reflects the major phonological features of a
    language. Creates smooth transitions and basic
    processing to match prosodic patterns
  • (http//cslu.cse.ogi.edu/HLTsurvey/ch5node4.html)

32
ATT Natural Voices
  • US English, UK English, French, Spanish, German,
    Korean
  • Can build a new voice font from an existing
    person
  • Examples
  • Male Voice
  • Custom UK English
  • Voice Font
  • French

33
Phoenix Semantic Frame Parser
  • Center for Spoken Language Research, University
    of Colorado, Boulder
  • http//communicator.colorado.edu/phoenix/license.h
    tml
  • System for processing and parsing natural
    language

34
Phoenix
35
Phoenix
Details and Syntax for creating frames and
networks http//communicator.colorado.edu/phoenix
/Phoenix_Manual.pdf
36
Universal Speech Interfaces
Universal speech interfacesgt Ronald Rosenfeld ,
Dan Olsen , Alex Rudnickygt Interactions October
2001gt Volume 8 Issue 6
  • In essence, we attempt to do for speech what
    Palms Graffiti has done for mobile text entry.
  • http//www-2.cs.cmu.edu/usi/USI-manifesto.htm
  • Speech is an ambient medium.
  • Speech is descriptive rather than referential.
  • Speech require modest physical resources.
  • Only speech will scale as digital technology
    progresses.
  • 3 Speech interaction techniques Natural Language
    (NLI, NLP), Dialog Trees, Command and Control

37
(No Transcript)
38
Universal Speech Interfaces
  • Look and FeelSound and Say
  • Universal Metaphors familiar ways of doing
    things across applications.
  • Universal User Primitives standard dialog
    interaction techniques, detection, recovering
    from error, asking for help, navigation, etc.
  • Universal Machine Primitives standardize
    machine responses and meanings to increase user
    understanding.

39
Java Speech
  • JSAPI Java Speech API
  • Speech Generation
  • Structure Analysis Java Synthesis Markup
    Language (JSML)
  • Text Pre-Processing abbreviation, acronyms,
    1998
  • Text-to-Phoneme Conversion
  • Prosody Analysis
  • Waveform Production
  • Speech Recognition
  • Grammar Design - Java Speech Grammar Format
    (JSGF)
  • Signal Processing
  • Phoneme Recognition
  • Word Recognition
  • Result Generation

40
Windows .NET Speech SDK
  • Basically the .NET-ified SAPI 5.1 (Speech API)
  • Continuous Speech Recognition (US English,
    Japanese, and Simplified Chinese)
  • Concatenative Speech Synthesis (US English and
    Simplified Chinese)
  • Interface is broken into two components
  • Application Programming Interface (API)
  • Device Driver Interface(DDI)

41
Windows .NET Speech SDK
  • Speech Synthesis API
  • ISpVoiceSpeak(my text, voice)
  • Speech Synthesis DDI
  • Prases text into an XML doc
  • Calls the TTSEngine
  • Manages sound and threading details

42
Windows .NET Speech SDK
  • Speech Recognition API
  • Define context
  • Define grammar
  • Request type (dictation or command/control)
  • Event is fired when recognized
  • Speech Recognition DDI
  • Interfacing and configuring the SREngine
  • Manages sound and threading details.

43
Windows .NET Speech SDK
  • Speech Application Language Tags (SALT)
    extension to HTML for speech integration in to
    webpages
  • Speech Recognition Grammar Specification (SRGS)
    support for field parsing
  • Telephony Controls interfaces with telephone
    technology to develop voice-only apps.

44
MacOS X Speech
  • Barely changed since 1996, MacInTalk 3
  • US English only
  • Full Java API
  • Speech Synthesis Manager (PlainTalk)
  • algorithmic voice generation
  • Speech Recognition Manager
  • OS wide push-to-talk Command/Control
  • Customizable vocabulary w/scripting
  • Uses Language Model grammar
  • No dictation support

45
Dragon Naturally Speaking
  • Commercial Recognition software
  • Dictation
  • Command and control
  • API available for developers for application
    integration
  • http//www.scansoft.com/naturallyspeaking/

46
Sphinx
  • Open source speech recognizer from CMU
    (http//fife.speech.cs.cmu.edu/sphinx/)
  • Auto-builds language model/grammervocabulary
    from example sentences
  • CMU-Cambridge Statistical Language Modeling
    Toolkit semi-machine learning algorithms for
    digesting a large example corpus into a usable
    model
  • Uses CMU Pronouncing Dictionary
  • SphinxTrain - builds new acoustic models
  • Audio recording, transcript, pronunciation
    dictionary/vocabulary, phoneme list

47
SUITEKeys
  • Manaris,B., McCauley,R., MacGyvers,V., An
    Intelligent Interface for Keyboard and Mouse
    Control--Providing Full Access to PC
    Functionality via Speech, Proceedings of 14th
    International Florida AI Research Symposium
    (www.cs.cofc.edu/manaris/)
  • Developed for individuals with motor
    disabilities.
  • Interface layer that generates keyboard and mouse
    events for the OS
  • Recognizes keyboard strokes/operations
    backspace, function twleve, control-alt-delete,
    page down, press. release
  • Recognizes mouse buttons and movement
    left-click, move down. Stop, 2 units above
    clock, move to 5-18

48
Suede
Scott R. Klemmer , Anoop K. Sinha , Jack Chen ,
James A. Landay , Nadeemgt Aboobaker , Annie Wanggt
Proceedings of the 13th annual ACM symposium on
User interface software andgt technology November
2000
  • Wizard of OZ tool for prototyping speech
    interfaces
  • Allows the developer to quicky generate a state
    machine representing the possible paths through a
    speech interface and stores recorded system
    responses.
  • Operator simulates a functional system during
    evaluation by stepping through the state machine.
  • Runtime transcripts are recorded for later
    analysis.

49
(No Transcript)
50
Mulitmodal Interaction
51
Multimodal Interaction
  • According to Scott The term multi-modal
    interface usually refers to speech and
    something else because speech alone wasnt good
    enough.
  • Though, should probably mean more than one
    (simultaneous?) input modality
  • Point, click, gesture, type, speak, write, touch,
    look, bite, shake, think, sweat, etc (lots of
    sensing techniques).

52
Multimodal Interaction
  • Lots of things have used them, but no real
    tools or werent simultaneous.
  • Cohen, P.R., Cheyer, A., Wang, M., and Baeg, S.C.
    An open agent architecture. AAAI 94 Spring
    Symposium Series on Software AgentsAAAI, (Menlo
    Park, CA, 1994) reprinted in Readings in Agents.
    MorganKaufmann, 1997, 197204.
  • Brad Myers, Robert Malkin, Michael Bett, Alex
    Waibel, BenBostwick, Robert C. Miller, Jie Yang,
    Matthias Denecke, Edgar Seemann,Jie Zhu, Choon
    Hong Peck, Dave Kong, Jeffrey Nichols,
    BillScherlis. "Flexi-modal and Multi-Machine User
    Interfaces",ltigtIEEE Fourth International
    Conference on Multimodal Interfaceslt/igt,Pittsburgh
    , PA. October 14-16, 2002. pp. 343-348.

53
Multimodal Interfaces
  • A common concept is mode-ing or modifying
    interaction.
  • Gives extra context for recognizers (e.g. point
    and speak)
  • Multiplies functionality of an interaction (e.g
    tool stones, left/right/no click)
  • Rekimoto, J., Sciammarella, E. (2000)
    ToolStone effective use of physical
    manipulation vocabularies of input devices.
    Proceedings of the ACM Symposium on User
    Interface Software and Technology, pp. 109-117,
    November 2000
  • Also, a need for an input interpretation layer
    for widgets that can be specified in multiple
    ways.
Write a Comment
User Comments (0)
About PowerShow.com