Modeling Music with Words a multi-class na - PowerPoint PPT Presentation

About This Presentation
Title:

Modeling Music with Words a multi-class na

Description:

Modeling Music with Words a multi-class na ve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 26
Provided by: swar81
Category:

less

Transcript and Presenter's Notes

Title: Modeling Music with Words a multi-class na


1
Modeling Music with Wordsa multi-class naïve
Bayes approach
  • Douglas Turnbull
  • Luke Barrington
  • Gert Lanckriet
  • Computer Audition Laboratory
  • UC San Diego
  • ISMIR 2006
  • October 11, 2006

Image from vintageguitars.org.uk
2
People use words to describe music
  • How would one describe Im a Believer by The
    Monkees?
  • We might use words related to
  • Genre Pop, Rock, 60s
  • Instrumentation tambourine, male vocals,
    electric piano
  • Adjectives catchy, happy, energetic
  • Usage getting ready to go out
  • Related Sounds The Beatles, The Turtles,
    Lovin Spoonful
  • We learn to associate certain words with the
    music we hear.

Image www.twang-tone.de/45kicks.html
3
Modeling music and words
  • Our goal is to design a statistical system that
    learns a relationship between music and
    words.
  • Given such a system, we can
  • Annotation Given a audio-content of a song, we
    can annotate the song with semantically
    meaningful words.
  • song ? words
  • Retrieval Given a text-based query, we can
    retrieve relevant songs based on the audio
    content of the songs.
  • words ? songs

Image from http//www.lacoctelera.com/
4
Modeling images and words
  • Content-based image annotation and retrieval has
    been a hot topic in recent years CV05, FLM04,
    BJ03, BDF02, .
  • This application has benefited from and inspired
    recent developments in machine learning.

How can MIR benefit from and inspire new
developments in machine learning?
Images from CV05, www.oldies.com
5
Related work
  • Modeling music and words is at the heart of MIR
    research.
  • jointly modeling semantic labels and audio
    content
  • genre, emotion, style, usage classification
  • music similarity analysis
  • Whitman et al. have produced a large body of work
    that is closely related to our work Whi05, WE04,
    WR05.
  • Others have looked at joint model of words and
    sound effects.
  • Most focus on non-parametric models (kNN)
    SAR-Sla02, AudioClas-CK04

Images from www.sixtiescity.com
6
Representing music and words
  • Consider a vocabulary and a heterogeneous data
    set of
  • song-caption pairs
  • Vocabulary - predefined set of words
  • Song - set of audio feature vectors (X x1
    ,, xT)
  • Caption - binary document vector (y)
  • Example
  • Im a believer by The Monkees is a happy pop
    song that features tambourine.
  • Given the vocabulary pop, jazz, tambourine,
    saxophone, happy, sad
  • X set of MFCC vectors extracted from audio
    track
  • y 1, 0, 1, 0, 1, 0

Image from www.bluesforpeace.com
7
Overview of our system Representation
Data Features

Training Data
Vocabulary
T
T
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
8
Probabilistic model for music and words
  • Consider a vocabulary and a set of song-caption
    pairs
  • Vocabulary - predefined set of words
  • Song - set of audio feature vectors (X x1
    ,, xT)
  • Caption - binary document vector (y)
  • For the i-th word in our vocabulary, we estimate
    a word distribution, P(xi).
  • Probability distribution over audio feature
    vector space
  • Modeled with a Gaussian Mixture Model (GMM)
  • GMM estimated using Expectation Maximization
    (EM)
  • Key idea training data for each word
    distribution is the set of all feature vectors
    from all songs that are labeled with that word.
  • Multiple Instance Learning includes some
    irrelevant feature vectors
  • Weakly Labeled Data excludes some relevant
    feature vectors
  • Our probabilistic model is a set of word
    distributions (GMMs)

Image from www.freewebs.com
9
Overview of our system Modeling
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
10
Overview of our system Annotation
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference
Caption
11
Inference Annotation
Given word distributions P(xi) and a query
song (x1,,xT), we annotate with word
i Naïve Bayes Assumption we assume xi and
xj are conditionally independent, given
i Assuming a uniform prior and taking a log
transform, we have Using this equation, we
annotate the query song with the top N words.

www.cascadeblues.org
12
Overview of our system Annotation
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference
Caption
13
Overview of our system Retrieval
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference (retrieval)
Caption
Text Query
14
Inference Retrieval
  • We would like to rank test songs by the posterior
    probability P(x1, ,xTq) given a query word q.
  • Problem this results in almost the same ranking
    for all query words.
  • There are two reasons
  • Length Bias
  • Longer songs will have proportionately lower
    likelihood resulting from the sum of additional
    log terms.
  • This results from the naïve Bayes assumption of
    conditional independence between audio feature
    vectors RQD00.

Image from www.rockakademie-owl.de
15
Inference Retrieval
  • We would like to rank test songs by the posterior
    probability P(x1, ,xTq) given a query word q.
  • Problem this results in almost the same ranking
    for all query words.
  • There are two reasons
  • Length Bias
  • Song Bias
  • Many conditional word distributions P(xq) are
    similar to the generic song distribution P(x)
  • High probability (e.g. generic) songs under P(x)
    often have high probability under P(xq)
  • Solution Rank by likelihood P(qx1, ,xT)
    instead.
  • Normalize P(x1, ,xTq) by P(x1, ,xT)

Image from www.rockakademie-owl.de
16
Overview of our system
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference (retrieval)
Caption
Text Query
17
Overview of our system Evaluation
Data Features
Modeling Evaluation
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
Evaluation
(annotation) Inference (retrieval)
Caption
Text Query
18
Experimental Setup
  • Data 2131 song-review pairs
  • Audio popular western music from the last 60
    years
  • DMFCC feature vectors MB03
  • Each feature vector summarize 3/4 seconds of
    audio content
  • Each song is represent by between 320-1920
    feature vectors
  • Text song reviews from AMG Allmusic database
  • We create a vocabulary of 317 musically
    relevant unigrams and bigrams
  • A review is a natural language document written
    by a musical expert
  • Each review is converted into a binary document
    vector
  • 80 Training Set used for parameters estimation
  • 20 Testing Set used for model evaluation

Image from www.chrisbarber.net
19
Experimental Setup
  • Tasks
  • Annotation annotate each test song with 10
    words
  • Retrieval rank order all test songs given a
    query word
  • Metrics We adopt evaluation metrics developed
    for image annotation and retrieval CV05.
  • Annotation
  • mean per-word precision and recall
  • Retrieval
  • mean average precision
  • mean area under the ROC curve

Image from www.chrisbarber.net
20
Quantitative Results

Annotation
Retrieval
Recall
Precision
maPrec
AROC
Our Model .072 .119 .109 0.61
Baseline .032 .060 .072 0.50
  • Our model performs significantly better than
    random for all metrics.
  • one-sided paired t-test with ? 0.1
  • recall precision are bounded by a value less 1
  • AROC is perhaps the most intuitive metric

Image from sesentas.ururock.com
21
Discussion
  • Music is inherently subjective
  • Different people will use different words to
    describe the same song.
  • We are learning and evaluating using a very noisy
    text corpus
  • Reviewer do not make explicit decisions about the
    relationships between individual words when
    reviewing a song.
  • This song does not rock.
  • Mining the web may not suffice.
  • Solution manually label data (e.g., MoodLogic,
    Pandora)

Image from www.16-bits.com.ar
22
Discussion
  • 3. Our system performs much better when we
    annotate retrieve sound effects
  • BBC sound effect library
  • More objective task
  • Cleaner text corpus
  • Area under the ROC 0.80 (compare with 0.61 for
    music)
  • 4. Best results for content-based image
    annotation and retrieval are comparable to our
    sound effect results.

Image from www.16-bits.com.ar
23
Talking about music is like dancing about
architecture - origins unknown
  • Please send your questions and comments to
  • Douglas Turnbull - dturnbul_at_cs.ucsd.edu

Image from vintageguitars.org.uk
24
References

25
References
Write a Comment
User Comments (0)
About PowerShow.com