Modeling Music with Words a multi-class na - PowerPoint PPT Presentation

About This Presentation

Title:

Modeling Music with Words a multi-class na

Description:

Modeling Music with Words a multi-class na ve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego – PowerPoint PPT presentation

Number of Views:167

Avg rating:3.0/5.0

Slides: 26

Provided by: swar81

Learn more at: https://www.cs.swarthmore.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Music with Words a multi-class na

1
Modeling Music with Wordsa multi-class naïve
Bayes approach

Douglas Turnbull
Luke Barrington
Gert Lanckriet
Computer Audition Laboratory
UC San Diego
ISMIR 2006
October 11, 2006

Image from vintageguitars.org.uk
2
People use words to describe music

How would one describe Im a Believer by The
Monkees?
We might use words related to
Genre Pop, Rock, 60s
Instrumentation tambourine, male vocals,
electric piano
Adjectives catchy, happy, energetic
Usage getting ready to go out
Related Sounds The Beatles, The Turtles,
Lovin Spoonful
We learn to associate certain words with the
music we hear.

Image www.twang-tone.de/45kicks.html
3
Modeling music and words

Our goal is to design a statistical system that
learns a relationship between music and
words.
Given such a system, we can
Annotation Given a audio-content of a song, we
can annotate the song with semantically
meaningful words.
song ? words
Retrieval Given a text-based query, we can
retrieve relevant songs based on the audio
content of the songs.
words ? songs

Image from http//www.lacoctelera.com/
4
Modeling images and words

Content-based image annotation and retrieval has
been a hot topic in recent years CV05, FLM04,
BJ03, BDF02, .
This application has benefited from and inspired
recent developments in machine learning.

How can MIR benefit from and inspire new
developments in machine learning?
Images from CV05, www.oldies.com
5
Related work

Modeling music and words is at the heart of MIR
research.
jointly modeling semantic labels and audio
content
genre, emotion, style, usage classification
music similarity analysis
Whitman et al. have produced a large body of work
that is closely related to our work Whi05, WE04,
WR05.
Others have looked at joint model of words and
sound effects.
Most focus on non-parametric models (kNN)
SAR-Sla02, AudioClas-CK04

Images from www.sixtiescity.com
6
Representing music and words

Consider a vocabulary and a heterogeneous data
set of
song-caption pairs
Vocabulary - predefined set of words
Song - set of audio feature vectors (X x1
,, xT)
Caption - binary document vector (y)
Example
Im a believer by The Monkees is a happy pop
song that features tambourine.
Given the vocabulary pop, jazz, tambourine,
saxophone, happy, sad
X set of MFCC vectors extracted from audio
track
y 1, 0, 1, 0, 1, 0

Image from www.bluesforpeace.com
7
Overview of our system Representation
Data Features

Training Data
Vocabulary
T
T
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
8
Probabilistic model for music and words

Consider a vocabulary and a set of song-caption
pairs
Vocabulary - predefined set of words
Song - set of audio feature vectors (X x1
,, xT)
Caption - binary document vector (y)
For the i-th word in our vocabulary, we estimate
a word distribution, P(xi).
Probability distribution over audio feature
vector space
Modeled with a Gaussian Mixture Model (GMM)
GMM estimated using Expectation Maximization
(EM)
Key idea training data for each word
distribution is the set of all feature vectors
from all songs that are labeled with that word.
Multiple Instance Learning includes some
irrelevant feature vectors
Weakly Labeled Data excludes some relevant
feature vectors
Our probabilistic model is a set of word
distributions (GMMs)

Image from www.freewebs.com
9
Overview of our system Modeling
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
10
Overview of our system Annotation
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference
Caption
11
Inference Annotation
Given word distributions P(xi) and a query
song (x1,,xT), we annotate with word
i Naïve Bayes Assumption we assume xi and
xj are conditionally independent, given
i Assuming a uniform prior and taking a log
transform, we have Using this equation, we
annotate the query song with the top N words.

www.cascadeblues.org
12
Overview of our system Annotation
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference
Caption
13
Overview of our system Retrieval
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference (retrieval)
Caption
Text Query
14
Inference Retrieval

We would like to rank test songs by the posterior
probability P(x1, ,xTq) given a query word q.
Problem this results in almost the same ranking
for all query words.
There are two reasons
Length Bias
Longer songs will have proportionately lower
likelihood resulting from the sum of additional
log terms.
This results from the naïve Bayes assumption of
conditional independence between audio feature
vectors RQD00.

Image from www.rockakademie-owl.de
15
Inference Retrieval

We would like to rank test songs by the posterior
probability P(x1, ,xTq) given a query word q.
Problem this results in almost the same ranking
for all query words.
There are two reasons
Length Bias
Song Bias
Many conditional word distributions P(xq) are
similar to the generic song distribution P(x)
High probability (e.g. generic) songs under P(x)
often have high probability under P(xq)
Solution Rank by likelihood P(qx1, ,xT)
instead.
Normalize P(x1, ,xTq) by P(x1, ,xT)

Image from www.rockakademie-owl.de
16
Overview of our system
Data Features
Modeling
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
(annotation) Inference (retrieval)
Caption
Text Query
17
Overview of our system Evaluation
Data Features
Modeling Evaluation
Parametric Model Set of GMMs
Training Data
Vocabulary
T
T
Parameter Estimation EM Algorithm
Caption
Document Vectors (y)
Audio-Feature Extraction (X)
Novel Song
Evaluation
(annotation) Inference (retrieval)
Caption
Text Query
18
Experimental Setup

Data 2131 song-review pairs
Audio popular western music from the last 60
years
DMFCC feature vectors MB03
Each feature vector summarize 3/4 seconds of
audio content
Each song is represent by between 320-1920
feature vectors
Text song reviews from AMG Allmusic database
We create a vocabulary of 317 musically
relevant unigrams and bigrams
A review is a natural language document written
by a musical expert
Each review is converted into a binary document
vector
80 Training Set used for parameters estimation
20 Testing Set used for model evaluation

Image from www.chrisbarber.net
19
Experimental Setup

Tasks
Annotation annotate each test song with 10
words
Retrieval rank order all test songs given a
query word
Metrics We adopt evaluation metrics developed
for image annotation and retrieval CV05.
Annotation
mean per-word precision and recall
Retrieval
mean average precision
mean area under the ROC curve

Image from www.chrisbarber.net
20
Quantitative Results

Annotation
Retrieval
Recall
Precision
maPrec
AROC
Our Model .072 .119 .109 0.61
Baseline .032 .060 .072 0.50

Our model performs significantly better than
random for all metrics.
one-sided paired t-test with ? 0.1
recall precision are bounded by a value less 1
AROC is perhaps the most intuitive metric

Image from sesentas.ururock.com
21
Discussion

Music is inherently subjective
Different people will use different words to
describe the same song.
We are learning and evaluating using a very noisy
text corpus
Reviewer do not make explicit decisions about the
relationships between individual words when
reviewing a song.
This song does not rock.
Mining the web may not suffice.
Solution manually label data (e.g., MoodLogic,
Pandora)

Image from www.16-bits.com.ar
22
Discussion

3. Our system performs much better when we
annotate retrieve sound effects
BBC sound effect library
More objective task
Cleaner text corpus
Area under the ROC 0.80 (compare with 0.61 for
music)
4. Best results for content-based image
annotation and retrieval are comparable to our
sound effect results.

Image from www.16-bits.com.ar
23
Talking about music is like dancing about
architecture - origins unknown