Title: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali
1A Social Network Approach to Unsupervised
Induction of Syntactic Clusters for Bengali
- Monojit Choudhury
- Microsoft Research India
- monojitc_at_microsoft.com
2Co-authors
Niloy Ganguly
Chris Biemann University of Leipzig
Joydeep Nath
Animesh Mukherjee
Indian Institute of Technology Kharagpur
3Language A Complex System
- Structure
- phones ? words, words ? phrases, phrase ?
sentence, sentence ? discourse - Function Communication through
- recursive syntax
- compositional semantics
- Dynamics
- Evolution
- Language change
4Computational Linguistics
- Study of language using computers
- Study of language-using computers
- Natural Language Processing
- Speech recognition
- Machine translation
- Automatic summarization
- Spell checkers, Information retrieval
extraction,
5Labeling of Text
- Lexical Category (POS tags)
- Syntactic Category (Phrases, chunks)
- Semantic Role (Agent, theme, )
- Sense
- Domain dependent labeling (genes, proteins, )
- How to define the set of labels?
- How to (learn to) predict them automatically?
6Distributional Hypothesis
- A word is characterized by the company it keeps
Firth, 1957 - Syntax function words (Harris, 1968)
- Semantics content words
7Outline
- Defining Context
- Syntactic Network of Words
- Complex Network Theory Applications
- Chinese Whispers Clustering the Network
- Experiments
- Topological Properties of the Networks
- Evaluation
- Future work
8Features Words
- Estimate the unigram frequencies
- Feature words Most frequent m words
9Feature Vector
- From the familiar to the exotic, the collection
is a delight
the
to
is
from
fw1
fw2
fw199
fw200
p-2
0 0 0 1
1 0 0 0
0 1 0 0
1 0 0 0
p-1
p1
p2
10Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1 1 cos(red, blue)
red
11The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
12The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
13The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
14Experiments
- Corpus Anandabazaar Patrika (17M words)
- We build networks Gn,m
- n corpus size 1M, 2M, 5M, 10M, 17M
- m number of feature words 25, 50, 100, 200
- Number of nodes 5000
- Number of edges 150,000
15Topological Properties Cumulative Degree
Distribution
G17M,50
CDD Pk is the probability that a randomly chosen
node has degree k
Pk
k
Pk ?? -log(k) pk -dPk /dk ? 1/k
Zipfian Distribution!!
16Topological PropertiesClustering Coefficient
- Measures transitivity of the network or
equivalently the proportion of triangles - Very small for random graphs, high for social
networks - Mean CC for G17M,50 0.53
CC vs. Degree
17Topological Properties Cluster Size Distribution
Cluster Size
rank
rank
Variation with n (m 50)
Variation with m (n 17M)
18Evaluation Tag Entropy
- w t1, t6, t9
- Tagw
- Cluster C w1, w2, w3, w4
- TE(C)
1 0 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1 0
0 0 0 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0 1 0
2
1 0 1 0 0 0 0 0 0 0
19Mean Tag Entropy
- MTE 1/N ?TE(Ci)
- Weighted MTE ?CiTE(Ci)/(?Ci)
- Caveat Every word in separate cluster has 0 MTE
and WMTE - Baseline Every word in a single cluster
20Tag Entropy vs. Corpus Size
m 50
Reduction in Tag Entropy
1M 2M 5M 10M 17M
74.49 75.14 76.09 78.29 74.94
17.46 18.68 24.23 27.56 30.60
MTE
WMTE
21The Bigger the worse!
Tag Entropy
Cluster Size
22Clusters
- Big ones ? Bad ones ? mix of everything!
- Medium sized clusters are good
- http//banglaposclusters.googlepages.com/home
Rank Size Type
5 596 Proper nouns, titles and posts
6 352 Possessive case of nouns (common, proper, verbal) and pronouns
8 133 Nouns (common, verbal) forming compounds with do or be
11 44 Number-Classifier (e.g. 1-TA, ekaTA)
12 84 Adjectives
23More Observations
- Words are split into
- First name vs. Surnames
- Animate nouns-poss vs. Inanimate noun-poss
- Nouns-acc vs. Nouns-poss vs. Nouns-loc
- Verb-finite vs. Verb-infinitive
- Syntactic or semantic?
- Nouns related to professions, months, days of
week, stars, players etc.
24Advantages
- No labeled data required A good solution to
resources scarcity - No prior class information Circumvents issues
related to tag set definition - Computational definition of Class
- Understanding the structure of language (Syntax)
and its evolution
25Danke für Ihre Aufmerksamkeit.
Thank you for your attention
- Dieses ist vom Übersetzer übersetzt worden, der
von Phasen Microsoft Beta ist.
This has been translated by "Translator Beta"
from Microsoft Live.
26Related Work
- Harris, 68 Distributional hypothesis for
syntactic classes - Miller and Charles, 91 Function words as
features - Finch and Chater, 92 Schtze, 93, 95 Clark, 00
Rapp, 05 Biemann, 06 The general technique - Haghighi and Klein, 06 Goldwater and Griffiths,
07 Bayesian approach to unsupervised POS tagging - Dasgupta and Ng, 07 Bengali POS induction
through morphological features
27Medium and Low Frequency Words
- Neighboring (window 4) co-occurrences ranked by
log-likelihood thresholded by ? - Two words are connected iff they share at least 4
neighbors
Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
28Construction of Lexicon
- Each word assigned a unique tag based on the word
class it belongs to - Class 1 sky, color, blood, weight
- Class 2 red, blue, light, heavy
- Ambiguous words
- High and medium frequency words that formed
singleton cluster - Possible tags of neighboring clusters
29Training and Evaluation
- Unsupervised training of trigram HMM using the
clusters and lexicon - Evaluation
- Tag a text, for which gold standard is available
- Estimate the conditional entropy H(TC) and the
related perplexity 2H(TC) - Final Results
- English 2.05 (619/345), Finnish 3.22
(625/466), German 1.79 (781/440)
30Example
From the familiar to the exotic, the collection
is a delight Prep At JJ Prep At JJ
At NN V At NN C200 C1 C331 C5
C1 C331 C1 C221 C3 C1 C220