A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali

About This Presentation

Title:

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali

Description:

Chinese Whispers: Clustering the Network. Experiments. Topological Properties of the Networks ... The Chinese Whisper Algorithm. light. color. red. blue. blood ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 31

Provided by: monojitc4

Category:

more less

Transcript and Presenter's Notes

Title: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali

1
A Social Network Approach to Unsupervised
Induction of Syntactic Clusters for Bengali

Monojit Choudhury
Microsoft Research India
monojitc_at_microsoft.com

2
Co-authors
Niloy Ganguly
Chris Biemann University of Leipzig
Joydeep Nath
Animesh Mukherjee
Indian Institute of Technology Kharagpur
3
Language A Complex System

Structure
phones ? words, words ? phrases, phrase ?
sentence, sentence ? discourse
Function Communication through
recursive syntax
compositional semantics
Dynamics
Evolution
Language change

4
Computational Linguistics

Study of language using computers
Study of language-using computers
Natural Language Processing
Speech recognition
Machine translation
Automatic summarization
Spell checkers, Information retrieval
extraction,

5
Labeling of Text

Lexical Category (POS tags)
Syntactic Category (Phrases, chunks)
Semantic Role (Agent, theme, )
Sense
Domain dependent labeling (genes, proteins, )
How to define the set of labels?
How to (learn to) predict them automatically?

6
Distributional Hypothesis

A word is characterized by the company it keeps
Firth, 1957
Syntax function words (Harris, 1968)
Semantics content words

7
Outline

Defining Context
Syntactic Network of Words
Complex Network Theory Applications
Chinese Whispers Clustering the Network
Experiments
Topological Properties of the Networks
Evaluation
Future work

8
Features Words

Estimate the unigram frequencies
Feature words Most frequent m words

9
Feature Vector

From the familiar to the exotic, the collection
is a delight

the
to
is
from
fw1
fw2
fw199
fw200
p-2
0 0 0 1
1 0 0 0
0 1 0 0
1 0 0 0
p-1
p1
p2
10
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1 1 cos(red, blue)
red
11
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
12
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
13
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
14
Experiments

Corpus Anandabazaar Patrika (17M words)
We build networks Gn,m
n corpus size 1M, 2M, 5M, 10M, 17M
m number of feature words 25, 50, 100, 200
Number of nodes 5000
Number of edges 150,000

15
Topological Properties Cumulative Degree
Distribution
G17M,50
CDD Pk is the probability that a randomly chosen
node has degree k
Pk
k
Pk ?? -log(k) pk -dPk /dk ? 1/k
Zipfian Distribution!!
16
Topological PropertiesClustering Coefficient

Measures transitivity of the network or
equivalently the proportion of triangles
Very small for random graphs, high for social
networks
Mean CC for G17M,50 0.53

CC vs. Degree
17
Topological Properties Cluster Size Distribution
Cluster Size
rank
rank
Variation with n (m 50)
Variation with m (n 17M)
18
Evaluation Tag Entropy

w t1, t6, t9
Tagw
Cluster C w1, w2, w3, w4
TE(C)

1 0 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1 0
0 0 0 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0 1 0
2
1 0 1 0 0 0 0 0 0 0
19
Mean Tag Entropy

MTE 1/N ?TE(Ci)
Weighted MTE ?CiTE(Ci)/(?Ci)
Caveat Every word in separate cluster has 0 MTE
and WMTE
Baseline Every word in a single cluster

20
Tag Entropy vs. Corpus Size
m 50
Reduction in Tag Entropy
1M 2M 5M 10M 17M
74.49 75.14 76.09 78.29 74.94
17.46 18.68 24.23 27.56 30.60
MTE
WMTE
21
The Bigger the worse!
Tag Entropy
Cluster Size
22
Clusters

Big ones ? Bad ones ? mix of everything!
Medium sized clusters are good
http//banglaposclusters.googlepages.com/home

Rank Size Type
5 596 Proper nouns, titles and posts
6 352 Possessive case of nouns (common, proper, verbal) and pronouns
8 133 Nouns (common, verbal) forming compounds with do or be
11 44 Number-Classifier (e.g. 1-TA, ekaTA)
12 84 Adjectives
23
More Observations

Words are split into
First name vs. Surnames
Animate nouns-poss vs. Inanimate noun-poss
Nouns-acc vs. Nouns-poss vs. Nouns-loc
Verb-finite vs. Verb-infinitive
Syntactic or semantic?
Nouns related to professions, months, days of
week, stars, players etc.

24
Advantages

No labeled data required A good solution to
resources scarcity
No prior class information Circumvents issues
related to tag set definition
Computational definition of Class
Understanding the structure of language (Syntax)
and its evolution

25
Danke für Ihre Aufmerksamkeit.
Thank you for your attention

Dieses ist vom Übersetzer übersetzt worden, der
von Phasen Microsoft Beta ist.

This has been translated by "Translator Beta"
from Microsoft Live.
26
Related Work

Harris, 68 Distributional hypothesis for
syntactic classes
Miller and Charles, 91 Function words as
features
Finch and Chater, 92 Schtze, 93, 95 Clark, 00
Rapp, 05 Biemann, 06 The general technique
Haghighi and Klein, 06 Goldwater and Griffiths,
07 Bayesian approach to unsupervised POS tagging
Dasgupta and Ng, 07 Bengali POS induction
through morphological features

27
Medium and Low Frequency Words

Neighboring (window 4) co-occurrences ranked by
log-likelihood thresholded by ?
Two words are connected iff they share at least 4
neighbors

Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
28
Construction of Lexicon

Each word assigned a unique tag based on the word
class it belongs to
Class 1 sky, color, blood, weight
Class 2 red, blue, light, heavy
Ambiguous words
High and medium frequency words that formed
singleton cluster
Possible tags of neighboring clusters

29
Training and Evaluation

Unsupervised training of trigram HMM using the
clusters and lexicon
Evaluation
Tag a text, for which gold standard is available
Estimate the conditional entropy H(TC) and the
related perplexity 2H(TC)
Final Results
English 2.05 (619/345), Finnish 3.22
(625/466), German 1.79 (781/440)

30
Example
From the familiar to the exotic, the collection
is a delight Prep At JJ Prep At JJ
At NN V At NN C200 C1 C331 C5
C1 C331 C1 C221 C3 C1 C220

Write a Comment

User Comments (0)

About PowerShow.com

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali - PowerPoint PPT Presentation

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali

Chinese Whispers: Clustering the Network. Experiments. Topological Properties of the Networks ... The Chinese Whisper Algorithm. light. color. red. blue. blood ... – PowerPoint PPT presentation