A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali

Description:

Chinese Whispers: Clustering the Network. Experiments. Topological Properties of the Networks ... The Chinese Whisper Algorithm. light. color. red. blue. blood ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 31
Provided by: monojitc4
Category:

less

Transcript and Presenter's Notes

Title: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali


1
A Social Network Approach to Unsupervised
Induction of Syntactic Clusters for Bengali
  • Monojit Choudhury
  • Microsoft Research India
  • monojitc_at_microsoft.com

2
Co-authors
Niloy Ganguly
Chris Biemann University of Leipzig
Joydeep Nath
Animesh Mukherjee
Indian Institute of Technology Kharagpur
3
Language A Complex System
  • Structure
  • phones ? words, words ? phrases, phrase ?
    sentence, sentence ? discourse
  • Function Communication through
  • recursive syntax
  • compositional semantics
  • Dynamics
  • Evolution
  • Language change

4
Computational Linguistics
  • Study of language using computers
  • Study of language-using computers
  • Natural Language Processing
  • Speech recognition
  • Machine translation
  • Automatic summarization
  • Spell checkers, Information retrieval
    extraction,

5
Labeling of Text
  • Lexical Category (POS tags)
  • Syntactic Category (Phrases, chunks)
  • Semantic Role (Agent, theme, )
  • Sense
  • Domain dependent labeling (genes, proteins, )
  • How to define the set of labels?
  • How to (learn to) predict them automatically?

6
Distributional Hypothesis
  • A word is characterized by the company it keeps
    Firth, 1957
  • Syntax function words (Harris, 1968)
  • Semantics content words

7
Outline
  • Defining Context
  • Syntactic Network of Words
  • Complex Network Theory Applications
  • Chinese Whispers Clustering the Network
  • Experiments
  • Topological Properties of the Networks
  • Evaluation
  • Future work

8
Features Words
  • Estimate the unigram frequencies
  • Feature words Most frequent m words

9
Feature Vector
  • From the familiar to the exotic, the collection
    is a delight

the
to
is
from
fw1
fw2
fw199
fw200
p-2
0 0 0 1
1 0 0 0
0 1 0 0
1 0 0 0
p-1
p1
p2
10
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1 1 cos(red, blue)
red
11
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
12
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
13
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
14
Experiments
  • Corpus Anandabazaar Patrika (17M words)
  • We build networks Gn,m
  • n corpus size 1M, 2M, 5M, 10M, 17M
  • m number of feature words 25, 50, 100, 200
  • Number of nodes 5000
  • Number of edges 150,000

15
Topological Properties Cumulative Degree
Distribution
G17M,50
CDD Pk is the probability that a randomly chosen
node has degree k
Pk
k
Pk ?? -log(k) pk -dPk /dk ? 1/k
Zipfian Distribution!!
16
Topological PropertiesClustering Coefficient
  • Measures transitivity of the network or
    equivalently the proportion of triangles
  • Very small for random graphs, high for social
    networks
  • Mean CC for G17M,50 0.53

CC vs. Degree
17
Topological Properties Cluster Size Distribution
Cluster Size
rank
rank
Variation with n (m 50)
Variation with m (n 17M)
18
Evaluation Tag Entropy
  • w t1, t6, t9
  • Tagw
  • Cluster C w1, w2, w3, w4
  • TE(C)

1 0 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1 0
0 0 0 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0 1 0
2
1 0 1 0 0 0 0 0 0 0
19
Mean Tag Entropy
  • MTE 1/N ?TE(Ci)
  • Weighted MTE ?CiTE(Ci)/(?Ci)
  • Caveat Every word in separate cluster has 0 MTE
    and WMTE
  • Baseline Every word in a single cluster

20
Tag Entropy vs. Corpus Size
m 50
Reduction in Tag Entropy
1M 2M 5M 10M 17M
74.49 75.14 76.09 78.29 74.94
17.46 18.68 24.23 27.56 30.60
MTE
WMTE
21
The Bigger the worse!
Tag Entropy
Cluster Size
22
Clusters
  • Big ones ? Bad ones ? mix of everything!
  • Medium sized clusters are good
  • http//banglaposclusters.googlepages.com/home

Rank Size Type
5 596 Proper nouns, titles and posts
6 352 Possessive case of nouns (common, proper, verbal) and pronouns
8 133 Nouns (common, verbal) forming compounds with do or be
11 44 Number-Classifier (e.g. 1-TA, ekaTA)
12 84 Adjectives
23
More Observations
  • Words are split into
  • First name vs. Surnames
  • Animate nouns-poss vs. Inanimate noun-poss
  • Nouns-acc vs. Nouns-poss vs. Nouns-loc
  • Verb-finite vs. Verb-infinitive
  • Syntactic or semantic?
  • Nouns related to professions, months, days of
    week, stars, players etc.

24
Advantages
  • No labeled data required A good solution to
    resources scarcity
  • No prior class information Circumvents issues
    related to tag set definition
  • Computational definition of Class
  • Understanding the structure of language (Syntax)
    and its evolution

25
Danke für Ihre Aufmerksamkeit.
Thank you for your attention
  • Dieses ist vom Übersetzer übersetzt worden, der
    von Phasen Microsoft Beta ist.

This has been translated by "Translator Beta"
from Microsoft Live.
26
Related Work
  • Harris, 68 Distributional hypothesis for
    syntactic classes
  • Miller and Charles, 91 Function words as
    features
  • Finch and Chater, 92 Schtze, 93, 95 Clark, 00
    Rapp, 05 Biemann, 06 The general technique
  • Haghighi and Klein, 06 Goldwater and Griffiths,
    07 Bayesian approach to unsupervised POS tagging
  • Dasgupta and Ng, 07 Bengali POS induction
    through morphological features

27
Medium and Low Frequency Words
  • Neighboring (window 4) co-occurrences ranked by
    log-likelihood thresholded by ?
  • Two words are connected iff they share at least 4
    neighbors

Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
28
Construction of Lexicon
  • Each word assigned a unique tag based on the word
    class it belongs to
  • Class 1 sky, color, blood, weight
  • Class 2 red, blue, light, heavy
  • Ambiguous words
  • High and medium frequency words that formed
    singleton cluster
  • Possible tags of neighboring clusters

29
Training and Evaluation
  • Unsupervised training of trigram HMM using the
    clusters and lexicon
  • Evaluation
  • Tag a text, for which gold standard is available
  • Estimate the conditional entropy H(TC) and the
    related perplexity 2H(TC)
  • Final Results
  • English 2.05 (619/345), Finnish 3.22
    (625/466), German 1.79 (781/440)

30
Example
From the familiar to the exotic, the collection
is a delight Prep At JJ Prep At JJ
At NN V At NN C200 C1 C331 C5
C1 C331 C1 C221 C3 C1 C220
Write a Comment
User Comments (0)
About PowerShow.com