Title: Human Computer Studies
1Human Computer Studies
2004 a good year for Computational Grammar
Induction January 2005 Pieter Adriaans
Universiteit van Amsterdam pietera_at_science.uva.nl
http//turing.wins.uva.nl/pietera/ALS/
2GI Research Questions
- Research Question What is the complexity of
human language? - Research Question Can we make a formal model of
language development of young children that
allows us to understand - Why the process is efficient?
- Why the process is discontinuous?
- Underlying Research Question Can we learn
natural language efficiently from text? How much
text is needed? How much processing is needed? - Research Question Semantic learning e.g. can we
construct ontologies for specific domains from
(scientific) text?
3Chomsky Hierarchy and the complexity of Human
Language
4Complexity of Natural Language Zipf distribution
Heavy Low Frequency Tail
Structured High Frequency Core
5Observations
- Word Frequencies in human utterances dominated by
powerlaws - High Frequency core
- Low Frequency heavy tail
- Open versus closed wordclasses (function words)
- Natural Language is open. Grammar is elastic.
Occurence of new words is natural phenomenon.
Syntactic/semantic bootstrapping must play an
important role in language learning. - Bootstrapping will be important for ontology
learning as well as child language acquisition - Better understanding of NL distributions is
necessary
6Learn NL from text Probabilistic versus
Recursion theoretic approach
- 1967 Gold. Any language more complex than
super-finite sets (including regular and up the
Chomsky hierarchy) can not be learned from
positive data. - 1969 Horning Probabilistic context-free
grammars can be learned from positive data. Given
a text T and two grammars G1 and G2 we are able
to approximate max(P(G1T), P(G2T)) - ICGI gt 1990 empirical approach. Just build
algorithms and try them. Approximate NL from
below Finite ? Regular ? Context-free ?
Context-sensitive
7Situation lt 2004
- GI seems to be hard
- No identification in the limit
- Ill-understood Powerlaws dominate (word)
frequencies in human communication - Machine learning algorithms have difficulties in
these domains - PAC learning does not converge on these domains
- Nowhere near learning natural languages
- We were running out of ideas
8Situation lt 2004 Learning Regular Languages
- Reasonable success in learning Regular languages
of moderate complexity (Evidence Based State
Merging, Blue-Fringe) - Transparant representation Deterministic Finite
Automata (DFA) - DEMO
9Situation lt 2004 Learning Context-free Languages
- A number of approaches Learning Probabilistic
CFG, Inside-outside Algorithm, Emile, ABL. - No transparant representation Push Down Automata
(PDA) are not really helpful to model the
learning process. - No adequate convergence on interesting real life
corpora - Problem of sparse data sets.
- Complexity issues ill-understood.
10Emile natural language allows bootstrapping
- Lewis Caroll's famous poem Jabberwocky' starts
with - 'Twas brillig, and the slithy toves
- Did gyre and gimble in the wabe
- All mimsy were the borogoves
- and the mome raths outgrabe.
11Emile Characteristic Expressions and Contexts
- An expression of a type T is characteristic for T
if it only appears with contexts of type T - Similarly, a context of a type T is
characteristic for T if it only appears with
expressions of type T. - Let G be a grammar (context-free or otherwise) of
a language L. G has context separability if each
type of G has a characteristic context, and
expression separability if each type of G has a
characteristic expression. - Natural languages seem to be context- and
expression-separable. - This is nothing but stating that languages can
define their own concepts internally (...is a
noun, ...is a verb).
12Emile Natural languages are shallow
- A class of languages C is shallow if for each
language L it is possible to find a context- and
expression-separable grammar G, and a set of
sentences S inducing characteristic contexts and
expressions for all the types of G, such that the
size of S and the length of the sentences of S
are logarithmic in the descriptive length of
L(relative to C). - Seems to hold for natural languages ? Large
dictionaries, low thickness
13Regular versus context-free merging-clustering
?
?
?
?
? ? ?
?
? ? ? ?
? ?
14The EMILE learning algorithm
- One can prove that, using clustering techniques,
shallow CFGs can be learned efficiently from
positive examples drawn under m. - General idea ????? ? ???\?/??
sentence? expression?\?/? context
?
?
? ? ? ?
? ?
15EMILE 4.1 (2000) Vervoort
- Unsupervised
- Two dimensional clustering random search for
maximized blocks in the matrix - Incremental thresholds for filling degree of
blocks - Simple (but sloppy) rule induction using
characteristic expressions
16Clustering (2-dimensional)
- John makes tea
- John likes tea
- John likes eating
- ? ? ? \ ? / ?
- John makes coffee
- John likes coffee
- John is eating
17Emile 4.1 Clustering Sparse Matrices of Contexts
and Expressions
Contexts
Characteristic Expression
Expressions
Characteristic Context
18Emile guaranteed to find types with right settings
- Let T be a type with a characteristic context cch
and a characteristic expression ech. Suppose that
the maximum lengths for primary contexts and
expressions are set to at least cch and ech
and suppose that the total_support ,
expression_support and context_support
settings are all set to 100 . Let TltmaxC and
TltmaxE be the sets of contexts and expressions of
T that are small enough to be used as primary
contexts and expressions. If EMILE is given a
sample containing all combinations of contexts
from TltmaxC and expressions from TltmaxE, then
EMILE will find type T. (Vervoort 2000)
19Original grammar
- S ? NP V_i ADV
- NP_a VP_a
- NP_a V_s that S
- NP ? NP_a
- NP_p
- VP_a ? V_t NP
- V_t NP P NP_p
- NP_a ? John Mary the man the child
- NP_p ? the car the city the house the shop
- P ? with near in from
- V_i ? appears is seems looks
- V_s ? thinks hopes tells says
- V_t ? knows likes misses sees
- ADV ? large small ugly beautiful
20Learned Grammar after 100.000 examples
- 0 ?17 6
- 0 ?17 22 17 6
- 0 ?17 22 17 22 17 22 17 6
- 6 ? misses 17 likes 17 knows 17
sees 17 - 6 ?22 17 6
- 6 ? appears 34 looks 34 is 34 seems
34 - 6 ?6 near 17 6 from 17 6 in 17
6 ?6 with 17 - 17 ? the child Mary the city the man
John the car the house the shop - 22 ? tells that thinks that hopes that
says that - 22 ?22 17 22
- 34 ? small beautiful large ugly
21Bible books
- King James version
- 31102 verses of 82935 lines
- 4,8 Mb of English text
- 001001 In the beginning God created the heaven
and the earth. - 66 Experiments with increasing sample size
- Initially Book Genesis, Book Exodus,
- Full run 40 minutes, 500 Mb on Ultra-2 Sparc
22Bible books
23GI on the bible
- 0 ? Thou shall not 582
- 0 ? Neither shalt thou 582
- 582 ? eat it
- 582 ? kill .
- 582 ? commit adultery .
- 582 ? steal .
- 582 ? bear false witness against thy neighbour
. - 582 ? abhor an Edomite
24Knowledge base in Bible
- Dictionary Type 76
- Esau, Isaac, Abraham, Rachel, Leah, Levi, Judah,
Naphtali, Asher, Benjamin, Eliphaz, Reuel, Anah,
Shobal, Ezer, Dishan, Pharez, Manasseh, Gershon,
Kohath, Merari, Aaron, Amram, Mushi, Shimei,
Mahli,Joel, Shemaiah, Shem, Ham, Salma, Laadan,
Zophah, Elpaal, Jehieli - Dictionary Type 362
- plague, leprosy
- Dictionary Type 414
- Simeon, Judah, Dan, Naphtali, Gad, Asher,
Issachar, Zebulun, Benjamin, Gershom - Dictionary Type 812
- two, three, four
- Dictionary Type 1056
- priests, Levites, porters, singers, Nethinims
- Dictionary Type 978
- afraid, glad, smitten, subdued
- Dictionary Type 2465
- holy, rich, weak, prudent
- Dictionary Type 3086
- Egypt, Moab, Dumah, Tyre, Damascus
- Dictionary Type 4082
- heaven, Jerusalem
25Evaluation
- Works efficiently on large corpora
- learns (partial) grammars
- unsupervised
- - EMILE 4.1 needs a lot of input.
- - Convergence to meaningful syntactic type rarely
observed. - - Types seem to be semantic rather than
syntactic. - Why?
- Hypothesis distribution in real life text is
semantic, not syntactic. - But, most of all Sparse data!!!
262004 Omphalos Competition Starkie van Zaanen
- Unsupervised learning of context-free grammars
- Deliberately constructed to be beyond current
state of the art - A theoretical brute force learner that constructs
all possible CFG consistent with a certain set of
positive examples O. - Complexity measure for CFGs.
- There are only2?i(2Oj -2) 1) (?i(2Oj
-2) 1) ) T(O)of these grammars, where T(O)
is the number of terminals!!
272004 Omphalos Competition Starkie van Zaanen
Let ? be an alphabet, ?? the set of all
strings over ? L(G) S ? ?? is the language
generated by a grammar G CG ? S is a
characteristic sample for G
?? (infinite)
S (infinite)
O (finite Omphalos sample)
CG
O lt 20 CG
28Bad news for distributional analysis (Emile, ABL,
Inside out)
a,S ? push X a,X ? push X
w ?an bn
aeb aaebb aaaebbb
b,X ? pop X
0
1
1
e,X ? no-op
?,X ? pop
w ?a,cn b,dn
a,S ? push X a,X ? push X
aeb ceb aed ced aaebb aaebd caebb
caebd acebb aaaebbb
b,X ? pop X
1
0
1
e,X ? no-op
?,X ? pop
c,S ? push X c,X ? push X
d,X ? pop X
29Bad news for distributional analysis (Emile, ABL,
Inside out)
a aeb b a
aeb d c aeb
b c aeb d a
ceb b a ceb
d c ceb b c
ceb d
a aeb b a
aaebb b
We need large corpora to make distributional
analysis working. Omphalos samples are way to
small!!
30Omphalos won by Alexander Clark some good ideas!
- Approach
- Exploit useful properties that randomly generated
grammars are likely to have - Identifying constituents Measure local mutual
information between symbol before and symbol
after. Clark, 2001. More reliable than other
information theoretic constituent boundary tests.
Lamb, 1961 Brill et al., 1990 - Under benign distributions non-constituents will
have zero mutual information crossing constituent
boundaries. Structures that do not cross
constituent boundaries will have non-zero mutual
information. - Analysis of cliques of strings that might be
constituents (Much like clusters in EMILE). - Most hard problems in Omphalos still open!!
31But, is Omphalos the right challenge? What about
NL?
Natural Languages
Need for larger samples
Shallow languages
Harder to learn
Log of terminals ?
Omphalos
Complexity of the grammar P/N
32ADIOS (Automatic DIstillation Of Structure) Solan
et al. 2004
- Representation of a corpus (of sentences) as
paths over a graph whose vertices are lexical
elements (words) - Motif Extraction (MEX) procedure for establishing
new vertices thus progressively redefining the
graph in an unsupervised fashion - Recursive Generalization
- Zach Solan, David Horn, Eytan Ruppin (Tel Aviv
University) Shimon Edelman (Cornell) - http//www.tau.ac.il/zsolan
33The Model (Solan et al. 2004)
- Graph representation with words as vertices and
sentences as paths.
Is that a dog?
Is that a cat?
Where is the dog?
And is that a horse?
34The MEX (motif extraction) procedure (Solan et
al. 2004)
35Generalization (Solan et al. 2004)
36From MEX to ADIOS (Solan et al. 2004)
- Apply MEX to search-path consisting of a given
data-path. - On same search-path, within a given window size,
allow for the occurrence of an equivalence class,
i.e. define a generalized search-path of the type
e1-gt e2-gt-gt E -gt-gtek. Apply MEX to this
window. - Choose patterns P, including equivalence classes
E according to MEX ranking. Add nodes. - Repeat the above for all search-paths.
- Repeat the procedure to obtain higher level
generalizations. - Express structures in syntactic trees.
37First pattern formation
Higher hierarchies patterns (P) constructed of
other Ps, equivalence classes (E) and terminals
(T)
Trees to be read from top to bottom and from left
to right
Final stage root pattern
CFG context free grammar
38Solan et al. 2004
- The ADIOS algorithm has been evaluated using
artificial grammars containing thousands of
rules, natural languages as diverse as English
and Chinese, regulatory and coding regions in DNA
sequences and functionally relevant structures in
protein data. - Complexity of ADIOS on large NL corpora seems to
be linear in the size of the corpus. - Allows mild context sensitive learning
- This is the first time an unsupervised algorithm
is shown capable of learning complex syntax, and
score well in standard language proficiency
tests!! (Trainingset 300.000 sentences from
CHILDES, ADIOS scoring intermediate level (58)
in Göteborg/ESL test).
39ADIOS learning from ATIS-CFG (4592 rules)using
different numbers of learners, and different
window length L
40Where does ADIOS fit in?
Natural Languages
ADIOS
Need for larger samples
Shallow languages
Harder to learn
of terminals ?
Omphalos
Complexity of the grammar P/N
41GI Research Questions
- Research Question What is the complexity of
human language? - Research Question Can we make a formal model of
language development of young children that
allows us to understand - Why the process is efficient?
- Why the process is discontinuous?
- Underlying Research Question Can we learn
natural language efficiently from text? How much
text is needed? How much processing is needed? - Research Question Semantic learning e.g. can we
construct ontologies for specific domains from
(scientific) text?
42Conclusions Further work
- We start to crack the code of unsupervised
learning of human languages - ADIOS is the first algorithm capable of learning
complex syntax, and scoring well in standard
language proficiency tests - We have better statistical techniques to separate
constituents form non-constituents. - Good ideas pseudo graph representation, MEX,
sliding windows. - To be done
- Can MEX help us in DFA induction?
- Better understanding of the complexity issues.
When does MEX collapse? - Better understanding of Semantic Learning
- Incremental Learning with background knowledge
- Use GI to learn ontologies