Title: Examining Higher Order Transformations for Scalefree Small World Graphs
1Examining Higher Order Transformations for
Scale-free Small World Graphs
- Uwe Quasthoff, Chris Biemann
- Universität LeipzigInstitut für
Informatikquasthoff,biemann_at_informatik.uni-leip
zig.de
2Background Word co-occurrences
- Given al lot of sentences in one language
(typically, millions of sentences), we ask - Which words appear significantly often together
within a sentence? (Examples Dresden Semper,
dog cat) - Which words appear significantly as next
neighbors? (Examples Semper - Opera, hot dog) - Significance is measured using log-likelihood.
- Size of German corpus
- Sentences 50 M
- Words 11 M lt Nodes for both graphs
- Sentence co-occurrences 180 M lt Edges for
sentence co-occurrence graph - NN co-occurrences 34 M lt Edges for NN
co-occurrence graph
3Sample word co-occurrences Space
- Significant co-occurrences of space within a
sentence - disk (2629), shuttle (2618), square (1163),
station (991), NASA (920), feet (822), memory
(718), address (653), Space (602), leased (567),
launch (505), storage (479), astronauts (473),
Challenger (420), represented (412), manned
(406), lessor (390), mission (385), office (382),
Discovery (341), hard (336), Mir (335), rocket
(329), orbit (326), program (308), RAM (307),
free (300), NASA's (297), flight (293), Atlantis
(291), cosmonauts (275), files (261), Earth
(239), satellite (238), amount (230), into (226),
requires (223) - Significant left neighbors of space
- disk (4073), address (1157), office (953),
storage (685), desk (323), manned (306), outer
(305), free (293), shelf (257), floor (230),
memory (229), into (219), hard-disk (208), phase
(198), breathing (194), presentation (179), white
(149), Mir (145), open (140), Soviet (130),
parking (122), retail (122), tuple (120),
industrial (109), air (105), warehouse (104),
extra (103), empty (102), save (96), less (82),
NASA's (79), parameter (78), blank (77), moduli
(77), much (74), orbiting (70), crawl (69),
enough (63), Hilbert (62), more (60), swap (59) - significant right neighbors of space
- shuttle (2967), station (1516), agency
(385), program (312), heating (258), for (229),
exploration (217), between (183), flight (181),
at (179), shuttles (167), probe (145), bar (139),
is (125), available (108), telescope (105), on
(98), center (96), missions (79), requirements
(76), charge (74), agency's (73), probes (63),
shuttle's (57), heaters (48), mission (47),
required (45), limitations (44), science (39),
than (37), walk (37), capsule (35), travel (33),
constraints (32), allocation (31), heater (30),
endurance (27), character (26)
4The co-occurrence graph for space
- Local sentence co-occurrence graph Distance 1
from space - Different meanings are clearly visible.
5http//corpora.informatik.uni-leipzig.de/
6http//corpora.informatik.uni-leipzig.de/
7Pruning of our graphs
- Due to construction, all edges are weighted by
significance. - We apply the following additional pruning
- Remove all edges
- For each node, re-insert the N (here, N3 or
N10) strongest edges (if not yet inserted) - Note that the degree of a node is not bounded by
this pruning.
8Statistical properties
9Comparison with random graphs
- The following random graph models are usually
associated with natural language - Barabási - Albert (1999)
- In the BA-model, a graph is constructed by
preferential attachment a new vertex connects to
existing vertices with a probability according to
their degree. - Dorogovtsev - Mendes (2001)
- A new vertex is connected to the preferential
existing vertex, but additionally edges in the
set of existing vertices are introduced with a
probability according to the product of their
degrees
10Degree distribution for BA and DM (left) and
sentence-based word co-occurrences (right)
11Searching for similar words
- Word co-occurrences represent all kind of
semantic relations. - Sometimes, we find similar words as strongest
sentence co-occurrences, but mostly not. - Good example Significant co-occurrences of zinc
- copper (369), lead (323), cadmium (212),
nickel (145), iron (94), metals (89), silver
(79), manganese (76), tonnes (71), oxide (67),
Dollars (55), chromium (54), Cominco (53), mine
(52), chloride (48), ... - Typical example Significant co-occurrences of
refrigerator - kitchen (41), freezer (38), magnets (24),
cold (24), magnet (23), helium (23), stove (19),
heat (19), compressors (18), door (18), oven
(17), her (15), cooling (14), Store (14),
microwave (14), stored (14), food (14), water
(13), ice (13), ...
12Co-occurrences of higher order
- Idea
- If the process of the calculation of significant
co-occurrences gives us some similar words, the
iteration of the process will give us more
similar words. - But How to iterate?
- Answer Co-occurrence sets produced in the first
steps replace sentences. - Sample sentences are
- copper lead cadmium nickel iron metals silver
manganese tonnes oxide Dollars chromium Cominco
mine chloride ... - kitchen freezer magnets cold magnet helium stove
heat compressors door oven her cooling Store
microwave stored food water ice ...
13The usual co-occurrences for Auto Using sentences
fahren (1396), Wagen (979), prallte (914), Fahrer
(809), seinem (723), fuhr (709), fährt (638),
Polizei (609), erfaßt(587), gefahren (485)
14Co-occurrences of second order for Auto Using
sentence co-occurrences
Wagen (114), Fahrzeug (54), Fahrer (41), Fahrbahn
(35), prallte (35), Polizei (28), verletzt (27),
Schleudern (24), fuhr (24), Richtung (21),
15Co-occurrences of second order for Auto Using
NN-co-occurrences
Wagen (35), Lastwagen (14), Fahrzeug (13), Autos
(9), Personenwagen (9), Bus (8), Zug (7), Haus
(5),Lkw (5), Pkw (5)
16First Iteration Step
- The two black nodes A and B get connected in the
step if there are many nodes C which are
connected to both A and B - The more Cs, the higher the weight of the new edge
existing connection
new connection
17Second Iteration Step
- The two black nodes A and B get connected in the
step if there are many (dark gray) nodes Ds which
are connected to both A and B. - The connections between the nodes Ds and the
nodes A and B were constructed because of (light
gray) nodes Es and Fs, respectively
Es
Ds
Fs
former connection
existing connection
new connection
B
A
18Collapsing bridging nodes
- Upper bound for path length in iteration n is 2n.
- However, some of the bridging nodes collapse,
giving rise to self-keeping clusters of arbitrary
path length, which are invariant under iteration.
Upper 5 nodes invariant cluster A, B are being
absorbed by this cluster
19Examples of Iterated Co-occurrences
20Where are fixed points?
- As expected, the dynamics often yield to fixed
points, i.e. sets of nodes invariant under
iteration. Usually, there are several strong
attracting fixed points. One can also observe
attracting cycles. - In the case of words, the words in the fixed
point are often not semantically related to the
starting point, hence the first steps seem more
interesting. - Stronger thresholds lead to fewer fixed points
and cycles, the empty set may be the only
attractor.
21Generalization for arbitrary Networks The
co-occurrence mapping
Given a large set of nodes (here words) and some
connected subgraphs as shown in fat. The upper
graph represents a sentence, the lower a
collocation set corresponding to the central
element. The graphs are completed representing
collocation edges. Both nodes and edges are
weighted. The collocation mapping removes most of
the edges because they are considered as
noise. Result is another graph which can be used
to iterate the process.
22Orders 2 and 3 for the random graph models and
for word cooccurrences
23Conclusions
- Natural language word co-occurrence networks
differ from networks created by BA and DM models. - The difference is due to longer range
dependencies in language given by syntax and
semantics. - The higher order transformation discussed here
- shows interesting dynamics and
- maps similar nodes into next neighbors.
- Future work is necessary to
- investigate hyperbolic fixed points
- understand the dynamics.
24Thank you!