Title: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 9: Resnick
1CS626/449 Speech, NLP and the Web/Topics in AI
Programming(Lecture 9 Resnicks measures of
word Similarity coverage of Jiang and Conrath,
1997)
- Pushpak BhattacharyyaCSE Dept., IIT Bombay
2Path length based similarity between house and
lock
- House belongs-to 12 senses
Sense-1 House
study
wall
Has-part
Has-part
Has-part
door
doorway
lock
Has-part
Has-part
3Properties that a Path Length based measure
should satisfy
- Zero property
- self distance is 0 (d(A,A)0)
- Symmetric property
- d(A,B)d(B,A)
- Positive property
- d is always non-negative, and
- Triangular inequality
- d(A,C) lt d(A,B)d(B,C).
4Motivating Resnicks measure through hypernymy
(is-a) hierarchy
- Sense 1
- lock -- (a fastener fitted to a door or drawer to
keep it firmly closed) - gt fastener, fastening, holdfast, fixing
-- (restraint that attaches to something or holds
something in place) - gt restraint, constraint -- (a device
that retards something's motion "the car did not
have proper restraints fitted") - gt device -- (an instrumentality
invented for a particular purpose "the device is
small enough to wear on your wrist" "a device
intended to conserve water") - gt instrumentality,
instrumentation -- (an artifact (or system of
artifacts) that is instrumental in accomplishing
some end) - gt artifact, artefact --
(a man-made object taken as a whole) - gt whole, unit -- (an
assemblage of parts that is regarded as a single
entity "how big is that part compared to the
whole?" "the team is a unit") - gt object,
physical object -- (a tangible and visible
entity an entity that can cast a shadow "it was
full of rackets, balls and other objects") - gt physical
entity -- (an entity that has physical existence) - gt entity
-- (that which is perceived or known or inferred
to have its own distinct existence (living or
nonliving))
5House sense 1
- house -- (a dwelling that serves as living
quarters for one or more families "he has a
house on Cape Cod" "she felt she had to get out
of the house") - gt dwelling, home, domicile, abode,
habitation, dwelling house -- (housing that
someone is living in "he built a modest dwelling
near the pond" "they raise money to provide
homes for the homeless") - gt housing, lodging, living
accommodations -- (structures collectively in
which people are housed) - gt structure, construction -- (a
thing constructed a complex entity constructed
of many parts "the structure consisted of a
series of arches" "she wore her hair in an
amazing construction of whirls and ribbons") - gt artifact, artefact -- (a
man-made object taken as a whole) - gt whole, unit -- (an
assemblage of parts that is regarded as a single
entity "how big is that part compared to the
whole?" "the team is a unit") - gt object, physical
object -- (a tangible and visible entity an
entity that can cast a shadow "it was full of
rackets, balls and other objects") - gt physical entity
-- (an entity that has physical existence) - gt entity --
(that which is perceived or known or inferred to
have its own distinct existence (living or
nonliving))
Overlap
6House sense 2
- Sense 2
- house -- (an official assembly having legislative
powers "a bicameral legislature has two houses") - gt legislature, legislative assembly,
legislative, general assembly, law-makers --
(persons who make or amend or repeal laws) - gt assembly -- (a group of persons
gathered together for a common purpose) - gt gathering, assemblage -- (a
group of persons together in one place) - gt social group -- (people
sharing some social relation) - gt group, grouping -- (any
number of entities (members) considered as a
unit) - gt abstraction -- (a
general concept formed by extracting common
features from specific examples) - gt abstract entity
-- (an entity that exists only abstractly) - gt entity --
(that which is perceived or known or inferred to
have its own distinct existence (living or
nonliving))
7House sense 11
- Sense 11
- sign of the zodiac, star sign, sign, mansion,
house, planetary house -- ((astrology) one of 12
equal areas into which the zodiac is divided) - gt region, part -- (the extended spatial
location of something "the farming regions of
France" "religions in all parts of the world"
"regions of outer space") - gt location -- (a point or extent in
space) - gt object, physical object -- (a
tangible and visible entity an entity that can
cast a shadow "it was full of rackets, balls and
other objects") - gt physical entity -- (an
entity that has physical existence) - gt entity -- (that which
is perceived or known or inferred to have its own
distinct existence (living or nonliving))
Overlap
8Measures of Semantic Relatedness Resnick
- The Resnik Measure
- Information content based relatedness measure
- Higher information content specific to particular
topics, lower ones specific to more general
topics - Carving fork HIGH IC, entity LOW IC
- The Idea is that two concepts are semantically
related proportional to the amount of information
shared
9Sense marked corpora semcor
- lts snum3gt
- ltwf cmdignore posPRPgtHelt/wfgt
- ltwf cmddone posVB lemmasucceed wnsn2
lexsn24101gtsucceedslt/wfgt - ltwf cmddone rdfperson posNNP lemmaperson
wnsn1 lexsn10300 pnpersongtBuck_Shawlt/wfgt - ltpuncgt,lt/puncgt
- ltwf cmdignore posWPgtwholt/wfgt
- ltwf cmddone posVB lemmaretire wnsn1
lexsn24101gtretiredlt/wfgt - ltwf cmdignore posINgtatlt/wfgt
- ltwf cmdignore posDTgtthelt/wfgt
- ltwf cmddone posNN lemmaend wnsn2
lexsn12800gtendlt/wfgt - ltwf cmdignore posINgtoflt/wfgt
- ltwf cmddone posJJ lemmalast wnsn1
lexsn50000past00gtlastlt/wfgt - ltwf cmddone posNN lemmaseason wnsn1
lexsn12802gtseasonlt/wfgt - ltpuncgt.lt/puncgt
- lt/sgt
10Measures of Semantic Relatedness
- Considers position of nouns in is-a hierarchy
- SR is determined by information content of lowest
common concept which subsumes both concept - For example Nickel and Dime subsumed by Coin,
Nickel and Credit card by Medium of Exchange - P(c) is probability of encountering concept c.
- If a is-a b, then p(a) lt p(b)
- Information content calculated by formula-
- IC (concept) log (P (concept))
11Measures of Semantic Relatedness
- Thus relatedness is given by-
- Simres (c1, c2) IC (LCS (c1, c2))
- Does not consider information content of the
concepts themselves nor path length - Problems faced is that many concepts might have
the same subsumer thus having same score - May get high measures on the basis of some
inappropriate word senses. E.g tobacco and horse - Newer methods such as Jiang-Conrath, Lin and
Leacock-Chodorow measures
12In case of multiple senses
where sen(w) denotes the set of possible senses
for word w.
13Relevant formulae
Classes(W) is no. of senses the word has
Words(c) is the set of words subsumed (directly
or indirectly) by the class c
14Example of Resnick Similarity in action
15Structural Characteristics of a hierarchical n/w
- Local network density (the number of child links
that span out from a parent node) - In the plant/flora section of WordNet, the
hierarchy is very dense - Depth of a node in the hierarchy
- distance shrinks as one descends the hierarchy,
since differentiation is based on finer and finer
details - Type of link
- The strength of an edge link corpus statistics
has to play role theoretical soundness and
computational efficiency are needed
16Link Strength Probability and IC theoretic
- The strength of a child link is proportional to
the conditional probability of encountering an
instance of the child concept ci given an
instance of its parent concept p - P(ci p)
17Link strength
Intuition
Formulation
Actual formula
18What does all this buy us?
19Correlations
20Page Rank
- Developed by Larry Page and Sergei Brinn
- Link analysis algorithm assigns numerical
weighting to hyperlinked set of documents - Measures relative importance of page in a set
- Link to a page is a vote of support which
increases the rank of that particular page - It is a probability distribution representing the
likelihood of a person randomly clicking
ultimately ending up on a specific page
21Pagerank based Algorithm
- Assume universe has 4 pages A, B, C and D
- Initial values of all the pages is 0.25
- Now suppose B, C and D link only to A
- Rank of A given by-
- If B links to other pages also then rank of A-
- L(B) is the number of outbound links from B
22Pagerank based Algorithm (contd.)
- Page rank of U depends on rank of page V linking
to U divided by number of links from V - Page Rank can be given by general formula-
- Formula applicable for pages which link to U
- Thus we can see that the page ranks of all pages
in corpus will be equal to 1
23Pagerank based Algorithm (contd.)
- Damping Factor Imaginary surfer will stop
clicking at links after some time. - d is probability that user will continue clicking
- Damping factor is estimated at 0.85 here
- The new page rank formula using this is-
- Now to get actual rank of a page we will have to
iterate this formula many times - Problem of Dangling Links