GLSA Server PARC - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

GLSA Server PARC

Description:

(sym-add-ia banana vegetable -0.013203714) (sym-add-ia banana citrus 0.06277088) 16 ... (add-dm (banana isa meaning)) (add-dm (fruit isa meaning) ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 20
Provided by: raluca4
Category:
Tags: glsa | parc | banana | server

less

Transcript and Presenter's Notes

Title: GLSA Server PARC


1
GLSA Server _at_PARC
  • Christiaan Royer, Ayman Farahat,
  • Peter Pirolli
  • Presenter Raluca Budiu
  • (budiu_at_parc.com)

2
Functionality
http//glsa.parc.com
3
(No Transcript)
4
(No Transcript)
5
Outline
  • Similarity
  • PMI and strength of association
  • Dimensionality reduction
  • Corpus
  • Parameters
  • ACT-R Interface
  • Future work

6
Strength of Association

The association between words reflects their
odds of occurring together
7
Semantic Similarity Math
(Pointwise Mutual Information)
Farahat, Pirolli, Markova, 2004 Pirolli, 2005
8
Information Retrieval
  • Pointwise mutual information between two words

In general, pointwise mutual information is the
reduction in uncertainty of one random variable
due to knowing about the other
(Manning and Schutze, 1999)
9
Computing PMIs
  • Estimate probabilities using frequency counts
  • of words in a large corpus of N documents

10
Dimensionality Reduction
Given a corpus V of v words
1. Build a matrix of strengths of
association/PMIs
2. Reduce the dimension of R to v x k (nr eigen
vectors)
3. Compute similarity using cosine measure
11
Dimensionality Reduction
  • Other techniques of dimensionality reduction can
    be used (e.g., Hellinger metric)
  • Dimensionality reduction is a smoothing that
    takes into account similarity to other terms
    while measuring the similarity between a specific
    pair of terms

12
The Corpus
  • The first 10 million pages of the Stanford
    Webbase project (generated by web crawling)

www.microsoft.com 10002 www.google.com 10002 w
ww.w3.org 10001 www.whitehouse.gov 10002 www.a
pple.com 2 www.epa.gov 10002 www.yahoo.com 237
www.cdc.gov 10002 www.pbs.org 10002 www.un.
org 10001 www.access.gpo.gov 2
http//dbpubs.stanford.edu8091/testbed/doc2/WebB
ase/
13
Number of Eigen Values
  • Too high overgeneralization
  • Too low overfitting (i.e., taking too much
    noise into account)
  • For several synonymy problem spaces
  • n 150-400 gave best results

14
Why Not Use Just PMIs?
  • GLSA was compared with PMI on several synonymy
    tests (TOEFL, TSL1, TSL2) (using a different
    corpus made of New York Times articles)
  • It achieved a performance 70 (80 for TOEFL
    and TS2 70 for TS1)
  • PMI only was consistently worse (65-70),
    although still comparable with humans

15
ACT-R Format
  • (chunk-type meaning)
  • (defun sym-add-ia-fct (chunk1 chunk2 ia)
  • (if ia
  • (first (eval (no-output (add-ia (,chunk1
    ,chunk2 ,ia)
  • (,chunk2 ,chunk1 ,ia)))))
  • 0))
  • (defmacro sym-add-ia (chunk1 chunk2 ia)
  • (sym-add-ia-fct ',chunk1 ',chunk2 ',ia))
  • (add-dm (banana isa meaning))
  • (add-dm (fruit isa meaning))
  • (add-dm (vegetable isa meaning))
  • (sym-add-ia banana banana 1.0000008)
  • (sym-add-ia banana fruit 0.08276595)
  • (sym-add-ia banana vegetable -0.013203714)
  • (sym-add-ia banana citrus 0.06277088)

16
Declare a meaning chunk type
(chunk-type meaning)
Declare a function that maps GLSA values onto
ias
it can be modified
(defun sym-add-ia-fct (chunk1 chunk2 ia)
(if ia (first (eval (no-output (add-ia
(,chunk1 ,chunk2 ,ia)
(,chunk2 ,chunk1 ,ia))))) 0))
add words to memory as chunks of type meaning
(add-dm (banana isa meaning)) (add-dm (fruit isa
meaning))
set ias
(sym-add-ia banana banana 1.0000008)
(sym-add-ia banana fruit 0.08276595)
17
Text Format
  • More convenient if you want to parse it into your
    own format

banana banana 1.0000008 banana fruit 0.0827659
5 banana vegetable -0.013203714 banana citrus
0.06277088 banana tropical 0.066639036 banana
nut 0.035970457 banana grain 0.020719754 banan
a monkey 0.081182115 banana elephant -0.02854477
4 banana eat -0.0042282715 banana person 0.035
31764 fruit fruit 1.0000001 fruit vegetable 0.
24325794 fruit citrus 0.18533087
18
Future Work
  • Corpus expansion and modification
  • Collect misses and add them periodically to the
    corpus
  • See what documents need to be removed from the
    corpus and what documents need to be added
  • Analyze word frequency and decide whether its
    representative of the web/persons vocabulary
  • TASA corpus as an option
  • How does raw PMI compare with GLSA
  • Have a raw PMI server available
  • Add word frequency counts
  • Add the option to upload a file

19
Suggestions?
Write a Comment
User Comments (0)
About PowerShow.com