Semantic Vectors A Scalable Open Source Package and Online Technology Management Application

Loading...

PPT – Semantic Vectors A Scalable Open Source Package and Online Technology Management Application PowerPoint presentation | free to download - id: 7f394-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Semantic Vectors A Scalable Open Source Package and Online Technology Management Application

Description:

Moore's Law of Data any algorithm more costly than linear hurts ... Thanks to Trevor Cohen, ASU, Biomedical Informatics. Mathematics, Technology, Cognition ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 12
Provided by: lrec
Learn more at: http://www.lrec-conf.org

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Semantic Vectors A Scalable Open Source Package and Online Technology Management Application


1
Semantic Vectors A Scalable Open Source Package
and Online Technology Management Application
  • LREC Conference
  • 28th May, 2008

Dominic Widdows Google, Inc. widdows_at_google.com
Kathleen Ferraro University of Pittsburgh kaf1_at_pit
t.edu
2
Natural Language Software Engineering Three
Problems
  • Software is often hard to use / unreliable
  • t (fiddling with computers) gtgt t (analysing
    data)
  • Does it scale?
  • Moore's Law of Data any algorithm more costly
    than linear hurts more every day!
  • What is it for?
  • Systems / components
  • Interesting (science) / useful (engineering)

3
Semantic Vector Models
  • Count how many times words occur in some context
  • TermDocument matrix
  • LSA (Latent Semantic Analyis)
  • Or count how many times words cooccur with one
    another
  • HAL (Hyperspace Analogue to Language)
  • Normally we reduce dimensions somehow
  • SVD, NNMR, LDA.
  • Many uses
  • IR, WSD, OL / LA, DS / TDT, OCIM, DC, ... ,
    Acronym Resolution.

4
Semantic Vectors Package
  • http//semanticvectors.google.com/
  • Created by University of Pittsburgh and MAYA
    Design
  • All Java (with some Perl / Python / php wrappers)
  • Maintained by Google 20 project other
    contributors
  • BSD license you can use it.
  • Nearly 1000 downloads
  • Developer group, Wiki, mailing list, ...
  • Child of Infomap with lessons learned

5
Challenge 1 Make it Easy!
  • 100 Java
  • Dependencies include Apache Lucene
  • Installation
  • User
  • Download jarfiles
  • Add to your CLASSPATH
  • Assemble a corpus (example provided)
  • Type java pitt.search.semanticvectors.BuildModel
  • Developer
  • Install SVN, Ant
  • Checkout source (Google code helps)
  • Install JUnit for testing
  • We have had no reports of difficulty yet!

6
Challenge 2 Make it Scale!
  • Dimension reduction and parallelization are key
  • Random Projection
  • Geometric alternatives SVD (orthogonality)
  • Probabilistic alternatives PLSA, LDS (generative
    models)
  • Sparse Random Vectors, e.g.
  • 0,0,0,1,0,-1,0,0,0,-1,0,0,0,0,1,0,0
  • 0,-1,0,0,0,1,0,0,0,0,1,0,0,0,0,-1,0
  • On average, dot products are nearly zero, so
    vectors are nearly orthogonal.
  • Approximate benefits of SVD, with none of the
    cost!
  • Believed to be trivially parallelizable and
    incremental (TODO)

7
Challenge 3 Make it Useful!
  • Hardest of the three problems
  • Technology Matching at UPitt
  • http//real.hsls.pitt.edu/
  • Matches technology disclosures to documents
    harvested from company websites
  • Traditionally needs much more than keywords
  • Does your data meet your needs?

8
Features and Demos ...
  • Negation, Disjunction
  • Quantum / Vector Logic
  • Translation
  • Bilingual Vector Models
  • Semantic Vector Products
  • Direct, Tensor, Convolution, Subspace
  • Clustering
  • kMeans
  • Context Window Approach (HAL)
  • Thanks to Trevor Cohen, ASU, Biomedical
    Informatics

9
Mathematics, Technology, Cognition
  • Geometry, Probability, Logic Intersection
  • That one term should be included in another
    as in a whole is the same as for the other to be
    predicated of all of the first.
  • Prior Analytics (Bk I, Ch 1)
  • The equations work ... does the method?
  • It is the mark of an educated man to look for
    precision in each class of things just so far as
    the nature of the subject admits it is evidently
    equally foolish to accept probable reasoning
    from a mathematician and to demand from a
    rhetorician scientific proofs.
  • Nicomachean Ethics (Bk I, Ch 3)
  • What do people do?
  • By nature animals are born with the faculty
    of sensation, and from sensation memory is
    produced in some of them, though not in others
    ... Now from memory experience is produced in
    humans for the several memories of the same
    thing produce finally the capacity for a single
    experience.
  • Metaphysics (Bk I, Ch 3)

10
Many Thanks ...
  • ELRA and the LREC conference
  • Developers of Java, Lucene, Ant, Junit, ...
  • Google, University of Pittsburgh
  • Harris, Firth, Van Rijsbergen, Salton, McGill,
    Landauer, Deerwester, Berry, Dumais, Schutze,
    Lund, Burgess, Sahlgren, Kahlgren, Kaufmann,
    Dorow, Cederberg, Hofmann, Kanerva, Plate,
    Papadimitriou, McArthur, Bruza, ...

11
And finally ...
  • http//infomap.stanford.edu/book
  • Introduction to Vectors, WordSpace, Quantum
    Logic, etc.
  • A few for sale here ... 150 ?.?.
  • Download the package ...
  • Google(Semantic Vectors)