Title: Semantic Vectors A Scalable Open Source Package and Online Technology Management Application
1Semantic VectorsA Scalable Open Source Package
and Online Technology Management Application
- LREC Conference
- 28th May, 2008
Dominic Widdows Google, Inc. widdows_at_google.com
Kathleen Ferraro University of Pittsburgh kaf1_at_pit
t.edu
2Natural Language Software Engineering Three
Problems
- Software is often hard to use / unreliable
- t (fiddling with computers) gtgt t (analysing
data) - Does it scale?
- Moore's Law of Data any algorithm more costly
than linear hurts more every day! - What is it for?
- Systems / components
- Interesting (science) / useful (engineering)
3Semantic Vector Models
- Count how many times words occur in some context
- TermDocument matrix
- LSA (Latent Semantic Analyis)
- Or count how many times words cooccur with one
another - HAL (Hyperspace Analogue to Language)
- Normally we reduce dimensions somehow
- SVD, NNMR, LDA.
- Many uses
- IR, WSD, OL / LA, DS / TDT, OCIM, DC, ... ,
Acronym Resolution.
4Semantic Vectors Package
- http//semanticvectors.google.com/
- Created by University of Pittsburgh and MAYA
Design - All Java (with some Perl / Python / php wrappers)
- Maintained by Google 20 project other
contributors - BSD license you can use it.
- Nearly 1000 downloads
- Developer group, Wiki, mailing list, ...
- Child of Infomap with lessons learned
5Challenge 1 Make it Easy!
- 100 Java
- Dependencies include Apache Lucene
- Installation
- User
- Download jarfiles
- Add to your CLASSPATH
- Assemble a corpus (example provided)
- Type java pitt.search.semanticvectors.BuildModel
- Developer
- Install SVN, Ant
- Checkout source (Google code helps)
- Install JUnit for testing
- We have had no reports of difficulty yet!
6Challenge 2 Make it Scale!
- Dimension reduction and parallelization are key
- Random Projection
- Geometric alternatives SVD (orthogonality)
- Probabilistic alternatives PLSA, LDS (generative
models) - Sparse Random Vectors, e.g.
- 0,0,0,1,0,-1,0,0,0,-1,0,0,0,0,1,0,0
- 0,-1,0,0,0,1,0,0,0,0,1,0,0,0,0,-1,0
- On average, dot products are nearly zero, so
vectors are nearly orthogonal. - Approximate benefits of SVD, with none of the
cost! - Believed to be trivially parallelizable and
incremental (TODO)
7Challenge 3 Make it Useful!
- Hardest of the three problems
- Technology Matching at UPitt
- http//real.hsls.pitt.edu/
- Matches technology disclosures to documents
harvested from company websites - Traditionally needs much more than keywords
- Does your data meet your needs?
8Features and Demos ...
- Negation, Disjunction
- Quantum / Vector Logic
- Translation
- Bilingual Vector Models
- Semantic Vector Products
- Direct, Tensor, Convolution, Subspace
- Clustering
- kMeans
- Context Window Approach (HAL)
- Thanks to Trevor Cohen, ASU, Biomedical
Informatics
9Mathematics, Technology, Cognition
- Geometry, Probability, Logic Intersection
- That one term should be included in another
as in a whole is the same as for the other to be
predicated of all of the first. - Prior Analytics (Bk I, Ch 1)
- The equations work ... does the method?
- It is the mark of an educated man to look for
precision in each class of things just so far as
the nature of the subject admits it is evidently
equally foolish to accept probable reasoning
from a mathematician and to demand from a
rhetorician scientific proofs. - Nicomachean Ethics (Bk I, Ch 3)
- What do people do?
- By nature animals are born with the faculty
of sensation, and from sensation memory is
produced in some of them, though not in others
... Now from memory experience is produced in
humans for the several memories of the same
thing produce finally the capacity for a single
experience. - Metaphysics (Bk I, Ch 3)
10Many Thanks ...
- ELRA and the LREC conference
- Developers of Java, Lucene, Ant, Junit, ...
- Google, University of Pittsburgh
- Harris, Firth, Van Rijsbergen, Salton, McGill,
Landauer, Deerwester, Berry, Dumais, Schutze,
Lund, Burgess, Sahlgren, Kahlgren, Kaufmann,
Dorow, Cederberg, Hofmann, Kanerva, Plate,
Papadimitriou, McArthur, Bruza, ...
11And finally ...
- http//infomap.stanford.edu/book
- Introduction to Vectors, WordSpace, Quantum
Logic, etc. - A few for sale here ... 150 ?.?.
- Download the package ...
- Google(Semantic Vectors)