CSC 9010: Text Mining Applications DocumentLevel Techniques - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

CSC 9010: Text Mining Applications DocumentLevel Techniques

Description:

... for articles about Tiger Woods in an API ... Tiger Woods takes some drama out of cut streak with opening round at Funai. ... Tiger Woods has no such worries. ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 35
Provided by: Matu
Category:

less

Transcript and Presenter's Notes

Title: CSC 9010: Text Mining Applications DocumentLevel Techniques


1
CSC 9010 Text Mining Applications
Document-Level Techniques
  • Dr. Paula Matuszek
  • Paula_A_Matuszek_at_glaxosmithkline.com
  • (610) 270-6851

2
Dealing with Documents
  • Sometimes our information need is not for
    something specific which we can capture in a
    clearcut knowledge model
  • What is the current research in secure networks?
  • What are our competitors working on?
  • Who should review this paper?
  • These kinds of questions are more typically
    answered by techniques which look at the entire
    document, or set of documents.
  • Categorizing
  • Clustering
  • Visualizing

3
Document Categorization
  • Document categorization
  • Assign documents to pre-defined categories
  • Examples
  • Process email into work, personal, junk
  • Process documents from a newsgroup into
    interesting, not interesting, spam and
    flames
  • Process transcripts of bugged phone calls into
    relevant and irrelevant
  • Issues
  • Real-time?
  • How many categories/document? Flat or
    hierarchical?
  • Categories defined automatically or by hand?

4
Document Categorization
  • Usually
  • relatively few categories
  • well defined a person could do task easily
  • Categories don't change quickly
  • Flat vs Hierarchy
  • Simple categorization is into mutually-exclusive
    document collections
  • Richer categorization is into hierarchy with
    multiple inheritance
  • broader and narrower categories
  • documents can go more than one place
  • merges into search-engine with category browsers

5
Categorization -- Automatic
  • Statistical approaches similar to search engine
  • Set of training documents define categories
  • Underlying representation of document is bag of
    words/TFIDF variant
  • Category description is created using neural
    nets, regression trees, other Machine Learning
    techniques
  • Individual documents categorized by net, inferred
    rules, etc
  • Requires relatively little effort to create
    categories
  • Accuracy is heavily dependent on "good" training
    examples
  • Typically limited to flat, mutually exclusive
    categories

6
Categorization Manual
  • Natural Language/linguistic techniques
  • Categories are defined by people
  • underlying representation of document is stream
    of tokens
  • category description contains
  • ontology of terms and relations
  • pattern-matching rules
  • individual documents categorized by
    pattern-matching
  • Defining categories can be very time-consuming
  • Typically takes some experimentation to "get it
    right"
  • Can handle much more complex structures

7
Document Classification
  • Document classification
  • Cluster documents based on similarity
  • Examples
  • Group samples of writing in an attempt to
    determine author(s)
  • Look for hot spots in customer feedback
  • Find new trends in a document collection
    (outliers, hard to classify)
  • Getting into areas where we dont know ahead of
    time what we will have true mining

8
Document Classification -- How
  • Typical process is
  • Describe each document
  • Assess similiaries among documents
  • Establish classification scheme which creates
    optimal "separation"
  • One typical approach
  • document is represented as term vector
  • cosine similarity for measuring association
  • bottom-up pairwise combining of documents to get
    clusters
  • Assumes you have the corpus in hand

9
Document Clustering
  • Approaches vary a great deal in
  • document characteristics used to describe
    document (linguistic or semantic? bow?
  • methods used to define "similar"
  • methods used to create clusters
  • Other relevant factors
  • Number of clusters to extract is variable
  • Often combined with visualization tools based on
    similarity and/or clusters
  • Sometimes important that approach be incremental
  • Useful approach when you don't have a handle on
    the domain or it's changing

10
Document Visualization
  • Visualization
  • Visually display relationships among documents
  • Examples
  • hyperbolic viewer based on document similarity
    browse a field of scientific documents
  • map based techniques showing peaks, valleys,
    outliers
  • graphs showing relationships between companies
    and research areas
  • Highly interactive, intended to aid a human in
    finding interrelationships and new knowledge in
    the document set.

11
Latent Semantic Analysis
  • Bag of Words methods we have looked at ignore
    syntax -- A document is "about" the words in it
  • People interpret documents in a richer context
  • a document is about some domain
  • reflected in the vocabulary
  • but not limited to it

12
Match Topic and Phrase
  • I saw Pathfinder on Mars with a telescope.
  • The Pathfinder photograph mars our perception of
    a lifeless planet.
  • The Pathfinder photograph from Ford has arrived.
  • When a Pathfinder fords a river it sometimes mars
    its paint job.
  • Astronomy
  • Automobiles
  • Biology

13
Domain-Based Processing
  • This task is relatively easy because we know a
    lot about all of the domains, and can
    disambiguate using that knowledge.
  • It's not completely trivial the biology choice
    could also have been astronomy.
  • Information Extraction systems like GATE and
    AeroText model the domain knowledge explicitly,
    but this takes a lot of effort.
  • Is there an easier way?

14
Word Co-Occurrences
  • BOW approaches assume meaning is carried by
    vocabulary, ignore syntax
  • Domain modeling approaches capture detailed
    knowledge about the meaning
  • An intermediate position is to look at vocabulary
    groups what words tend to occur together?
  • Still a statistical approach, but richer
    representation than single terms

15
Examples of What We Would Like
  • Looking for articles about Tiger Woods in an API
    newswire database brings up stories about the
    golfer, followed by articles about golf
    tournaments that don't mention his name.
  • Constraining the search to days when no articles
    were written about Tiger Woods still brings up
    stories about golf tournaments and well-known
    players.
  • So we are recognizing that Tiger Woods is about
    golf.
  • javelina.cet.middlebury.edu/lsa/out/lsa_definition
    .htm

16
Example
  • Tiger Woods takes some drama out of cut streak
    with opening round at Funai.
  • Every player on the money list is at Disney
    trying to make it to the Tour Championship.
    Tiger Woods has no such worries.
  • Going into this week's event, Tiger Woods has
    made the cut in 113 successive events. He tied
    the PGA Tour's consecutive cut record two weeks
    ago at the Funai Classic in Orlando, Florida,
    while Cink finished second.
  • Stewart Cink finished second at the Funai Classic
    at Walt Disney World.

17
Example, Cont.
  • Woods tended to occur in same articles as Funai.
  • Cinc also tended to occur in articles about Funai
  • So there is a relationship between Wood and Cinc
    which is stronger than is indicated just by the
    one article in which they are both mentioned.
  • It has to do with cuts, Funai, and the
    championship tour.
  • So by creating a term-document matrix and
    examining it we can find potential relationships
    which are latent, or hidden. They are tied
    together by the meaning, or semantics, of the
    terms.
  • This is the basic concept of Latent Semantic
    Analysis and Latent Semantic Indexing.

18
Problem Very High Dimensionality
  • A vector of TFIDF representing a document is
    high dimensional.
  • If we start looking at a matrix of terms by
    documents, it gets even worse.
  • Need some way to trim words looked at
  • First, throw away anything "not useful"
  • Second, identify clusters and pick representative
    terms

19
Throw Away
  • Most domain semantics carried by nouns,
    adjectives, verbs, adverbs
  • throw away prepositions, articles, conjunctions,
    pronouns
  • Very frequent words don't add to domain
    semantics.
  • throw away common verbs (go, be, see),
    adjectives (big, good, bad ), adverbs (very)
  • throw away words which appear in most documents
  • Very infrequent words don't either
  • throw away terms which only appear in one
    document

20
What's Left
  • A condensed matrix where we can assume that most
    terms are meaningful.
  • It's still very large, and very sparse.
  • Basic index table for a keyword search tool.
  • Where can we go now?
  • We have fewer concepts than terms
  • So move from terms to concepts
  • So Identify clusters and pick representative
    terms

21
Singular Value Decomposition
  • One approach to this is called Singular Value
    Decomposition.
  • Have a term space of thousands of dimensions,
    with each document a vector in that space.
  • Want to project or map those dimensions onto a
    smaller number of dimensions in such a way that
    relative distance among vectors is preserved as
    much as possible.
  • We end up with a much smaller number of
    dimensions, and a vector for each document of its
    value for those dimensions
  • For a detailed explanation
  • http//www.acm.org/sigmm/MM98/electronic_proceedin
    gs/huang/node4.html

22
Dimension Reduction
  • For n (words) x m (documents) matrix M
  • Finds least squares best U (nxk)
  • Rows of U map input features (words) to encoded
    features (concept clusters)
  • Closely related to
  • symm. eigenvalue decomposition,
  • factor analysis
  • principle component analysis
  • Subroutine in many math packages.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
23
LSI/LSA
  • Latent semantic indexing is the application of
    SVD to IR.
  • Latent semantic analysis is the more general
    term.
  • Features are words, examples are text passages.
  • Latent Not visible on the surface
  • Semantic Word meanings

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
24
Geometric View
  • Words embedded in high-d space.

exam
test
fish
0.02
0.42
0.01
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
25
Comparison to VSM
  • AThe feline climbed upon the roof
  • BA cat leapt onto a house
  • CThe final will be on a Thursday
  • How similar?
  • Vector space model sim(A,B)0
  • LSI sim(A,B).49sim(A,C).45
  • Non-zero sim with no words in common by overlap
    in reduced representation.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
26
What Does LSI Do?
  • Lets send it to school

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
27
Platos Problem
  • 7th grader learns 10-15 new words today, fewer
    than 1 by direct instruction. Perhaps 3 were
    even encountered. How can this be?
  • Plato You already knew them.
  • LSA Many weak relationships combined (data to
    back it up!)
  • Rate comparable to students.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
28
Vocabulary
  • TOEFL synonym test
  • Choose alternative with highest similarity score.
  • LSA correct on 64 of 80 items.
  • Matches avg applicant to US college. Mistakes
    correlate w/ people (r.44).
  • best solo measure of intelligence

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
29
Multiple Choice Exam
  • Trained on psych textbook.
  • Given same test as students.
  • LSA 60 lower than average, but passes.
  • Has trouble with hard ones.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
30
Essay Test
  • LSA cant write.
  • If you cant do, judge.
  • Students write essays, LSA trained on related
    text.
  • Compare similarity and length with graded essays
    (labeled).
  • Cosine weighted average of top 10. Regression to
    combine sim and len.
  • Correlation .64-.84. Better than human. Bag of
    words!?

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
31
Digit Representations
  • Look at similarities of all pairs from one to
    nine.
  • Look at best fit of these similarities in one
    dimension they come out in order!
  • Similar experiments with cities in Europe in two
    dimensions.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
32
Word Sense
  • The chemistry student knew this was not a good
    time to forget how to calculate volume and mass.
  • heavy? .21
  • church? .14
  • LSI picks best p

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
33
LSApplications
  • Improve IR.
  • Cross-language IR. Train on parallel collection.
  • Measure text coherency.
  • Use essays to pick educational text.
  • Grade essays.
  • Visualize word clusters
  • Demos at http//LSA.colorado.edu

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
34
LSI Background Reading
  • Landauer, Laham, Foltz (1998). Learning
    human-like knowledge by Singular Value
    Decomposition A Progress Report. Advances in
    Neural Information Processing Systems 10, (pp.
    44-51)
  • http//lsa.colorado.edu/papers/nips.ps
Write a Comment
User Comments (0)
About PowerShow.com