Text%20Data%20Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Text%20Data%20Mining

Description:

284 members voted aye both times. 185 (94%) Republicants voted aye both times. 96 (57%) Democrats voted aye both times. How to find Hypocritical Congresspersons? ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 40
Provided by: melody87
Category:
Tags: 20data | 20mining | ayeaye | text

less

Transcript and Presenter's Notes

Title: Text%20Data%20Mining


1
Text Data Mining
  • Prof. Marti Hearst
  • UC Berkeley SIMS
  • Guest Lecture, ME 290M
  • Prof. Agogino
  • May 4, 1999

2
Theres Lots of Text Out There
  • Is it Information Overload?

3
  • Why not TURBO-Text?
  • How can we SYNTHESIZE whats there to make new
    discoveries?

4
Talk Outline
  • Definitions
  • What is Data Mining?
  • What is Text Data Mining?
  • Text data mining examples
  • Lexical knowledge acquisition
  • Merging textual records
  • Finding cures for diseases (from medical
    literature)
  • Future Directions

5
What is Data Mining? (Fayyad Uthurusamy 96,
Fayyad 97)
  • Fitting models to or determining patterns from
    very large datasets.
  • A regime which enables people to interact
    effectively with massive data stores.
  • Deriving new information from data.
  • finding patterns across large datasets
  • discovering heretofore unknown information

6
What is Data Mining?
  • Potential point of confusion
  • The extracting ore from rock metaphor does not
    really apply to the practice of data mining
  • If it did, then standard database queries would
    fit under the rubric of data mining
  • Find all employee records in which employee earns
    300/month less than their managers
  • In practice, DM refers to
  • finding patterns across large datasets
  • discovering heretofore unknown information

7
Why Data Mining?
  • Because the data is there.
  • Because current DBMS technology does not support
    data analysis.
  • Because
  • larger disks
  • faster cpus
  • high-powered visualization
  • networked information
  • are becoming widely available.

8
DM Touchstone Applications(CACM 39 (11) Special
Issue)
  • Finding patterns across data sets
  • Reports on changes in retail sales
  • to improve sales
  • Patterns of sizes of TV audiences
  • for marketing
  • Patterns in NBA play
  • to alter, and so improve, performance
  • Deviations in standard phone calling behavior
  • to detect fraud
  • for marketing

9
DM Touchstone Applications(CACM 39 (11) Special
Issue)
  • Separating signal from noise
  • Classifying faint astronomical objects
  • Finding genes within DNA sequences
  • Discovering novel tectonic activity

10
What is Text Data Mining?
  • Peoples first thought
  • Make it easier to find things on the Web.
  • This is information retrieval!
  • The metaphor of extracting ore from rock does
    make sense for extracting documents of interest
    from a huge pile.
  • But does not reflect notions of DM in practice
  • finding patterns across large collections
  • discovering heretofore unknown information

11
Text DM ? IR
  • Data Mining
  • Patterns, Nuggets, Exploratory Analysis
  • Information Retrieval
  • Finding and ranking documents that match users
    information need
  • ad hoc query
  • filtering/standing query
  • Rarely Patterns, Exploratory Analysis

12
Real Text DM
  • The point
  • Discovering heretofore unknown information is not
    what we usually do with text.
  • (If it werent known, it could not have been
    written by someone.)
  • However
  • There is a field whose goal is to learn about
    patterns in text for its own sake ...

13
Computational Lingustics
  • Goal automated language understanding
  • this isnt possible
  • instead, go for subgoals, e.g.,
  • word sense disambiguation
  • phrase recognition
  • semantic associations
  • Current approach
  • statistical analyses of very large text
    collections

14
WordNet A Lexical Database
A list of hypernyms for each sense of crow
15
Lexicographic Knowledge Acquisition
  • Given a large lexical database ...
  • Wordnet Miller, Fellbaum et al. at Princeton
  • http//www.cogsci.princeton.edu/wn
  • and a huge text collection
  • How to automatically add new relations?

16
Idea Use Simple Lexico-Syntactic Analysis
  • Patterns of the following type work
  • NP0 such as NP1, NP2 , (and or) NPi
  • i gt 1, implies
  • forall NPi, igt1, hyponym(NPi, NP0)
  • Example
  • Agar is a substance prepared from a mixture of
    red algae, such as Gelidium, for laboratory or
    industrial use.
  • implies hyponym(Gelidium, red algae)

17
More Examples
  • Felonies, such as shootings and stabbings
    implies
  • hyponym(shootings, felonies)
  • hyponym(stabbings, felonies)
  • Is this in the WordNet hierarchy?

18
Linking Killing to Felonies
19
Another Example
  • Einstein is (was) a physicist.
  • Is/was he a genius?

20
Making Einstein a Genius
21
Results from such as lexico-syntactic relation
22
Results with the or other lexico-syntactic
relation
23
Procedure
  • Discover a pattern that indicates a lexical
    relationship
  • Scan through a large collection extract
    sentences that match the pattern
  • Extract the NPs from the sentence
  • requires some phrase parsing
  • Check if suggested relation is in WordNet or not
  • this part not automated, but could be

24
Discovering New Patterns
  • Suggested algorithm
  • Decide on a lexical relation of interest, e.g.,
    hyponymy
  • Derive a list of word pairs from WordNet that are
    known to hold that relation
  • e.g., (crow, bird)
  • Extract sentence from text collection in which
    both terms occur
  • Find commonalities among lexico-syntactic context
  • Test these out against other word pairs known to
    hold the relationship in WordNet

25
Text Merging ExampleDiscovering Hypocritical
Congresspersons
26
Discovering Hypocritical Congresspersons
  • Feb 1, 1996
  • US House of Reps votes to pass Telecommunications
    Reform Act
  • this contains the CDA (Communications Decency
    Act)
  • violaters subject to fines of 250,000 and 5
    years in prison
  • eventually struck down by court

27
Discovering Hypocritical Congresspersons
  • Sept 11, 1998
  • US House of Reps votes to place the Starr report
    online
  • the content would (most likely) have violated the
    CDA
  • 365 people were members for both votes
  • 284 members voted aye both times
  • 185 (94) Republicants voted aye both times
  • 96 (57) Democrats voted aye both times

28
(No Transcript)
29
(No Transcript)
30
How to find Hypocritical Congresspersons?
  • This must have taken a lot of work
  • Hand cutting and pasting
  • Lots of picky details
  • Some people voted on one but not the other bill
  • Some people share the same name
  • Check for different county/state
  • Still messed up on Bono
  • Taking stats at the end on various attributes
  • Which state
  • Which party
  • Tools should help streamline, reuse results

31
How to find Hypocritical Congresspersons?
  • The hard part?
  • Knowing two compare these two sets of voting
    records.

32
How to find causes of disease?Don Swansons
Medical Work
  • Given
  • medical titles and abstracts
  • a problem (incurable rare disease)
  • some medical expertise
  • find causal links among titles
  • symptoms
  • drugs
  • results

33
Swanson Example (1991)
  • Problem Migraine headaches (M)
  • stress associated with M
  • stress leads to loss of magnesium
  • calcium channel blockers prevent some M
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD)implicated in
    M
  • high levels of magnesium inhibit SCD
  • M patients have high platelet aggregability
  • magnesium can suppress platelet aggregability
  • All extracted from medical journal titles

34
Swansons TDM
  • Two of his hypotheses have received some
    experimental verification.
  • His technique
  • Only partially automated
  • Required medical expertise
  • Few people are working on this.

35
How to Automate This?
  • Idea mixed-initiative interaction
  • User applies tools to help explore the hypothesis
    space
  • System runs suites of algorithms to help explore
    the space, suggest directions

36
Our Proposed Approach
  • Three main parts
  • UI for building/using strategies
  • Backend for interfacing with various databases
    and translating different formats
  • Content analysis/machine learning for figuring
    out good hypotheses/throwing out bad ones

37
The UI part
  • Need support for building strategies
  • Mixed-initiative system
  • Trade off between user-initiated hypotheses
    exploration and system-initiated suggestions
  • Information visualization
  • Another way to show lots of choices

38
Candidate Associations
Suggested Strategies
Current Retrieval Results
39
Lindi Linking Information for Novel Discovery
and Insight
  • Just starting up now (fall 98)
  • Initial work Hao Chen, Ketan Mayer-Patel,
    Shankar Raman

40
Ore-Filled Text Collections
  • Congressional Voting Records
  • Answer questions like
  • Who are the most hypocritical congresspeople?
  • Medical Articles
  • Create hypotheses about causes of rare diseases
  • Create hypotheses about gene function
  • Patent Law
  • Answer questions like
  • Is government funding of research worthwhile?

41
Summary
  • Text Data Mining
  • Extracting heretofore undiscovered information
    from large text collections
  • Not the same as information retrieval
  • Examples
  • Lexicographic knowledge acquisition
  • Merging of text representations
  • Linking related information
  • The truth is out there!
Write a Comment
User Comments (0)
About PowerShow.com