Text Tango: A New Text Data Mining Project - PowerPoint PPT Presentation

About This Presentation
Title:

Text Tango: A New Text Data Mining Project

Description:

Answer questions like: Who are the most hypocritical ... Hope they all talk about diagnostics and prostate cancer. Fortunately, 7 documents returned ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 31
Provided by: hea4
Category:
Tags: data | mining | new | project | tango | text

less

Transcript and Presenter's Notes

Title: Text Tango: A New Text Data Mining Project


1
Text TangoA New Text Data Mining Project
  • Marti A. Hearst
  • GUIR Meeting, Sept 17, 1998

2
Talk Outline
  • What is Data Mining?
  • What isnt Text Data Mining?
  • What is Text Data Mining
  • Examples
  • A proposal for a system for Text Data Mining

3
What is Data Mining? (Fayyad Uthurusamy 96,
Fayyad 97)
  • Fitting models to or determining patterns from
    very large datasets.
  • A regime which enables people to interact
    effectively with massive data stores.
  • Deriving new information from data.
  • finding patterns across large datasets
  • discovering heretofore unknown information

4
What is Data Mining?
  • Potential point of confusion
  • The extracting ore from rock metaphor does not
    really apply to the practice of data mining
  • If it did, then standard database queries would
    fit under the rubric of data mining
  • Find all employee records in which employee earns
    300/month less than their managers
  • In practice, DM refers to
  • finding patterns across large datasets
  • discovering heretofore unknown information

5
DM Touchstone Applications(CACM 39 (11) Special
Issue)
  • Finding patterns across data sets
  • Reports on changes in retail sales
  • to improve sales
  • Patterns of sizes of TV audiences
  • for marketing
  • Patterns in NBA play
  • to alter, and so improve, performance
  • Deviations in standard phone calling behavior
  • to detect fraud
  • for marketing

6
What is Text Data Mining?
  • Peoples first thought
  • Make it easier to find things on the Web.
  • This is information retrieval!
  • The metaphor of extracting ore from rock does
    make sense for extracting documents of interest
    from a huge pile.
  • But does not reflect notions of DM in practice
  • finding patterns across large collections
  • discovering heretofore unknown information

7
Text DM ! IR
  • Data Mining
  • Patterns, Nuggets, Exploratory Analysis
  • Information Retrieval
  • Finding and ranking documents that match users
    information need
  • ad hoc query
  • filtering/standing query

8
Real Text DM
  • What would finding a pattern across a large text
    collection really look like?

9
Bill Gates MS-DOS in the Bible!
From The Internet Diary of the man who
cracked the Bible Code Brendan McKay,
Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
10
From The Internet Diary of the man who cracked
the Bible Code Brendan McKay, Yahoo Internet
Life, www.zdnet.com/yil
11
Real Text DM
  • The point
  • Discovering heretofore unknown information is not
    what we usually do with text.
  • (If it werent known, it could not have been
    written by someone.)
  • However
  • There are some interesting problems of this type!

12
Combining Data Typesfor Novel Tasks
  • Text Links to find authority pages
    (Kleinberg at Cornell, Page at Stanford)
  • Usage Time Links to study evolution of web
    and information use (Pitkow et al. at PARC)

13
Ore-Filled Text Collections
  • Congressional Voting Records
  • Answer questions like
  • Who are the most hypocritical congresspeople?
  • Medical Articles
  • Create hypotheses about causes of rare diseases
  • Create hypotheses about gene function
  • Patent Law
  • Answer questions like
  • Is government funding of research worthwhile?

14
(No Transcript)
15
(No Transcript)
16
How to find Hypocritical Congresspersons?
  • This must have taken a lot of work
  • Hand cutting and pasting
  • Lots of picky details
  • Some people voted on one but not the other bill
  • Some people share the same name
  • Check for different county/state
  • Still messed up on Bono
  • Taking stats at the end on various attributes
  • Which state
  • Which party

17
How to find causes of disease?Don Swansons
Medical Work
  • Given
  • medical titles and abstracts
  • a problem (incurable rare disease)
  • some medical expertise
  • find causal links among titles
  • symptoms
  • drugs
  • results

18
Swanson Example (1991)
  • Problem Migraine headaches (M)
  • stress associated with M
  • stress leads to loss of magnesium
  • calcium channel blockers prevent some M
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD)implicated in
    M
  • high levels of magnesium inhibit SCD
  • M patients have high platelet aggregability
  • magnesium can suppress platelet aggregability
  • All extracted from medical journal titles

19
Swansons TDM
  • Two of his hypotheses have received some
    experimental verification.
  • His technique
  • Only partially automated
  • Required medical expertise
  • Few people are working on this.

20
How to find functions of genes?
  • Important problem in molecular biology
  • Have the genetic sequence
  • Dont know what it does
  • But
  • Know which genes it coexpresses with
  • Some of these have known function
  • So Infer function based on function of
    co-expressed genes
  • This is new work by Michael Walker and others at
    Incyte Pharmaceuticals

21
Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
22
Make use of the literature
  • Look up what is known about the other genes.
  • Different articles in different collections
  • Look for commonalities
  • Similar topics indicated by Subject Descriptors
  • Similar words in titles and abstracts
  • adenocarcinoma, neoplasm, prostate, prostatic
    neoplasms, tumor markers, antibodies ...

23
Developing Strategies
  • Different strategies seem needed for different
    situations
  • First see what is known about Kallikrein.
  • 7341 documents. Too many
  • AND the result with disease category
  • If result is non-empty, this might be an
    interesting gene
  • Now get 803 documents
  • AND the result with PSA
  • Get 11 documents. Better!

24
Developing Strategies
  • Look for commalities among these documents
  • Manual scan through 100 category labels
  • Would have been better if
  • Automatically organized
  • Intersections of important categories scanned
    for first

25
Try a new tack
  • Researcher uses knowledge of field to realize
    these are related to prostate cancer and
    diagnostic tests
  • New tack intersect search on all three known
    genes
  • Hope they all talk about diagnostics and prostate
    cancer
  • Fortunately, 7 documents returned
  • Bingo! A relation to regulation of this cancer

26
Formulate a Hypothesis
  • Hypothesis mystery gene has to do with
    regulation of expression of genes leading to
    prostate cancer
  • New tack do some lab tests
  • See if mystery gene is similar in molecular
    structure to the others
  • If so, it might do some of the same things they
    do

27
Strategies again
  • In hindsight, combining all three genes was a
    good strategy.
  • Store this for later
  • Might not have worked
  • Need a suite of strategies
  • Build them up via experience and a good UI

28
The System
  • Doing the same query with slightly different
    values each time is time-consuming and tedious
  • Same goes for cutting and pasting results
  • IR systems dont support varying queries like
    this very well.
  • Each situation is a bit different
  • Some automatic processing is needed in the
    background to eliminate/suggest hypotheses

29
The System
  • Three main parts
  • UI for building/using strategies
  • Backend for interfacing with various databases
    and translating different formats
  • Content analysis/machine learning for figuring
    out good hypotheses/throwing out bad ones

30
The UI part
  • Need support for building strategies
  • Lots of info lying around, so a nice option is
    ...
  • Two-handed interface
  • Big table display
  • Mixed-initiative system
  • Trade off between user-initiated hypotheses
    exploration and system-initiated suggestions
  • Information visualization
  • Another way to show lots of choices

31
Candidate Associations
Suggested Strategies
Current Retrieval Results
32
Other applications
  • Patent example
  • Political example
  • The truths out there!

33
Text Tango
  • Just starting up now.
  • Let me know if youd like to work on it!
Write a Comment
User Comments (0)
About PowerShow.com