CIS392 Text Retrieval - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

CIS392 Text Retrieval

Description:

Chapter 3. 7. NJIT CIS 634 Information Retrieval. Fall 2002. Information Extraction. Material: ... How many companies filed bankruptcy in year 2001? ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 49
Provided by: cisN9
Category:

less

Transcript and Presenter's Notes

Title: CIS392 Text Retrieval


1
CIS392 Text Retrieval Mining
  • Exploiting the Structure of Text
  • Material Sullivan Ch 3 (exclude integration with
    data warehouse and WWW) and Ch 8

2
Text-Oriented Business Intelligence
  • How do business intelligence analysts work?
  • Summarizing documents
  • Classifying and routing documents to interested
    readers
  • Answering questions
  • Searching and browsing by topic and theme
  • Searching with topic
  • Browsing by topic
  • Searching by example

3
Summarizing Documents
  • FEDERAL RESERVE POLICY FACILITATED BYMARKET
    PRICE INDICATORSACCORDING TO NEW JEC STUDY
  • http//www.house.gov/jec/press/2000/10-18-0.htm
  • Summarized Text http//web.njit.edu/wu/teaching/
    sp03/CIS392/JECPressRelease.htm

4
Summarization
  • Problems
  • News story summaries tend to read reasonably.
    Automatic summaries of other documents might not
    have a logic flow.
  • Wrong order of points of an argument.
  • 20 rule can still lead to long summarized text,
    but anything less then 5 is not understandable.

5
Undirected summarization
  • Does not use patterns or templates
  • Select and copy most important sentences from
    original document.
  • Methods
  • Add up frequency of words and select sentences
    with high total frequency
  • Find trigger words or phrases, e.g. in
    conclusion.

6
Directed summarization
  • Also called information extraction
  • Items (key phrases) to find are pre-defined.
  • Templates and patterns are pre-defined.
  • Processing involves noun phrase identification,
    pattern matching, and templates fulfillment.

7
NJIT CIS 634 Information Retrieval Fall 2002
  • Information Extraction
  • Material
  • Information Extraction Techniques and
    Challenges, by Ralph Grishman

8
What do people want from IE?
  • Lists of relevant entities rather than lists of
    relevant documents.
  • How many companies filed bankruptcy in year 2001?
  • How many universities are there in the United
    States?

9
Definitions
  • IE is the identification of instances of a
    particular class of events or relationships in a
    natural language text, and the extraction of the
    relevant arguments of the event relationship.
  • It involves the creation of structured
    representation of selected information drawn from
    the text.

10
Example
  • Text 19 March A bomb went off this morning
    near a power tower in San Salvador leaving a
    large part of the city without energy, but no
    casualties have been reported. According to
    unofficial sources, the bomb allegedly
    detonated by urban guerrilla commandos blew up
    a power tower in the northwestern part of San
    Salvador at 0650 (1250 GMT).

11
Results
  • INCIDENT TYPE bombing
  • DATE March 19
  • LOCATION El Salvador San Salvador (city)
  • PERPETRATOR urban guerrilla commandos
  • PHYSICAL TARGET power tower
  • HUMAN TARGET -
  • EFFECT ON PHYSICAL TARGET destroyed
  • EFFECT ON HUMAN TARGET no injury or death
  • INSTRUMENT bomb

12
Top Level Overview of Processes
  • Facts are extracted from text through local text
    analysis.
  • Facts are integrated, producing larger facts or
    new facts.
  • Facts are translated into required format.
  • Domain vs scenario vs template.

13
Desired outputs
  • Scenario Sam Schwartz retired as executive vice
    president of the famous hot dog manufacturer,
    Hupplewhite, Inc. He will be succeeded by Harry
    Himmelfarb.
  • Templates
  • Event start job
  • Person Harry Himmelfarb
  • Position Executive vice president
  • Company Hupplewhite Inc.
  • --------------------------------------------------
    ---------------
  • Event leave job
  • Person Sam Schwartz
  • Position Executive vice president
  • Company Hupplewhite Inc.

14
Pattern creation and template structure building
  • Create sets of expression patterns.
  • Person retires as position
  • Person is succeeded by person.
  • Structures for templates
  • Entities
  • Events
  • (The role of patterns is to extract events or
    relationships relevant to the scenario.)

15
Local text analysis step 1Lexical Analysis
  • Text is first divided into sentences and into
    tokens.
  • Each token is looked up in the dictionaries
    (general vs specialized) to determine its
    possible parts-of-speech and features.

16
Local text analysis step 2 and 3
  • Name Recognition
  • Identifying various types of proper names and
    other special forms (e.g. dates, currency).
  • Syntactic Structure
  • Arguments are mostly noun phrases.
  • Relationships grammatical functional relations
  • Example Company-description, company-name,
  • Position of company

17
Example of syntactic structure
  • np e1 Sam Schwartz vg retired as np e2
    executive vice president of np e3 the famous
    hot dog manufacturer, np e4 Hupplewhite, Inc.
    np e5 He vg will be succeeded by np e6 Harry
    Himmelfarb.

18
Example (cont)
  • Semantic Entity
  • Entity e1 type person name Sam Schwartz
  • Entity e2 type position value executive vice
    president
  • Entity e3 type manufacturer
  • Entity e4 type company name Hupplewhite
    Inc.
  • Entity e5 type person
  • Entity e6 type person name Harry Himmelfarb
  • Updated according to pattern position of
    company
  • Entity e1 type person name Sam Schwartz
  • Entity e2 type position value executive vice
    president companye3
  • Entity e3 type manufacturer name Hupplewhite
    Inc.
  • Entity e5 type person
  • Entity e6 type person name Harry Himmelfarb

19
Local text analysis step 4 Scenario Pattern
Matching
  • Extract the events or relationships relevant to
    the scenario, which is executive succession in
    this case.
  • Person (A) is succeeded by person (B).
  • Entity e1 type person name Sam Schwartz
  • Entity e2 type position value executive vice
    president
  • Entity e3 type manufacturer name Hupplewhite
    Inc.
  • Entity e5 type person
  • Entity e6 type person name Harry Himmelfarb
  • Event e7 type leave-job persone1 positione2
  • Event e8 type succeed person1e6 person2e5

20
Discourse analysis step 1 CORE-ference Analysis
  • Resolving anaphoric references by pronouns and
    definite noun phrases
  • E5 type person (pronoun -- he)
  • It is replaced by the most recent previously
    mentioned entity of type person, which is e1 Sam
    Schwartz.

21
Discourse analysis step 2 Inferenceing and Event
Merging
  • Leave-job (X-person, Y-job) succeed (Z-person,
    X-person)
  • gt start-job (Z-person, Y-job)
  • Start-job (X-person, Y-job) succeed (X-person,
    Z-person)
  • gt leave-job (Z-person, Y-job)

22
Inferencing and Event Merging (cont)
  • Entity e1 type person name Sam Schwartz
  • Entity e2 type position value executive vice
    president company e3
  • Entity e3 type manufacturer name Hupplewhite
    Inc.
  • Entity e6 type person name Harry Himmelfarb
  • Event e7 type leave-job persone1 positione2
  • Event e8 type succeed person1e6 person2e1
  • Event e9 type stat-job persone6 positione2

23
(No Transcript)
24
Design Issues
  • To Parse or not to Parse linguistics complexity
    involved.
  • Portability low
  • Performance not satisfactory

25
Classifying and routing docs
  • Process classify docs ? route them to specific
    users
  • Classify docs according to thesaurus, subject
    hierarchy, taxonomy, or ontology.

26
Answering questions
  • Also called question answering
  • For very specific and straightforward questions
    extract related noun phrase from text.
  • Example what is the capital of Denmark?
  • Solution a document containing capital and
    Denmark, and also has Copenhagen near them
    (note C is in upper case, meaning its a
    proper name.)

27
Answering questions
  • For complicated questions, 1 word, or 1 phrase
    answer is not enough background info is needed.
  • what is document warehousing?
  • If no answers are found, provide alternate
    questions to users.

28
Searching and browsing by topic
  • Ad hoc searching with topics
  • Search within a category (select domain first)
    http//dir.yahoo.com/Business_and_Economy/
  • Browsing by topic
  • Effectiveness depends on breadth and depth of the
    subject hierarchy.
  • Browse Yahoo!s main page and narrow down topic.
  • Commercial DBs have incorporated text processing.

29
Searching by example
  • Also called query by example
  • Google similar pages and Page-Specific Search
    (in Advanced Search page) are examples of query
    by example.
  • It works well for very narrow and specific topics.

30
Full text searching
  • Boolean operators (AND, OR, NOT)
  • Proximity operators (Food and Drug
    Administration or FDA) NEAR clinical trails
  • Weighting operators (commodity AND wheat3

31
Clustering Definitions
  • Discovering group structure amongst the cases of
    n by p matrix. -- (Venables, W. N., and Ripley,
    B. D. (1997). Modern Applied Statistics with
    S-Plus (2 ed.). Statistics and Computing Series.
    New York Springer. )
  • Clustered groups
  • In a group, each object has majority of the
    attributes and each attribute is owned by
    majority of objects.
  • Resultant groups are supposed to be as distant to
    each other as possible.
  • Inside a group, members are supposed to be as
    close to each other as possible.

32
Document Clustering
  • Unlike classification schemes, it does not use a
    pre-define set of terms to group documents
  • Theoretically, documents are grouped together
    because their contents are similar.
  • Closely associated documents tend to be relevant
    to the same query ? they are likely to be wanted
    together.
  • Documents in the same clustered group are treated
    the same until further examined individually.

33
Document Clustering
  • Steps
  • Find attributes, i.e. a set of key words (columns
    in next slide) from documents (rows in next
    slide.)
  • Vector representation. E.g. the vector for object
    2 is (1, 1, 0, 1, 0, 0, 0, 0)
  • Calculate distances between document pairs.

34
(No Transcript)
35
Document Space and Clustering
NJIT
Doc2
Doc1
Doc3
Doc4
Information Systems Dept
36
The Use of Clustering in IR
  • Choosing a clustering method
  • The method should produce stable results under
    growth (of the size of document collection)
  • Small errors in the description should lead to
    small changes in the clustering
  • The method should be independent of the initial
    ordering of the objects.

37
The use of clustering in IR
  • Can be used for filtering and routing.
  • Can be used for creating categories for retrieval.

38
Clustering Routines(optional, wont be in
exams.)
  • K-means
  • PAM
  • CLARA
  • Hierarchical clustering AGNES, DIANA, and MONA
  • FANNY
  • Model based clustering mclust
  • See Kaufman and Rousseeuw (1990) for details.

39
Dissimilarity Metrics
  • DAISY a routine for calculating dissimilarity
    either using Euclidean or Manhattan distance.
  • The following clustering routines are all based
    on distance measures.

40
K-means
  • The number of clusters needs to be pre-specified.
  • An initial clustering is created.
  • Iterative relocation by moving objects from one
    group to another if this reduces the sum of
    squares.

41
PAM (Partitioning Around Medoids)
  • The number of clusters needs to be pre-specified.
  • The algorithm computes k representative
    objectives, called medoids, which together
    determine a clustering.
  • Each object will be assigned to the nearest
    medoid according to dissimilarity value.

42
CLARA (Clustering Large Applications)
  • It deals with large data set by considering data
    subsets of fixed size.
  • Each sub-dataset is partitioned into k clusters
    using the same algorithm as in the PAM function.
    The remaining objects in the original dataset are
    assigned to the nearest medoid.
  • The procedure is repeated several times until the
    best result is reached.

43
FANNY (Fuzzy Analysis)
  • PAM and CLARA are crisp clustering methods
    namely, each object belongs to one cluster.
  • FANNY spreads objects over groups.
  • A membership value is used to determined how
    strongly an object belongs to a group.

44
AGNES (Agglomerative Nesting)
  • At first, each object is a cluster. Then repeat
    the following two steps
  • Merge two clusters that have the smallest
    between-cluster dissimilarity.
  • Compute the dissimilarity between the new cluster
    and all remaining clusters.
  • AC Agglomerative Coefficient

45
DIANA (Divisive Analysis)
  • It starts with a large cluster, which contains
    ALL objects.
  • The cluster is split into two smaller clusters
    according to distance measure, until finally all
    clusters contain only one object.
  • DC Divisive Coefficient

46
(No Transcript)
47
MONA (Monothetic Analysis)
  • It is a divisive hierarchical method and it
    operates on matrixes with binary variables.
  • For each split, MONA uses one variable at a time.
  • Repeat the following steps
  • Select one variable that has the largest total
    association to other variables.
  • Then the cluster is divided into two groups one
    cluster with all objects having value 1 for that
    variable, one with objects having 0 for that
    variable.

48
mclust (model-based clustering)
  • Assumption there is a underlying probability
    distribution in data clusters have different
    orientations, shapes, and sizes.
  • mclust function can suggest an optimum number of
    clusters.
Write a Comment
User Comments (0)
About PowerShow.com