CIS392 Text Retrieval - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

CIS392 Text Retrieval

Description:

Chapter 3. 7. NJIT CIS 634 Information Retrieval. Fall 2002. Information Extraction. Material: ... How many companies filed bankruptcy in year 2001? ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 49

Provided by: cisN9

Category:

more less

Transcript and Presenter's Notes

Title: CIS392 Text Retrieval

1
CIS392 Text Retrieval Mining

Exploiting the Structure of Text
Material Sullivan Ch 3 (exclude integration with
data warehouse and WWW) and Ch 8

2
Text-Oriented Business Intelligence

How do business intelligence analysts work?
Summarizing documents
Classifying and routing documents to interested
readers
Answering questions
Searching and browsing by topic and theme
Searching with topic
Browsing by topic
Searching by example

3
Summarizing Documents

FEDERAL RESERVE POLICY FACILITATED BYMARKET
PRICE INDICATORSACCORDING TO NEW JEC STUDY
http//www.house.gov/jec/press/2000/10-18-0.htm
Summarized Text http//web.njit.edu/wu/teaching/
sp03/CIS392/JECPressRelease.htm

4
Summarization

Problems
News story summaries tend to read reasonably.
Automatic summaries of other documents might not
have a logic flow.
Wrong order of points of an argument.
20 rule can still lead to long summarized text,
but anything less then 5 is not understandable.

5
Undirected summarization

Does not use patterns or templates
Select and copy most important sentences from
original document.
Methods
Add up frequency of words and select sentences
with high total frequency
Find trigger words or phrases, e.g. in
conclusion.

6
Directed summarization

Also called information extraction
Items (key phrases) to find are pre-defined.
Templates and patterns are pre-defined.
Processing involves noun phrase identification,
pattern matching, and templates fulfillment.

7
NJIT CIS 634 Information Retrieval Fall 2002

Information Extraction
Material
Information Extraction Techniques and
Challenges, by Ralph Grishman

8
What do people want from IE?

Lists of relevant entities rather than lists of
relevant documents.
How many companies filed bankruptcy in year 2001?
How many universities are there in the United
States?

9
Definitions

IE is the identification of instances of a
particular class of events or relationships in a
natural language text, and the extraction of the
relevant arguments of the event relationship.
It involves the creation of structured
representation of selected information drawn from
the text.

10
Example

Text 19 March A bomb went off this morning
near a power tower in San Salvador leaving a
large part of the city without energy, but no
casualties have been reported. According to
unofficial sources, the bomb allegedly
detonated by urban guerrilla commandos blew up
a power tower in the northwestern part of San
Salvador at 0650 (1250 GMT).

11
Results

INCIDENT TYPE bombing
DATE March 19
LOCATION El Salvador San Salvador (city)
PERPETRATOR urban guerrilla commandos
PHYSICAL TARGET power tower
HUMAN TARGET -
EFFECT ON PHYSICAL TARGET destroyed
EFFECT ON HUMAN TARGET no injury or death
INSTRUMENT bomb

12
Top Level Overview of Processes

Facts are extracted from text through local text
analysis.
Facts are integrated, producing larger facts or
new facts.
Facts are translated into required format.
Domain vs scenario vs template.

13
Desired outputs

Scenario Sam Schwartz retired as executive vice
president of the famous hot dog manufacturer,
Hupplewhite, Inc. He will be succeeded by Harry
Himmelfarb.
Templates
Event start job
Person Harry Himmelfarb
Position Executive vice president
Company Hupplewhite Inc.
--------------------------------------------------
---------------
Event leave job
Person Sam Schwartz
Position Executive vice president
Company Hupplewhite Inc.

14
Pattern creation and template structure building

Create sets of expression patterns.
Person retires as position
Person is succeeded by person.
Structures for templates
Entities
Events
(The role of patterns is to extract events or
relationships relevant to the scenario.)

15
Local text analysis step 1Lexical Analysis

Text is first divided into sentences and into
tokens.
Each token is looked up in the dictionaries
(general vs specialized) to determine its
possible parts-of-speech and features.

16
Local text analysis step 2 and 3

Name Recognition
Identifying various types of proper names and
other special forms (e.g. dates, currency).
Syntactic Structure
Arguments are mostly noun phrases.
Relationships grammatical functional relations
Example Company-description, company-name,
Position of company

17
Example of syntactic structure

np e1 Sam Schwartz vg retired as np e2
executive vice president of np e3 the famous
hot dog manufacturer, np e4 Hupplewhite, Inc.
np e5 He vg will be succeeded by np e6 Harry
Himmelfarb.

18
Example (cont)

Semantic Entity
Entity e1 type person name Sam Schwartz
Entity e2 type position value executive vice
president
Entity e3 type manufacturer
Entity e4 type company name Hupplewhite
Inc.
Entity e5 type person
Entity e6 type person name Harry Himmelfarb
Updated according to pattern position of
company
Entity e1 type person name Sam Schwartz
Entity e2 type position value executive vice
president companye3
Entity e3 type manufacturer name Hupplewhite
Inc.
Entity e5 type person
Entity e6 type person name Harry Himmelfarb

19
Local text analysis step 4 Scenario Pattern
Matching

Extract the events or relationships relevant to
the scenario, which is executive succession in
this case.
Person (A) is succeeded by person (B).
Entity e1 type person name Sam Schwartz
Entity e2 type position value executive vice
president
Entity e3 type manufacturer name Hupplewhite
Inc.
Entity e5 type person
Entity e6 type person name Harry Himmelfarb
Event e7 type leave-job persone1 positione2
Event e8 type succeed person1e6 person2e5

20
Discourse analysis step 1 CORE-ference Analysis

Resolving anaphoric references by pronouns and
definite noun phrases
E5 type person (pronoun -- he)
It is replaced by the most recent previously
mentioned entity of type person, which is e1 Sam
Schwartz.

21
Discourse analysis step 2 Inferenceing and Event
Merging

Leave-job (X-person, Y-job) succeed (Z-person,
X-person)
gt start-job (Z-person, Y-job)
Start-job (X-person, Y-job) succeed (X-person,
Z-person)
gt leave-job (Z-person, Y-job)

22
Inferencing and Event Merging (cont)

Entity e1 type person name Sam Schwartz
Entity e2 type position value executive vice
president company e3
Entity e3 type manufacturer name Hupplewhite
Inc.
Entity e6 type person name Harry Himmelfarb
Event e7 type leave-job persone1 positione2
Event e8 type succeed person1e6 person2e1
Event e9 type stat-job persone6 positione2

23
(No Transcript)
24
Design Issues

To Parse or not to Parse linguistics complexity
involved.
Portability low
Performance not satisfactory

25
Classifying and routing docs

Process classify docs ? route them to specific
users
Classify docs according to thesaurus, subject
hierarchy, taxonomy, or ontology.

26
Answering questions

Also called question answering
For very specific and straightforward questions
extract related noun phrase from text.
Example what is the capital of Denmark?
Solution a document containing capital and
Denmark, and also has Copenhagen near them
(note C is in upper case, meaning its a
proper name.)

27
Answering questions

For complicated questions, 1 word, or 1 phrase
answer is not enough background info is needed.
what is document warehousing?
If no answers are found, provide alternate
questions to users.

28
Searching and browsing by topic

Ad hoc searching with topics
Search within a category (select domain first)
http//dir.yahoo.com/Business_and_Economy/
Browsing by topic
Effectiveness depends on breadth and depth of the
subject hierarchy.
Browse Yahoo!s main page and narrow down topic.
Commercial DBs have incorporated text processing.

29
Searching by example

Also called query by example
Google similar pages and Page-Specific Search
(in Advanced Search page) are examples of query
by example.
It works well for very narrow and specific topics.

30
Full text searching

Boolean operators (AND, OR, NOT)
Proximity operators (Food and Drug
Administration or FDA) NEAR clinical trails
Weighting operators (commodity AND wheat3

31
Clustering Definitions

Discovering group structure amongst the cases of
n by p matrix. -- (Venables, W. N., and Ripley,
B. D. (1997). Modern Applied Statistics with
S-Plus (2 ed.). Statistics and Computing Series.
New York Springer. )
Clustered groups
In a group, each object has majority of the
attributes and each attribute is owned by
majority of objects.
Resultant groups are supposed to be as distant to
each other as possible.
Inside a group, members are supposed to be as
close to each other as possible.

32
Document Clustering

Unlike classification schemes, it does not use a
pre-define set of terms to group documents
Theoretically, documents are grouped together
because their contents are similar.
Closely associated documents tend to be relevant
to the same query ? they are likely to be wanted
together.
Documents in the same clustered group are treated
the same until further examined individually.

33
Document Clustering

Steps
Find attributes, i.e. a set of key words (columns
in next slide) from documents (rows in next
slide.)
Vector representation. E.g. the vector for object
2 is (1, 1, 0, 1, 0, 0, 0, 0)
Calculate distances between document pairs.

34
(No Transcript)
35
Document Space and Clustering
NJIT
Doc2
Doc1
Doc3
Doc4
Information Systems Dept
36
The Use of Clustering in IR

Choosing a clustering method
The method should produce stable results under
growth (of the size of document collection)
Small errors in the description should lead to
small changes in the clustering
The method should be independent of the initial
ordering of the objects.

37
The use of clustering in IR

Can be used for filtering and routing.
Can be used for creating categories for retrieval.

38
Clustering Routines(optional, wont be in
exams.)

K-means
PAM
CLARA
Hierarchical clustering AGNES, DIANA, and MONA
FANNY
Model based clustering mclust
See Kaufman and Rousseeuw (1990) for details.

39
Dissimilarity Metrics

DAISY a routine for calculating dissimilarity
either using Euclidean or Manhattan distance.
The following clustering routines are all based
on distance measures.

40
K-means

The number of clusters needs to be pre-specified.
An initial clustering is created.
Iterative relocation by moving objects from one
group to another if this reduces the sum of
squares.

41
PAM (Partitioning Around Medoids)

The number of clusters needs to be pre-specified.
The algorithm computes k representative
objectives, called medoids, which together
determine a clustering.
Each object will be assigned to the nearest
medoid according to dissimilarity value.

42
CLARA (Clustering Large Applications)

It deals with large data set by considering data
subsets of fixed size.
Each sub-dataset is partitioned into k clusters
using the same algorithm as in the PAM function.
The remaining objects in the original dataset are
assigned to the nearest medoid.
The procedure is repeated several times until the
best result is reached.

43
FANNY (Fuzzy Analysis)

PAM and CLARA are crisp clustering methods
namely, each object belongs to one cluster.
FANNY spreads objects over groups.
A membership value is used to determined how
strongly an object belongs to a group.

44
AGNES (Agglomerative Nesting)

At first, each object is a cluster. Then repeat
the following two steps
Merge two clusters that have the smallest
between-cluster dissimilarity.
Compute the dissimilarity between the new cluster
and all remaining clusters.
AC Agglomerative Coefficient

45
DIANA (Divisive Analysis)

It starts with a large cluster, which contains
ALL objects.
The cluster is split into two smaller clusters
according to distance measure, until finally all
clusters contain only one object.
DC Divisive Coefficient

46
(No Transcript)
47
MONA (Monothetic Analysis)

It is a divisive hierarchical method and it
operates on matrixes with binary variables.
For each split, MONA uses one variable at a time.
Repeat the following steps
Select one variable that has the largest total
association to other variables.
Then the cluster is divided into two groups one
cluster with all objects having value 1 for that
variable, one with objects having 0 for that
variable.

48
mclust (model-based clustering)

Assumption there is a underlying probability
distribution in data clusters have different
orientations, shapes, and sizes.
mclust function can suggest an optimum number of
clusters.

Write a Comment

User Comments (0)