Chapter 7 Text Operations

About This Presentation

Title:

Chapter 7 Text Operations

Description:

Chapter 7 Text Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Document Preprocessing Lexical analysis ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 86

Provided by: HH26

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 7 Text Operations

1
Chapter 7Text Operations

Hsin-Hsi Chen
Department of Computer Science and Information
Engineering
National Taiwan University

2
Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
3
Document Preprocessing

Lexical analysis
Elimination of stopwords
Stemming of the remaining words
Selection of index terms
Construction of term categorization structures

4
Lexical Analysis for Automatic Indexing

Lexical AnalysisConvert an input stream of
characters into stream words or token.
What is a word or a token?Tokens consist of
letters.
digits Most numbers are not good index
terms.counterexamples case numbers in a legal
database, B6 and B12 in vitamin database.
hyphens
break hyphenated words state-of-the-art, state
of the art
keep hyphenated words as a token Jean-Claude,
F-16

5
Lexical Analysis for Automatic Indexing(Continued
)

punctuation marks often used as parts of terms,
e.g., OS/2, 510B.C.
case usually not significant in index terms
Issues recall and precision
breaking up hyphenated termsincrease recall but
decrease precision
preserving case distinctionsenhance precision
but decrease recall
commercial information systemsusually take
recall enhancing approach(numbers and words
containing digits are index terms, and all are
case insensitive)

6
Lexical Analysis for Query Processing

Tasks
depend on the design strategies of the lexical
analyzer for automatic indexing (search terms
must match index terms)
distinguish operators like Boolean operators,
stemming or truncating operators, and weighting
functions
distinguish grouping indicators like parentheses
and brackets

7
stoplist (negative dictionary)

Avoid retrieving almost every item in a database
regardless of its relevance.
Examples
conservative approach (ORBIT Search Service)
and, an, by, from, of, the, with
(derived from Brown corpus) 425 wordsa, about,
above, across, after, again, against, all,
almost, alone, along, already, also, although,
always, among, an, and, another, any, anybody,
anyone, anything, anywhere, are, area, areas,
around, as, ask, asked, asking, asks, at, away,
b, back, backed, backing, backs, be, because,
became, ...
Articles, prepositions, conjunctions,

8
Chinese Stop Words?

? Neu 58388 ? Nh 40332 ? D 39014
? Di 31873 ? Nh 30025 ? D 29646
? D 29211 ? Na 24269 ? D 20403
? VE 19625 ?? Nh 18152 ? Nh 17298
? D 15955 ? D 14066 ? Dfa 13013
? VH 11577 ? D 11125 ? Di 11026
? Nh 10776 ? D 9698 ?? D 9670
? Dfa 9416 ?? Nh 9069 ? D 8992
? D 8869 ?? Nh 8818 ? Neu 8692
? D 8508 ? VG 8369 ? VH 8304
? D 8037 ? D 7858 ? D 7298
? Da 7266 ? D 7256 ...

9
Implementing Stoplists

approaches
examine lexical analyzer output and remove any
stopwords
Every token must be looked up in the stoplist,
and removed from further analysis if found
A standard list searching problem
remove stopwords as part of lexical analysis
best implementation of stoplist

10
Stemming

stem
Portion of a word which is left after the removal
of its affixes
connect ? connected, connecting, connection,
connections
benefits of stemming?
Some favor the usage of stemming
Many Web search engines do not adopt any stemming
algorithm
issues
correctness
retrieval performance
compression performance

11
Stemmers

programs that relate morphologically similar
indexing and search terms
stem at indexing time
advantage efficiency and index file compression
disadvantage information about the full terms is
lost
example (CATALOG system), stem at search
time Look for system users Search Term
users Term Occurrences 1. user 15 2.
users 1 3. used 3 4. using 2 Which terms
(0none, CRall)

The user selects the terms he wants by numbers
12
Conflation Methods

manual
automatic (stemmers)
affix removallongest match vs. simple removal
successor variety
table lookup
n-gram
evaluation
correctness
retrieval effectiveness
compression performance

Term Stem engineering engineer engineered enginee
r engineer engineer
13
Successor Variety

Definition (successor variety of a string)the
number of different characters that follow it in
words in some body of text
Examplea body of text able, axle, accident,
ape, aboutsuccessor variety of apple1st 4 (b,
x, c, p)2nd 1 (e)

14
Successor Variety (Continued)

IdeaThe successor variety of substrings of a
term will decrease as more characters are added
until a segment boundary is reached, i.e., the
successor variety will sharply increase.
ExampleTest word READABLECorpus ABLE,
BEATABLE, FIXABLE, READS, READABLE, READING,
RED, ROPE, RIPEPrefix Successor
Variety LettersR 3 E, O, IRE 2 A,
DREA 1 DREAD 3 A, I, SREADA 1 BREA
DAB 1 LREADABL 1 EREADABLE 1 blank

15
The successor variety stemming process

Determine the successor variety for a word.
Use this information to segment the word.
cutoff methoda boundary is identified whenever
the cutoff value is reached
peak and plateau methoda character whose
successor variety exceeds that of the character
immediately preceding it and the character
immediately following it
complete word methoda segment is a complete word
entropy method
Select one of the segments as the stem.

16
n-gram stemmers

diagrama pair of consecutive letters
shared diagram methodassociation measures are
calculated between pairs of termswhere A the
number of unique diagrams in the first word,
B the number of unique diagrams in the second,
C the number of unique diagrams shared by A
and B.

17
n-gram stemmers (Continued)

Example statistics gt st ta at ti is st ti ic
cs unique diagrams gt at cs ic is st ta
ti statistical gt st ta at ti is st ti ic ca
al unique diagrams gt al at ca ic is st ta ti

18
n-gram stemmers (Continued)

similarity matrixdetermine the semantic measures
for all pairs of terms in the database word1 wor
d2 word3 ... wordn-1 word1 word2 S21 word3 S31
S32 . . wordn Sn1 Sn2 Sn3 Sn(n-1)
terms are clustered using a single link
clustering method
more a term clustering procedure than a stemming
one

19
Affix Removal Stemmers

procedureRemove suffixes and/or prefixes from
terms leaving a stem, and transform the resultant
stem. E.g., Porter algorithm
example plural formsIf a word ends in ies but
not eies or aies then ies --gt yIf a
word ends in es but not aes, ees, or
oes then es --gt eIf a word ends in s,
but not us or ss then s --gt NULL
ambiguity

20
Affix Removal Stemmers (Continued)

longest match stemmerremove the longest possible
string of characters from a word according to a
set of rules
recoding AxC--gt AyC, e.g., ki --gt ky
partial matching only n initial characters of
stems are used in comparing
different versionsLovins, Slaton, Dawson,
Porter, Students can refer to the rules listed
in appendix of the text book (pp. 433-436)

21
Index Term Selection(see Chapter 2)
22
Fast Statistical Parsing of Noun Phrases for
Document Indexing

Chengxiang Zhai
Laboratory for Computational Linguistics
Carnegie Mellon University
(ANLP97, pp. 312-319)

23
Phrases for Document Indexing

Indexing by single words
single words are often ambiguous and not specific
enough for accurate discrimination of documents
bank terminology vs. terminology bank
Indexing by phrases
Syntactic phrases are almost always more specific
than single words
Indexing by single words and phrases

24
No significant improvement?

Fagan, Joel L., Experiments in Automatic Phrase
Indexing for Document Retrieval A Comparison of
Syntactic and Non-syntactic methods, Ph.D.
thesis, Cornel University, 1987.
Lewis, D., Representation and Learning in
Information Retrieval, Ph.D. thesis, University
of Massachusetts, 1991.
Many syntactic phrases have very low frequency
and tend to be over-weighted by normal weighting
method.

25
authors points

A larger document collection may increase the
frequency of most phrases, and thus alleviate the
problem of low frequency.
The phrases are used only to supplement, not
replace the single words for indexing.
The new issue ishow to parse gigabytes of text
in practically feasible time.(133MH DEC alpha
workstation, 8 hours/GB, 20 hours of training
with 1GB text.)

26
Experiment Design

CLARIT commercial retrieval system
original document set ----gtCLARIT NP Extractor
----gtRaw Noun Phrases ----gtStatistical NP
Parser, Phrase Extractor ----gtIndexing Term
Set ----gtCLARIT Retrieval Engine

27
Different Indexing Units

example
heavy construction industry group (WSJ90)
single words
heavy, construction, industry, group
head modifier pairs
heavy construction, construction industry,
industry group
full noun phrases
heavy construction industry group

28
Different Indexing Units (Continued)

WD-SET
single word only (no phrases, baseline)
WD-HM-SET
single word head modifier pair
WD-NP-SET
single word full NP
WD-HM-NP-SET
single word head modifier full NP

29
Result Analysis

Collection Tipster Disk 2 (250MB)
Query TREC-5 ad hoc topics (251-300)
relevance feedback top 10 documents returned
from initial retrieval
evaluation
total number of relevant documents retrieved
highest level of precision over all the points of
recall
average precision

30
Effects of phraseswith feedback and TREC-5
31
Summary

When only one kind of phrase is used to
supplement the single words, each can lead to a
great improvement in precision.
When we combine the two kinds of phrases, the
effect is a greater improvement in recall rather
than precision.
How to combine and weight different phrases
effectively becomes an important issue.

32
Thesaurus Construction

IR thesaurus coordinate indexing and retrievala
list of terms (words or phrases) along with
relationships among them
physics, EE, electronics, computer and control
INSPEC thesaurus (1979) cesium (?,Cs)
USE caesium (the preferred form)
computer-aided instruction see also
education (cross-referenced terms) UF
teaching machines (a set of alternatives) BT
educational computing (broader terms, cf. NT)
TT computer applications (root node/top
term) RT education (related terms)
teaching CC C7810C (subject area) FC
C7810Cf (subject area)

For indexer and searcher
33
Roget thesaurus

example
cowardly adjective (???)
Ignobly lacking in courage cowardly turncoats
Syns chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang)

34
Functions of thesauri

Provide a standard vocabulary for indexing and
searching
Assist users with locating terms for proper query
formulation
Provide classified hierarchies that allow the
broadening and narrowing of the current query
request

35
Usage

IndexingSelect the most appropriate thesaurus
entries for representing the document.
SearchingDesign the most appropriate search
strategy.
If the search does not retrieve enough documents,
the thesaurus can be used to expand the query.
If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary.

36
Features of Thesauri
Construction of phrases from individual terms

Coordination Level
pre-coordination phrases
phrases are available for indexing and retrieval
advantage reducing ambiguity in indexing and
searching
disadvantage searcher has to be know the phrase
formulation rules
post-coordination words
phrases are constructed while searching
advantage users do not worry about the exact
word ordering
disadvantage the search precision may fall,
e.g.,library school vs. school library
immediate level phrases and single words
the higher the level of coordination, the greater
the precision of the vocabulary but the larger
the vocabulary size

length of phrases?? Two or three words or more
37
Features of Thesauri (Continued)

Term Relationships
Aitchison and Gilchrist (1972)
equivalence relationships
synonymy trade names, popular and local usage,
superseded terms
quasi-synonymy, e.g., harshness and tenderness
hierarchical relationships, e.g., genus-species,
BT vs. NT
nonhierarchical relationships, e.g., thing-part
(bus and seat), thing-attribute (rose and
fragrance)

??
??
dog-german shepherd
38
Features of Thesauri (Continued)

Wang, Vandendorpe, and Evens (1985)
parts-wholes, e.g., set-element, count-mass
collocation relations words that frequently
co-occur in the same phrase or sentence
paradigmatic relations words that have the same
semantic core, e.g., moon and lunar
taxonomy and synonymy (?????)
antonymy relations (??)

39
Features of Thesauri (Continued)

Number of entries for each term
homographs words with multiple meanings
each homograph entry is associated with its own
set of relations
problem how to select between alternative
meanings
Specificity of vocabulary
the precision associated with the component terms
a highly specific vocabulary promotes precision
in retrieval (rules of phrase construction)

40
Features of Thesauri (Continued)

Control on term frequency of class members
for statistical thesaurus construction methods
terms included in the same thesaurus class have
roughly equal frequencies
the total frequency in each class should also be
roughly similar
Normalization of vocabulary
terms should be in noun form
noun phrases should avoid prepositions unless
they are commonly known
a limited number of adjectives should be used
singularity, ordering, spelling, capitalization,
transliteration, abbreviations, ...

41
Thesaurus Construction

manual thesaurus construction
define the boundaries of the subject area
collect the terms for each subareasources
indexes, encyclopedias, handbooks, textbooks,
journal titles and abstracts, catalogues,
relevant thesauri, vocabulary systems, ...
organize the terms and their relationship into
structures
review (and refine) the entire thesaurus for
consistency
automatic thesaurus construction
from a collection document items
by merging existing thesaurus

42
Thesaurus Construction from Texts
1. Construction of vocabulary normalization
and selection of terms phrase construction
depending on the coordination level desired 2.
Similarity computations between terms
identify the significant statistical associations
between terms 3. Organization of vocabulary
organize the selected vocabulary into a hierarchy
on the basis of the associations computed in
step 2.
43
Construction of Vocabulary

Objectiveidentify the most informative terms
(words and phrases)
Procedure(1) Identify an appropriate document
collection. The document collection should
be sizable and representative of the subject
area.(2) Determine the required specificity for
the thesaurus.(3) Normalize the vocabulary
terms. (a) Eliminate very trivial words
such as prepositions and
conjunctions. (b) Stem the vocabulary. (4)
Select the most interesting stems, and create
interesting phrases for a higher
coordination level.

44
Stem evaluation and selection

selection by frequency of occurrence
each term may belong to category of high, medium
or low frequency
terms in the mid-frequency range are the best for
indexing and searching

45
Stem evaluation and selection (Continued)

selection by discrimination value (DV)
the more discriminating a term, the higher its
value as an index term
procedure
Compute the average inter-document similarity in
the collection
Remove the term K from the indexing vocabulary,
and recompute the average similarity
DV(K)(average similarity without K)-(average
similarity with k)
The DV for good discriminators is positive.

?retrieval?????,??????terms??
46
Phrase Construction
Decrease the frequency of high-frequency terms
and increase their value of retrieval

Salton and McGill procedure1. Compute pairwise
co-occurrence for high-frequency words.2.
If this co-occurrence is lower than a threshold,
then do not consider the pair any further.3.
For pairs that qualify, compute the cohesion
value. COHESION(ti, tj)
co-occurrence-frequency/(sqrt(frequency(ti)freque
ncy(tj))) COHESION(ti, tj)size-factor
co-occurrence-frequency/(frequency(ti)frequency(t
j)) where size-factor is the size of
thesaurus vocabulary 4. If cohesion is above a
second threshold, retain the phrase

(vs. syntactic/semantic methods)
47
Phrase Construction (Continued)

Choueka Procedure1. Select the range of length
allowed for each collocational expression.
E.g., 2-6 wsords2. Build a list of all potential
expressions from the collection with the
prescribed length that have a minimum
frequency.3. Delete sequences that begin or end
with a trivial word (e.g., prepositions,
pronouns, articles, conjunctions, etc.)4. Delete
expressions that contain high-frequency
nontrivial words.5. Given an expression,
evaluate any potential sub-expressions for
relevance. Discard any that are not
sufficiently relevant.6. Try to merge smaller
expressions into larger and more meaningful
ones.

e.g, abcd ? abc and bcd
48
Similarity Computation

Cosinecompute the number of documents associated
with both terms divided by the square root of the
product of the number of documents associated
with the first term and the number of documents
associated with the second term.
Dicecompute the number of documents associated
with both terms divided by the sum of the number
of documents associated with one term and the
number associated with the other.

49
Vocabulary Organization
Assumptions (1) high-frequency words have broad
meaning, while low- frequency words have narrow
meaning. (2) if the density functions of
two terms have the same shape, then the two words
have similar meaning. 1. Identify a set of
frequency ranges. 2. Group the vocabulary terms
into different classes based on their
frequencies and the ranges selected in step 1. 3.
The highest frequency class is assigned level 0,
the next, level 1, and so on. 4. Parent-child
links are determined between adjacent levels as
follows. For each term t in level i,
compute similarity between t and every term
in level i-1. Term t becomes the child of the
most similar term in level i-1. If more than
one term in level i-1 qualifies for this,
then each becomes a parent of t. In other
words, a term is allowed to have multiple
parents. 5. After all terms in level i have been
linked to level i-1 terms, check level
i-1terms and identify those that have no
children. Propagate such terms to level i by
creating an identical dummy term as its
child. 6. Perform steps 4 and 5 for each level
starting with level.
50
Merging Existing Thesauri

simple mergelink hierarchies wherever they have
terms in common
complex merge
link terms from different hierarchies if they are
similar enough.
similarity is a function of the number of parent
and child terms in common

51
Document Clustering

Searching vs. Browsing
Disadvantages in using inverted index files
information pertaining to a document is scattered
among many different inverted-term lists
information relating to different documents with
similar term assignment is not in close proximity
in the file system
Approaches
inverted-index files (for searching) clustered
document collection (for browsing)
clustered file organization (for searching and
browsing)

52
Typical Clustered File Organization
clusters
superclusters
Hypercentroid Supercentroids Centroids Documents
complete space
53
Search Strategy for Clustered Documents
Highest-level centroid
Supercentroids
Centroids
Documents
Typical Search path
Centroids
Documents
54
Cluster Generation VS Cluster Search

Cluster structure is generated only once.
Cluster maintenance can be carried out at
relatively infrequent intervals.
Cluster generation process may be slower and more
expensive.
Cluster search operations may have to be
performed continually.
Cluster search operations must be carried out
efficiently.

55
Hierarchical Cluster Generation

Two strategies
pairwise item similarities
heuristic methods
Models
Divisive Clustering (top down)
The complete collection is assumed to represent
one complete cluster.
Then the collection is subsequently broken down
into smaller pieces.
Agglomerative Clustering (bottom up)
Individual item similarities are used as a
starting point.
A gluing operation collects similar items, or
groups, into larger group.

56
Term clustering from column viewpoint Document
clustering from row viewpoint
57
A Naive Program for Hierarchical Agglomerative
Clustering
1. Compute all pairwise document-document
similarity coefficients. (N(N-1)/2
coefficients) 2. Place each of N documents into a
class of its own. 3. Form a new cluster by
combining the most similar pair of current
clusters i and j update similarity matrix
by deleting the rows and columns
corresponding to i and j calculate the
entries in the row corresponding to the new
cluster ij. 4. Repeat step 3 if the number of
clusters left is great than 1.
58
How to Combine Clusters?

Single-link clustering
Each document must have a similarity exceeding a
stated threshold value with at least one other
document in the same class.
similarity between a pair of clusters is taken to
be the similarity between the most similar pair
of items
each cluster member will be more similar to at
least one member in that same cluster than to any
member of another cluster

????
e11
e22
c1
c2
e21
e24
e13
e23
e12
Let (e13,e21) be the most similar pair between c1
and c2, and its distancebe dist (e13,e21). ?
p?c1(c2), ? q?c1(c2), p?q such that
dist(p,q)ltdist (e13,e21)
??dist (e13,e21) ?????,???dist(p,q)ltdist
(e13,e21) ?, dist(p,q)? ??dist(p,r), ??r?c2(r?q )
59
How to Combine Clusters? (Continued)

Complete-link Clustering
Each document has a similarity to all other
documents in the same class that exceeds the
threshold value.
similarity between the least similar pair of
items from the two clusters is used as the
cluster similarity
each cluster member is more similar to the most
dissimilar member of that cluster than to the
most dissimilar member of any other cluster

????
e11
e22
e23
c1
c2
e13
e21
e24
e12
Let (e12,e24) be the least similar pair between
c1 and c2, and its distancebe dist (e12,e24). ?
p?c1(c2), let q be the most dissimilar member of
p in c1, i.e., dist(p,q)gtdist(p,r) ? r?c1(r ?q).
Because dist(p,q)ltdist(e12,e24), dist(p,r)
ltdist(e12,e24)
60
How to Combine Clusters? (Continued)

Group-average clustering
a compromise between the extremes of single-link
and complete-link systems
each cluster member has a greater average
similarity to the remaining members of that
cluster than it does to all members of any other
cluster

61
Example for Agglomerative Clustering
A-F (6 items) 6(6-1)/2 (15) pairwise similarities
decreasing order
62
Single-link Clustering
0.9
1. AF 0.9
A
F
sim(AF,X)max(sim(A,X),sim(F,X))
AF B C D E AF . .8 .5 .6 .8
B .8 . .4 .5 .7 C .5 .4 . .3
.5 D .6 .5 .3 . .4 E .8 .7 .5 .4
.
0.8
2. AE 0.8
0.9
E
A
F
sim(AEF,X)max(sim(AF,X),sim(E,X))
63
Single-link Clustering (Continued)
AEF B C D AEF . .8 .5 .6
B .8 . .4 .5 C .5 .4 . .3 D
.6 .5 .3 .
0.8
3. BF 0.8
0.9
B
E
A
F
sim(ABEF,X)max(sim(AEF,X),
sim(B,X))
Note E and B are on the same level (i.e.,
same similarity value)
ABEF C D ABEF . .5 .6 C
.5 . .3 D .6 .3 .
0.8
4. BE 0.7
0.9
B
E
A
F
sim(ABDEF,X)max(sim(ABEF,X))
sim(D,X))
64
Single-link Clustering (Continued)
0.6
ABDEF C ABDEF . .5 C
.5 .
0.8
D
5. AD 0.6
0.9
B
E
A
F
0.5
C
0.6
0.8
6. AC 0.5
D
0.9
B
E
A
F
65
Single-Link Clusters

Similarity level 0.7 (i.e., similarity threshold)
Similarity level 0.5 (i.e., similarity threshold)

E
A
F
E
B
.8
.9
.8
.7
C
D
C
.5
E
F
E
A
B
.8
.9
.8
.7
.6
D
Small number of large, poorly linked clusters
66
Complete-link cluster generation
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Check Operations
Similarity Pair
new
1. AF 0.9
0.9
A
F
sim(AF,X)min(sim(A,X),
sim(F,X))
check EF?
2. AE 0.8
(A,E) (A,F)
3. BF 0.8
check AB?
(A,E) (A,F) (B,F)
67
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
AF B C D E AF . .3 .2 .1 .3
B .3 . .4 .5 .7 C .2 .4 . .3
.5 D .1 .5 .3 . .4 E .3 .7 .5 .4
.
new
0.7
4. BE 0.7
B
E
check DF?
(A,D)(A,E)(A,F) (B,E)(B,F)
5. AD 0.6
6. AC 0.6
check CF?
(A,C)(A,D)(A,E)(A,F) (B,E)(B,F)
7. BD 0.5
check DE?
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)
68
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Step Number
Similarity Pair
Check Operations
Similarity Matrix
AF BE C D AF . .3 .2 .1
BE .3 . .4 .4 C .2 .4 . .3 D
.1 .4 .3 .
check BC?
8. CE 0.5
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E)
0.4
check CE0.5
0.7
9. BC 0.4
C
B
E
(in the checklist)
10. DE 0.4
Check BD0.5 CD?
(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,
E)
Check AC0.5 AE0.8 BF0.8 CF ? , EF?
11. AB 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
E)(D,E)
69
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
0.3
AF BCE D AF . .2 .1
BCE .2 . .3 D .1 .3 .
12. CD 0.3
Check BD0.5 DE0.4
0.4
D
0.7
C
B
E
Check BF0.8 CF? DF ?
13. EF 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(D,E)(E,F)
Check BF0.8 EF0.3 DF ?
14. CF 0.2
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(C,F)(D,E)(E,F)
70
Complete-link cluster generation (Continued)
0.1
AF BCDE AF . .1 BCDE
.1 .
15. DF 0.1
last pair
0.9
0.3
A
F
0.4
D
0.7
C
B
E
71
Complete link clusters
Similarity level 0.7
A
F
B
E
0.9
0.7
C
D
Similarity level 0.4
Larger number of small, tightly linked clusters
A
F
B
E
0.9
0.7
0.5
D
0.4
C
Similarity level 0.3
D
0.5
0.4
A
F
B
E
0.9
0.3
0.7
0.4
0.5
C
72
The Behavior of Single-Link Cluster

The single-link process tends to produce a small
number of large clusters that are characterized
by a chaining effect.
Each element is usually attached to only one
other member of the same cluster at each
similarity level.
It is sufficient to remember the list of
previously clustered single items.

73
The Behavior of Complete-Link Cluster

Complete-link process produces a much larger
number of small, tightly linked groupings.
Each item in a complete-link cluster is
guaranteed to resemble all other items in that
cluster at the stated similarity level.
It is necessary to remember the list of all item
pairs previously considered in the clustering
process.

74
The Behavior of Complete-Link Cluster(Continued)

The complete-link clustering system may be better
adapted to retrieval than the single-link
clusters.
A complete-link cluster generation is more
expensive to perform than a comparable
single-link process.

75
How to Generate Similarity
Di(di1, di2, ..., dit) document vector for
Di Lj(lj1, lj2, ..., ljnj) inverted list for
term Tj lji denotes document identifier of ith
document listed under term Tj nj denote
number of postings for term Tj for j1 to t (for
each of t possible terms) for i1 to nj (for
all nj entries on the jth list) compute
sim(Dlji,Dlj,ik) i1ltkltnj end for end
for
number of documents containing term Tj
76
Similarity without Recomputation
for j1 to N (for each document in collection)
set S(j)0, 1ltjltN for k1 to nj (for each
term in document) take up inverted list
Lk for i1 to nk (for each document
identifier on list) if iltj or if
Sji1 take up next document Di
else compute sim(Dj,Di)
set Sji1 end for end for end for
77
Heuristic Clustering Methods

Hierarchical clustering strategies
use all pairwise similarities between items
the clustering-generation are relatively
expensive
produce a unique set of well-formed clusters for
each set of data, regardless of the order in
which the similarity pairs are introduced into
the clustering process
Heuristic clustering methods
produce rough cluster arrangements at relatively
little expense

78
Single-Pass Heuristic Clustering Methods

Item 1 is first taken and placed into a cluster
of its own.
Each subsequent item is then compared against all
existing clusters.
It is placed in a previously existing cluster
whenever it is similar to any existing cluster.
Compute the similarities between all existing
centroids and the new incoming item.
When an item is added to an existing cluster, the
corresponding centroid must then be appropriately
updated.
If a new item is not sufficiently similar to any
existing cluster, the new item forms a cluster of
its own.

79
Single-Pass Heuristic Clustering
Methods(Continued)

Produce uneven cluster structures.
Solutions
cluster splitting cluster sizes
variable similarity thresholds the number of
clusters, and the overlap among clusters
Produce cluster arrangements that vary according
to the order of individual items.

80
Cluster Splitting
Addition of one more item to cluster A
Splitting cluster A into two pieces A and A
Splitting superclusters S into two pieces S and
S
81
Cluster Searching

Cluster centroidthe average vector of all the
documents in a given cluster
strategies
top downthe query is first compared with the
highest-level centroids
bottom uponly the lowest-level centroids are
stored, the higher-level cluster structure is
disregarded

82
Top-down entire-clustering search
1. Initialized by adding top item to active node
list 2. Take centroid with highest-query
similarity from active node list if the number
of singleton items in subtree headed by that
centroid is not larger than number of items
wanted, then retrieve these singleton items
and eliminate the centroid from active node
list else eliminate the centroid with
highest query similarity from active node list
and add its sons to active node list 3. if
number of retrieved ? number wanted then
stop else repeat step 2
83
Active node list Number of single Retrieved i
tems in subtree items (1,0.2) 14 (too
big) (2,0.5), (4,0.7), (3,0) 6 (too
big) (2,0.5), (8,0.8), (9,0.3),(3,0) 2 I,
J (2,0.5), (9,0.3), (3,0) 4 (too big) (5,0.6),
(6,0.5), (9,0.3), (3,0) 2 A,B
84
Bottom-up Individual-Cluster Search
Take a specified number of low-level centroids
if there are enough singleton items in those
clusters to equal the number of items
wanted, then retrieve the number of items
wanted in ranked order else add
additional low-level centroids to list
and repeat test
85
Active centroid list (8,.8), (4,.7),
(5,.6) Ranked documents from clusters
(I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4),
(N,.4), (M,.2) Retrieved items I, L, A

Write a Comment

User Comments (0)