Text Databases - PowerPoint PPT Presentation

1 / 89

About This Presentation

Title:

Text Databases

Description:

Text Databases Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video databases Time ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 90

Provided by: George897

Category:

more less

Transcript and Presenter's Notes

Title: Text Databases

1
Text Databases
2
Outline

Spatial Databases
Temporal Databases
Spatio-temporal Databases
Data Mining
Multimedia Databases
Text databases
Image and video databases
Time Series databases

3
Text - Detailed outline

Text databases
problem
full text scanning
inversion
signature files (a.k.a. Bloom Filters)
Vector model and clustering
information filtering and LSI

4
Vector Space Model and Clustering

Keyword (free-text) queries (vs Boolean)
each document -gt vector (HOW?)
each query -gt vector
search for similar vectors

5
Vector Space Model and Clustering

main idea each document is a vector of size d d
is the number of different terms in the database

document
zoo
aaron
data
indexing
...data...
d ( vocabulary size)
6
Document Vectors

Documents are represented as bags of words
Represented as vectors when used computationally
A vector is like an array of floating points
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

7
Document VectorsOne location for each word.

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
8
Document VectorsOne location for each word.

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
9
Document Vectors
Document ids

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
10
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
11
Vector Space Model and Clustering

Then, group nearby vectors together
Q1 cluster search?
Q2 cluster generation?
Two significant contributions
ranked output
relevance feedback

12
Vector Space Model and Clustering

cluster search visit the (k) closest
superclusters continue recursively

MD TRs
CS TRs
13
Vector Space Model and Clustering

ranked output easy!

MD TRs
CS TRs
14
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73

MD TRs
CS TRs
15
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73
How?

MD TRs
CS TRs
16
Vector Space Model and Clustering

How? A by adding the good vectors and
subtracting the bad ones

MD TRs
CS TRs
17
Cluster generation

Problem
given N points in V dimensions,
group them

18
Cluster generation

Problem
given N points in V dimensions,
group them (typically a k-means or AGNES is used)

19
Assigning Weights to Terms

Binary Weights
Raw term frequency
tf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

20
Binary Weights

Only the presence (1) or absence (0) of a term is
included in the vector

21
Raw Term Weights

The frequency of occurrence for the term in each
document is included in the vector

22
Assigning Weights

tf x idf measure
term frequency (tf)
inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Goal assign a tf idf weight to each term in
each document

23
tf x idf
24
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words

For a collection of 10000 documents
25
Similarity Measures for document vectors
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
26
tf x idf normalization

Normalize the term weights (so longer documents
are not unfairly given more weight)
normalize usually means force all values to fall
within a certain range, usually between 0 and 1,
inclusive.

27
Vector space similarity(use the weights to
compare the documents)
28
Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
29
Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
30
Text - Detailed outline

Text databases
problem
full text scanning
inversion
signature files (a.k.a. Bloom Filters)
Vector model and clustering
information filtering and LSI

31
Information Filtering LSI

Foltz,92 Goal
users specify interests ( keywords)
system alerts them, on suitable news-documents
Major contribution LSI Latent Semantic
Indexing
latent (hidden) concepts

32
Information Filtering LSI

Main idea
map each document into some concepts
map each term into some concepts
Concept a set of terms, with weights, e.g.
data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept

33
Information Filtering LSI

Pictorially term-document matrix (BEFORE)

34
Information Filtering LSI

Pictorially concept-document matrix and...

35
Information Filtering LSI

... and concept-term matrix

36
Information Filtering LSI

Q How to search, eg., for system?

37
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

38
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

39
Information Filtering LSI

Thus it works like an (automatically constructed)
thesaurus
we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)

40
SVD - Detailed outline

Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties

41
SVD - Motivation

problem 1 text - LSI find concepts
problem 2 compression / dim. reduction

42
SVD - Motivation

problem 1 text - LSI find concepts

43
SVD - Motivation

problem 2 compress / reduce dimensionality

44
Problem - specs

106 rows 103 columns no updates
random access to any cell(s) small error OK

45
SVD - Motivation
46
SVD - Motivation
47
SVD - Detailed outline

Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties

48
SVD - Definition

An x m Un x r L r x r (Vm x r)T
A n x m matrix (eg., n documents, m terms)
U n x r matrix (n documents, r concepts)
L r x r diagonal matrix (strength of each
concept) (r rank of the matrix)
V m x r matrix (m terms, r concepts)

49
SVD - Properties

THEOREM Press92 always possible to decompose
matrix A into A U L VT , where
U, L, V unique ()
U, V column orthonormal (ie., columns are unit
vectors, orthogonal to each other)
UT U I VT V I (I identity matrix)
L eigenvalues are positive, and sorted in
decreasing order

50
SVD - Example

A U L VT - example

retrieval
inf.
lung
brain
data
CS
x
x

MD
51
SVD - Example

A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
52
SVD - Example
doc-to-concept similarity matrix

A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
53
SVD - Example

A U L VT - example

retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x

MD
54
SVD - Example

A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
55
SVD - Example

A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
56
SVD - Detailed outline

Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties

57
SVD - Interpretation 1

documents, terms and concepts
U document-to-concept similarity matrix
V term-to-concept sim. matrix
L its diagonal elements strength of each
concept

58
SVD - Interpretation 2

best axis to project on (best min sum of
squares of projection errors)

59
SVD - Motivation
60
SVD - interpretation 2
SVD gives best axis to project
v1

minimum RMS error

61
SVD - Interpretation 2
62
SVD - Interpretation 2

A U L VT - example

63
SVD - Interpretation 2

A U L VT - example

variance (spread) on the v1 axis
x
x

64
SVD - Interpretation 2

A U L VT - example
U L gives the coordinates of the points in
the projection axis

x
x

65
SVD - Interpretation 2

More details
Q how exactly is dim. reduction done?

66
SVD - Interpretation 2

More details
Q how exactly is dim. reduction done?
A set the smallest eigenvalues to zero

x
x

67
SVD - Interpretation 2
x
x

68
SVD - Interpretation 2
x
x

69
SVD - Interpretation 2
x
x

70
SVD - Interpretation 2

71
SVD - Interpretation 2

Equivalent
spectral decomposition of the matrix

x
x

72
SVD - Interpretation 2

Equivalent
spectral decomposition of the matrix

l1
x
x

u1
u2
l2
v1
v2
73
SVD - Interpretation 2

Equivalent
spectral decomposition of the matrix

m

...
n
74
SVD - Interpretation 2

spectral decomposition of the matrix

m
r terms

...
n
n x 1
1 x m
75
SVD - Interpretation 2

approximation / dim. reduction
by keeping the first few terms (Q how many?)

m

...
n
assume l1 gt l2 gt ...
76
SVD - Interpretation 2

A (heuristic - Fukunaga) keep 80-90 of
energy ( sum of squares of li s)

m

...
n
assume l1 gt l2 gt ...
77
SVD - Interpretation 3

finds non-zero blobs in a data matrix

x
x

78
SVD - Interpretation 3

finds non-zero blobs in a data matrix

x
x

79
SVD - Interpretation 3

Drill find the SVD, by inspection!
Q rank ??

x
x
??

??
??
80
SVD - Interpretation 3

A rank 2 (2 linearly independent rows/cols)

x
x
??

??
??
??
81
SVD - Interpretation 3

A rank 2 (2 linearly independent rows/cols)

x
x

orthogonal??
82
SVD - Interpretation 3

column vectors are orthogonal - but not unit
vectors

0
0
x
x
0

0
0
0
0
0
0
0
83
SVD - Interpretation 3

and the eigenvalues are

0
0
x
x
0

0
0
0
0
0
0
0
84
SVD - Interpretation 3

A SVD properties
matrix product should give back matrix A
matrix U should be column-orthonormal, i.e.,
columns should be unit vectors, orthogonal to
each other
ditto for matrix V
matrix L should be diagonal, with positive values

85
SVD - Detailed outline

Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties

86
SVD - Complexity

O( n m m) or O( n n m) (whichever is
less)
less work, if we just want eigenvalues
or if we want first k eigenvectors
or if the matrix is sparse Berry
Implemented in any linear algebra package
(LINPACK, matlab, Splus, mathematica ...)

87
SVD - Complexity

Faster algorithms for approximate eigenvector
computations exist
Alan Frieze, Ravi Kannan, Santosh Vempala Fast
Monte-Carlo Algorithms for finding low-rank
approximations, Proceedings of the 39th FOCS,
p.370, November 08-11, 1998
Sudipto Guha, Dimitrios Gunopulos, Nick Koudas
Correlating synchronous and asynchronous data
streams. KDD 2003 529-534

88
SVD - conclusions so far

SVD A U L VT unique ()
U document-to-concept similarities
V term-to-concept similarities
L strength of each concept
dim. reduction keep the first few strongest
eigenvalues (80-90 of energy)
SVD picks up linear correlations
SVD picks up non-zero blobs

89
References

Berry, Michael http//www.cs.utk.edu/lsi/
Fukunaga, K. (1990). Introduction to Statistical
Pattern Recognition, Academic Press.
Press, W. H., S. A. Teukolsky, et al. (1992).
Numerical Recipes in C, Cambridge University
Press.

Write a Comment

User Comments (0)