Text and Web Search - PowerPoint PPT Presentation

About This Presentation

Title:

Text and Web Search

Description:

NO, just keep the first k (concepts) Web Search What about web search? First you need to get all the documents of the web . Crawlers. Then you have to index them ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 88

Provided by: GeorgeK47

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text and Web Search

1
Text and Web Search
2
Text Databases and IR

Text databases (document databases)
Large collections of documents from various
sources news articles, research papers, books,
digital libraries, e-mail messages, and Web
pages, library database, etc.
Information retrieval
A field developed in parallel with database
systems
Information is organized into (a large number of)
documents
Information retrieval problem locating relevant
documents based on user input, such as keywords
or example documents

3
Information Retrieval

Typical IR systems
Online library catalogs
Online document management systems
Information retrieval vs. database systems
Some DB problems are not present in IR, e.g.,
update, transaction management, complex objects
Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance

4
Basic Measures for Text Retrieval

Precision the percentage of retrieved documents
that are in fact relevant to the query (i.e.,
correct responses)
Recall the percentage of documents that are
relevant to the query and were, in fact, retrieved

5
Information Retrieval Techniques

Index Terms (Attribute) Selection
Stop list
Word stem
Index terms weighting methods
Terms ? Documents Frequency Matrices
Information Retrieval Models
Boolean Model
Vector Model
Probabilistic Model

6
Problem - Motivation

Given a database of documents, find documents
containing data, retrieval
Applications
Web
law patent offices
digital libraries
information filtering

7
Problem - Motivation

Types of queries
boolean (data AND retrieval AND NOT ...)
additional features (data ADJACENT retrieval)
keyword queries (data, retrieval)
How to search a large collection of documents?

8
Full-text scanning

for single term
(naive O(NM))

ABRACADABRA
text
CAB
pattern
9
Full-text scanning

for single term
(naive O(NM))
Knuth, Morris and Pratt (77)
build a small FSA visit every text letter once
only, by carefully shifting more than one step

ABRACADABRA
text
CAB
pattern
10
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
11
Full-text scanning

for single term
(naive O(NM))
Knuth Morris and Pratt (77)
Boyer and Moore (77)
preprocess pattern start from right to left
skip!

ABRACADABRA
text
CAB
pattern
12
Text - Detailed outline

text
problem
full text scanning
inversion
signature files
clustering
information filtering and LSI

13
Text Inverted Files
14
Text Inverted Files
Q space overhead?
A mainly, the postings lists
15
Text Inverted Files

how to organize dictionary?
stemming Y/N?
Keep only the root of each word ex. inverted,
inversion ? invert
insertions?

16
Text Inverted Files

how to organize dictionary?
B-tree, hashing, TRIEs, PATRICIA trees, ...
stemming Y/N?
insertions?

17
Text Inverted Files

postings list more Zipf distr. eg.,
rank-frequency plot of Bible

log(freq)
freq 1/rank / ln(1.78V)
log(rank)
18
Text Inverted Files

postings lists
CuttingPedersen
(keep first 4 in B-tree leaves)
how to allocate space Faloutsos92
geometric progression
compression (Elias codes) Zobel down to 2
overhead!
Conclusions needs space overhead (2-300), but
it is the fastest

19
Vector Space Model and Clustering

Keyword (free-text) queries (vs Boolean)
each document -gt vector (HOW?)
each query -gt vector
search for similar vectors

20
Vector Space Model and Clustering

main idea each document is a vector of size d d
is the number of different terms in the database

document
zoo
aaron
data
indexing
...data...
d ( vocabulary size)
21
Document Vectors

Documents are represented as bags of words
Represented as vectors when used computationally
A vector is like an array of floating points
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

22
Document VectorsOne location for each word.

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
23
Document VectorsOne location for each word.

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
24
Document Vectors
Document ids

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
25
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
26
Vector Space Model and Clustering

Then, group nearby vectors together
Q1 cluster search?
Q2 cluster generation?
Two significant contributions
ranked output
relevance feedback

27
Vector Space Model and Clustering

cluster search visit the (k) closest
superclusters continue recursively

MD TRs
CS TRs
28
Vector Space Model and Clustering

ranked output easy!

MD TRs
CS TRs
29
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73

MD TRs
CS TRs
30
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73
How?

MD TRs
CS TRs
31
Vector Space Model and Clustering

How? A by adding the good vectors and
subtracting the bad ones

MD TRs
CS TRs
32
Cluster generation

Problem
given N points in V dimensions,
group them

33
Cluster generation

Problem
given N points in V dimensions,
group them (typically a k-means or AGNES is used)

34
Assigning Weights to Terms

Binary Weights
Raw term frequency
tf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

35
Binary Weights

Only the presence (1) or absence (0) of a term is
included in the vector

36
Raw Term Weights

The frequency of occurrence for the term in each
document is included in the vector

37
Assigning Weights

tf x idf measure
term frequency (tf)
inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Goal assign a tf idf weight to each term in
each document

38
tf x idf
39
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words

For a collection of 10000 documents
40
Similarity Measures for document vectors
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
41
tf x idf normalization

Normalize the term weights (so longer documents
are not unfairly given more weight)
normalize usually means force all values to fall
within a certain range, usually between 0 and 1,
inclusive.

42
Vector space similarity(use the weights to
compare the documents)
43
Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
44
Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
45
Text - Detailed outline

Text databases
problem
full text scanning
inversion
signature files (a.k.a. Bloom Filters)
Vector model and clustering
information filtering and LSI

46
Information Filtering LSI

Foltz,92 Goal
users specify interests ( keywords)
system alerts them, on suitable news-documents
Major contribution LSI Latent Semantic
Indexing
latent (hidden) concepts

47
Information Filtering LSI

Main idea
map each document into some concepts
map each term into some concepts
Concept a set of terms, with weights, e.g.
data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept

48
Information Filtering LSI

Pictorially term-document matrix (BEFORE)

49
Information Filtering LSI

Pictorially concept-document matrix and...

50
Information Filtering LSI

... and concept-term matrix

51
Information Filtering LSI

Q How to search, eg., for system?

52
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

53
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

54
Information Filtering LSI

Thus it works like an (automatically constructed)
thesaurus
we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)

55
SVD

LSI find concepts

56
SVD - Definition

An x m Un x r L r x r (Vm x r)T
A n x m matrix (eg., n documents, m terms)
U n x r matrix (n documents, r concepts)
L r x r diagonal matrix (strength of each
concept) (r rank of the matrix)
V m x r matrix (m terms, r concepts)

57
SVD - Example

A U L VT - example

retrieval
inf.
lung
brain
data
CS
x
x

MD
58
SVD - Example

A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
59
SVD - Example
doc-to-concept similarity matrix

A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
60
SVD - Example

A U L VT - example

retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x

MD
61
SVD - Example

A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
62
SVD - Example

A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
63
SVD for LSI

documents, terms and concepts
U document-to-concept similarity matrix
V term-to-concept sim. matrix
L its diagonal elements strength of each
concept

64
SVD for LSI

Need to keep all the eigenvectors?
NO, just keep the first k (concepts)

65
Web Search

What about web search?
First you need to get all the documents of the
web. Crawlers.
Then you have to index them (inverted files, etc)
Find the web pages that are relevant to the query
Report the pages with their links in a sorted
order
Main difference with IR web pages have links
may be possible to exploit the link structure for
sorting the relevant documents

66
Kleinbergs Algorithm (HITS)

Main idea In many cases, when you search the web
using some terms, the most relevant pages may not
contain this term (or contain the term only a few
times)
Harvard www.harvard.edu
Search Engines yahoo, google, altavista
Authorities and hubs

67
Kleinbergs algorithm

Problem dfn given the web and a query
find the most authoritative web pages for this
query
Step 0 find all pages containing the query terms
(root set)
Step 1 expand by one move forward and backward
(base set)

68
Kleinbergs algorithm

Step 1 expand by one move forward and backward

69
Kleinbergs algorithm

on the resulting graph, give high score (
authorities) to nodes that many important nodes
point to
give high importance score (hubs) to nodes that
point to good authorities)

hubs
authorities
70
Kleinbergs algorithm

observations
recursive definition!
each node (say, i-th node) has both an
authoritativeness score ai and a hubness score hi

71
Kleinbergs algorithm

Let E be the set of edges and A be the adjacency
matrix
the (i,j) is 1 if the edge from i to j exists
Let h and a be n x 1 vectors with the
hubness and authoritativiness scores.
Then

72
Kleinbergs algorithm

Then
ai hk hl hm
that is
ai Sum (hj) over all j that (j,i) edge
exists
or
a AT h

k
i
l
m
73
Kleinbergs algorithm

symmetrically, for the hubness
hi an ap aq
that is
hi Sum (qj) over all j that (i,j) edge
exists
or
h A a

n
i
p
q
74
Kleinbergs algorithm

In conclusion, we want vectors h and a such that
h A a
a AT h

Start from a and h to all 1. Then apply the
following trick hAaA(ATh)(AAT)h ..(AAT)2
h .. (AAT)k h a (ATA)ka
75
Kleinbergs algorithm

In short, the solutions to
h A a
a AT h
are the left- and right- eigenvectors of the
adjacency matrix A.
Starting from random a and iterating, well
eventually converge
(Q to which of all the eigenvectors? why?)

76
Kleinbergs algorithm

(Q to which of all the eigenvectors? why?)
A to the ones of the strongest eigenvalue,
because of property
(AT A ) k v (constant) v1

So, we can find the a and h vectors and the page
with the highest a values are reported!
77
Kleinbergs algorithm - results

Eg., for the query java
0.328 www.gamelan.com
0.251 java.sun.com
0.190 www.digitalfocus.com (the java developer)

78
Kleinbergs algorithm - discussion

authority score can be used to find similar
pages to page p
closely related to citation analysis, social
networs / small world phenomena

79
google/page-rank algorithm

closely related The Web is a directed graph of
connected nodes
imagine a particle randomly moving along the
edges ()
compute its steady-state probabilities. That
gives the PageRank of each pages (the importance
of this page)
() with occasional random jumps

80
PageRank Definition

Assume a page A and pages T1, T2, , Tm that
point to A. Let d is a damping factor. PR(A) the
Pagerank of A. C(A) the out-degree of A. Then

81
google/page-rank algorithm

Compute the PR of each pageidentical problem
given a Markov Chain, compute the steady state
probabilities p1 ... p5

2
1
3
4
5
82
Computing PageRank

Iterative procedure
Also, navigate the web by randomly follow links
or with prob p jump to a random page. Let A the
adjacency matrix (n x n), ci out-degree of page i
Prob(Ai-gtAj) dn-1(1-d)ci1Aij
Ai,j Prob(Ai-gtAj)

83
google/page-rank algorithm

Let A be the transition matrix ( adjacency
matrix, row-normalized sum of each row 1)

2
1
3

4
5
84
google/page-rank algorithm

A p p

A p p
2
1
3

4
5
85
google/page-rank algorithm

A p p
thus, p is the eigenvector that corresponds to
the highest eigenvalue (1, since the matrix is
row-normalized)

86
Kleinberg/google - conclusions

SVD helps in graph analysis
hub/authority scores strongest left- and right-
eigenvectors of the adjacency matrix
random walk on a graph steady state
probabilities are given by the strongest
eigenvector of the transition matrix

87
References