INFO 624 Week 5 Text Properties and Operations - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

INFO 624 Week 5 Text Properties and Operations

Description:

Hand-on experience with the selected search engines. Personal observation or experience ... The best a computer can do is counting numbers ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 51

Provided by: pagesD

Category:

more less

Transcript and Presenter's Notes

Title: INFO 624 Week 5 Text Properties and Operations

1
INFO 624 -- Week 5Text Properties and Operations

Dr. Min Song
College of Information Science and Technology
Drexel University

2
Objectives of Assignment 1

Practice basic Web skills
Get familiar with a few search engines
Learn to describe features of search engines
Learn to compare search engines

3
Grading Sheet for Assignment 1

Memo
Selection of Search engines
Is it downloadable
Can it be under controlled of the small business?
Quality of reviews
Formats of review pages, including metadata.
Appropriate links in reviews and in the
registered page

4
Whats missing?

Who are the current users of the selected search
engine?
Hand-on experience with the selected search
engines
Personal observation or experience
Some testing on demos or customer sites.
Convincing statements on the differences between
the search engines.

5
Properties of Text

Classic theories
Zipfs Law
Information Theory
Benford's Law
Bradford's Law
Heaps Law
English letter/word frequencies

6
Zipfs Law (1945)

in a large, well-written English document,
r f c
where
r is the ranking number,
f is the number of times the given word is
used in the document c is a constant.
Difference collections may have different c.
English text tends to have c N/10 where N is
the number of words in the collection.

Zipfs Law is an observation of a fact in
proximity.
Examples
Word frequencies in Alice in Wonderland
Time magazine collection
Zipfs Law has been verified for many many years
on many different collections.
There are also many revised version of Ziphs Law.

8
Example

The word "the" is the most frequently occurring
word in the novel "Moby Dick," occurring 1450
times.
The word "with" is the second-most frequently
occurring word in that novel.
How many times would we expect "with" to occur?
How many times would we expect the third most
frequently occurring word to appear?

9
Information Theory

Entropy (1948)
Use the distribution of symbols to predict the
amount of information in a text.
Quantified measure for information
Useful for (physical) data transfer
And compression
Not directly applicable to IR
Example
Which letter is likely to appear after a letter
c is received?

10
English Letter Usage Statistics

Letter use frequencies
E 72881 12.4
T 52397 8.9
A 47072 8.0
O 45116 7.6
N 41316 7.0
I 39710 6.7
H 38334 6.5

Doubled letter frequencies
LL 2979 20.6
EE 2146 14.8
SS 2128 14.7
OO 2064 14.3
TT 1169 8.1
RR 1068 7.4
-- 701 4.8
PP 628 4.3
FF 430 2.9

Initial letter frequencies
T 20665 15.2
A 15564 11.4
H 11623 8.5
W 9597 7.0
I 9468 6.9
S 9376 6.9
O 8205 6.0
M 6293 4.6
B 5831 4.2

Ending letter frequencies
E 26439 19.4
D 17313 12.7
S 14737 10.8
T 13685 10.0
N 10525 7.7
R 9491 6.9
Y 7915 5.8
O 6226 4.5

14
Benford's Law

If we randomly select a number from a table of
statistical data, the probability that the first
digit will be a "1" is about 0.301, rather than
0.1 as we might expect if all digits were equally
likely.

15
Bradford's Law

On a given subject, a few core journals will
provide 1/3 of the articles on that subject, a
medium number of secondary journals will provide
another 1/3 of the articles on that subject, and
a large number peripheral journals will provide
the final 1/3 of the articles on that subject.

16
For example

If you found 300 citations for IR,
100 of those citations likely came from a core
group of 5 journals,
another 100 citations came from a group of 25
journals,
and the final 100 citations came from 125
peripheral journals.
Bradford expressed his law with this formula
1nn

2
17
Heaps Law

The relationship of the size of vocabulary and
the size of collections are V K n

b
Number of unique words
Text size
18
Computerized Text Analysis

Word (token) extraction
Stop words
Stemming
Frequency counts
Clustering

19
Word Extraction

Basic problems
Digits
Hyphens
Punctuation
Cases
Lexical analyzer
Define all possible characters into finite state
machine
Specify what states should cause the break of
tokens.
Example
Parser.c

20
Stop words

Many of the most frequently used words in English
are worthless in the indexing these words are
called stop words.
the, of, and, to, .
Typically about 400 to 500 such words
Why do we need to remove stop words?
Reduce indexing file size
stopwords accounts 20-30 of total word counts.
Improve efficiency
stop words are not useful for searching
stop words always have a large number of hits

21
Stop words

Potential problems of removing stop words
small stop list does not improve indexing much
large stop list may eliminate some words that
might be useful for someone or for some purposes
stopwords might be part of phrases
needs to process for both indexing and queries.
Examples
Lommoncommon.c
commonwords

22
Stemming

Techniques used to find out the root/stem of a
word
lookup user engineering
user 15 engineering 12
users 4 engineered 23
used 5 engineer
12
using 5
stem use engineer

23
Advantages of stemming

improving effectiveness
matching similar words
reducing indexing size
combing words with same roots may reduce indexing
size as much as 40-50.
Criteria for stemming
correctness
retrieval effectiveness
compression performance

24
Basic stemming methods

Use tables and rules
remove ending
if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of
th.
If a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single
letter.
...

transform the remaining word
if a word ends with ies but not eies or
aies then ies --gt y.

26
Example 1 Porter stem Algorithm

A set of condition/action rules
condition on the stem
condition on the suffix
condition on the rules
different combination of conditions will activate
different rules.
Implementation
stem.c
Stem(word)
..
ReplaceEnd(word, step1a_rule)
ruleReplaceEnd(word, step1b_rule)
if (rule106) (rule 107)
ReplaceEnd(word, 1b1_rule)

27
Example 2 Sound-based stemming

Soundex rules
letter Numeric equivalent
B, F, P, V 1
C, G, J, K, Q, S, X, Z 2
D, T, 3
L 4
M, N, 5
R, 6
A, E, I, O, U, W, Y not coded
Words sound similar often have same codes
The code is not unique
high compression rate

28
Frequency counts

The idea
The best a computer can do is counting numbers
counts the number of times a word occurred in a
document
counts the number of documents in a collection
that contains a word
Using occurrence frequencies to indicate relative
importance of a word in a document
if a word appears often in a document, the
document likely deals with subjects related to
the word.

Using occurrence frequencies to select most
useful words to index a document collection
if a word appears in every document, it is not a
good indexing word
if a word appears in only one or two documents,
it may not be a good indexing word
If a word appears in titles, each occurrence
should be count 5(or 10) times.

30
Automatic indexing

1. Parse individual words (tokens)
2. Remove stop words.
3. Stemming
4. Use frequency data
decide heading threshold
decide tail threshold
decide variance of counting

5. Create indexing structure
invert indexing
other structures

32
Term Associations

Counting word pairs
If two words appear together very often, they are
likely to be a phrase
Counting document pairs
if two documents have many common words, they are
likely related

33
More Counting

Counting citation pairs
If documents A and B both cite document C, D,
then A and B might be related.
If documents C and D often be cited together,
they are likely related.
Counting link patterns
Get all pages that have links to my pages.
Get all pages that contain similar links to my
pages

34
Google Search Engine

Link analysis
PageRank --The ranking of web pages are based on
the number of links that refer to that web page
If page A has a link to B, page A has one vote to
B.
The more votes a page get, the more useful the
page is.
If page A itself receives many votes, its vote to
B will count more heavily
Combining link analysis with word matching.

35
ConceptLink

Use terms co-occurring frequencies
to predict semantic relationships
to build concept clusters
to suggest search terms
Visualization of term relationships
Link displays
Map displays
Drag-and drop interface for searching

36
Document clustering

Grouping similar documents to different sets
Create similarity matrix
Apply a hierarchical clustering algorithm
1 Identify the two closet documents and combine
them into a cluster
2 Identify the next two closet documents and
clusters and combine them into a clusters
3 If more then one cluster remains, return to
step 1

37
Application of Document Clustering

Vivisimo
Cluster search results on the fly
Hierarchical categories for drill-down capability
AltaVista
Refine search
Cluster related words into different groups based
on their co-occurrence rates in documents.

38
AltaVista
39
Document Similarity

Documents
D1t11, t12, t13, , t1n
D2t21, t22, t23, , t2n
tik is either 0 or 1.
Simple measurement of difference/ similarity
wthe number of times t1k1, t2k1.
xthe number of times t1k1, t2k0.
ythe number of times t1k0, t2k1.
zthe number of times t1k0, t2k0.

40
Similarity Measure

Cosine Coefficient
The same as

D1s terms only n1wx (the number of times
t1k1)
D2s terms only n2wy (the number of times
t2k1)
Sameness count sc (wz)/(n1n2)
Difference count dc (xy)/(n1n2)
Rectangular Distance rd MAX(n1, n2)
Conditional probability cpmin(n1, n2)
mean mean (n1n2)/2

42
Similarity Measure

Dices Coefficient
Dice(D1, D2) 2w/(n1n2)
where w is the number of terms that D1, and D2
have in common n1, n2 are the number of terms
in D1and D2.
Jaccard Coefficient
Jaccard(D1, D2) w/(N-z)
w/(n1n2-w)

43
Similarity Metric

A metric has three defining properties
Its value are non-negative
Its symmetric
It satisfies the triangle inequality
AC?ABBC

44
Lp Metrics
45
Similarity Matrix

Pairwise coupling of similarities among a group
of documents
S11 S12 S13 S14 S15 S16 S17 S18
S21 S22 S23 S24 S25 S26 S27 S28
S31 S32 S33 S34 S35 S36 S37 S38
S41 S42 S43 S44 S45 S46 S47 S48
S51 S52 S53 S54 S55 S56 S57 S58
S61 S62 S63 S64 S65 S66 S67 S68
S71 S72 S73 S74 S75 S76 S77 S78
S81 S82 S83 S84 S85 S86 S87 S88

46
MetaData

Data about data
Descriptive Data
External to the meaning of the document
Dublin Core Metadata Element Set
Author, title, publisher, etc.
Semantic Metadata
Subject indexing
Challenge automatic generation of metadata for
documents

47
Markup Languages
SGML
XSL
XML
HyTime
Metalanguage Languages
HTML
CSS
RDF
MathML
SMIL
Semantic Web?
Stylesheet
48
Midterm

Concepts
What is information retrieval?
Data, information, text, and documents
Two abstractions principles
Users information needs
Queries and query formats
Precision and Recall
Relevance
Zipfs Law, Benford's Law

49
Midterm

Procedures problem solving
How to translate a request into a query?
How to expand queries
for better recall or better precision?
How to create an inverted indexing?
How to create a vector space ?
How to calculate similarities of documents?
How to match a query to documents in a vector
space?

Discussions
Challenges of IR
Advantages and disadvantages of Boolean search
(vector space, automatic indexing,
association-based queries, etc.)
Evaluation of IR systems
With or without using precision/recall.
Difference between data retrieval and information
retrieval

Write a Comment

User Comments (0)