INFO 624 Week 5 Text Properties and Operations - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

INFO 624 Week 5 Text Properties and Operations

Description:

Hand-on experience with the selected search engines. Personal observation or experience ... The best a computer can do is counting numbers ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 51
Provided by: pagesD
Category:

less

Transcript and Presenter's Notes

Title: INFO 624 Week 5 Text Properties and Operations


1
INFO 624 -- Week 5Text Properties and Operations
  • Dr. Min Song
  • College of Information Science and Technology
  • Drexel University

2
Objectives of Assignment 1
  • Practice basic Web skills
  • Get familiar with a few search engines
  • Learn to describe features of search engines
  • Learn to compare search engines

3
Grading Sheet for Assignment 1
  • Memo
  • Selection of Search engines
  • Is it downloadable
  • Can it be under controlled of the small business?
  • Quality of reviews
  • Formats of review pages, including metadata.
  • Appropriate links in reviews and in the
    registered page

4
Whats missing?
  • Who are the current users of the selected search
    engine?
  • Hand-on experience with the selected search
    engines
  • Personal observation or experience
  • Some testing on demos or customer sites.
  • Convincing statements on the differences between
    the search engines.

5
Properties of Text
  • Classic theories
  • Zipfs Law
  • Information Theory
  • Benford's Law
  • Bradford's Law
  • Heaps Law
  • English letter/word frequencies

6
Zipfs Law (1945)
  • in a large, well-written English document,
  • r f c
  • where
  • r is the ranking number,
  • f is the number of times the given word is
    used in the document c is a constant.
  • Difference collections may have different c.
    English text tends to have c N/10 where N is
    the number of words in the collection.

7
  • Zipfs Law is an observation of a fact in
    proximity.
  • Examples
  • Word frequencies in Alice in Wonderland
  • Time magazine collection
  • Zipfs Law has been verified for many many years
    on many different collections.
  • There are also many revised version of Ziphs Law.

8
Example
  • The word "the" is the most frequently occurring
    word in the novel "Moby Dick," occurring 1450
    times.
  • The word "with" is the second-most frequently
    occurring word in that novel.
  • How many times would we expect "with" to occur?
  • How many times would we expect the third most
    frequently occurring word to appear?

9
Information Theory
  • Entropy (1948)
  • Use the distribution of symbols to predict the
    amount of information in a text.
  • Quantified measure for information
  • Useful for (physical) data transfer
  • And compression
  • Not directly applicable to IR
  • Example
  • Which letter is likely to appear after a letter
    c is received?

10
English Letter Usage Statistics
  • Letter use frequencies
  • E 72881 12.4
  • T 52397 8.9
  • A 47072 8.0
  • O 45116 7.6
  • N 41316 7.0
  • I 39710 6.7
  • H 38334 6.5

11
  • Doubled letter frequencies
  • LL 2979 20.6
  • EE 2146 14.8
  • SS 2128 14.7
  • OO 2064 14.3
  • TT 1169 8.1
  • RR 1068 7.4
  • -- 701 4.8
  • PP 628 4.3
  • FF 430 2.9

12
  • Initial letter frequencies
  • T 20665 15.2
  • A 15564 11.4
  • H 11623 8.5
  • W 9597 7.0
  • I 9468 6.9
  • S 9376 6.9
  • O 8205 6.0
  • M 6293 4.6
  • B 5831 4.2

13
  • Ending letter frequencies
  • E 26439 19.4
  • D 17313 12.7
  • S 14737 10.8
  • T 13685 10.0
  • N 10525 7.7
  • R 9491 6.9
  • Y 7915 5.8
  • O 6226 4.5

14
Benford's Law
  • If we randomly select a number from a table of
    statistical data, the probability that the first
    digit will be a "1" is about 0.301, rather than
    0.1 as we might expect if all digits were equally
    likely.

15
Bradford's Law
  • On a given subject, a few core journals will
    provide 1/3 of the articles on that subject, a
    medium number of secondary journals will provide
    another 1/3 of the articles on that subject, and
    a large number peripheral journals will provide
    the final 1/3 of the articles on that subject.

16
For example
  • If you found 300 citations for IR,
  • 100 of those citations likely came from a core
    group of 5 journals,
  • another 100 citations came from a group of 25
    journals,
  • and the final 100 citations came from 125
    peripheral journals.
  • Bradford expressed his law with this formula
    1nn

2
17
Heaps Law
  • The relationship of the size of vocabulary and
    the size of collections are V K n

b
Number of unique words
Text size
18
Computerized Text Analysis
  • Word (token) extraction
  • Stop words
  • Stemming
  • Frequency counts
  • Clustering

19
Word Extraction
  • Basic problems
  • Digits
  • Hyphens
  • Punctuation
  • Cases
  • Lexical analyzer
  • Define all possible characters into finite state
    machine
  • Specify what states should cause the break of
    tokens.
  • Example
  • Parser.c

20
Stop words
  • Many of the most frequently used words in English
    are worthless in the indexing these words are
    called stop words.
  • the, of, and, to, .
  • Typically about 400 to 500 such words
  • Why do we need to remove stop words?
  • Reduce indexing file size
  • stopwords accounts 20-30 of total word counts.
  • Improve efficiency
  • stop words are not useful for searching
  • stop words always have a large number of hits

21
Stop words
  • Potential problems of removing stop words
  • small stop list does not improve indexing much
  • large stop list may eliminate some words that
    might be useful for someone or for some purposes
  • stopwords might be part of phrases
  • needs to process for both indexing and queries.
  • Examples
  • Lommoncommon.c
  • commonwords

22
Stemming
  • Techniques used to find out the root/stem of a
    word
  • lookup user engineering
  • user 15 engineering 12
  • users 4 engineered 23
  • used 5 engineer
    12
  • using 5
  • stem use engineer

23
Advantages of stemming
  • improving effectiveness
  • matching similar words
  • reducing indexing size
  • combing words with same roots may reduce indexing
    size as much as 40-50.
  • Criteria for stemming
  • correctness
  • retrieval effectiveness
  • compression performance

24
Basic stemming methods
  • Use tables and rules
  • remove ending
  • if a word ends with a consonant other than s,
  • followed by an s, then delete s.
  • if a word ends in es, drop the s.
  • if a word ends in ing, delete the ing unless the
    remaining word consists only of one letter or of
    th.
  • If a word ends with ed, preceded by a consonant,
    delete the ed unless this leaves only a single
    letter.
  • ...

25
  • transform the remaining word
  • if a word ends with ies but not eies or
    aies then ies --gt y.

26
Example 1 Porter stem Algorithm
  • A set of condition/action rules
  • condition on the stem
  • condition on the suffix
  • condition on the rules
  • different combination of conditions will activate
    different rules.
  • Implementation
  • stem.c
  • Stem(word)
  • ..
  • ReplaceEnd(word, step1a_rule)
  • ruleReplaceEnd(word, step1b_rule)
  • if (rule106) (rule 107)
  • ReplaceEnd(word, 1b1_rule)

27
Example 2 Sound-based stemming
  • Soundex rules
  • letter Numeric equivalent
  • B, F, P, V 1
  • C, G, J, K, Q, S, X, Z 2
  • D, T, 3
  • L 4
  • M, N, 5
  • R, 6
  • A, E, I, O, U, W, Y not coded
  • Words sound similar often have same codes
  • The code is not unique
  • high compression rate

28
Frequency counts
  • The idea
  • The best a computer can do is counting numbers
  • counts the number of times a word occurred in a
    document
  • counts the number of documents in a collection
    that contains a word
  • Using occurrence frequencies to indicate relative
    importance of a word in a document
  • if a word appears often in a document, the
    document likely deals with subjects related to
    the word.

29
  • Using occurrence frequencies to select most
    useful words to index a document collection
  • if a word appears in every document, it is not a
    good indexing word
  • if a word appears in only one or two documents,
    it may not be a good indexing word
  • If a word appears in titles, each occurrence
    should be count 5(or 10) times.

30
Automatic indexing
  • 1. Parse individual words (tokens)
  • 2. Remove stop words.
  • 3. Stemming
  • 4. Use frequency data
  • decide heading threshold
  • decide tail threshold
  • decide variance of counting

31
  • 5. Create indexing structure
  • invert indexing
  • other structures

32
Term Associations
  • Counting word pairs
  • If two words appear together very often, they are
    likely to be a phrase
  • Counting document pairs
  • if two documents have many common words, they are
    likely related

33
More Counting
  • Counting citation pairs
  • If documents A and B both cite document C, D,
    then A and B might be related.
  • If documents C and D often be cited together,
    they are likely related.
  • Counting link patterns
  • Get all pages that have links to my pages.
  • Get all pages that contain similar links to my
    pages

34
Google Search Engine
  • Link analysis
  • PageRank --The ranking of web pages are based on
    the number of links that refer to that web page
  • If page A has a link to B, page A has one vote to
    B.
  • The more votes a page get, the more useful the
    page is.
  • If page A itself receives many votes, its vote to
    B will count more heavily
  • Combining link analysis with word matching.

35
ConceptLink
  • Use terms co-occurring frequencies
  • to predict semantic relationships
  • to build concept clusters
  • to suggest search terms
  • Visualization of term relationships
  • Link displays
  • Map displays
  • Drag-and drop interface for searching

36
Document clustering
  • Grouping similar documents to different sets
  • Create similarity matrix
  • Apply a hierarchical clustering algorithm
  • 1 Identify the two closet documents and combine
    them into a cluster
  • 2 Identify the next two closet documents and
    clusters and combine them into a clusters
  • 3 If more then one cluster remains, return to
    step 1

37
Application of Document Clustering
  • Vivisimo
  • Cluster search results on the fly
  • Hierarchical categories for drill-down capability
  • AltaVista
  • Refine search
  • Cluster related words into different groups based
    on their co-occurrence rates in documents.

38
AltaVista
39
Document Similarity
  • Documents
  • D1t11, t12, t13, , t1n
  • D2t21, t22, t23, , t2n
  • tik is either 0 or 1.
  • Simple measurement of difference/ similarity
  • wthe number of times t1k1, t2k1.
  • xthe number of times t1k1, t2k0.
  • ythe number of times t1k0, t2k1.
  • zthe number of times t1k0, t2k0.

40
Similarity Measure
  • Cosine Coefficient
  • The same as

41
  • D1s terms only n1wx (the number of times
    t1k1)
  • D2s terms only n2wy (the number of times
    t2k1)
  • Sameness count sc (wz)/(n1n2)
  • Difference count dc (xy)/(n1n2)
  • Rectangular Distance rd MAX(n1, n2)
  • Conditional probability cpmin(n1, n2)
  • mean mean (n1n2)/2

42
Similarity Measure
  • Dices Coefficient
  • Dice(D1, D2) 2w/(n1n2)
  • where w is the number of terms that D1, and D2
    have in common n1, n2 are the number of terms
    in D1and D2.
  • Jaccard Coefficient
  • Jaccard(D1, D2) w/(N-z)
  • w/(n1n2-w)

43
Similarity Metric
  • A metric has three defining properties
  • Its value are non-negative
  • Its symmetric
  • It satisfies the triangle inequality
  • AC?ABBC

44
Lp Metrics
45
Similarity Matrix
  • Pairwise coupling of similarities among a group
    of documents
  • S11 S12 S13 S14 S15 S16 S17 S18
  • S21 S22 S23 S24 S25 S26 S27 S28
  • S31 S32 S33 S34 S35 S36 S37 S38
  • S41 S42 S43 S44 S45 S46 S47 S48
  • S51 S52 S53 S54 S55 S56 S57 S58
  • S61 S62 S63 S64 S65 S66 S67 S68
  • S71 S72 S73 S74 S75 S76 S77 S78
  • S81 S82 S83 S84 S85 S86 S87 S88

46
MetaData
  • Data about data
  • Descriptive Data
  • External to the meaning of the document
  • Dublin Core Metadata Element Set
  • Author, title, publisher, etc.
  • Semantic Metadata
  • Subject indexing
  • Challenge automatic generation of metadata for
    documents

47
Markup Languages
SGML
XSL
XML
HyTime
Metalanguage Languages
HTML
CSS
RDF
MathML
SMIL
Semantic Web?
Stylesheet
48
Midterm
  • Concepts
  • What is information retrieval?
  • Data, information, text, and documents
  • Two abstractions principles
  • Users information needs
  • Queries and query formats
  • Precision and Recall
  • Relevance
  • Zipfs Law, Benford's Law

49
Midterm
  • Procedures problem solving
  • How to translate a request into a query?
  • How to expand queries
  • for better recall or better precision?
  • How to create an inverted indexing?
  • How to create a vector space ?
  • How to calculate similarities of documents?
  • How to match a query to documents in a vector
    space?

50
  • Discussions
  • Challenges of IR
  • Advantages and disadvantages of Boolean search
    (vector space, automatic indexing,
    association-based queries, etc.)
  • Evaluation of IR systems
  • With or without using precision/recall.
  • Difference between data retrieval and information
    retrieval
Write a Comment
User Comments (0)
About PowerShow.com