INSYS 300 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

INSYS 300

Description:

... Dogs, 2. TI: National Geographic's Really Wild Animals: Hot Dogs and Cool Cats. DE: Video, Animals ... Small Animal Practice. DE: Non-fictions, Animals, Cats, ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 19
Provided by: xlin2
Category:
Tags: insys

less

Transcript and Presenter's Notes

Title: INSYS 300


1
INSYS 300 Week 5Association Analysis
  • Dr. Xia Lin
  • Associate Professor
  • College of Information Science and Technology
  • Drexel University

2
Automatic indexing
  • 1. Parse individual words (tokens)
  • 2. Remove stop words.
  • 3. Stemming
  • 4. Use frequency data
  • decide heading threshold
  • decide tail threshold
  • decide variance of counting

3
  • 5. Create indexing structure
  • invert indexing
  • other structures

4
Term Associations
  • Counting word pairs
  • If two words appear together very often, they are
    likely to be a phrase
  • Counting document pairs
  • If two documents have many common words, they are
    likely related.

5
More Counting
  • Counting citation pairs
  • If documents A and B both cite document C, D,
    then A and B might be related.
  • If documents C and D often be cited together,
    they are likely related.
  • Counting link patterns
  • Get all pages that have links to my pages.
  • Get all pages that contain similar links to my
    pages

6
Google Search Engine
  • Link analysis
  • PageRank --The ranking of web pages are based on
    the number of links that refer to that web page
  • If page A has a link to B, page A has one vote to
    B.
  • The more votes a page get, the more useful the
    page is.
  • If page A itself receives many votes, its vote to
    B will count more heavily
  • Combining link analysis with word matching.

7
Other similarity measurements
  • If tik is either 0 or 1.
  • D1s terms only n1wx (the number of times
    t1k1)
  • D2s terms only n2wy (the number of times
    t2k1)
  • Sameness count sc (wz)/(n1n2)
  • Difference count dc (xy)/(n1n2)
  • Rectangular Distance rd MAX(n1, n2)
  • Conditional probability cpmin(n1, n2)
  • mean mean (n1n2)/2

8
  • Simple measurement of difference/ similarity
  • wthe number of times t1k1, t2k1.
  • xthe number of times t1k1, t2k0.
  • ythe number of times t1k0, t2k1.
  • zthe number of times t1k0, t2k0.

9
Similarity Measure
  • Dices Coefficient
  • Dice(D1, D2) 2w/(n1n2)
  • where w is the number of terms that D1, and D2
    have in common n1, n2 are the number of terms
    in D1and D2.
  • Jaccard Coefficient
  • Jaccard(D1, D2) w/(N-z)
  • w/(n1n2-w)

10
Similarity Measure
  • Cosine Coefficient
  • The same as

11
Similarity Metric
  • A metric has three defining properties
  • Its value are non-negative
  • Its symmetric
  • It satisfies the triangle inequality
  • AC?ABBC

12
Lp Metrics
13
ConceptLink
  • Use terms co-occurring frequencies
  • to predict semantic relationships
  • to build concept clusters
  • to suggest search terms
  • Visualization of term relationships
  • Link displays
  • Map displays
  • Drag-and drop interface for searching

14
Similarity Matrix
  • Pairwise coupling of similarities among a group
    of documents
  • S11 S12 S13 S14 S15 S16 S17 S18
  • S21 S22 S23 S24 S25 S26 S27 S28
  • S31 S32 S33 S34 S35 S36 S37 S38
  • S41 S42 S43 S44 S45 S46 S47 S48
  • S51 S52 S53 S54 S55 S56 S57 S58
  • S61 S62 S63 S64 S65 S66 S67 S68
  • S71 S72 S73 S74 S75 S76 S77 S78
  • S81 S82 S83 S84 S85 S86 S87 S88

15
Document Similarity
  • Documents
  • D1t11, t12, t13, , t1n
  • D2t21, t22, t23, , t2n

16
Document clustering
  • Grouping similar documents to different sets
  • Create similarity matrix
  • Apply a hierarchical clustering algorithm
  • 1 Identify the two closet documents and combine
    them into a cluster
  • 2 Identify the next two closet documents and
    clusters and combine them into a clusters
  • 3 If more then one cluster remains, return to
    step 1

17
Application of Document Clustering
  • Northern Light
  • Cluster search results into different groups
  • AltaVista
  • Refine search
  • Cluster related words into different groups based
    on their co-occurrence rates in documents.

18
  •  1. TI   The Truth About Cats and Dogs DE
    Video, Pets, Cats, Dogs,
  • 2. TI  National Geographic's Really Wild
    Animals Hot Dogs and Cool Cats DE Video,
    Animals
  • 3.  TI  Dogs, Cats Kids
  • DE Pets, Pictures, Kids
  • 4.  TI  Makita Horsepower 2-1/2 Gallon Hot Dogs
    Oil-Less Air Compressor DE Products, Food
  • 5.  TI I Love Hot Dogs
  • DE Fictions, Food, Kids Families
  • 6.  TI  It's Raining Cats and Dogs
  • DE Fictions, Pictures, Kids Families
  • 7. TI Infectious Diseases of the Dogs Cats
  • DE Non-fictions, cats, dogs,
  • 8. TI Handbook of Small Animal Practice
  • DE Non-fictions, Animals, Cats, dogs,
Write a Comment
User Comments (0)
About PowerShow.com