Web Search, Web Community and Web Mining - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Web Search, Web Community and Web Mining

Description:

The Web is 500 times larger than the segment covered by standard search engines ... The Web holds about 550 billion documents, search engines index a combined total ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 42
Provided by: JimKurosea346
Category:
Tags: community | mining | search | web

less

Transcript and Presenter's Notes

Title: Web Search, Web Community and Web Mining


1
Web Search, Web Community and Web Mining
  • The lecture notes are edited based on the
    lecture notes of mining the web discovering
    knowledge from hyper data by Soumen Chakrabarti,
    a talk by Soumen Chakrabarti et al on Mining the
    Link Structure of the World Wide Web, a talk by
    Sanjay Kumar Madria on Web Mining ABirds Eye
    View, a talk by Chen Li on Searching and
    Integrating Information on the Web, a talk given
    by Yanchun Zhang on Web Search Web Community,
    a talk by S. D. Kamvar et al on Extrapolation
    Methods for Accelarating PageRank Computations

2
Web Bigger Than We Think
  • Web is expanding continuously
  • Today's search engines only cover a fraction of
    the existing pages
  • The Web is 500 times larger than the segment
    covered by standard search engines such as Yahoo!
    and AltaVista.
  • The Web holds about 550 billion documents, search
    engines index a combined total of 1 billion
    pages, .(CNet, 26 July 2000)

3
Web Search Problem
  • Web search tools such as Yahoo!, AltaVista,
    Google, return more information than those
    require
  • Example search Java Programming , Google
    returns 1,330,000 (more than 1 million) Web
    pages, AltaVista returns 16,921,862 (more than 16
    million) Web pages.
  • Users only concern about a small and interesting
    portion of the Web search results.

4
Closer Look at the Problems
  • Lacking the concept of importance of each page
    on each topic
  • E.g. My homepage is not as important as
    Yahoos main page.
  • A link from Yahoo is more important than a link
    from a personal homepage
  • But, how to capture the importance of a page?
  • A guess of hits? ? where to get that info?
  • of inlinks to a page ? Googles main idea.

5
(Google) PageRank
  • Intuition
  • The importance of each page should be decided by
    what other pages say about this page
  • One naïve implementation count the of pages
    pointing to each page (i.e., of inlinks)
  • Problem
  • We can easily fool this technique by generating
    many dummy pages that point to our class page

6
Google (PageRank)
  • Assumption
  • the importance of a page is proportional to the
    sum of the prestige scores of pages linking to it
  • Random surfer on strongly connected web graph

As Home Page
Bs Home Page
Linked by 2 Important Pages
Linked by 1 Unimportant page
Yahoo!
DB Pub Server
CNN
7
Page Importance
  • The importance of a page is given by the
    importance of the pages that link to it.

importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
8
Link Counts Example
A
B
Yahoo!
CNN
DB Pub Server
9
Computing PageRank
  • Initialize
  • Repeat until convergence

importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
10
PageRank Diagram
0.167
Initialize all nodes to rank
0.333
0.333
0.167
0.333
0.333
Propagate ranks across links (multiplying by link
weights)
0.5
0.333
0.333
0.167
11
PageRank Diagram
0.167
0.5
0.167
0.167
0.5
0.333
0.333
0.5
0.167
0.167
12
PageRank Diagram
The Surfing Model
  • Correspondence between surfer model and the
    notion of importance
  • Page v has high prestige if the visit rate is
    high
  • This happens if there are many neighbors u with
    high visit rates leading to v

After a while
0.4
0.4
0.2
13
Example MiniWeb
  • Our MiniWeb has only three web sites Netscape,
    Amazon, and Microsoft.
  • Their weights are represented as a vector

Ne
MS
For instance, in each iteration, half of the
weight of AM goes to NE, and half goes to MS.
Am
Materials by courtesy of Jeff Ullman
14
Iterative computation
Ne
  • Final result
  • Netscape and Amazon have the same importance, and
    twice the importance of Microsoft.
  • Does it capture the intuition? Yes.

MS
Am
15
Problem 1 Dead Ends!
Ne
MS
Am
  • MS does not point to anybody
  • Result weights of the Web leak out

16
Problem 2 Spider Traps
Ne
MS
Am
  • MS only points to itself
  • Result all weights go to MS!

17
PageRank at Google
  • Ranking of pages more important than exact values
    of pi
  • Convergence of page ranks in 52 iterations for a
    crawl with 322 million links.
  • Pre-compute and store the PageRank of each page.
  • PageRank independent of any query or textual
    content.
  • Ranking scheme combines PageRank with textual
    match
  • Unpublished
  • Many empirical parameters, human effort and
    regression testing.
  • Criticism Ad-hoc coupling and decoupling
    between relevance and importance

18
Hubs and Authorities
  • Motivation find web pages to a topic
  • E.g. find all web sites about automobiles
  • Authority a page that offers info about a
    topic
  • E.g. DBLP is a page about papers
  • E.g. google.com, aj.com, teoma.com, lycos.com
  • Hub a page that doesnt provide much info, but
    tell us where to find pages about a topic
  • E.g. www.searchenginewatch.com is a hub of
    search engines

19
Two Values of a Page
  • Each page has a hub value and an authority value.
  • In PageRank, each page has one value weight
  • Two vectors
  • H hub values
  • A authority values

20
HITS Find Hubs and Authorities
  • First step find pages related to the topic
    (e.g., automobile), and construct the
    corresponding focused subgraph
  • Find pages S containing the keyword
    (automobile)
  • Find all pages these S pages point to, i.e.,
    their forward neighbors.
  • Find all pages that point to S pages, i.e., their
    backward neighbors
  • Compute the subgraph of these pages

root
Focused subgraph
21
Computing H and A
  • Initially set hub and authority to 1
  • In each iteration, the hub value of a page is the
    total authority value of its forward neighbors
  • The authority value of each page is the total hub
    value of its backward neighbors
  • Iterate until converge

authorities
hubs
22
HITS How to Count?
  • Count OUT links
  • Count IN links

23
HITS Calculating Values
  • Authority value
  • Hub value
  • Matrix notation A - adjacency matrix
  • A(i, j) 1 if i-th page points to j-th page

24
PageRank vs HITS
  • PageRank advantage over HITS
  • Query-time cost is low
  • HITS computes an eigenvector for every query
  • Less susceptible to localized link-spam
  • HITS advantage over PageRank
  • HITS ranking is sensitive to query
  • HITS has notion of hubs and authorities

25
Web Community
  • Suppose one is familiar with some Web pages of
    specific topic, such as, sports
  • Find more pages about the same topic
  • Web community entity of related web pages (
    centers )

26
What is cyber-community
  • A community on the web is a group of web pages
    sharing a common interest
  • Eg. A group of web pages talking about POP Music
  • Eg. A group of web pages interested in
    data-mining
  • Main properties
  • Pages in the same community should be similar to
    each other in contents
  • The pages in one community should differ from the
    pages in another community
  • Similar to cluster

27
Two different types of communities
  • Explicitly-defined communities
  • They are well known ones, such as the resource
    listed by Yahoo!
  • Implicitly-defined communities
  • They are communities unexpected or invisible to
    most users
  • How to find them!?

Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
28
Similarity of Web Pages
  • Discovering web communities is similar to
    clustering. For clustering, we must define the
    similarity of two nodes
  • Method I
  • For page and page B, A is related to B if there
    is a hyper-link from A to B, or from B to A
  • Not so good. Consider the home page of IBM and
    Microsoft. They dont point to each other as
    competitors.

Page A
Page B
29
Similarity of Web Pages
  • Method II (from Bibliometrics)
  • Co-citation the similarity of A and B is
    measured by the number of pages cite both A and B
  • Bibliographic coupling the similarity of A and B
    is measured by the number of pages cited by both
    A and B.

Page A
Page B
Page A
Page B
30
Methods of Clustering
  • All of them can discover meaningful communities.
  • But their methods are very expensive to the whole
    World Wide Web with billions of web pages.

31
An Effective Method
  • The method from Ravi Kumar, Prabhakar Raghavan,
    Sridhar Rajagopalan, Andrew Tomkins
  • IBM Almaden Research Center
  • They call their method communities trawling (CT)
  • They implemented it on the graph of 200 millions
    pages, it worked very well

32
Basic idea of CT
  • Dense directed bipartite sub graphs
  • Bipartite graph Nodes are partitioned into two
    sets, F and C
  • Every directed edge in the graph is directed from
    a node u in F to a node v in C
  • dense if many of the possible edges between F and
    C are present

F
C
33
Basic idea of CT
  • Bipartite cores
  • a complete bipartite subgraph with at least i
    nodes from F and at least j nodes from C
  • i and j are tunable parameters
  • A (i, j) Bipartite core
  • Every community have such a core with a certain i
    and j.
  • A bipartite core is the identity of a community
  • To extract all the communities is to enumerate
    all the bipartite cores on the web.

A (i3, j3) bipartite core
34
Experiment on CT
  • 200 millions web pages
  • IBM PC with an Intel 300MHz Pentium II processor,
    with 512M of memory, running Linux
  • i from 3 to 10 and j from 3 to 20
  • 200k potential communities were discovered
  • 29 of them cannot be found in Yahoo!.

35
Mining the World-Wide Web
  • WWW is huge, widely distributed, global
    information source for
  • Information services news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  • Hyper-link information
  • Access and usage information
  • Web Site contents and Organization

36
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
37
Web Content Mining
  • Discovery of useful information from web contents
    / data / documents
  • Web data contents text, image, audio, video,
  • metadata and hyperlinks.
  • Information Retrieval View ( Structured
    Semi-Structured)
  • Assist / Improve information finding
  • Filtering Information to users on user profiles
  • Database View
  • Model Data on the web
  • Integrate them for more sophisticated queries

38
Web Structure Mining
  • To discover the link structure of the hyperlinks
    at the inter-document level to generate
    structural summary about the Website and Web
    page.
  • Direction 1 based on the hyperlinks,
    categorizing the Web pages and generated
    information.
  • Direction 2 discovering the structure of Web
    document itself.
  • Direction 3 discovering the nature of the
    hierarchy or network of hyperlinks in the Website
    of a particular domain.

39
Web Structure Mining
  • Finding authoritative Web pages
  • Retrieving pages that are not only relevant, but
    also of high quality, or authoritative on the
    topic
  • Hyperlinks can infer the notion of authority
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • A hyperlink pointing to another Web page, this
    can be considered as the author's endorsement of
    the other page

40
Web Usage Mining
  • Also known as Web log mining
  • Mining techniques to discover interesting usage
    patterns from the secondary data derived from the
    interactions of the users while surfing the web

41
Web Usage Mining
  • Applications
  • Target potential customers for electronic
    commerce
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Facilitates personalization/adaptive sites
  • Improve site design
  • Fraud/intrusion detection
  • Predict users actions (allows prefetching)
Write a Comment
User Comments (0)
About PowerShow.com