Extracting knowledge from the World Wide Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: Extracting knowledge from the World Wide Web

1
Extracting knowledge from the World Wide Web

Presentation by Özgür Deniz Aydin
2
Overview

3
Abstract

Discuss methods for extracting knowledge from the
web by randomly sampling and analyzing hosts and
pages, and by analyzing the link structure of the
web and how links accumulate over time.
Information collected on the dist. of web pages
over domains, the dist. of interest in different
areas, communities related to different topics,
the nature of competition in different categories
of sites, and the degree of communication between
different communities or countries.

4
Sampling the Web

5
Sampling Random Walk

6
Sampling Random Walk

Problems
We assume that we can select a page at random,
the very problem we are trying to build a model
for.
Many pages have outinks to other pages in the
same domain, very likely to get stuck.

7
Modified Random Walk

Method
Choice of initial page is uniformly random
When at page p,
With prob. d, follow an outlink from p,chosen
uniformly at random.
With prob. 1-d, follow to a random host from
visited hosts so far, and jump to a ramdomly
selected page out of the visited pages in that
host.

8
Random Walk - Results
9
IP Address Sampling

10
IP Address Sampling

Possible Problems
Firewalls and authentication requirements.
Default page (no page) responses
Coming soon pages
Same site on multiple IPs (large/critic sites do
this for load balancing and redundency)
Multiple sites on single IP (virtual hosting)
Non-web site serving IPs Fax servers, etc.

11
(No Transcript)
12
Discussion on Sampling

13
Analyzing and Modelling Web Growth

Link distribution of pages follows a power law
Prob. that a randomly selected web page has k
inlinks is proportional to k? where ?2.1
Prob. That this web page has k outlinks is
proportional to k? where ?2.72
Models for discussion Preferential Attachment
and Competition Varies

14
Preferential Attachment

Rich get richer A node becomes more likely to
get an edge from a new node if it has a larger
number of edges. (undirected graph model)
Growth Starting with small m0 nodes, in every
tme step introduce new node u with m edges, m
less than or equal to m0
p Prob. that a new node will be connected to
node u. Depends ku such that
p ku / S node w kw

15
Pref. Attch. - Results

16
Competition Varies

In earlier models, most members of a community
fare poorly, few or no inlinks.
In actual distributions, this is not the case.
i.e. The mode for inlinks can go up to 800 in
universities.
New method by Pennock mixing uniform and
preferential attachment accounts for
connectivity distributions within communities as
well as the web itself.

17
Competition Varies
18
The Hostgraph Model

Idea The web is a hierarchically nested graph,
with domains, hosts, and pages introducing
different levels of affiliation. Instead of
modeling the web at the level of pages, one can
also model it on the host or domain level.
Each node represents a host.
Bharats model power law dist., ?1.62 for
inlinks, ?1.67 for outlinks.
In reality, no. of small inlink hosts is
considerably smaller than predicted by the model.

19
The Hostgraph Model

Bharats model
At each step, with prob. ß, select a random
already existing node u,
With prob. 1-ß create a new node u. Add d
additional outlinks to it. Choice of outlinks
made as follows
First choose existing node v at random.
Pick d random outgoing edges from v.
Then, for j 1, 2, . . . , d, the jth link of u
points to a random existing node with probability
a, and to the destination of vs jth link with
probability 1-a .

20
The Hostgraph Model
Model with 1.000.000 nodes, d 7 and a 0.05
21
(No Transcript)
22
Communities on the Web

23
Communities on the Web
24
Communities on the Web

Problems with regular approach
Cannot cluster fully connected graphs/subgraphs.
(identical to non-connected graphs)
Introduce seed nodes and identify communities
around seeds over a polynomial time algorithm.
Bi-partite subgraphs, cocitation, bibliographic
coupling methods good for identifying narrow
portions of the web only.
HITS and PageRank define large collections as
communities. Less accuracy.

25
Summary

World Wide Web offers both opportunities and
challanges.
Many areas open for further research
Interesting results may come from updated
analysis.
Sampling still an important issue Which pages
should be counted? How to reduce bias?
Growth models How to model or refine for
accuracy, while keeping models simple and easy to
analyse?
Communities How to define, how to model?

Write a Comment

User Comments (0)

About PowerShow.com

Extracting knowledge from the World Wide Web PowerPoint PPT Presentation