WEB SEARCH and P2P

About This Presentation

Title:

WEB SEARCH and P2P

Description:

WEB SEARCH and P2P Advisor: Dr Sushil Prasad Presented By: DM Rasanjalee Himali – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 59

Provided by: Him53

Learn more at: https://www.cs.gsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: WEB SEARCH and P2P

1
WEB SEARCH and P2P

Advisor Dr Sushil Prasad
Presented By DM Rasanjalee Himali

2
OUTLINE

Introduction to web search engines
What is a web search engine?
Web Search engine architecture
How a web search engine work?
Relevance and Ranking
Limitations in current Web Search Engines
P2P Web Search Engines
YouSearch
Copeer
ODISSEA
Conclusion

3
What is a web search engine?

A Web search engine is a search engine designed
to search for information on the World Wide Web.
Information may consist of web pages, images and
other types of files.
Some search engines also mine data available in
newsgroups, databases, or open directories

4
History

Company Millions of searches Relative market share
Google 28,454 46.47
Yahoo! 10,505 17.16
Baidu 8,428 13.76
Microsoft 7,880 12.87
NHN 2,882 4.71
eBay 2,428 3.9
Time Warner 1,062 1.6
Ask.com 728 1.1
Yandex 566 0.9
Alibaba.com 531 0.8
Total 61,221 100.0

Before there were search engines there was a
complete list of all webservers.
The very first tool used for searching on the
Internet was Archie
downloaded directory listings of files on FTP
sites
did not index the contents of these sites
Soon after, many search engines appeared
Excite, Infoseek, Northern Light, AltaVista.
Yahoo!, Google, MSN Search

5
How Web Search Engine Work

A search engine operates, in the following order
Web crawling
Indexing
Searching

6
Web Crawling

A web crawler
a program or which browses the World Wide Web in
a methodical, automated manner.
a means of providing up-to-date data
create a copy of all the visited pages for later
processing by a search engine
starts with a list of URLs to visit, called the
seeds.
As the crawler visits these URLs, it identifies
all the hyperlinks in the page and adds them to
the list of URLs to visit, called the crawl
frontier.
URLs from the frontier are recursively visited
according to a set of policies.

7
Robot Exclusion Protocol

also known as the robots.txt protocol
is a convention to prevent cooperating web robots
from accessing all or part of a website which is
otherwise publicly viewable.
User-agent
Disallow /cgi-bin/
Disallow /images/
Disallow /tmp/
Disallow /private/
Sitemap http//www.example.com/sitemap.xml.gz
Crawl-delay 10
Allow /folder1/myfile.html
Request-rate 1/5 maximum rate is one page
every 5 seconds
Visit-time 0600-0845 only visit between
0600 and 0845 UTC (GMT)
It relies on the cooperation of the web robot, so
that marking an area of a site out of bounds with
robots.txt does not guarantee privacy.
The standard complements Sitemaps, a robot
inclusion standard for websites.

8
SiteMap Protocol

allows a webmaster to inform search engines about
URLs on a website that are available for
crawling.
A Sitemap is an XML file that lists the URLs for
a site.
It allows webmasters to include additional
information about each URL
when it was last updated,
how often it changes, and
how important it is in relation to other URLs in
the site.
This allows search engines to crawl the site more
intelligently.
Sitemaps are a URL inclusion protocol complement
robots.txt

lt?xml version"1.0" encoding"UTF-8"?gt lturlset
xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
"gt     lturlgt        ltlocgthttp//www.example.com/lt
/locgt       ltlastmodgt2005-01-01lt/lastmodgt
       ltchangefreqgtmonthlylt/changefreqgt
       ltprioritygt0.8lt/prioritygt     lt/urlgt
lt/urlsetgt
9
Distributed Web Crawling

Internet search engines employ many computers to
index the Internet via web crawling.
Dynamic assignment
a central server assigns new URLs to different
crawlers dynamically.
allows the central server to, dynamically balance
the load of each crawler.
Static assignment
there is a fixed rule stated from the beginning
of the crawl that defines how to assign new URLs
to the crawlers.
Google uses thousands of individual computers in
multiple locations to crawl the Web.

10
Indexing

The purpose of storing an index is to optimize
speed and performance in finding relevant
documents for a search query
Search engine indexing collects, parses, and
stores data to facilitate fast and accurate
information retrieval.
The contents of each page are analyzed to
determine how it should be indexed
Ex words are extracted from the titles,
headings, or special fields called meta tags
Meta search engines reuse the indices of other
services and do not store a local index

11
Challenges in Parallelism

A major challenge in the design of search engines
is the management of parallel computing
processes.
There are many opportunities for race conditions
and coherent faults.
Ex a new document is added to the corpus and the
index must be updated, but the index
simultaneously needs to continue responding to
search queries.
the search engine's architecture may involve
distributed computing, where the search engine
consists of several machines operating in unison.
This increases the possibilities for incoherency
and makes it more difficult to maintain a
fully-synchronized, distributed, parallel
architecture.

12
Inverted Indices

inverted index stores a list of the documents
containing each word
search engine can use direct access to find the
documents associated with each word in the query
to retrieve the matching documents quickly

Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7
13
Searching

web search query
a query that a user enters into web search engine
to satisfy his or her information needs.
is distinctive in that it is unstructured and
often ambiguous
vary greatly from standard query languages which
are governed by strict syntax rules.

14
Searching

Three broad categories that cover most web search
queries
Informational queries
Queries that cover a broad topic (e.g., colorado
or trucks) for which there may be thousands of
relevant results.
Navigational queries
Queries that seek a single website or web page of
a single entity (e.g., youtube or delta
airlines).
Transactional queries
Queries that reflect the intent of the user to
perform a particular action, like purchasing a
car or downloading a screen saver.

15
Web search engine architecture
Fetched pages
URL List
Compress store
Anchors file
read

Relative URLs
? absolute URLs
? docIDs

links
- Read repository - Uncompress parse docs to
hit list - Distribute hit list to baralles by
docID - Parse out links and store in anchor file
Anchor text --docIDs
Partiall sorted forward index
Resort baralls by word IDs
lexicon
Inverted index
Calculate PR of all docs
Answer queries
PR
From The Anatomy of a Large-Scale
Hypertextual Web Search Engine Sergey Brin and
Lawrence Page
16
Important Properties Of Commercial Web Search

To be successful a commercial Search Engine must
address all of these issues/properties
Millions of heterogeneous users
Goal is to make money
UI is extremely important
Real-time/fast expectation
Content of web page not sufficient to imply
meaning
Result ranking cannot assume independence
Must consider maliciousness
No quality control on pages (quality varies)
Web is large (practically infinite)
Millions of heterogeneous users

17
Relevance and Ranking

Exactly how a particular search engine's
algorithm works is a closely-kept trade secret.
However, all major search engines follow the
general rules below.
Location, Location, Location...and Frequency
Location
Search engines will also check to see if the
search keywords appear near the top of a web
page, such as in the headline or in the first few
paragraphs of text. They assume that any page
relevant to the topic will mention those words
right from the beginning.
Frequency
A search engine will analyze how often keywords
appear in relation to other words in a web page.
Those with a higher frequency are often deemed
more relevant than other web pages.

18
Precision and Recall

two widely used measures for evaluating the
quality of results in Information Retrieval
Precision
fraction of the documents retrieved that are
relevant to the user's information need
number of relevant documents retrieved by a
search ___________________________________________
________ the total number of documents retrieved
by that search
Recall
the fraction of the documents that are relevant
to the query that are successfully retrieved.
number of relevant documents retrieved by a
search ___________________________________________
__________ the total number of existing relevant
documents which should have been retrieved
Often, there is an inverse relationship between
Precision and Recall

19
Relevance and Ranking

webmasters constantly rewrite their web pages in
an attempt to gain better rankings.
Some sophisticated webmasters may even "reverse
engineer" the location/frequency systems used by
a particular search engine
Because of this, all major search engines now
also make use of "off the page" ranking criteria

20
Relevance and Ranking

Off the page factors
those that a webmasters cannot easily influence
Link analysis
Search engine analyzing how pages link to each
other
Helps to determine what a page is about and
whether that page is "important" and thus
deserving of a ranking boost
Click through measurement
a search engine watch what results someone
selects for a particular search,
eventually drop high-ranking pages that aren't
attracting clicks,
promote lower-ranking pages that do pull in
visitors.

21
Limitations in current web search engines

Centralized search engines have limited
scalability.
crawler based indices are stale and incomplete
Fundamental issue How much of the web is
crawlable
If you follow the rules many sites say robots
get lost
What about Dynamic content? (Deep Web)
The deep web is around 500 times larger than
surface web. These deep web resources mainly
include data held by databases which can be
accessed only through queries. Since crawlers
discover resources only through links, they
cannot discover these resources.
Theres no guarantee that current search engines
index or even crawl the total surface web space

22
Limitations in current web search engines

Single point of failure
Ambiguous words
Polysemy - words with multiple meanings train
car train neural network
Synonymy - multiple words same meaning neural
network is trained as follows neural network
learns as follows
What about phrases - searches are not bag of
words
Positional information? Structural (throw out
case punctuation)?
Non-text content data worth storing
Most web search engines today crawl only surface
web.

23
P2P Web Search

Seen explosion of activity in the area of
peer-to-peer (P2P) systems last few years
Since an increasing amount of content now resides
in P2P networks, it becomes necessary to provide
search facilities within P2P networks.
The significant computing resources provided by a
P2P system could also be used to implement search
and data mining functions for content located
outside the system
e.g., for search and mining tasks across large
intranets or global enterprises, or even to build
a P2P-based alternative to the current major
search engines.

24
P2P Web Search

The characteristics distinguish P2P systems from
previous technologies
low maintenance overhead
improved scalability
Improved reliability
synergistic performance
increased autonomy and privacy
Dynamism

25
P2P Web Search Engines

YouSearch
Coopeer
ODISSEA

26
YouSearch

YouSearch
is a distributed search application for personal
webservers operating within a shared context
Allow peers to aggregate into groups and users to
search over specific groups
Goal
Provide fast, fresh and complete results to users

27
YouSearch

System Overview
participants in YouSearch
Peer-nodes
run YouSearch enabled clients
Browsers
search YouSearch enabled content through their
web browsers
Registrar
centralized light-weight service that
acts like a blackboard on which peer nodes
store and lookup (summarized) network state.

28
YouSearch

System Overview
Search System
Each peer node closely monitors its own content
to maintain a fresh local index
A bloom filter content summary is created by each
peer and pushed to the registrar.
When a browser issues a search query at a peer p
, the peer p first queries the summaries at the
registrar to obtain a set of peers R in the
network that are hosting relevant documents.
The peers in R are then directly contacted by
with the query to obtain the URLs for its
results.
To quickly satisfy any subsequently issued
queries with identical terms, the results from
each query issued at a peer p are cached for a
limited time at p

29
YouSearch

Indexing
Indexing is periodically executed at every peer
node.
Inspector examines each shared file for its last
modification date and time.
If the file is new or the file has changed, the
file is passed to the Indexer.
The Indexer maintains a disk-based inverted-index
over the shared content.
The name and path information of the file are
indexed as well.

30
YouSearch

Indexing
Summarizer
The Summarizer obtains a list of terms T from the
Indexer and creates a bloom filter from them in
the following way.
A bit vector V of length L is created with each
bit set to 0.
A specified hash function H with range 1,...,L
is used to hash each term t in T and the bit at
position H(t) in V is set to 1
YouSearch use k independent hash functions
H1,H2,...,Hk and construct k different bloom
filters, one for each hash function
In YouSearch,
the length of each bloom filter is L 64 Kbits
and
the number of bloom filters k is set to 3
Summary Manager at the registrar aggregate these
Bloom Filters into a structure that maps each bit
position to a set of peers whose Bloom Filters
have the corresponding bit set

31
YouSearch

Querying

query
query
computes the hash of keywords
keywords
determine
intersection of peer I s
looks up mapping
Corresponding bits of each k bloom filters
keywords
intersection of peer I s
results
Bit position to IP address mapping
contacts each of the peers in list and obtains a
list of URLs for matching documents
32
YouSearch

Caching
Every time a global query is answered that
returns non-zero results, the querying peer
caches the result set of URLs U (temporary)
The peer then informs the registrar of the fact.
The registrar adds a mapping from the query to
the IP-address of the caching peer in its cache
table

33
YouSearch

Limitations
False Positive results 17.38
Central registrar gtgt single point of failure
No extensive phrase search
No attention has been given for query ranking
No human user collaboration

34
Coopeer

Coopeer
Is a P2P web search engine where each user
computer stores a part of the web model used for
indexing and retrieving web resources in response
to queries
Goal
complement centralized search engines to provide
more humanized and personalized results by
utilizing users collaboration

35
Coopeer

(a)Collaboration
One may look for interesting web pages in the P2P
knowledge repository consisted with shared web
pages.
A novel collaborative filtering technique called
PeerRank is presented to rank pages proportional
to the votes from relevant peers
(b)Humanization
Coopeer use a query-based representation for
documents,
The relevant words are not directly extracted
from page content but introduced by human users
with a high proficiency in their expertise
domains.
(c)Personalization
Similar users are self-organized according to
their semantic content of search session.
Thus, requestor peer can extend routing paths
along its neighbors, rather than just take a
blind shot.
User-customized results can be obtained along
personal routing paths in contrast with CSEs.

36
Coopeer

System Overview
requestor forwards the query based on the
semantically routing.
Peers maintain a local index about the semantic
content of remote peers.
Receiving a query message from remote peer,
current peer check it against the local store.
In order to facilitate this work, a novel
query-based representation about documents is
introduced.
Based on query representation, cosine similarity
between new query and documents can be computed.
the documents are relevant enough, if the
similarity exceeds a certain threshold.
Then these results are returned to the requestor.
Receiving the returned results, the requestor
peer need to rank them in term of preference of
its human owner using PeerRank method.

37
Coopeer

The Coopeer client consists of four main software
agents
The User Agent
is responsible for interacting with the users.
It provides a friendly user interface, so that
users can conveniently manage and manipulate the
whole search sessions.
The Web-searcher Agent
is the resource of P2P knowledge repository.
It performs the users individual searching with
several search engines from the Internet.
The Collaborator Agent
is the key component for performing users
real-time collaborative searching.
It facilitates maintaining the P2P knowledge
repository, such as information sharing,
searching, and fusion.
The Manager Agent
is the key component of Coopeer, which
coordinates and manage the other types of agents.
It is also responsible for updating and
maintaining data.

38
Coopeer

PeerRank
All the users are taken as a Referrer Network.
Determines pages relevance by examining a
radiating network of referrers.
Documents with more referrers gain higher ranks.
Obtain better rank order, as collaborative
evaluation of human users is much more precise
than description of term frequency or link
amount.
Prevent spam, since it is difficult to pretend
evaluation from human users.

39
Coopeer

PeerRank
For a given search session, we firstly compute
the similarity between requestors favorite lists
and referrers,
then the similarity is used as the baseline of
recommending degree of the referrer.
Firstly, as shown in equation (1), the similarity
of local list and recommended list is given by
the Kendall measure.
Secondly, we convert the rank of a given URL in
its recommended list to a moderate score
R(e) - weight of URL e.
C (e) - set constituted by es referrers.
Z - constant gt 1.
p - local peer
Pi - a remote peer,
Lp , Lpi - list of p and Pi respectively.
K(r)(Lp, Lpi ) -Kendall function to measure the
distance of the local list and the recommended
list,
r decay factor.
SLpi(e) - score of e in the recommended list.
Re - rank of e and
RMax - highest rank of list pi, the length of
the list.

40
Coopeer

Kendall Measure
Kendall is used to measure the distance between
two lists in the same length.
Paper extend it to fit in with measuring two
lists in different length.
Kendall function

t1 and t2 - two lists composed with URLs
Kr(t1, t2) -the distance between t1 and t2,
r fixed parameter with 0 r 1. C2
2L - used for normalization is the possible
maximum of the distance.
U(t1, t2) - set consists of all the URLs in t1
and t2,
K ri,j(t1, t2) - means the penalty of the URL
pair (i, j)

41
Coopeer

Query Based Representation
A novel type of representation based on the
relevant words introduced by human users with a
high proficiency in their expertise domains.
is efficient on the P2P platform, as the users
evaluation can be utilized easily through the
client application.
represent and organize the local documents for
responding remote query

42
Coopeer

Each peer maintains
an inverted index table
represent local documents for responding remote
query
the IDs of the documents that were replied the
query
key of inverted index is terms extracted from the
previous queries
Ex when peer j writes in two queries P2P
Overlay and P2P Routing and obtains two set of
documents, d1, d2, d3 and d3, d4
respectively.
The retrieved documents will be updated with
their corresponding query terms.
When any other peer issues a query about Overlay
Routing Algorithm, peer j would look up relevant
documents in the inverted index by using VSM
cosine similarity as ranking algorithm, and d3
would gain the highest ranking.

43
Coopeer

Semantic Routing Algorithm
each Coopeer client maintains a local Topic
Neighbor Index
The index records the used performance of remote
peers which has similar topics to the local peer.
These search sessions queries are used to
represent the peers semantic content

session 1 gtgt is the local peer which has two
topics (queries)
other sessions below denote the remote peers are
interested in by the local peer in some aspect.
session 2 and 3 are relevant to P2P Routing
topic of local peer, while others are about
Pattern Recognition.
The peers on a same topic are in descending order
of the rate.
The peers providing more interested resource
would move to the top of an individuals local
index

44
Coopeer

with query-based inverted index, the precision of
matching results of different subjects was almost
100
system uses information coming from centralized
search engines, so the system is not aimed to
replace CSEs, but to complement them.

45
Coopeer

Query based representation is Efficient in p2p
because users evaluation can be utilized easily
through the client application.
This is Inefficient in CSEs because gaining user
evaluation through web browser is inefficient
impractical to store and index documents every
users query.
Prevent spam, since it is difficult to pretend
evaluation from human users.
Use human searching experience ?better results

46
ODISSEA

A distributed global indexing and query execution
service
Maintains a global index structure under document
insertions and updates and node joins and
failures
the inverted index for a particular term (word)
is located at a single node, or partitioned over
a small number of nodes in some hybrid
organizations.
Assume two tier architecture.
The system is implemented on top of an underlying
global address space provided by a DHT structure

47
ODISSEA

System provide the lower tier of the two tier
architecture.
In the upper tier, there are two classes of
clients that interact with this P2P-based lower
tier
Update clients
insert new or updated documents into the system,
which stores and indexes them.
An update client could be a crawler inserting
crawled pages, a web server pushing documents
into the index, or a node in a file sharing
system.
Query clients
design optimized query execution plans, based on
statistics about term frequencies and
correlations, and issue them to the lower tier.

48
ODISSEA
49
ODISSEA

Global Index
An inverted index for a document collection is a
data structure that contains for each word in the
collection a list of all its occurrences, or a
list of postings.
Each posting contains the document ID of the
occurrence of the word, its position inside the
document, and other information (in title? bold
face?)
each node holds a complete global postings list
for a subset of the words, as determined by a
hash function.

50
ODISSEA

Query Processing
a ranking function is a function F that, given a
query consisting of a set of search terms
q0,q1,,qm-1 , assigns to each document d a score
F(d, q0,q1,,qm-1) . The top- k ranking problem
is then the problem of identifying the k
documents in the collection with the highest
scores.

51
ODISSEA

We focus on two families of ranking functions,
The first family includes the common families of
term-based ranking functions used in IR, where we
add up the scores of each document with respect
to all words in the queries.
The second formula adds a query-independent value
g(d) to the score of each page

52
ODISSEA

Fagins Algorithm
Consider the inverted lists for a search query
with two terms q0 and q1 .
Assume they are located on the same machine, and
that the postings in the list are pairs
(d,f(d,qi)),i ?0,1, where d is an integer
identifying the document and " f(d,qi) is real
valued.
Assume each inverted list is sorted by the second
attribute, so that documents with largest "
f(d,qi) are at the start of the list.
Then the following algorithm, called FA, computes
the top-k results

53
ODISSEA

FA
(1)Scan both lists from the beginning, by reading
one element from each list in every step, until
there are documents that have each been
encountered in both of the lists.
(2) Compute the scores of these documents. Also,
for each document that was encountered in only
one of the lists, perform a lookup into the other
list to determine the score of the document.
Return the documents with the highest score.

54
Conclusion

Still no P2P web search engine has outperformed
Google!
() Lot of resources for complex data mining
tasks and for crawling whole surface web
()Emergence of semantic communities also has a
positive impact on p2p web search performance
(-)lack of global knowledge
(-)smart crawling strategies beyond BFS are hard
to implement in a P2P environment without a
centralized scheduler.

55
Some Open Problems

how to uniformly sample web pages on a web site
if one does not have an exhaustive list of these
pages?
Bar-Yosseff converted the web graph into an
undirected, connected, and regular graph.
The equilibrium of a random walk on this graph is
the uniform distribution.
It is not clear how many steps such a walk needs
to perform.
A more significant problem, however, is that
there is no reliable way of converting the web
graph into an undirected graph.

56
Some Open Problems

Data Streams
The query logs of a web search engine contain all
the queries issued at this search engine.
The most frequent queries change only slowly over
time.
However, the queries with the largest increase or
decrease from one time period over the next show
interesting trends in user interests. We call
them the top gainers and losers.
Since the number of queries is huge, the top
gainers and losers need to be computed by making
only one pass over the query logs.
This leads to the following data stream problem
Another interesting variant is to find all items
above a certain frequency whose relative increase
(i.e., their increase divided by their frequency
in the first sequence) is the largest.

57
References

The anatomy of a large-scale hypertextual Web
search engineSource Computer Networks and ISDN
Systems Volume 30 , Issue 1-7 ,1998 Sergey Brin
Lawrence Page
Make it fresh, make it quick searching a network
of personal International World Wide Web
Conference Budapest, Hungary , 2003
Towards a Fully Distributed P2P Web Search
EngineProceedings of the 10th IEEE International
Workshop on Future Trends of Distributed
Computing Systems Jin Zhou, Kai Li and Li Tang
2004
Odissea A peer-to-peer architecture for scalable
web search and information retrieval by T Suel,
C Mathur, J Wu, J Zhang, A Delis, M Kharrazi, X
Long, K Shanmugasunderam , 2003
Space/time Trade-offs in Hash Coding with
Allowable Errors B. Bloom. In Communications of
ACM, volume 13(7), pages 422426, 1970
www.en.wikipedia.org

58
Extra Slides
59
Bloom Filters

a space-efficient probabilistic data structure
that is used to test whether an element is a
member of a set.
False positives are possible, but false negatives
are not.
The more elements that are added to the set, the
larger the probability of false positives.

60
Bloom Filters

An empty Bloom filter is a bit array of m bits,
all set to 0.
There must also be k different hash functions
defined, each of which maps a key value to one of
the m array positions.
To add an element, feed it to each of the k hash
functions to get k array positions. Set the bits
at all these positions to 1.
To query for an element (test whether it is in
the set), feed it to each of the k hash functions
to get k array positions.
If any of the bits at these positions are 0, the
element is not in the set if it were, then all
the bits would have been set to 1 when it was
inserted.
If all are 1, then either the element is in the
set, or the bits have been set to 1 during the
insertion of other elements.

61
Bloom Filters
An example of a Bloom filter, representing the
set x, y, z. The colored arrows show the
positions in the bit array that each set element
is mapped to. The element w, not in the set, is
detected as a nonmember as it is mapped to a
position containing a 0.
62
Bloom Filters

Write a Comment

User Comments (0)