Search Engines -- based on material written by Paul Ryan - PowerPoint PPT Presentation

About This Presentation
Title:

Search Engines -- based on material written by Paul Ryan

Description:

Search.com (CNet) Spray out searches to several engines combine the results ... The Crawler-based Search Engines. Lycos (7/94) the wolf spider. Infoseek (4/94) ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 22
Provided by: gu8
Category:

less

Transcript and Presenter's Notes

Title: Search Engines -- based on material written by Paul Ryan


1
Search Engines-- based on material written by
Paul Ryan
2
Agenda
  • Search Engines
  • Where did they come from?
  • How do they work?
  • Whos the biggest?
  • Why GoTo is the coolest.
  • What type of stuff do you need to support the
    webs 2nd largest search engine?
  • Architecture, infrastructure, nuts and bolts
  • Performance
  • Operations
  • What kind of people (and how many) do you need to
    do this kind of business?
  • Where is the Internet going? What's going to
    happen to search engines?
  • Dont quote me ?

3
Ancient History
  • The Pre-cursors
  • Archie (1990) ftp based file indexing and
    retrieval
  • Gopher (1992) document network (non-ftp)
  • The early bots (1992-1993)
  • WWW Wanderer (wandex) servers, then URLs
  • Aliweb index web like Archie w/site index
    retrieval
  • Then came the spiders (1993)
  • WWW Worm
  • Excite (Architext), 2/93 from Stanford

4
All Done? Wrong!
  • Problems with Spiders
  • Get lots of data, but no intelligence to map
    pages to concept space
  • Problem still exist today (spamming)
  • The Solution? Searchable Directories. Human
    crafted hierarchies.
  • Tradewave Galaxy (1/94)
  • Yahoo! (4/94), Filo and Yang of Stanford

5
I Give Up Lets Search Everyone!
  • Here Come the Metasearchers!
  • MetaCrawler, go2net, dogpile (1995)
  • Momma
  • Search.com (CNet)
  • Spray out searches to several engines combine
    the results

6
The Universe Divides (kinda)
  • The Crawler-based Search Engines
  • Lycos (7/94) the wolf spider
  • Infoseek (4/94)
  • Altavista (12/95)
  • Inktomi (Slurp) HotBot (5/96) the plains
    Indians spider myth
  • Google, Northern Lights, Excite, FAST, direct
    hit, and more
  • The Directory/Editorial based Search Engines
  • Yahoo! (4/94)
  • LookSmart (5/95)
  • Snap.com
  • ODP (NewHoo) -- dmoz (1/98)
  • Ask Jeeves (4/97)
  • GoTo (6/98)

7
How Crawlers Work (or dont)
  • Start with list of URLs (submitted, generated
    from somewhere)
  • For each Site
  • Get the base page
  • Catalog the page based on crawler-specific
    implementation
  • Follow links on page and recurse
  • Some Details
  • META tags
  • ltMETA NAMEROBOTS CONTENTALL NONE
    NOINDEX NOFOLLOWgt
  • Robots.txt
  • /robots.txt file for http//goto.com/
  • disallow all robots from crawling GoTo
  • User-agent
  • Disallow /

8
Some Search Engine Examples
  • Inktomi
  • Infrastructure only you pay for the search
    results
  • Used to power Yahoo! (now Google), HotBot, many
    others
  • Now typically a fall-though placement (bidded or
    other paid inclusion first, then Inktomi results
  • Google
  • Sergey and Larry
  • Power Yahoo!, virgin.net, some others
  • Searching for a revenue model

9
Inktomi Slurp Crawler
  • Slurp Characteristics
  • Starts with active submitted URLs
  • Hierarchy of Importance
  • Page Title
  • Description meta
  • Keyword meta
  • Text in document (not in images ?)
  • No frames
  • Looks for spoofing tricks (drop page)
  • 4 week full cycle (constant incremental)
  • Many different indices created (or various
    customers), different depths, etc.

10
Some Cataloging Approaches (cont.)
  • Google
  • Backrub/Googlebot crawler
  • PageRank
  • Page A, Pages linking to A T1..Tn, Links on A
    C(A)
  • PR(A) (1-d) d(PR(T1)/C(T1)PR(Tn)/C(Tn))
  • probability distribution that random surfer hits
    a page based on links
  • Cache the documents (no kidding)
  • All kinds of tweaks to the PageRank, including
  • Domain tweaks (.org, .gov, .edu)
  • Serious bias against large pages
  • Bias against dynamic pages (.asp, .jhtml, .jsp)
  • Check out http//www.searchengineworld.com/google
  • Original design at http//www7.scu.edu.au/programm
    e/fullpapers/1921/com1921.htm

11
Whos the biggest Search Engine
  • What is big
  • Number of documents indexed (SearchEngineWatch,
    11/8/200)
  • KEY GGGoogle, FASTFAST, WTWebTop.com,
    INKInktomi, AVAltaVista,NLNorthern Light, 
    EXExcite,  GoGo (Infoseek).

12
Whos the biggest Search Engine
  • What is big
  • Searches/Day Total Web 500mm/day (ptr estimate
    )
  • Google 150mm (5/02, both Google and its
    partners)
  • Inktomi 80mm (8/01)
  • AltaVista 50mm (3/00)
  • Direct Hit 20mm (4/01)
  • FAST 12mm (10/00)
  • Overture (GoTo) 6.5mm (4/02)
  • Ask Jeeves 4mm (3/00)
  • Everyone else 4mm or fewer
  • Wheres GoTo? Hint ?

13
Lets Talk About GoTo
  • Basic Business Model Middlemen for Textual
    Advertisements (Search Results)
  • Advertisers provide us Search Listings (Title,
    URL, Description, bid) for a search term
  • We charge advertisers for user clicks on Search
    Listings
  • We serve search listings to our own site
    (www.goto.com - 5), and other partners sites
    (affiliates like Alta Vista, AOL, Netscpae, Cnet,
    etc. etc. 95)
  • Since we make money when people search (and
    click), we pay for sites to include our listings
  • Live auction for search results

14
The Scale of Operations
  • Search Volume 70mm/day, capacity for 210mm/day
  • 300mm impressions/day
  • 10mm clicks/day Med/Large Phone company
  • 6mm search listings
  • 40,000 advertisers
  • Wow

15
It Cant be that Simple, Right?
  • Right!

16
Architecture Features
  • High Availability -- Noahs Ark Approach no
    single point of failure
  • Load balancers
  • State migration
  • Scalabilityno architectural changes to scale
    serving capacity.
  • Extensibilitycan add search features
    incrementally.
  • Distributed contentmultiple sites currently
    serving all partners.

17
Advertiser Management
18
Advertiser Tools
  • DirecTraffic Center
  • Functions manage account balance, report on
    activity, real-time bid charges,
    add/modify/delete search listings
  • ATG/Dynamo (jhtml)/Java, EJB search Listing
    services (BEA/Weblogic), custom cache reporting
    scheme based on Oracle 8i

19
How do you do this in near-real-time?
  • Data Pipeline
  • The backbone of fraud detection
  • A flexible array (30) of commodity machines that
    perform simple aggregations and other arithmetic
    calculations in a networked and coordinated way
  • A control and processing language used to
    describe the required calculations, and processed
    by the data pipeline machines.
  • Click Scoring
  • Assignment of a click score for click events that
    classifies them into various buckets of
    validity.
  • Formulas that define the buckets based on
    historical patterns of behavior of the site, and
    analysis of previous fraudulent attempts.

20
Search Serving Systems
21
Search Serving Systems
Write a Comment
User Comments (0)
About PowerShow.com