Large Scale Internet Search at Ask'com - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Large Scale Internet Search at Ask'com

Description:

For text, image, video, map, city search etc. ... 7 Billion Videos Per Month. Site Features: Smart Answer. Topic Zooming with Search Suggestions ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 38
Provided by: taoy
Learn more at: https://cs.nyu.edu
Category:
Tags: ask | com | internet | scale | search | video

less

Transcript and Presenter's Notes

Title: Large Scale Internet Search at Ask'com


1
Large Scale Internet Search at Ask.com
  • Tao Yang

2
Outline
  • Overview of the company and search products
  • Core techniques for page ranking
  • ExpertRank
  • Challenges in building scalable search services
  • Neptune clustering middleware.
  • Fault isolation. Fast detection (TAMP)
  • Communication

3
(No Transcript)
4
Image/video search
  • Image search is very popular.
  • Video search is getting popular.
  • Driven by significant growth of broadband users
  • Popularity of video sharing Youtube.
  • Major content providers continue to make
    significant investments
  • Disney, Fox, ABC/NBC/CNN/CBS, ESPN, AOL/Time
    Warner.
  • Traditional publishers also move to video
    content NY times/ LA time, Wall street journal
    etc.
  • Significant growth on online advertisement market
    and

5
(No Transcript)
6
(No Transcript)
7
Ask.com Focused on Delivering a Better Search
Experience
  • Innovative search technologies help people find
    what theyre looking for faster
  • For text, image, video, map, city search etc.
  • 6 U.S. Web Property 8 Global in terms of user
    coverage (Formally Ask Jeeves)
  • 28.5 reach - Active North American Audience with
    48.8 million unique users
  • 133 million global unique users
  • 6 Share of North American Searches
  • A Division of IAC, a fortunate 500 company with
    over 60 brands, 28,000 employees.

8
Sectors of IAC
  • Retailing
  • Services
  • Media Advertising
  • Membership Subscriptions
  • Emerging Businesses

9
7 Billion Videos Per Month
10
Site Features Smart Answer
11
Topic Zooming with Search Suggestions
12
AskCity
13
Ask Competitive Strengths
  • Deeper topic view of the Internet
  • Query-specific link and text analysis with
    behavior analysis
  • Differentiated clustering technology
  • Natural Language Processing
  • Better understanding/analysis of queries and user
    behavior
  • Integration of Structured Data with Web search.

14
Behind Ask.com Data Indexing and Mining
Internet
Web documents
Crawler
Crawler
Crawler
Document DB
Document DB
Document DB
Inverted index generation
Inverted index generation
Inverted index generation
Online Database
Content classification
Link graph generation
Spammer removal
Link graph generation
Web graph generation
Duplicate removal
15
Engine Architecture
Client queries
Traffic load balancer
Frontend
Frontend
Frontend
Frontend
Neptune Clustering Middleware
Hierarchical Result Cache

Page index
Document Abstract

Document Abstract
Document Abstract
Document description
Structured DB
16
Concept Link-based Popularity for Ranking
  • A is a connectivity matrix among web pages.
    A(i,j)1 for edge from i to j.
  • Query-independent popularity.
  • Query-specific popularity

17
Approaches for Page Ranking
  • PageRankBrin/Page98 offline computation of
    query-independent popularity iteratively.
  • HITSKleinberg98, IBM Clever
  • Build a query-based connectivity matrix on the
    fly. H, R are hub and authority weights of
    pages.
  • Repeat until H, R converge.
  • RA H AA R
  • Normalize H, R.
  • ExpertRank Compute query-specific communities
    and ranking in real time.
  • Started from Teoma and evolved at Ask.com

18
Steps of ExpertRank at Ask.com
Search the index for a query
19
Index search and web graph generation
  • Search the index and identify relevant candidates
    for a given query.
  • Generate a query-specific link graph
    dynamically.

20
Multi-stage Cluster Refinement with Integrated
Link/Topic Analysis
  • Derive link-guided page communities.
  • Cluster refinement with topic purification
  • Decompose through text classification and NLP
  • Restructure through topic similarity analysis

21
Subject-specific ranking
  • Examples
  • bat Flying mammals vs. Baseball bat.
  • microwave dish food recipes/cookware vs.
    satellite TV reception.
  • For each topic group, identify experts for page
    recommendation, and remove spamming links.

22
Integrated Ranking with User Intention Analysis
  • Score weighting from multiple topic groups.
  • Authoritativeness and freshness assessment.
  • User intention analysis.
  • Result diversification.

23
Scalability Challenges
  • Data scalability
  • From millions of pages to billions of pages.
  • Clean vs. datasets with lots of noise.
  • Infrastructure scalability
  • Tens of thousands of machines.
  • Tens of Millions of users
  • Impact on response time, throughput,
    availability,
  • data center power/space/networking.
  • People scalability From few persons to many
    engineers with non-uniform experience.

24
Examples of Scalability Problems
  • Mining question answers from web.
  • Computing with irregular data structure.
    Level-1/2 cache.
  • Large-scale memory management 32 bits vs. 64
    bits.
  • Incremental cluster expansion and topology-aware
    management.
  • High throughput write/read traffic reliability
    vs performance. Logging and checkpointing.
  • Fast and reliable data propagation across
    networks.
  • Architecture optimization for low power
    consumption.
  • Software engineering
  • Update large software data on a live platform.
  • Distributed debugging thousands of machines.

25
Some of Lessons Learned
  • Data
  • Data methods can behave differently with
    different data sizes/noise levels.
  • Data-driven approaches with iterative refinement
    to track positive/negative effectiveness
  • Architecture Software
  • Distributed service-oriented architectures
  • Middleware support.

26
The Neptune Clustering Middleware
  • A simple/flexible programming model
  • Aggregating and replicating application modules
    with persistent data.
  • Shielding complexity of service discovery, load
    balancing, consistency, and failover management
  • Providing inter-service communication.
  • Providing quality-aware request scheduling for
    service differentiation
  • Started at UCSB. Evolved with Teoma, Ask.com.

27
Programming Model and Cluster-level
Parallelism/Redudancy in Neptune
  • Request-driven processing model.
  • SPMD model (single program/multiple data) while
    large data sets are partitioned and replicated.
  • Location-transparent service access with
    consistency support.

Service cluster
Request
Provider module
Provider module
Service method
Clustering by Neptune

Data
28
Neptune architecture for cluster-based services
  • Symmetric and decentralized
  • Each node can host multiple services, acting as a
    service provider.
  • Each node can also subscribe internal services
    from other nodes, acting as a consumer.
  • Support multi-tier or nested service architecture

Service consumer/provider
Client requests
29
Inside a Neptune Server Node
Service Consumers
Service Handling Module
30
Impact of Component Failure in Multi-tier services
  • Failure of one replica 7s - 12s
  • Service unavailable 10s - 13s

31
Problems that affect availability
  • Tradeoff Bounded pools in multi-threaded
    services.
  • Threads are blocked with slow service dependency.
  • Fault detection speed.

Requests
Queue
Service B
Replica 2
Thread Pool
(Healthy)
Service A
(From healthy to unresponsive)
32
Dependency Isolation
  • Per-dependency management with capsules.
  • Isolate their performance impact.
  • maintain dependency-specific feedback information
    for QoS control.
  • Programming support with automatic recognition of
    dependency states.

33
Fast Fault Detection and Information Propagation
for Large-Scale Cluster-Based Services
  • Complex 24x7 network topology in service
    clusters.
  • Frequent events failures, structure changes, and
    new services.
  • Yellowpage directory
  • discovery of services and their attributes
  • Server aliveness

34
TAMP Topology-Adaptive Membership Protocol
  • Highly Efficient Optimize bandwidth, of
    packets
  • Topology-aware
  • Form a hierarchical tree according to network
    topology
  • Localize traffic within switches and adaptive to
    changes of switch architecture.
  • Topology-adaptive
  • Network changes switches
  • Scalable scale to tens of thousands of nodes.
    Easy to operate.

35
Reliable Communication for Large-Scale
Data-Intensive Computing
Senders
Receivers
  • Small messages or large files
  • Membership dynamicity and fault masking
  • Easy programming

36
Solution for large-scale NxM communication
Receivers
Senders
MediationLayer
37
Concluding Remarks
  • Ask.com is focused on leading-edge technology
    for Internet search.
  • Various solutions developed for ranking, mining,
    and infrastructure support.
  • Still there are many open/challenging problems to
    be solved exciting opportunities.
Write a Comment
User Comments (0)
About PowerShow.com