Internet Research: What - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Internet Research: What

Description:

Internet Research: Whats hot in Search, Advertizing – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 60
Provided by: chuckne
Category:
Tags: cec | internet | research

less

Transcript and Presenter's Notes

Title: Internet Research: What


1
Internet Research Whats hot in Search,
Advertizing Cloud Computing
Rajeev RastogiYahoo! Labs Bangalore
2
The most visited site on the internet
  • 600 million users per month
  • Super popular properties
  • News, finance, sports
  • Answers, flickr, del.icio.us
  • Mail, messaging
  • Search

3
Unparalleled scale
  • 25 terabytes of data collected each day
  • Over 4 billion clicks every day
  • Over 4 billion emails per day
  • Over 6 billion instant messages per day
  • Over 20 billion web documents indexed
  • Over 4 billion images searchable

No other company on the planet processes as much
data as we do!
4
Yahoo! Labs Bangalore
  • Focus is on basic and applied research
  • Search
  • Advertizing
  • Cloud computing
  • University relations
  • Faculty research grants
  • Summer internships
  • Sharing data/computing infrastructure
  • Conference sponsorships
  • PhD co-op program

5
Web Search
6
What does search look like today?
7
Search results of the future Structured abstracts
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
8
Search results of the future Query refinement
9
Search results of the future Rich media
10
Technologies that are enabling search
transformation
  • Information extraction (structured abstracts)
  • Web page classification (query refinement)
  • Multimedia search (rich media)

11
Information extraction (IE)
  • Goal Extract structured records from Web pages

Name
Category
Address
Map
Phone
Price
Reviews
12
Multiple verticals
  • Business, social networking, video, .

13
One schema per vertical
14
IE on the Web is a hard problem
  • Web pages are noisy
  • Pages belonging to different Web sites have
    different layouts

Noise
15
Web page types
  • Template-based

Hand-crafted
16
Template-based pages
  • Pages within a Web site generated using scripts,
    have very similar structure
  • Can be leveraged for extraction
  • 30 of crawled Web pages
  • Information rich, frequently appear in the top
    results of search queries
  • E.g. search query Chinese Mirch New York
  • 9 template-based pages in the top 10 results

17
Wrapper Induction
  • Enables extraction from template-based pages

Learn
Sample pages
Annotations
Website pages
Annotate Pages
Learn Wrappers
Sample
Apply wrappers
XPath Rules
Extract
Extract
Website pages
Records
18
Example
Generalize
XPath /html/body/div/div/div/div/div/div/span
/html/body//div//span
19
Filters
  • Apply filters to prune from multiple candidates
    that match XPath expression

XPath /html/body//div//span
Regex Filter (Phone)(0-93) 0-93-0-94
20
Limitations of wrappers
  • Wont work across Web sites due to different page
    layouts
  • Scaling to thousands of sites can be a challenge
  • Need to learn a separate wrapper for each site
  • Annotating example pages from thousands of sites
    can be time-consuming expensive

21
Research challenge
  • Unsupervised IE Extract attribute values from
    pages of a new Web site without annotating a
    single page from the site
  • Only annotate pages from a few sites initially as
    training data

22
Conditional Random Fields (CRFs)
  • Models conditional probability distribution of
    label sequence yy1,,yn given input sequence
    xx1,,xn
  • fk features, lk weights
  • Choose lk to maximize log-likelihood of training
    data
  • Use Viterbi algorithm to compute label sequence y
    with highest probability

23
CRFs-based IE
  • Web pages can be viewed as labeled sequences
  • Train CRF using pages from few Web sites
  • Then use trained CRF to extract from remaining
    sites

24
Drawbacks of CRFs
  • Require too many training examples
  • Have been used previously to segment short
    strings with similar structure
  • However, may not work too well across Web sites
    that
  • contain long pages with lots of noise
  • have very different structure

25
An alternate approach that exploits site knowledge
  • Build attribute classifiers for each attribute
  • Use pages from a few initial Web sites
  • For each page from a new Web site
  • Segment page into sequence of fields (using
    static repeating text)
  • Use attribute classifiers to assign attribute
    labels to fields
  • Use constraints to disambiguate labels
  • Uniqueness an attribute occurs at most once in a
    page
  • Proximity attribute values appear close together
    in a page
  • Structural relative positions of attributes are
    identical across pages of a Web site

26
Attribute classifiers constraints example
Chinese Mirch
Chinese, Indian
120 Lexington AvenueNew York, NY 10016
(212) 532 3663
Page1
Phone
Category
Name
Address
Jewel of India
Indian
15 W 44th StNew York, NY 10016
(212) 869 5544
Page2
Category
Name
Phone
Address
21 Club
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Page3
Phone
Category, Name
Name, Noise
Address
Uniqueness constraint NamePrecedence
constraint Name lt Category
21 Club
Page3
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Phone
Category
Name
Address
27
Other IE scenarios Browse page extraction
Similar-structuredrecords
28
IE big picture/taxonomy
  • Things to extract from
  • Template-based, browse, hand-crafted pages, text
  • Things to extract
  • Records, tables, lists, named entities
  • Techniques used
  • Structure-based (HTML tags, DOM tree paths)
    e.g. Wrappers
  • Content-based (attribute values/models) e.g.
    dictionaries
  • Structure Content (sequential/hierarchical
    relationships among attribute values) e.g.
    hierarchical CRFs
  • Level of automation
  • Manual, supervised, unsupervised

29
Web Page Classification Requirements
  • Quality
  • High Precision and Recall
  • Leverage structured input (links, co-citations)
    and output (taxonomy)
  • Scalability
  • Large numbers of training Examples, Features and
    Classes
  • Complex Structured input and output
  • Cost
  • Small human effort (for labeling of pages)
  • Compact classifier model
  • Low prediction time

30
Structured Output Learning
  • Structured Output Examples
  • Multi-class
  • Taxonomy
  • Naïve approach
  • Separate binary classifier per class
  • Separate classifier for each taxonomy level
  • Better approach single (SVM) classifier
  • Higher accuracy, more efficient
  • Sequential Dual Method (SDM)
  • Visit each example sequentially and solve
    associated QP problem (in dual) efficiently
  • Order of magnitude faster

Health
Sport
Fitness
Medicine
Cricket
Soccer
One-day
Test
31
Classification With Relational Information
Similar structure
  • Relational Information
  • Web page links, structural similarity
  • Graph representation
  • Pages as nodes (with labels)
  • Edge weights (s(j,k)) Page similarity,
    out-link/co-citation existence, etc.
  • Classification can be expressed as an
    optimization problem

Co-citation
Link
32
Multimedia Search
  • Availability consumption of multimedia content
    on the Internet is increasing
  • 500 billion images will be captured in 2010
  • Leveraging content and metadata are important for
    MM search
  • Some big technical challenges are
  • Results diversity
  • Relevance
  • Image Classification, e.g., pornography

33
Near-Duplicate Detection
  • Multiple near-similar versions of an image exist
    on the internet
  • scaled, cropped, captioned, small scene change,
    etc.
  • Near-duplicates adversely impact user experience
  • Can we use a compact description and dedup in
    constant time?
  • Fourier-Mellin Transform (FMT) translation,
    rotation, and scale invariant
  • Signature generation using a small number of
    low-frequency coefficients of FMT

34
Filtering noisy tags to improve relevance
  • Measures such as IDF may assign high weights to
    noisy tags
  • Treat Tag-Sets as Bag-of-words, random collection
    of terms
  • Boosting weights of tags based on their
    co-occurrence with other tags can filter out
    noise

idf
co-occur
35
Online Advertizing
36
Sponsored search ads
Search query
Ad
37
How it works
Ad Index
Advertiser
Sponsored search engine
  • Engine decides when/where to show this ad on
    search results page
  • Advertizer pays only if user clicks on ad

38
Ad selection criterion
  • Problem which ads to show from among ads
    containing keyword?
  • Ads with highest bid may not maximize revenue
  • Choose ads with maximum expected revenue
  • Weigh bid amount with click probability

39
Contextual Advertising
Ads
40
Contextual ads
  • Similar to sponsored search, but now ads are
    shown on general Web pages as opposed to only
    search pages
  • Advertizers bid on keywords
  • Advertizer pays only if user clicks, Y!
    publisher share paid amount
  • Ad matching engine ranks ads based on expected
    revenue (bid amount click probability)

41
Estimating click probability
  • Use logistic regression model
  • p(click ad, page, user)
  • fi ith feature for ad, page, user
  • wi weight for feature fi
  • Training data ad click logs (all clicks
    non-click samples)
  • Optimize log-likelihood to learn weights

42
Features
  • Ad bid terms, title, body, category,
  • Page url, title, keywords in body, category,
  • User
  • Geographic (location, time)
  • Demographic (age, gender)
  • Behavioral
  • Combine above to get (billions of) richer
    features
  • E.g (apple ad title) (ipod page body)
    (20 lt user age lt 30)
  • Select subset that leads to improvement in
    likelihood

43
Banner ads
  • Show Web page with display ads

Ad
Creates Brand Awareness
44
How it works
Ad Index
Advertiser
I want 1M impressions On finance.yahoo.com,
gender male, age 20-30 during the month of
April 2009
Banner Ad Engine
  • Engine guarantees 1M impressions
  • Advertiser pays a fixed price
  • No dependence on clicks
  • Engine does admission control, decides allocation
    of ads to pages

45
Allocation Example
SUPPLY (Qty, Price)
(6M,10)
Unallocated
Value60M
(10M,10)
(10M,10)
12
12
Suboptimal
(10M,20)
(10M,10)
12
DEMAND (Target, Qty)
(6M,20)
Value 120M
12
(GenderMale, 12M)
(Agegt30, 12M)
Optimal
46
Research problem
  • Goal Allocate demands so that the value of
    unallocated inventory is maximized
  • Similar to transportation problem

47
Transportation problem
Demands
Supply
Price
d1
1
s1
1
p1
xi1
d2
2
2
s2
p2
xi2
xij
j
i
sj
pj
di
Edges to Ri
xij Units of demand I allocated to region j
48
Ads taxonomy
Online Ads
Search pages
Web pages
Contextual
Banner
Sponsored search
Keywords
Keywords
Attributes
Targeting
Guarantees
NG
NG
G
NG
CPM/CPC
CPM
CPC
CPC
Model
49
Major trend Ads convergence
  • Today

Separate systems for contextual display
Contextual
Display
CPC
CPM
  • Tomorrow
  • Unified Ads marketplace
  • Unify contextual Display
  • Increase supply demand
  • Enable better matching
  • CPC, CPM ads compete

Advertiser Creates demand
Publisher Creates supply of pages
50
Research challenge
  • Which ad to select between competing CPC, CPM
    ads?
  • Use eCPM
  • For CPM ads eCPM bid
  • For CPC ads eCPM bid Pr(click)
  • Select ad with max eCPM to maximize revenue
  • Problem ad with highest eCPM may not get
    selected
  • eCPMs estimated based on historical data, which
    can differ from actual eCPMs
  • Variance in estimated eCPMs higher for CPC ads
  • Selection gets biased towards ads which have
    higher variance as they have higher probability
    of over-estimated eCPMs

Estimated eCPM
CPC ad
CPM ad
Actual eCPM
Estimated eCPM
51
Cloud Computing
52
Much of the stuff we do is compute/data-intensive
  • Search
  • Index 100 billion crawled Web pages
  • Build Web graph, compute PageRank
  • Advertizing
  • Construct ML models to predict click probability
  • Cluster, classify Web pages
  • Improve search relevance, ad matching
  • Data mining
  • Analyze TBs of Web logs to compute correlations
    between (billions of) user profiles and page views

53
Solution Cloud computing
  • A cloud consists of
  • 1000s of commodity machines (e.g., Linux PCs)
  • Software layer for
  • Distributing data across machines
  • Parallelizing application execution across
    cluster
  • Detecting and recovering from failures
  • Yahoo!s software layer based on Hadoop Open
    Source

54
Cloud computing benefits
  • Enables processing of massive compute-intensive
    tasks
  • Reduces computing and storage costs
  • Resource sharing leads to efficient utilization
  • Commodity hardware, open source
  • Shields application developers from complexity of
    building in reliability, scalability in their
    programs
  • In large clusters, machines fail every day
  • Parallel programming is hard

55
Cloud computing at Yahoo!
  • 10,000s of nodes running Hadoop, TBs of RAM, PBs
    of disk
  • Multiple clusters, largest is a 1600 node cluster

56
Hadoops Map/Reduce Framework
  • Framework for parallel computation over massive
    data sets on large clusters
  • As an example, consider the problem of creating
    an index for word search.
  • Input Thousands of documents/web pages
  • Output A mapping of word to document IDs

57
Hadoops Map/Reduce
Index example (contd.)
intermediate output (sorted)
Shuffle
Reduce Tasks
Input split
Map Tasks
58
Research challenges
Data Distribution and Replication
Compute Nodes in Racks
Data Blocks for a given job distributed and
replicated across nodes in a rack and across racks
Rack 1
Rack 2
  • Challenges
  • Optimize distribution to provide maximum
    locality
  • Optimize replication to provide best fault
    tolerance

Rack i
Rack n
Job Scheduling
Job Queues based on priorities and SLAs
  • Challenges
  • Schedule jobs to maximize resource utilization
    while preserving SLAs
  • Schedule jobs to maximize data locality
  • Performance modeling

L1
SDS Q1 40
1
2
3
L2
YST Q2 35
Lm
ATG Qm 25
59
Summary
  • Internet is an exciting place, plenty of research
    needed to improve
  • User experience
  • Monetization
  • Scalability
  • Search -gt Information extraction, classification,
    .
  • Advertizing -gt Click prediction, ad placement, .
  • Cloud computing -gt Job scheduling, perf modeling,
  • Solving problems will require techniques from
    multiple disciplines ML, statistics, economics,
    algos, systems,
Write a Comment
User Comments (0)
About PowerShow.com