Large Scale Applications on Hadoop in Yahoo

About This Presentation

Title:

Large Scale Applications on Hadoop in Yahoo

Description:

Content Optimization Content Optimization Yahoo Front Page Case Study Mail Spam Filtering: Connected Components Y1 = Yahoo user 1, ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 49

Provided by: researchI7

more less

Transcript and Presenter's Notes

Title: Large Scale Applications on Hadoop in Yahoo

1
Large Scale Applications on Hadoop in Yahoo
Vijay K Narayanan, Yahoo! Labs
04.26.2010

Massive Data Analytics Over the Cloud (MDAC
2010)
2
Outline

Hadoop in Yahoo!
Common types of applications on Hadoop
Sample applications in
Content Analysis
Web Graph
Mail Spam Filtering
Search
Advertising
User Modeling on Hadoop
Challenges and Practical Considerations

Hadoop in Yahoo

4
By the Numbers

About 30,000 nodes in tens of clusters
1 Node 4 1 TB disk, 8 cores, 16 GB RAM as a
typical configuration.
Largest single cluster of about 4000 nodes
4 tiers of clusters
Application research and development
Production clusters
Hadoop platform development and testing
Proof of concepts and ad-hoc work
Over 1000 users across research, engineering,
operations etc.
Running more than 100,000 jobs per day
More than 3 PB of data
Compressed and un-replicated volume
Currently running Hadoop 0.20

5
Advantages

Wide applicability of the M/R computing model
Many problems in internet domain can be solved by
data parallelism
High throughput
Stream through 100 TB of data in less than 1 hour
Applications that took weeks earlier complete in
hours
Research prototyping, development, and production
deployment systems are (almost) identical
Scalable, economical, fault-tolerant
Shared resource with common infrastructure
operations

6
Entities in internet eco-system
Leverage Hadoop extensively in all of these
domains in Yahoo!
Content (pages, blogs etc.)
Content/Display Advertising
Search Engine
Browses
Searches
Interacts
Ads (Text, Display etc.)
Queries
Search Advertising
7

Common Types of Applications

8
Common applications on Hadoop in Yahoo!

Near real-time data pipelines
Backbone for analytics, reporting, research etc.
Multi-step pipelines to create data feeds from
logs
Web-servers - page content, layout and links,
clicks, queries etc.
Ad servers ad serving opportunity data,
impressions
Clicks, beacons, conversion data servers
Process large volume of events
Tens of billions events/day
Tens of TB (compressed) data/day
Latencies of tens of minutes to a few hours.
Continuous runs of jobs working on chunks of data

9
Example Data Pipelines
Analytics

Network analytics
Experiment reporting

Tens of billions events/day
Parse and Transform event streams
Join clicks with views
Filter out robots
Aggregate, Sort, Partition
Data Quality Checks

Optimize traffic engagement
User session click-stream
Path and funnel analysis

User Sessions
User Profiles

User segment analysis
Interest

Ads and Content

Measurements
Modeling and Scoring
Experimentation

10
Common applications on Hadoop in Yahoo!

High throughput engine for ETL and reporting
applications
Put large data sources (e.g. logs) on HDFS
Run canned aggregations, transformations,
normalizations
Load reports to RDBMS/data marts
Hourly and Daily batch jobs
3. Exploratory data research
Ad-hoc analysis and insights into data
Leveraging Pig and custom Map Reduce scripts
Pig is based on Pig Latin (up-coming support for
SQL)
Procedural language, designed for data
parallelism
Supports nested relational data structures

11
Common applications on Hadoop in Yahoo!

Indexing for efficient retrieval
Build and update indices of content, ads etc.
Updated in batch mode and pushed for online
serving
Efficient retrieval of content and ads during
serving
5. Offline modeling
Supervised and un-supervised learning algorithms
Outlier detection methods
Association rule mining techniques
Graph analysis methods
Time series analysis etc.

12
Common applications on Hadoop in Yahoo!

6. Batch and near real-time scoring applications
Offline model scoring for upload to serving
applications
Frequency hourly or daily jobs
7. Near real-time feedback from serving systems
Update features and model weights based on
feedback from serving
Periodically push these updates to online scoring
and serving
Typical updates in minutes or hours
8. Monitoring and performance dashboards
Analyze scoring and serving logs for
Monitoring end to end performance of scoring and
serving systems
Measurements of model performance and
measurements

Sample Applications

14
Application Content Analysis

Web data
Information about every web site, page, and link
crawled by Yahoo
Growing corpus of more than 100Tb data from 10s
of billions documents
Document processing pipeline on Hadoop
Enrich with features from page, site etc.
Page segmentation
Term document vector and weighted variants
Entity anlaysis
Detection, disambiguation, resolution of entities
in page
Concepts and topic modeling and clustering
Page quality analysis

15
Application Web graph analysis

Directed graph of the web
Aggregated views by different dimensions
Sites, Domains, Hosts etc.
Large scale analysis of this graph
2 trillion links
Jobs utilize 100,000 maps, 10,000 reduces
300 TB compressed output

Attribute Before Hadoop With Hadoop
Time 1 month Days
Maximum number of URLs Order of 100 billion Many times larger
16
Application Mail spam filtering

Scale of the problem
25B Connections, 5B deliveries per day
450M mailboxes
User feedback on spam is often late, noisy and
not always actionable

Problem Algorithm Data size Running time on Hadoop
Detecting spam campaigns Frequent Itemset mining 20 MM spam votes 1 hour
Gaming of spam IP votes by spammers Connected component (squaring a bi-partite graph) 500K spammers, 500k spam IPs 1 hour
17
Application Mail Spam Filtering Campaigns

9 2595 (IPTYPEnone,FROMUSERsales,SUBJIt's
Important You Know,FROMDOMdappercom.info,URLdapp
ercom.info,ip_D66.206.14.77,)
9 2457 (IPTYPEnone,FROMUSERsales,SUBJSave On
Costly Repairs,FROMDOMaftermoon.info,URLaftermoo
n.info,ip_D66.206.14.78,)
9 2447 (IPTYPEnone,FROMUSERsales,SUBJCar-Dealer
s-Compete-On-New-Vehicles,FROMDOMsherge.info,URL
sherge.info,ip_D66.206.25.227,)
9 2432 (IPTYPEnone,FROMUSERsales,SUBJJanuary
18th CreditReport Update,FROMDOMzaninte.info,URL
zaninte.info,ip_D66.206.25.227,)
9 2376 (IPTYPEnone,FROMUSERhealth,SUBJFinally.
Coverage for the whole family,FROMDOMfiatchimera.
com,URLarticulatedispirit.com,ip_D216.218.201.14
9,)
9 2184 (IPTYPEnone,FROMUSERhealth,SUBJFinally.
Coverage for the whole family,FROMDOMfiatchimera.
com,URLstratagemnepheligenous.com,ip_D216.218.20
1.149,)
9 1990 (IPTYPEnone,FROMUSERsales,SUBJCloseout
2008-2009-2010 New Cars,FROMDOMsastlg.info,URLsa
stlg.info,ip_D66.206.25.227,)
9 1899 (IPTYPEnone,FROMUSERsales,FROMDOMbrunhil
.info,SUBJ700-CreditScore-What-Is-Yours?,URLbrun
hil.info,ip_D66.206.25.227,)
9 1743 (IPTYPEnone,FROMUSERsales,SUBJNow
exercise can be fun,FROMDOMaccordpac.info,URLacc
ordpac.info,ip_D66.206.14.78,)
9 1706 (IPTYPEnone,FROMUSERsales,SUBJCloseout
2008-2009-2010 New Cars,FROMDOMrionel.info,URLri
onel.info,ip_D66.206.25.227,)
9 1693 (IPTYPEnone,FROMUSERsales,SUBJJanuary
18th CreditReport Update,FROMDOMastroom.info,URL
astroom.info,ip_D66.206.25.227,)
9 1689 (IPTYPEnone,FROMUSERsales,SUBJeBay
Work_at_Home w/Solid-Income-Strategies,FROMDOMstamin
e.info,URLstamine.info,ip_D66.165.232.203,)

2432 (IPTYPEnone,FROMUSERsales,SUBJJanuary
18th CreditReport Update,FROMDOMzaninte.info,URL
zaninte.info, ip_D66.206.25.227,)
2447 (IPTYPEnone,FROMUSERsales,SUBJCar-Dealers-
Compete-On-New-Vehicles,FROMDOMsherge.info,URLsh
erge.info, ip_D66.206.25.227,)
17
18
Application Search Ranking

Rank web-pages based on relevance to queries
Features based on content of page, site, queries,
web graph etc.
Train machine learning models to rank relevant
pages for queries
Periodically learn new models

Dimension Before Hadoop Using Hadoop
Features 100s 1000s
Running Time Days to weeks hours
19
Application Search AssistTM

Related concepts occur together. Analyze 3
years of logs
Build dictionaries on Hadoop and push to online
serving

Dimension Before Hadoop Using Hadoop
Time 4 weeks lt 30 minutes
Language C Python
Development Time 2-3 weeks 2-3 days
20
Applications in Advertising

Expanding sets of seed keywords for matching with
text ads
Analyze text corpus, user query sessions,
clustering keywords etc.
Indexing ads for fast retrieval
Build and update index of more than a billion
text ads
Response prediction and Relevance modeling
Categorization of pages and queries to help in
matching
Adult pages, gambling pages etc.
Forecasting of ad inventory
User modeling
Model performance dashboards

User Modeling on Hadoop

22
User activities

Large dimensionality of possible user activities
But a typical user has a sparse activity vector
Attributes of the events change over time
Building a pipeline on Hadoop to model user
interests from activities

Attribute Possible Values Typical values per user
Pages MM 10 100
Queries 100s of MM Few
Ads 100s of thousands 10s
23
User Modeling Pipeline

5 main components to train, score and evaluate
models
Data Generation
Data Acquisition
Feature and Target Generation
Model Training
Offline Scoring and Evaluation
Batch scoring and upload to online serving
Dashboard to monitor the online performance

24
Overview of User Modeling Pipeline
Online Serving Systems
Models and Scores
Hadoop
Scoring and
Data Generation
Modeling Engine
Evaluation
Merging
Projection
Join
Join
Filtering
Join
Filtering
Scoring
Work Flow Manager
Model Training
Aggregations
Score

graph
based eval
Transformations
Scores and
User event
Feature and
Model Files
Reports
History files
Target Set
HDFS
25
1a. Data Acquisition

Input
Multiple user event feeds (browsing activities,
search etc.) per time period

User Time Event Source
U1 T0 visited autos.yahoo.com Web server logs
U1 T1 searched for car insurance Search logs
U1 T2 browsed stock quotes Web server logs
U1 T3 saw an ad for discount brokerage, but did not click Ad logs
U1 T4 checked Yahoo Mail Web server logs
U1 T5 clicked on an ad for auto insurance Ad logs, click server logs
26
1a. Data Acquisition

Tag and Transform
Categorization
Topic
.

Map Operations
Project relevant event attributes
Filter irrelevant events
User event
User event
User event
User event
User event
User event
Normalized Events (NE)
Event Feeds
HDFS
27
1a. Data Acquisition

Output
Single normalized feed containing all events for
all users per time period

User Time Event Tag
U1 T0 Content browsing Autos, Mercedes Benz
U2 T2 Search query Category Auto Insurance
.
... .
U23 T23 Mail usage Drop event
U36 T36 Ad click Category Auto Insurance
28
1b. Feature and Target Generation

Features
Summaries of user activities over a time window
Aggregates, Moving averages, Rates etc. over
moving time windows
Support online updates to existing features
Targets
Constructed in the offline model training phase
Typically user actions in the future time period
indicating interest
Clicks/Click-through rates on ads and content
Site and page visits
Conversion events
Purchases, Quote requests etc.
Sign-ups to newsletters, Registrations etc.

29
1b. Feature and Target Windows
T0
Query
Visit Y! finance
Interest event
Time
Moving Window
Feature Window
Target Window
29
30
1b. Feature Generation
U1 T0 Content browsing Autos, Mercedes Benz
U1 T2 Search query Category Auto Insurance
U1 T3 Click on search result Category Insurance premiums
U1 T4 Ad click Category Auto Insurance
Reduce 1
Reduce 2
Summaries over user event history
All events for U2
All events for U1
Aggregates within window Time and event weighted
averages Event rates ..
Map 1
Map 2
Map 3
U1, Event 1
U1, Event 2
U1, Event 2
U2, Event 2
U2, Event 1
U2, Event 3
Feature Set
NE 1
NE 2
NE 3
Aggregate Normalized events
NE 4
NE 5
NE 6
HDFS
NE 7
NE 8
NE 9
31
1b. Joining Features and Targets

Low target rates
Typical response rates are in the range of 0.01
1
Many users have no interest activities in the
target window
First construct the targets
Compute the feature vector only for users with
targets
Reduces the need for computing features for users
without target actions
Allows stratified sampling of users with
different target and feature attributes

32
2. Model Training

Supervised models trained using a variety of
techniques
Regressions
Different flavors Linear, Logistic, Poisson etc.
Constraints on weights
Different regularizations L1 and L2
Decision trees
Used for both regression and ranking problems
Boosted trees
Naïve Bayes
Support vector machines
Commonly used in text classification, query
categorization etc.
Online learning algorithms

33
2. Model Training

Maximum Entropy modeling
Log-linear link function.
Classification problems in large dimensional,
sparse features
Constrained Random Fields
Sequence labeling and named-entity recognition
problems
Some of these algorithms are implemented in
Mahout
Not all algorithms are easy to implement in MR
framework
Train one model per node.
Each node can train model for one target response

34
3. Offline Scoring and Evaluation

Apply weights from model training phase to
features from Feature generation component
Mapper operations only
Janino equation editor
Embedded compiler can compile arbitrary scoring
equations.
Can also embed any class invoked during scoring
Can modify features on the fly before scoring
Evaluation metrics
Sort by scores and compute metrics in reducer
Precision vs. Recall curve
Lift charts

http//docs.codehaus.org/display/JANINO/Home
35
Modeling Workflow
Training Phase
Targets
Features
36
4. Batch Scoring
Online Serving Systems
37
User modeling pipeline system
Component Data Processed Time
Data Acquisition 1 Tb per time period 2 3 hours
Feature and Target Generation 1 Tb Size of feature window 4 - 6 hours
Model Training 50 - 100 Gb 1 2 hours for 100s of models
Scoring 500 Gb 1 hour
38

Challenges and Practical Considerations

39
Current challenges

Limited size of name-node
File and block meta-data in HDFS is in RAM on
name-node
On name-node with 64Gb RAM
100 million file blocks and 60 million files
Upper limit of 4000 node limit cluster
Adding more reducers leads to a large number of
small files
Copying data in/out of HDFS
Limited by read/write rates of external file
systems
High latency for small jobs
Overhead to set up may be large for small jobs

40
Practical considerations

Reduce amount of data transfer from mapper to
reducer
There is still disk write/read in going from
mapper to reducer
Mapper output Reducer input files can become
large
Can run out of disk space for intermediate
storage
Project a subset of relevant attributes in mapper
to send to reducer
Use combiners
Compress intermediate data
Distribution of keys
Reducer can become a bottleneck for common keys
Use Partitioner to control distribution of map
records to reducers
E.g. distribute mapper records with common keys
across multiple reducers in a round robin manner

41
Practical considerations

Judicious partitioning of data
Multiple files helps parallelism, but hit
name-node limits
Smaller number of files keeps name-node happy but
at the expense of parallelism
Less ideal for distributed computing algorithms
requiring communications (e.g. distributed
decision trees)
MPI on top of the cluster for communication

42
Acknowledgment

Numerous wonderful colleagues!
Questions?

Appendix
More Applications

44
Application Content Optimization

Optimizing content across the Yahoo portal pages
Rank articles from an editorial pool of articles
based on interest
Yahoo Front Page,
Yahoo News etc.
Customizing feeds in My Yahoo portal page
Top buzzing queries
Content recommendations (RSS feeds)
Use Hadoop for feature aggregates and model
weight updates
near real-time and uploaded to online serving

45
Yahoo Front Page Case Study
Search Index
Ads Optimization
Machine Learned Spam filters
RSS Feed Recos.
46
Application Search Logs Analysis

Analyze search result view and click logs
Reporting and measurement of user click response
User session analysis
Enrich, expand and re-write queries
Spelling corrections
Suggesting related queries
Traffic quality and protection
Detect and filter out fraudulent traffic and
clicks

47
Mail Spam Filtering Connected Components

Y1 Yahoo user 1, Y2 Yahoo user 2
IP1 IP address of the host Y1 voted not-spam
from

47
48
Mail Spam Filtering Connected Components Voting
Set of IPs/YIDs used exclusively for voting
notspam
Set of (likely new) spamming IPs which are
worth voting for
y1
IP3
IP1
y2
IP4
IP2
y3
Set of voted on IPs
Set of voted from IPs
Set of Yahoo IDs voting notspam
48

Write a Comment

User Comments (0)