Large Scale Applications on Hadoop in Yahoo

1 / 48
About This Presentation
Title:

Large Scale Applications on Hadoop in Yahoo

Description:

Content Optimization Content Optimization Yahoo Front Page Case Study Mail Spam Filtering: Connected Components Y1 = Yahoo user 1, ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 49
Provided by: researchI7

less

Transcript and Presenter's Notes

Title: Large Scale Applications on Hadoop in Yahoo


1
Large Scale Applications on Hadoop in Yahoo
Vijay K Narayanan, Yahoo! Labs
04.26.2010

Massive Data Analytics Over the Cloud (MDAC
2010)
2
Outline
  • Hadoop in Yahoo!
  • Common types of applications on Hadoop
  • Sample applications in
  • Content Analysis
  • Web Graph
  • Mail Spam Filtering
  • Search
  • Advertising
  • User Modeling on Hadoop
  • Challenges and Practical Considerations

3
  • Hadoop in Yahoo

4
By the Numbers
  • About 30,000 nodes in tens of clusters
  • 1 Node 4 1 TB disk, 8 cores, 16 GB RAM as a
    typical configuration.
  • Largest single cluster of about 4000 nodes
  • 4 tiers of clusters
  • Application research and development
  • Production clusters
  • Hadoop platform development and testing
  • Proof of concepts and ad-hoc work
  • Over 1000 users across research, engineering,
    operations etc.
  • Running more than 100,000 jobs per day
  • More than 3 PB of data
  • Compressed and un-replicated volume
  • Currently running Hadoop 0.20

5
Advantages
  • Wide applicability of the M/R computing model
  • Many problems in internet domain can be solved by
    data parallelism
  • High throughput
  • Stream through 100 TB of data in less than 1 hour
  • Applications that took weeks earlier complete in
    hours
  • Research prototyping, development, and production
    deployment systems are (almost) identical
  • Scalable, economical, fault-tolerant
  • Shared resource with common infrastructure
    operations

6
Entities in internet eco-system
Leverage Hadoop extensively in all of these
domains in Yahoo!
Content (pages, blogs etc.)
Content/Display Advertising
Search Engine
Browses
Searches
Interacts
Ads (Text, Display etc.)
Queries
Search Advertising
7
  • Common Types of Applications

8
Common applications on Hadoop in Yahoo!
  • Near real-time data pipelines
  • Backbone for analytics, reporting, research etc.
  • Multi-step pipelines to create data feeds from
    logs
  • Web-servers - page content, layout and links,
    clicks, queries etc.
  • Ad servers ad serving opportunity data,
    impressions
  • Clicks, beacons, conversion data servers
  • Process large volume of events
  • Tens of billions events/day
  • Tens of TB (compressed) data/day
  • Latencies of tens of minutes to a few hours.
  • Continuous runs of jobs working on chunks of data

9
Example Data Pipelines
Analytics
  • Network analytics
  • Experiment reporting
  • Tens of billions events/day
  • Parse and Transform event streams
  • Join clicks with views
  • Filter out robots
  • Aggregate, Sort, Partition
  • Data Quality Checks
  • Optimize traffic engagement
  • User session click-stream
  • Path and funnel analysis

User Sessions
User Profiles
  • User segment analysis
  • Interest

Ads and Content
  • Measurements
  • Modeling and Scoring
  • Experimentation

10
Common applications on Hadoop in Yahoo!
  • High throughput engine for ETL and reporting
    applications
  • Put large data sources (e.g. logs) on HDFS
  • Run canned aggregations, transformations,
    normalizations
  • Load reports to RDBMS/data marts
  • Hourly and Daily batch jobs
  • 3. Exploratory data research
  • Ad-hoc analysis and insights into data
  • Leveraging Pig and custom Map Reduce scripts
  • Pig is based on Pig Latin (up-coming support for
    SQL)
  • Procedural language, designed for data
    parallelism
  • Supports nested relational data structures

11
Common applications on Hadoop in Yahoo!
  • Indexing for efficient retrieval
  • Build and update indices of content, ads etc.
  • Updated in batch mode and pushed for online
    serving
  • Efficient retrieval of content and ads during
    serving
  • 5. Offline modeling
  • Supervised and un-supervised learning algorithms
  • Outlier detection methods
  • Association rule mining techniques
  • Graph analysis methods
  • Time series analysis etc.

12
Common applications on Hadoop in Yahoo!
  • 6. Batch and near real-time scoring applications
  • Offline model scoring for upload to serving
    applications
  • Frequency hourly or daily jobs
  • 7. Near real-time feedback from serving systems
  • Update features and model weights based on
    feedback from serving
  • Periodically push these updates to online scoring
    and serving
  • Typical updates in minutes or hours
  • 8. Monitoring and performance dashboards
  • Analyze scoring and serving logs for
  • Monitoring end to end performance of scoring and
    serving systems
  • Measurements of model performance and
    measurements

13
  • Sample Applications

14
Application Content Analysis
  • Web data
  • Information about every web site, page, and link
    crawled by Yahoo
  • Growing corpus of more than 100Tb data from 10s
    of billions documents
  • Document processing pipeline on Hadoop
  • Enrich with features from page, site etc.
  • Page segmentation
  • Term document vector and weighted variants
  • Entity anlaysis
  • Detection, disambiguation, resolution of entities
    in page
  • Concepts and topic modeling and clustering
  • Page quality analysis

15
Application Web graph analysis
  • Directed graph of the web
  • Aggregated views by different dimensions
  • Sites, Domains, Hosts etc.
  • Large scale analysis of this graph
  • 2 trillion links
  • Jobs utilize 100,000 maps, 10,000 reduces
  • 300 TB compressed output

Attribute Before Hadoop With Hadoop
Time 1 month Days
Maximum number of URLs Order of 100 billion Many times larger
16
Application Mail spam filtering
  • Scale of the problem
  • 25B Connections, 5B deliveries per day
  • 450M mailboxes
  • User feedback on spam is often late, noisy and
    not always actionable

Problem Algorithm Data size Running time on Hadoop
Detecting spam campaigns Frequent Itemset mining 20 MM spam votes 1 hour
Gaming of spam IP votes by spammers Connected component (squaring a bi-partite graph) 500K spammers, 500k spam IPs 1 hour
17
Application Mail Spam Filtering Campaigns
  • 9 2595 (IPTYPEnone,FROMUSERsales,SUBJIt's
    Important You Know,FROMDOMdappercom.info,URLdapp
    ercom.info,ip_D66.206.14.77,)
  • 9 2457 (IPTYPEnone,FROMUSERsales,SUBJSave On
    Costly Repairs,FROMDOMaftermoon.info,URLaftermoo
    n.info,ip_D66.206.14.78,)
  • 9 2447 (IPTYPEnone,FROMUSERsales,SUBJCar-Dealer
    s-Compete-On-New-Vehicles,FROMDOMsherge.info,URL
    sherge.info,ip_D66.206.25.227,)
  • 9 2432 (IPTYPEnone,FROMUSERsales,SUBJJanuary
    18th CreditReport Update,FROMDOMzaninte.info,URL
    zaninte.info,ip_D66.206.25.227,)
  • 9 2376 (IPTYPEnone,FROMUSERhealth,SUBJFinally.
    Coverage for the whole family,FROMDOMfiatchimera.
    com,URLarticulatedispirit.com,ip_D216.218.201.14
    9,)
  • 9 2184 (IPTYPEnone,FROMUSERhealth,SUBJFinally.
    Coverage for the whole family,FROMDOMfiatchimera.
    com,URLstratagemnepheligenous.com,ip_D216.218.20
    1.149,)
  • 9 1990 (IPTYPEnone,FROMUSERsales,SUBJCloseout
    2008-2009-2010 New Cars,FROMDOMsastlg.info,URLsa
    stlg.info,ip_D66.206.25.227,)
  • 9 1899 (IPTYPEnone,FROMUSERsales,FROMDOMbrunhil
    .info,SUBJ700-CreditScore-What-Is-Yours?,URLbrun
    hil.info,ip_D66.206.25.227,)
  • 9 1743 (IPTYPEnone,FROMUSERsales,SUBJNow
    exercise can be fun,FROMDOMaccordpac.info,URLacc
    ordpac.info,ip_D66.206.14.78,)
  • 9 1706 (IPTYPEnone,FROMUSERsales,SUBJCloseout
    2008-2009-2010 New Cars,FROMDOMrionel.info,URLri
    onel.info,ip_D66.206.25.227,)
  • 9 1693 (IPTYPEnone,FROMUSERsales,SUBJJanuary
    18th CreditReport Update,FROMDOMastroom.info,URL
    astroom.info,ip_D66.206.25.227,)
  • 9 1689 (IPTYPEnone,FROMUSERsales,SUBJeBay
    Work_at_Home w/Solid-Income-Strategies,FROMDOMstamin
    e.info,URLstamine.info,ip_D66.165.232.203,)

2432 (IPTYPEnone,FROMUSERsales,SUBJJanuary
18th CreditReport Update,FROMDOMzaninte.info,URL
zaninte.info, ip_D66.206.25.227,)
2447 (IPTYPEnone,FROMUSERsales,SUBJCar-Dealers-
Compete-On-New-Vehicles,FROMDOMsherge.info,URLsh
erge.info, ip_D66.206.25.227,)
17
18
Application Search Ranking
  • Rank web-pages based on relevance to queries
  • Features based on content of page, site, queries,
    web graph etc.
  • Train machine learning models to rank relevant
    pages for queries
  • Periodically learn new models

Dimension Before Hadoop Using Hadoop
Features 100s 1000s
Running Time Days to weeks hours
19
Application Search AssistTM
  • Related concepts occur together. Analyze 3
    years of logs
  • Build dictionaries on Hadoop and push to online
    serving

Dimension Before Hadoop Using Hadoop
Time 4 weeks lt 30 minutes
Language C Python
Development Time 2-3 weeks 2-3 days
20
Applications in Advertising
  • Expanding sets of seed keywords for matching with
    text ads
  • Analyze text corpus, user query sessions,
    clustering keywords etc.
  • Indexing ads for fast retrieval
  • Build and update index of more than a billion
    text ads
  • Response prediction and Relevance modeling
  • Categorization of pages and queries to help in
    matching
  • Adult pages, gambling pages etc.
  • Forecasting of ad inventory
  • User modeling
  • Model performance dashboards

21
  • User Modeling on Hadoop

22
User activities
  • Large dimensionality of possible user activities
  • But a typical user has a sparse activity vector
  • Attributes of the events change over time
  • Building a pipeline on Hadoop to model user
    interests from activities

Attribute Possible Values Typical values per user
Pages MM 10 100
Queries 100s of MM Few
Ads 100s of thousands 10s
23
User Modeling Pipeline
  • 5 main components to train, score and evaluate
    models
  • Data Generation
  • Data Acquisition
  • Feature and Target Generation
  • Model Training
  • Offline Scoring and Evaluation
  • Batch scoring and upload to online serving
  • Dashboard to monitor the online performance

24
Overview of User Modeling Pipeline
Online Serving Systems
Models and Scores
Hadoop
Scoring and
Data Generation
Modeling Engine
Evaluation
Merging
Projection
Join
Join
Filtering
Join
Filtering
Scoring
Work Flow Manager
Model Training
Aggregations
Score

graph
based eval
Transformations
Scores and
User event
Feature and
Model Files
Reports
History files
Target Set
HDFS
25
1a. Data Acquisition
  • Input
  • Multiple user event feeds (browsing activities,
    search etc.) per time period

User Time Event Source
U1 T0 visited autos.yahoo.com Web server logs
U1 T1 searched for car insurance Search logs
U1 T2 browsed stock quotes Web server logs
U1 T3 saw an ad for discount brokerage, but did not click Ad logs
U1 T4 checked Yahoo Mail Web server logs
U1 T5 clicked on an ad for auto insurance Ad logs, click server logs
26
1a. Data Acquisition
  • Tag and Transform
  • Categorization
  • Topic
  • .

Map Operations
Project relevant event attributes
Filter irrelevant events
User event
User event
User event
User event
User event
User event
Normalized Events (NE)
Event Feeds
HDFS
27
1a. Data Acquisition
  • Output
  • Single normalized feed containing all events for
    all users per time period

User Time Event Tag
U1 T0 Content browsing Autos, Mercedes Benz
U2 T2 Search query Category Auto Insurance
.
... .
U23 T23 Mail usage Drop event
U36 T36 Ad click Category Auto Insurance
28
1b. Feature and Target Generation
  • Features
  • Summaries of user activities over a time window
  • Aggregates, Moving averages, Rates etc. over
    moving time windows
  • Support online updates to existing features
  • Targets
  • Constructed in the offline model training phase
  • Typically user actions in the future time period
    indicating interest
  • Clicks/Click-through rates on ads and content
  • Site and page visits
  • Conversion events
  • Purchases, Quote requests etc.
  • Sign-ups to newsletters, Registrations etc.

29
1b. Feature and Target Windows
T0
Query
Visit Y! finance
Interest event
Time
Moving Window
Feature Window
Target Window
29
30
1b. Feature Generation
U1 T0 Content browsing Autos, Mercedes Benz
U1 T2 Search query Category Auto Insurance
U1 T3 Click on search result Category Insurance premiums
U1 T4 Ad click Category Auto Insurance
Reduce 1
Reduce 2
Summaries over user event history
All events for U2
All events for U1
Aggregates within window Time and event weighted
averages Event rates ..
Map 1
Map 2
Map 3
U1, Event 1
U1, Event 2
U1, Event 2
U2, Event 2
U2, Event 1
U2, Event 3
Feature Set
NE 1
NE 2
NE 3
Aggregate Normalized events
NE 4
NE 5
NE 6
HDFS
NE 7
NE 8
NE 9
31
1b. Joining Features and Targets
  • Low target rates
  • Typical response rates are in the range of 0.01
    1
  • Many users have no interest activities in the
    target window
  • First construct the targets
  • Compute the feature vector only for users with
    targets
  • Reduces the need for computing features for users
    without target actions
  • Allows stratified sampling of users with
    different target and feature attributes

32
2. Model Training
  • Supervised models trained using a variety of
    techniques
  • Regressions
  • Different flavors Linear, Logistic, Poisson etc.
  • Constraints on weights
  • Different regularizations L1 and L2
  • Decision trees
  • Used for both regression and ranking problems
  • Boosted trees
  • Naïve Bayes
  • Support vector machines
  • Commonly used in text classification, query
    categorization etc.
  • Online learning algorithms

33
2. Model Training
  • Maximum Entropy modeling
  • Log-linear link function.
  • Classification problems in large dimensional,
    sparse features
  • Constrained Random Fields
  • Sequence labeling and named-entity recognition
    problems
  • Some of these algorithms are implemented in
    Mahout
  • Not all algorithms are easy to implement in MR
    framework
  • Train one model per node.
  • Each node can train model for one target response

34
3. Offline Scoring and Evaluation
  • Apply weights from model training phase to
    features from Feature generation component
  • Mapper operations only
  • Janino equation editor
  • Embedded compiler can compile arbitrary scoring
    equations.
  • Can also embed any class invoked during scoring
  • Can modify features on the fly before scoring
  • Evaluation metrics
  • Sort by scores and compute metrics in reducer
  • Precision vs. Recall curve
  • Lift charts

http//docs.codehaus.org/display/JANINO/Home
35
Modeling Workflow
Training Phase
Targets
Features
36
4. Batch Scoring
Online Serving Systems
37
User modeling pipeline system
Component Data Processed Time
Data Acquisition 1 Tb per time period 2 3 hours
Feature and Target Generation 1 Tb Size of feature window 4 - 6 hours
Model Training 50 - 100 Gb 1 2 hours for 100s of models
Scoring 500 Gb 1 hour
38
  • Challenges and Practical Considerations

39
Current challenges
  • Limited size of name-node
  • File and block meta-data in HDFS is in RAM on
    name-node
  • On name-node with 64Gb RAM
  • 100 million file blocks and 60 million files
  • Upper limit of 4000 node limit cluster
  • Adding more reducers leads to a large number of
    small files
  • Copying data in/out of HDFS
  • Limited by read/write rates of external file
    systems
  • High latency for small jobs
  • Overhead to set up may be large for small jobs

40
Practical considerations
  • Reduce amount of data transfer from mapper to
    reducer
  • There is still disk write/read in going from
    mapper to reducer
  • Mapper output Reducer input files can become
    large
  • Can run out of disk space for intermediate
    storage
  • Project a subset of relevant attributes in mapper
    to send to reducer
  • Use combiners
  • Compress intermediate data
  • Distribution of keys
  • Reducer can become a bottleneck for common keys
  • Use Partitioner to control distribution of map
    records to reducers
  • E.g. distribute mapper records with common keys
    across multiple reducers in a round robin manner

41
Practical considerations
  • Judicious partitioning of data
  • Multiple files helps parallelism, but hit
    name-node limits
  • Smaller number of files keeps name-node happy but
    at the expense of parallelism
  • Less ideal for distributed computing algorithms
    requiring communications (e.g. distributed
    decision trees)
  • MPI on top of the cluster for communication

42
Acknowledgment
  • Numerous wonderful colleagues!
  • Questions?

43
  • Appendix
  • More Applications

44
Application Content Optimization
  • Optimizing content across the Yahoo portal pages
  • Rank articles from an editorial pool of articles
    based on interest
  • Yahoo Front Page,
  • Yahoo News etc.
  • Customizing feeds in My Yahoo portal page
  • Top buzzing queries
  • Content recommendations (RSS feeds)
  • Use Hadoop for feature aggregates and model
    weight updates
  • near real-time and uploaded to online serving

45
Yahoo Front Page Case Study
Search Index
Ads Optimization
Machine Learned Spam filters
RSS Feed Recos.
46
Application Search Logs Analysis
  • Analyze search result view and click logs
  • Reporting and measurement of user click response
  • User session analysis
  • Enrich, expand and re-write queries
  • Spelling corrections
  • Suggesting related queries
  • Traffic quality and protection
  • Detect and filter out fraudulent traffic and
    clicks

47
Mail Spam Filtering Connected Components
  • Y1 Yahoo user 1, Y2 Yahoo user 2
  • IP1 IP address of the host Y1 voted not-spam
    from

47
48
Mail Spam Filtering Connected Components Voting
Set of IPs/YIDs used exclusively for voting
notspam
Set of (likely new) spamming IPs which are
worth voting for
y1
IP3
IP1
y2
IP4
IP2
y3
Set of voted on IPs
Set of voted from IPs
Set of Yahoo IDs voting notspam
48
Write a Comment
User Comments (0)