The Evolution and Architecture of IMDb.com - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Evolution and Architecture of IMDb.com

Description:

True Blood. Untouchables, The. Vamp 'Chisholms, The' (1979) (mini) ... Too Good to Be True (TV) 'L.A. Law' Dreyfuss, Richard Always (1989) American Graffiti (C:GGN) ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 32
Provided by: radlabCs
Category:

less

Transcript and Presenter's Notes

Title: The Evolution and Architecture of IMDb.com


1
The Evolution and Architecture of IMDb.com
  • Charles Gordon
  • Principal Engineer, Amazon.com

2
Who Am I?
  • Joined Amazon.com in Jan. 2002
  • Worked on the Digital Technologies team
  • Lead engineer for Amazons Search Inside the Book
    project 2003-2005
  • Moved to IMDb in 2006

3
Overview
  • Overview of IMDbs Business
  • IMDbs Technical History
  • Business Requirements
  • IMDbs Current Architecture
  • Technical Requirements
  • RDBMS, Embedded DB or Both?

4
IMDbs Business
5
IMDb Services
  • Browse Data
  • Search Data
  • MyMovies
  • Localized TV Listings
  • Vote/Rating Tracking
  • Pro Site for Industry Professionals
  • Resume Services
  • XML Web Services

6
IMDb Data Characteristics
  • Origin
  • User submissions
  • External data feeds
  • Wiki
  • Size
  • Core data dozens of GB
  • Feed data hundreds of GB of TV listings, news
    articles, images, theatrical trailers, etc.
  • Other Characteristics
  • Highly interrelated (numerous many-to-many
    relationships)
  • Hierarchical (episodes)

7
IMDb Traffic
  • Around 100 million requests per day
  • Peak traffic reaches thousands of hits per second
  • Millions of unique users per day
  • Views follow power law with long tail

8
Overview
  • Overview of IMDbs Business
  • IMDbs Technical History
  • Business Requirements
  • IMDbs Current Architecture
  • Technical Requirements
  • RDBMS, Embedded DB or Both?
  • Future Directions

9
Tech History rec.arts.movies
  • THE ACTORS LIST (Dr-Hai) PART 3 of 8
    Name      
                 Movie List ----                  
     --------------------------------- Drago, Billy
               China White                        
    Cutter's Way                         Dark Before
    Dawn                         Delta Force 2
    Operation Stranglehold                        
    Diplomatic Immunity                        
    Freeway                         Guncrazy (1992)
    (TV)                         Hero and the Terror
                            In Self Defense (TV)  
                          Invasion U.S.A. (1985)    
                        No Other Love (TV)          
                  Pale Rider                        
    Prime Suspect (1989)                        
    Secret Games                         True Blood
                            Untouchables, The      
                      Vamp                        
    "Chisholms, The" (1979) (mini)
  • Drake, Larry            Dark Night of the
    Scarecrow (TV)                         Darkman
                            Dr. Giggles            
                For Keeps                        
    Karate Kid, The                         Murder
    in New Hampshire The Pamela Smart Story (TV)  
                          Too Good to Be True (TV)  
                          "L.A. Law"
  • Dreyfuss, Richard       Always (1989)          
                  American Graffiti (CGGN)
  • PHASE ONE Creating the database
    --------------------------------
  • (1) Unpack the shell archive at the end of this
    message into a directory. If     you already
    have saved copies of the lists, unpack it in the
    same     directory. Otherwise, create a new one.
    The archive contains four     main scripts
    'gendb' to generate the databases and 'list',
    'listall' and     'lindex' to search them. The
    other programs are used internally by the    
    main scripts.
  • (2) Place a copy of each list in a file as
    outlined below
  •       List           File      
    ------------------------       Actors List    
    actors       Actress List   actress      
    Directors List directs       Dead List      
    dead       Writers List   writers
  •     You should remove all headers and trailing
    lines from each file. All     lines in the files
    should either be blank or of the form
  • ltnamegt                    ltmovie/tvgt
  •     or
  •                         ltmovie/tvgt
  • (3) To create the databases, enter the following
    command at the shell prompt
  •     gendb
  •     This will create five database files
    (actors.dbs, actress.dbs, dead.dbs,    
    writers.dbs and directs.dbs) corresponding to the
    lists. You can browse     these files using your
    favourite editor.
  • (4) Each time a new release of one of the lists
    is posted, strip off the     headers, save it to
    the appropriate file and re-run 'gendb'.

10
Tech History First Web Site
11
Overview
  • Overview of IMDbs Business
  • IMDbs Technical History
  • Business Requirements
  • IMDbs Current Architecture
  • Technical Requirements
  • RDBMS, Embedded DB or Both?
  • Future Directions

12
Business Requirements
  • Near real time updates for all data
  • High availability for reads
  • High availability for writes from the website
    (backend writes can be queued)
  • Eventual consistency for most writes, need option
    for read-after-write for some writes
  • Low latency for most operations
  • Enable rapid development/deployment of Web 2.0
    features

13
Overview
  • Overview of IMDbs Business
  • IMDbs Technological History
  • Business Requirements
  • IMDbs Current Architecture
  • Technical Requirements
  • RDBMS, Embedded or Both?
  • Future Directions

14
Architecture Overview
Data Master
Daily Build
Web Server
Data Input (submissions, feeds)
MySQL Cluster
Web Server
Embedded DB Master
15
Architecture Data Master
Editor Workflow
Email Submissions
Data Master
Third-party Feeds
Backend Processing
MySQL
Website Submissions
16
Data Master Pain Points
  • No schema
  • No ad-hoc querying
  • Keys are split across multiple systems
  • Submissions come in via email (!)

17
Architecture Daily Build
Data Master
Web Server
Web Server
Build Master
Build Slave
Build Slave
Build Slave
18
Daily Build Pain Points
  • Rebuilds everything, but only 0.5 changes
  • Pushes everything over the network, every day
  • Automatic generation of views can go haywire very
    easily
  • Master/Slave communication happens via NFS, which
    can be very flakey

19
Architecture MySQL
Web Server
Master
Standby
Slave
Web Server
Slave
20
MySQL Pain Points
  • Easily overloaded by rogue queries
  • Requires dedicated maintenance personnel
  • No automatic failover
  • Replication has bugs
  • Single-copy consistency only available from
    master (slaves are eventual)

21
Architecture Embedded DBs
Master
Web Server
Published XML
Master
Web Server
22
Embedded DBs Pain Points
  • Currently has no schema
  • Query language not as simple to use as SQL in
    most cases
  • Consistency is always eventual
  • Doesnt store normalized data, and wouldnt
    automatically denormalize if it did
  • Master servers are a bottleneck and a SPOF

23
Overview
  • Overview of IMDbs Business
  • IMDbs Technical History
  • Technical/Business Requirements
  • IMDbs Current Architecture
  • Technical Requirements
  • RDBMS, Embedded DB or Both?
  • Future Directions

24
Technical Requirements
  • Normalized data store for core data
  • Denormalized views for web templates
  • Must handle high read traffic, significantly less
    write traffic
  • Ability to query data from both normalized store
    and views
  • Simple scalability
  • Low maintenance
  • Remote serving capabilities
  • Redundancy for load balancing and failover

25
Overview
  • Overview of IMDbs Business
  • IMDbs Technical History
  • Business Requirements
  • IMDbs Current Architecture
  • RDBMS, Embedded DB or Both?
  • Technical Requirements
  • Future Directions

26
LAMP Considered Harmful
Master
Standby
Master
Standby
Slave
Slave
Caching
Web Server
Web Server
27
LAMP Considered Harmful
  • Data model needs to be completely denormalized to
    handle traffic
  • Requires partitioning (vertical/horizontal)
  • Availability is low
  • Requires dedicated personnel to manage complexity
  • RDBMSs are overly general
  • Shared resource, requires a lot of planning and
    oversight to use correctly
  • Scaling can be quite difficult

28
Distributed XML Database
Master
Master
Web Service
Web Server
Web Server
29
Distributed XML Database
  • No way to get read-after-write
  • Space on web servers is at a premium
  • No plan for storing normalized data that can be
    queried
  • Requires extra work to use as a remote service
  • Still requires planning for denormalized storage
    (to maximize caching)

30
Hybrid Approach
Master
Standby
Denormalization
Web Service
Web Server
Web Server
31
Hybrid Approach
  • Not easy to get read-after-write
  • Much more complicated and will require a lot of
    work to write the glue for the two layers
  • Needs an automated way to optimize the views
    (reduce redundancies, plan for what is in the
    cache)
Write a Comment
User Comments (0)
About PowerShow.com