HBASE - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

HBASE

Description:

Vendors Offers replication and partition solutions to grow the database beyond ... WebTable :- A table of crawled web pages and their attributes, keyed by web page URL ... – PowerPoint PPT presentation

Number of Views:827
Avg rating:3.0/5.0
Slides: 31
Provided by: reddyr3
Category:
Tags: hbase | keyed

less

Transcript and Presenter's Notes

Title: HBASE


1
HBASE
  • ReddyRaja

2
RDBMS scaling
  • Cannot scale for large distributed data sets
  • Vendors Offers replication and partition
    solutions to grow the database beyond the
    confines of single node, but generally
    complicated to install and maintain
  • Such techniques compromise
  • RDBMS features such as
  • Joins, Complex queries, Views, Triggers and
    foreign key constraints
  • These queries becomes expensive

3
HBASE
  • Designed from base with scaling in Mind
  • Scales linearly by adding more nodes
  • Hbase is not relational and does not support SQL
  • However, it can host
  • Very large sparsely populated tables on clusters
    made from commodity hardware

4
HBASE use case
  • WebTable - A table of crawled web pages and
    their attributes, keyed by web page URL
  • Web table is large with row counts that run into
    millions/billions
  • Batch analytics and parsing programs are run for
    later indexing by a search engine.
  • Concurrently, table is randomly accessed by
    crawlers running at various rates updating random
    rows
  • Web pages are served randomly in realtime as
    users click on Websites cached feature

5
HBASE
  • Started toward end of 2006 by
  • Chad Walters and Jim Kellerman of PowerSet
  • Modelled after BigTable from Google
  • 2008 Hbase became a Hadoop Subproject at Apache
  • Hbase has been in production since late 2007 at
    powerset
  • WorldLingo
  • Streamy.com
  • OpenPlaces
  • Yahoo and
  • Adobe

6
Conceptual overview
7
Physical Storage View - HStore
8
Data Model
  • Rows of Labeled tables
  • Data row has a sortable key and arbitarry number
    of columns
  • Table is stored sparsely, so that rows in the
    same table can have widely varying number of
    columns
  • A column has the form
  • familylabel
  • Family and label are arbitary arrays
  • Set of families is done by performing
    administrative operations on the table
  • However, new labels can be added without pre
    announcing it.
  • Hbase stores column families physically close on
    disk
  • Items in a given column family have similar
    characterstics and contain similar data
  • Only a single row may be locked at any point of
    time
  • Row writes are always atomic

9
Data Model
  • Data is stored in Tables
  • Tables have rows and columns
  • Table Cell is identified by a row and column
  • Cell content is versioned
  • Cell content is an uninterpreted array of bytes
  • Table rows are also byte arrays
  • Anything can be served as PrimarKey
  • Strings, Longs, binary representation of longs or
    serialized data structures
  • Table rows are sorted by row key by default
  • Sort is byte ordered
  • All Tables accesses are via the table key
  • Row Columns are grouped into families
  • All columns families have a prefix
  • Column prefix must be printable characters

10
Data Model ..continued
  • Hbase
  • Tables column families have to be specified
    upfront
  • New Column families can be added
  • New columns can only be added on an existing
    family
  • All column families are stored together on the
    file system
  • It is more a column-family oriented store
  • Tuning and Storage specifications are at column
    family level
  • Advises to have all column families have same
    general access pattern and size characteristics

11
Data Model Summary..
  • HBase tables are similar to RDBS tables with a
    difference
  • Rows are sorted with a Row Key
  • Only cells are versioned
  • Columns can be added on the fly by client as long
    as the column family they belong to preexists

12
Hbase Cluster Members
Master Server
Zoo Keeper Cluster
Region Server
Region Server
Region Server
HDFS Cluster
Region
HStore
HStore
Map Files
Map Files
13
Region
  • Tables are automatically partitioned horizontally
    into Regions
  • Each region comprises a subset of rows
  • First row, last row and inclusive rows
  • Plus a random region identifier
  • Initially table has one region. As it grows and
    crosses the threshold, it will split into 2
    regions equal in size
  • As tables grows, the number of regions grows
  • Regions are units distributed over an Hbase
    Cluster
  • Table that is too big to a server can be carried
    by a cluster of servers
  • Each node hosting a subset of regions
  • Load on the table also gets distributed
  • At any time, all the regions sorted set is the
    table content

14
Hbase Implementation - Brief
  • HDFS
  • Namenode, dataNode
  • MapReduce
  • JobTracker, TaskTracker
  • HBASE
  • Master Server, Region Server

15
HBase Implementation .. brief
  • Depends on ZooKeeper
  • Hbase Master Server
  • Orchestrates a cluster of region servers
  • Assigns regions to registered region servers
  • Recovers region server failures
  • Light loaded
  • Region Servers
  • Carry zero or more regions
  • Services client read/write requests
  • Manage regions splits
  • Use Hadoop FileSystem API
  • Can persist in LocalFileSystem, HDFS, Amazon S3
    or KFS

16
META and ROOT Tables
  • META
  • Meta table stores information about every user
    region in HBASE
  • Stores start, end row key, region is off-line and
    on-line
  • Address of the region server
  • META table can grow as number of regions grow
  • ROOT
  • Confined to a single region
  • Maps all regions in the META table
  • Contains the Location of Region Server, Meta
    region is serving
  • Each Row in ROOT and META is 1 KB in size
  • Default region size is 256 MB

17
Hbase in Operation
  • Special Catalog files
  • - ROOT and .META
  • Maintains current
  • List, state, Recent history
  • Location of all regions
  • Root Holds the .META regions
  • .Meta Regions holds list of all user space
    regions
  • Catalog tables are updated when State of all
    regions are kept current
  • Regions transition
  • Regions split
  • Disabled/enabled
  • Redeployed to load balance
  • due to a crash in region server
  • Redeployed

18
Region Server
  • Responsible for Clients Read and Write requests
  • Tells the master that it is alive, gets a list of
    regions to serve
  • Instructions are piggy-backed on the heart beat
    messages

19
Region Server Write Requests
  • Write Requests
  • Write data is first written to a Write-Ahead log.
  • All write requests for every region the region
    server is serving are written to the same log
  • Data is stored in the in-memory cache called
    mem-cache
  • When cache fills, content is flushed to the file
    system
  • Commit log is hosted on HDFS
  • Remains available through a region server crash

20
Region Server Read Requests
  • Read Requests
  • Region servers memcache is consulted first
  • If versions are found, return them
  • Else Flush files are consulted from newest to
    oldest until sufficient versions are found

21
Regions Server Crash recovery
  • Master notices a region server crash
  • Splits the dead servers commit log by region
  • Regions would come up to date for themselves
    before
  • All edits on that regions would be up to date

22
Region Server - Compaction
  • When the number of Map files exceeds a threshold,
    a minor compaction is performed
  • Major compaction is performed periodically
  • Compactions can happen parallely with region
    server processing read and write requests
  • Reads and writes are suspended until the map file
    has been added to the list of active map files
  • Map Files that were merged are removed

23
Region Server Region splits
  • Aggregate size of mapfile reaches 256 MB,
  • Region is split is requested
  • Region splot divides the row range into half
  • Parent region is taken off-line
  • Region server records new child regions in META
    region
  • Master is informed about the split
  • Master can assign the child region to region
    servers
  • If split message is lost, Master discovers
    regions from META region periodically
  • Parent region is closed for read and write
    requests
  • Client can detect a region split and can re-try
    after the new regions are available
  • Parent regions are garbage collected

24
Clients
  • Clients connect to ZooKeeper cluster
  • Get the location for ROOT and .META
  • Scope covers that of the requested row
  • Client does a lookup against the found .META
    region to figure out
  • User space region and
  • Location of the region server that contains the
    desired row range
  • Client interacts directly with region server
  • Client caches the learning
  • Do not have to go back to ROOT and META regions
  • Caching locations
  • User space region start and stop rows
  • Go back to ROOT and .META if a fault occurs

25
Updates
  • Row updates are atomic
  • No matter how many columns constitute the row
    level transaction

26
HBASE vs RDBMS
  • HBASE is distributed, column-oriented data
    storage system
  • Provides random read and writes on top of Hadoop
    File System
  • No of rows could be millions
  • No of columns can also scale
  • Horozontally partitioned and replicated across
    thosands of commodity machines
  • Table schmeas mirror the physical storage,
    creating a system for efficient data structure
    serialization, storage and retrieval.

27
RDBMS
  • Fixed Schema (table and columns)
  • Emphasis on strong consistency, referential
    integrity, abstraction from physical layer
    complex queries through sql language
  • Perform outer and inner joins

28
Small data bases
  • For Small databases , RDBMS is the King.
  • No substitute for its maturity, flexibility,
    powerful feature set
  • Scaling up to millions of records and cannot
    scale
  • RDBMS would soon becomes a limitation and
    distribution becomes difficult
  • Techniques do exist in terms of partitions,
    however would loose the RDBMS feature set
  • Overall it becomes complex to manage

29
Example
  • RDBMS

30
Example continued
  • HBASE
Write a Comment
User Comments (0)
About PowerShow.com