HBASE - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

HBASE

Description:

Vendors Offers replication and partition solutions to grow the database beyond ... WebTable :- A table of crawled web pages and their attributes, keyed by web page URL ... – PowerPoint PPT presentation

Number of Views:827

Avg rating:3.0/5.0

Slides: 31

Provided by: reddyr3

Category:

more less

Transcript and Presenter's Notes

Title: HBASE

1
HBASE

ReddyRaja

2
RDBMS scaling

Cannot scale for large distributed data sets
Vendors Offers replication and partition
solutions to grow the database beyond the
confines of single node, but generally
complicated to install and maintain
Such techniques compromise
RDBMS features such as
Joins, Complex queries, Views, Triggers and
foreign key constraints
These queries becomes expensive

3
HBASE

Designed from base with scaling in Mind
Scales linearly by adding more nodes
Hbase is not relational and does not support SQL
However, it can host
Very large sparsely populated tables on clusters
made from commodity hardware

4
HBASE use case

WebTable - A table of crawled web pages and
their attributes, keyed by web page URL
Web table is large with row counts that run into
millions/billions
Batch analytics and parsing programs are run for
later indexing by a search engine.
Concurrently, table is randomly accessed by
crawlers running at various rates updating random
rows
Web pages are served randomly in realtime as
users click on Websites cached feature

5
HBASE

Started toward end of 2006 by
Chad Walters and Jim Kellerman of PowerSet
Modelled after BigTable from Google
2008 Hbase became a Hadoop Subproject at Apache
Hbase has been in production since late 2007 at
powerset
WorldLingo
Streamy.com
OpenPlaces
Yahoo and
Adobe

6
Conceptual overview
7
Physical Storage View - HStore
8
Data Model

Rows of Labeled tables
Data row has a sortable key and arbitarry number
of columns
Table is stored sparsely, so that rows in the
same table can have widely varying number of
columns
A column has the form
familylabel
Family and label are arbitary arrays
Set of families is done by performing
administrative operations on the table
However, new labels can be added without pre
announcing it.
Hbase stores column families physically close on
disk
Items in a given column family have similar
characterstics and contain similar data
Only a single row may be locked at any point of
time
Row writes are always atomic

9
Data Model

Data is stored in Tables
Tables have rows and columns
Table Cell is identified by a row and column
Cell content is versioned
Cell content is an uninterpreted array of bytes
Table rows are also byte arrays
Anything can be served as PrimarKey
Strings, Longs, binary representation of longs or
serialized data structures
Table rows are sorted by row key by default
Sort is byte ordered
All Tables accesses are via the table key
Row Columns are grouped into families
All columns families have a prefix
Column prefix must be printable characters

10
Data Model ..continued

Hbase
Tables column families have to be specified
upfront
New Column families can be added
New columns can only be added on an existing
family
All column families are stored together on the
file system
It is more a column-family oriented store
Tuning and Storage specifications are at column
family level
Advises to have all column families have same
general access pattern and size characteristics

11
Data Model Summary..

HBase tables are similar to RDBS tables with a
difference
Rows are sorted with a Row Key
Only cells are versioned
Columns can be added on the fly by client as long
as the column family they belong to preexists

12
Hbase Cluster Members
Master Server
Zoo Keeper Cluster
Region Server
Region Server
Region Server
HDFS Cluster
Region
HStore
HStore
Map Files
Map Files
13
Region

Tables are automatically partitioned horizontally
into Regions
Each region comprises a subset of rows
First row, last row and inclusive rows
Plus a random region identifier
Initially table has one region. As it grows and
crosses the threshold, it will split into 2
regions equal in size
As tables grows, the number of regions grows
Regions are units distributed over an Hbase
Cluster
Table that is too big to a server can be carried
by a cluster of servers
Each node hosting a subset of regions
Load on the table also gets distributed
At any time, all the regions sorted set is the
table content

14
Hbase Implementation - Brief

HDFS
Namenode, dataNode
MapReduce
JobTracker, TaskTracker
HBASE
Master Server, Region Server

15
HBase Implementation .. brief

Depends on ZooKeeper
Hbase Master Server
Orchestrates a cluster of region servers
Assigns regions to registered region servers
Recovers region server failures
Light loaded
Region Servers
Carry zero or more regions
Services client read/write requests
Manage regions splits
Use Hadoop FileSystem API
Can persist in LocalFileSystem, HDFS, Amazon S3
or KFS

16
META and ROOT Tables

META
Meta table stores information about every user
region in HBASE
Stores start, end row key, region is off-line and
on-line
Address of the region server
META table can grow as number of regions grow
ROOT
Confined to a single region
Maps all regions in the META table
Contains the Location of Region Server, Meta
region is serving
Each Row in ROOT and META is 1 KB in size
Default region size is 256 MB

17
Hbase in Operation

Special Catalog files
- ROOT and .META
Maintains current
List, state, Recent history
Location of all regions
Root Holds the .META regions
.Meta Regions holds list of all user space
regions
Catalog tables are updated when State of all
regions are kept current
Regions transition
Regions split
Disabled/enabled
Redeployed to load balance
due to a crash in region server
Redeployed

18
Region Server

Responsible for Clients Read and Write requests
Tells the master that it is alive, gets a list of
regions to serve
Instructions are piggy-backed on the heart beat
messages

19
Region Server Write Requests

Write Requests
Write data is first written to a Write-Ahead log.
All write requests for every region the region
server is serving are written to the same log
Data is stored in the in-memory cache called
mem-cache
When cache fills, content is flushed to the file
system
Commit log is hosted on HDFS
Remains available through a region server crash

20
Region Server Read Requests

Read Requests
Region servers memcache is consulted first
If versions are found, return them
Else Flush files are consulted from newest to
oldest until sufficient versions are found

21
Regions Server Crash recovery

Master notices a region server crash
Splits the dead servers commit log by region
Regions would come up to date for themselves
before
All edits on that regions would be up to date

22
Region Server - Compaction

When the number of Map files exceeds a threshold,
a minor compaction is performed
Major compaction is performed periodically
Compactions can happen parallely with region
server processing read and write requests
Reads and writes are suspended until the map file
has been added to the list of active map files
Map Files that were merged are removed

23
Region Server Region splits

Aggregate size of mapfile reaches 256 MB,
Region is split is requested
Region splot divides the row range into half
Parent region is taken off-line
Region server records new child regions in META
region
Master is informed about the split
Master can assign the child region to region
servers
If split message is lost, Master discovers
regions from META region periodically
Parent region is closed for read and write
requests
Client can detect a region split and can re-try
after the new regions are available
Parent regions are garbage collected

24
Clients

Clients connect to ZooKeeper cluster
Get the location for ROOT and .META
Scope covers that of the requested row
Client does a lookup against the found .META
region to figure out
User space region and
Location of the region server that contains the
desired row range
Client interacts directly with region server
Client caches the learning
Do not have to go back to ROOT and META regions
Caching locations
User space region start and stop rows
Go back to ROOT and .META if a fault occurs

25
Updates

Row updates are atomic
No matter how many columns constitute the row
level transaction

26
HBASE vs RDBMS

HBASE is distributed, column-oriented data
storage system
Provides random read and writes on top of Hadoop
File System
No of rows could be millions
No of columns can also scale
Horozontally partitioned and replicated across
thosands of commodity machines
Table schmeas mirror the physical storage,
creating a system for efficient data structure
serialization, storage and retrieval.

27
RDBMS

Fixed Schema (table and columns)
Emphasis on strong consistency, referential
integrity, abstraction from physical layer
complex queries through sql language
Perform outer and inner joins

28
Small data bases

For Small databases , RDBMS is the King.
No substitute for its maturity, flexibility,
powerful feature set
Scaling up to millions of records and cannot
scale
RDBMS would soon becomes a limitation and
distribution becomes difficult
Techniques do exist in terms of partitions,
however would loose the RDBMS feature set
Overall it becomes complex to manage

29
Example