HBase at Xiaomi - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

HBase at Xiaomi

Description:

Liang Xie / Honghua Feng {xieliang, fenghonghua}_at_xiaomi.com * www.mi.com Coordinated Compaction RS RS RS Master Can I ? OK Can I ? OK Can I ? NO HDFS (global resource ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 39
Provided by: admi2566
Category:

less

Transcript and Presenter's Notes

Title: HBase at Xiaomi


1
HBase at Xiaomi
Liang Xie / Honghua Feng
xieliang, fenghonghua_at_xiaomi.com
2
About Us
Honghua Feng
Liang Xie
3
Outline
  • Introduction
  • Latency practice
  • Some patches we contributed
  • Some ongoing patches
  • QA

4
About Xiaomi
  • Mobile internet company founded in 2010
  • Sold 18.7 million phones in 2013
  • Over 5 billion revenue in 2013
  • Sold 11 million phones in Q1, 2014

5
Hardware
6
Software
7
Internet Services
8
About Our HBase Team
  • Founded in October 2012
  • 5 members
  • Liang Xie
  • Shaohui Liu
  • Jianwei Cui
  • Liangliang He
  • Honghua Feng
  • Resolved 130 JIRAs so far

9
Our Clusters and Scenarios
  • 15 Clusters 9 online / 2 processing / 4 test
  • Scenarios
  • MiCloud
  • MiPush
  • MiTalk
  • Perf Counter

10
Our Latency Pain Points
  • Java GC
  • Stable page write in OS layer
  • Slow buffered IO (FS journal IO)
  • Read/Write IO contention

11
HBase GC Practice
  • Bucket cache with off-heap mode
  • Xmn/ServivorRatio/MaxTenuringThreshold
  • PretenureSizeThreshold repl src size
  • GC concurrent thread number

GC time per day 2500, 3000 -gt 300, 600s !!!
12
Write Latency Spikes
  • HBase client put
  • -gtHRegion.batchMutate
  • -gtHLog.sync
  • -gtSequenceFileLogWriter.sync
  • -gtDFSOutputStream.flushOrSync
  • -gtDFSOutputStream.waitForAckedSeqno ltStuck here
    often!gt

  • DataNode pipeline write, in BlockReceiver.receiveP
    acket()
  • -gtreceiveNextPacket
  • -gtmirrorPacketTo(mirrorOut) //write packet to the
    mirror
  • -gtout.write/flush //write data to local disk. lt-
    buffered IO
  • Added instrumentation(HDFS-6110) showed the
    stalled write was the culprit, strace result also
    confirmed it

13
Root Cause of Write Latency Spikes
  • write() is expected to be fast
  • But blocked by write-back sometimes!

14
Stable page write issue workaround
Workaround 2.6.32.279(6.3) -gt
2.6.32.220(6.2) or 2.6.32.279(6.3) -gt
2.6.32.358(6.4) Try to avoid deploying
REHL6.3/Centos6.3 in an extremely latency
sensitive HBase cluster!
15
Root Cause of Write Latency Spikes
  • ...
  • 0xffffffffa00dc09d do_get_write_access0x29d/0x5
    20 jbd2
  • 0xffffffffa00dc471 jbd2_journal_get_write_access
    0x31/0x50 jbd2
  • 0xffffffffa011eb78 __ext4_journal_get_write_acce
    ss0x38/0x80 ext4
  • 0xffffffffa00fa253 ext4_reserve_inode_write0x73
    /0xa0 ext4
  • 0xffffffffa00fa2cc ext4_mark_inode_dirty0x4c/0x
    1d0 ext4
  • 0xffffffffa00fa6c4 ext4_generic_write_end0xe4/0
    xf0 ext4
  • 0xffffffffa00fdf74 ext4_writeback_write_end0x74
    /0x160 ext4
  • 0xffffffff81111474 generic_file_buffered_write0
    x174/0x2a0 kernel
  • 0xffffffff81112d60 __generic_file_aio_write0x25
    0/0x480 kernel
  • 0xffffffff81112fff generic_file_aio_write0x6f/0
    xe0 kernel
  • 0xffffffffa00f3de1 ext4_file_write0x61/0x1e0
    ext4
  • 0xffffffff811762da do_sync_write0xfa/0x140
    kernel
  • 0xffffffff811765d8 vfs_write0xb8/0x1a0
    kernel
  • 0xffffffff81176fe1 sys_write0x51/0x90 kernel

XFS in latest kernel can relieve journal IO
blocking issue, more friendly to metadata heavy
scenarios like HBase HDFS
16
Write Latency Spikes Testing
  • 8 YCSB threads write 20 million rows, each 3200
    Bytes 3 DN kernel 3.12.17
  • Statistic the stalled write() which costs gt 100ms

The largest write() latency in Ext4 600ms !
17
Hedged Read (HDFS-5776)
18
Other Meaningful Latency Work
  • Long first put issue (HBASE-10010)
  • Token invalid (HDFS-5637)
  • Retry/timeout setting in DFSClient
  • Reduce write traffic? (HLog compression)
  • HDFS IO Priority (HADOOP-10410)

19
Wish List
  • Real-time HDFS, esp. priority related
  • Core data structure GC friendly
  • More off-heap shenandoah GC
  • TCP/Disk IO characteristic analysis
  • Need more eyes on OS
  • Stay tuned

20
Some Patches Xiaomi Contributed
  • New write thread model(HBASE-8755)
  • Reverse scan(HBASE-4811)
  • Per table/cf replication(HBASE-8751)
  • Block index key optimization(HBASE-7845)

21
1. New Write Thread Model
Old model


WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
WriteHandler write to HDFS
WriteHandler write to HDFS
WriteHandler write to HDFS
256
WriteHandler sync to HDFS
WriteHandler sync to HDFS
256
WriteHandler sync to HDFS
Problem WriteHandler does everything, severe
lock race!
22
New Write Thread Model
New model


WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
AsyncWriter write to HDFS
1
AsyncSyncer sync to HDFS
WriteHandler sync to HDFS
WriteHandler sync to HDFS
4
AsyncNotifier notify writers
1
23
New Write Thread Model
  • Low load No improvement
  • Heavy load Huge improvement (3.5x)

24
2. Reverse Scan
1. All scanners seek to previous rows
(SeekBefore)
2. Figure out next row max previous row
3. All scanners seek to first KV of next row
(SeekTo)
Row2 kv2
Row1 kv2
Row1 kv1
Row3 kv1
Row3 kv2
Row2 kv1
Row3 kv3
Row3 kv4
Row2 kv3
Row4 kv2
Row4 kv4
Row4 kv1
Row4 kv5
Row4 kv6
Row4 kv3
Row5 kv2
Row5 kv3
Row6 kv1
Performance 70 of forward scan
25
3. Per Table/CF Replication
  • PeerB creates T2 only replication cant work!

PeerA (backup)
  • PeerB creates T1T2 all data replicated!

T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
?
Need a way to specify which data to replicate!
26
Per Table/CF Replication
  • add_peer PeerA, PeerA_ZK

PeerA
  • add_peer PeerB, PeerB_ZK, T2cfX

T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
T2cfX
27
4. Block Index Key Optimization
Before Block 2 block index key ah,
hello world/
Now Block 2 block index key ac/ (
k1 lt key lt k2)
k1ab
k2 ah, hello world


Block 1
Block 2
  • Reduce block index size
  • Save seeking previous block if the searching key
    is in ac, ah, hello world

28
Some ongoing patches
  • Cross-table cross-row transaction(HBASE-10999)
  • HLog compactor(HBASE-9873)
  • Adjusted delete semantic(HBASE-8721)
  • Coordinated compaction (HBASE-9528)
  • Quorum master (HBASE-10296)

29
1. Cross-Row Transaction Themis
http//github.com/xiaomi/themis
  • Google Percolator Large-scale Incremental
    Processing Using
  • Distributed
    Transactions and Notifications
  • Two-phase commit strong cross-table/row
    consistency
  • Global timestamp server global strictly
    incremental timestamp
  • No touch to HBase internal based on HBase Client
    and coprocessor
  • Read 90, Write 23 (same downgrade as
    Google percolator)
  • More details HBASE-10999

30
2. HLog Compactor
HLog 1,2,3
Region x few writes but scatter in many HLogs
Region 1
Region 2
Region x
Memstore
HFiles
PeriodicMemstoreFlusher flush old memstores
forcefully
  • flushCheckInterval/flushPerChanges hard to
    config
  • Result in tiny HFiles
  • HBASE-10499 problematic region cant be flushed!

31
HLog Compactor
HLog 1, 2, 3,4
  • Compact HLog 1,2,3,4 ? HLog x
  • Archive HLog1,2,3,4

HLog x
Region 1
Region 2
Region x
Memstore
HFiles
32
3. Adjusted Delete Semantic
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result kvA cant be read out
Scenario 2
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
5. Read kvA
Result kvA can be read out
Fix delete cant mask kvs with larger mvcc (
put later )
33
4. Coordinated Compaction
RS
RS
RS
Compact storm!
HDFS (global resource)
  • Compact uses a global HDFS, while whether to
    compact is decided locally!

34
Coordinated Compaction
RS
RS
RS
Can I ?
OK
Master
Can I ?
NO
Can I ?
OK
HDFS (global resource)
  • Compact is scheduled by master, no compact storm
    any longer

35
5. Quorum Master
A
zk3
zk2
X
Master
A
Read info/states
Master
zk1
ZooKeeper
RS
RS
RS
  • When active master serves, standby master stays
    really idle
  • When standby master becomes active, it needs to
    rebuild in-memory status

36
Quorum Master
A
X
Master 3
Master 1
A
Master 2
RS
RS
RS
  • Better master failover perf No phase to rebuild
    in-memory status
  • Better restart perf for BIG cluster(10K regions)
  • No external(ZooKeeper) dependency
  • No potential consistency issue
  • Simpler deployment

37
Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong,
Hao Huang, Hailei LiShaohui Liu, Jianwei Cui,
Liangliang HeDihao Chen
Acknowledgement
38
Thank You!xieliang_at_xiaomi.comfenghonghua_at_xiao
mi.com
www.mi.com
Write a Comment
User Comments (0)
About PowerShow.com