Deconstructing Commodity Storage Clusters - PowerPoint PPT Presentation

About This Presentation

Title:

Deconstructing Commodity Storage Clusters

Description:

Title: Slide 1 Author: Haryadi Gunawi Last modified by: Haryadi Gunawi Created Date: 1/20/2005 2:53:55 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 30

Provided by: Haryadi2

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Deconstructing Commodity Storage Clusters

1
Deconstructing Commodity Storage Clusters

Haryadi S. Gunawi, Nitin Agrawal,
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Univ. of Wisconsin - Madison
Jiri Schindler
Corporation
2
Storage system

Storage system
Important components of large-scale systems
Multi-billion dollar industry
Often comprised of high-end storage servers
A big box with lots of disks inside
The simple question
How does storage server work?
Simple but hard closed storage subsystem design

3
Why need to know?

Better modeling
How system behaves under different workload
Example in storage industry capacity model for
capacity planning
Model is limited if the information is limited
Product validation
Validate what product specs say
Performance numbers cannot confirm
Critical evaluation of design and implementation
choices
Control what is occurring inside

4
Traditionally black box

Highly customized and proprietary hardware and OS
Hitachi Lightning, NetApp Filers, EMC Symmetrix
EMC Symmetrix disk/cache manager, proprietary OS
Internal information is hidden behind standard
interfaces

?
Client
Acks
5
Modern graybox storage system

Cluster of commodity PCs running commodity OS
Google FS cluster, HP FAB, EMC Centera
Advantages of commodity storage clusters
Direct internal observation visible probe
points
Leverage existing standardized tools

Storage System
Update DB
Update DB
PC
Commodity
PC
PC
Client
Switch
PC
Switch
PC
PC
6
Intra-box Techniques

Two Intra-box techniques
Observation
System perturbation
Two components of analysis
Deduce structure of main communication protocol
Object Read and Write protocol
Internal policy decisions
Caching, prefetching, write buffering, load
balancing, etc.

7
Goal and EMC Author

Objectives
Feasibility of deconstructing commodity storage
clusters, no source code
Results achieved without EMC assistance
EMC Author
Evaluate correctness of our findings
Give insights behind their design decisions

8
Outline

Introduction
EMC Centera Overview
Intra-box tools
Deducing Protocol
Observation and Delay Pertubation
Inferring Policies
System Perturbation
Conclusion

9
Centera Topology
Storage Nodes
Access Nodes
Client
SN 1
LAN
WAN
SN 2
AN 1
SN 3
Client
AN 2
SN 4
SN 5
SN 6
10
Commodity OS
Storage Node
Access Node
Client
Centera Software
Centera Software
Linux
Linux
Client SDK
Reiserfs
TCP/UDP
TCP
Reiserfs
TCP/UDP
IDE driver
IDE driver
WAN
LAN
11
Probe Points Observation
Storage Node
Access Node
Client
Centera Software
Centera SW.
Client SDK
Reiserfs
TCP/UDP
TCP/UDP
TCP
tcpdump
tcpdump
tcpdump
Pseudo Dev. Driver
IDE drives

Internal probe points
Trace traffic using standardized tools
tcpdump trace network traffic
Pseudo Device Driver trace disk traffic

12
Probe Points Perturbation
Storage Node
User-level Process
Access Node
Client
Centera Software
Add CPU Load while(1) ..
Add Disk Load cp fX fY
Centera SW
Client SDK
Reiserfs
TCP/UDP
TCP/UDP
TCP
Pseudo Dev.
Mod. NistNet
Mod. NistNet
Mod. NistNet
Delay
tcpdump
tcpdump
tcpdump
IDE drives

Perturbing system at probe points
Modified NistNet delay particular messages
Pseudo Dev. Driver delay disk I/O traffic
Additional Load
CPU Load High priority while loop
Disk Load File copy

13
Outline

Introduction
EMC Centera Overview
Deducing Protocol
Observation and Delay Perturbation
Inferring Policies
System Perturbation
Conclusion

14
Understanding the protocol

Understanding Read/Write protocol
Read and Write implementations in big distributed
storage systems are not simple
Deconstruct the protocol structure
Which pieces are involved?
Where data is sent to?
Data reliably stored, mirrored, striped?

15
Observing Write Protocol

Deconstruct protocol using passive observation
Run a series of write workload
Observe network and disk traffic
Correlation tools convert traces into protocol
structure

EMC Centera
Client
an1
sn1
sn2
sn3
an2
sn4
sn5
Access Nodes
sn6
Storage Nodes
16
Observation Results
Access Node
Primary SN
Secondary SN

Object Write Protocol findings
Phase 1 Write request establishment
Phase 2 Data transfer
Phase 3 Disk write, notify other SNs, commit
Phase 4 Series of acknowledgement
Determine general properties
Primary SN handles generation of 2nd copy
Two new TCP connections / object write

Client
R
Write Req.
TCP Setup
R
Write Req
TCP Setup
R
Request Ack.
Request Ack.
Request Ack.
Data Transfer
Transfer Ack.
SNx
SNy
SNv
SNw
Write-Commit
Write-Commit
Write Complete
time
17
Resolving Dependencies

Cannot conclude dependencies from observation
only
B after A ! B depends on A
Must delay A, and see if B is delayed

Primary SN
Secondary SN
AN
From observation only Primary commit depends on
secondary commit and sync. disk write
Primary commit (pc)

Conclude causality by delaying
disk write traffic and
secondary commit

18
Delaying a Particular Message

Need to delay a particular message
Leverage packet sizes
Modify NistNet
Delay specific message, not link
Ex delay sc (90 bytes)

Access Node
Primary SN
Secondary SN
Client
299 bytes
509
509
161
161
161
289
375
321
321
sc
90 bytes
prim. commit
539
4
4
4
4
19
Delaying secondary-commit
Primary SN
Secondary SN

Resolving first dependency
Delay secondary commit ? primary commit also gets
delayed
Primary commit depends on the receipt of
secondary commit

AN
delay
20
Delaying disk I/O traffic
Primary SN

Delay disk writes at primary storage node

Secondary-commit
Delay Disk Write
From observation and delay Primary commit
depends on secondary commit message and sync.
disk write
21
Ability to analyze internal designs

Intra-box techniques Observation and
perturbation by delay
Able to deduce Object Write protocol
Give ability to analyze internal design decisions
Serial vs. Parallel
Primary SN handles the generation of 2nd copy
(Serial)
vs.
AN handles both 1st and 2nd (Parallel)
EMC Centera write throughput is more important
Decrease load on access nodes increase write
throughput
New TCP connections (internally) / object write
vs. using persistent connection to remove TCP
setup cost
Prefer simplicity no need to manage persistent
conn. for all requests

22
Outline

Introduction
EMC Centera Overview
Deducing Protocol
Inferring Policies
Various system perturbation
Conclusion

23
Inferring internal policies

Write policies
Level of replication, Load balancing,
Caching/buffering
Read policies
Caching, Prefetching, Load balancing
Try to infer
Is particular policy implemented?
At which level it is being implemented?
Ex Read Caching at Client, Access Node, Storage
Node?

24
System Pertubation

Perturb the system
Delay and extra load
4 common load-balancing factors
CPU load
High priority while loop
Disk load
Background file copy
Active TCP connection
Network delay

net delay
25
Write Load Balancing

What factors determined which storage nodes are
selected?
Experiment
Observe which primary storage nodes selected
Without load writes are balanced
With load writes skew toward unloaded nodes

?
sn1 Unloaded
AN
?
sn2 Unloaded
sn2 Loaded
26
Write Load Balancing Results
Normal No Perturb
Additional CPU Load
Disk Load
Network Load
Incoming Net. Delay
sn1
sn1
sn1
sn1
sn1
sn2
CPU
Disk
TCP
Delay
27
Summary of findings
Write Policies Write Policies
Replication Two copies in two nodes attached to different power (reliability)
Load balancing CPU usage (locally observable status) Network status is not incorporated
Write buffering Storage nodes write synchronously
EMC Centera Simplicity and Reliability
Read Policies Read Policies
Caching Storage node only (commodity filesystem) Access node and client does not cache.
Prefetching Storage node only (commodity filesystem) Access node and client does not prefetch
Load Balancing Not implemented in earlier version Still reads from busy nodes
28
Conclusion