Research @ Northeastern University presentation

About This Presentation

Transcript and Presenter's Notes

Title: Research @ Northeastern University

1
Research _at_ Northeastern University

I/O storage modeling and performance
David Kaeli
Soft error modeling and mitigation
Mehdi B. Tahoori

2
I/O Storage Research at Northeastern University

David Kaeli
Yijian Wang
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
kaeli_at_ece.neu.edu

3
Outline

Motivation to study file-based I/O
Profile-driven partitioning for parallel file I/O
I/O Qualification Laboratory _at_ NU
Areas for future work

4
Important File-base I/O Workloads

Many subsurface sensing and imaging workloads
involve file-based I/O
Cellular biology in-vitro fertilization with
NU biologists
Medical imaging cancer therapy with MGH
Underwater mapping multi-sensor fusion with
Woods Hole Oceanographic Institution
Ground-penetrating radar toxic waste tracking
with Idaho National Labs

5
The Impact of Profile-guided Parallelization on
SSI Applications

Reduced the runtime of a single-body Steepest
Descent Fast Multipole Method (SDFMM) application
by 74 on a 32-node Beowulf cluster
Hot-path parallelization
Data restructuring
Reduced the runtime of a Monte Carlo
scattered light simulation by 98 on
a 16-node Silicon Graphics Origin 2000
Matlab-to-C compliation
Hot-path parallelization
Obtained superlinear speedup of Ellipsoid
Algorithm run on a 16-node IBM SP2
Matlab-to-C compliation
Hot-path parallelization

6
Limits of Parallelization

For compute-bound workloads, Beowulf clusters can
be used effectively to overcome computational
barriers
Middlewares (e.g., MPI and MPI/IO) can
significantly reduce the programming effort on
parallel systems
Multiple clusters can be combined, utilizing Grid
Middleware (Globus Toolkit)
For file-based I/O-bound workloads, Beowulf
clusters and Grid systems are presently
ill-suited to exploit the potential parallelism
present on these systems

7
Outline

Motivation to study file-based I/O
Profile-driven partitioning for parallel file I/O
I/O Qualification Laboratory _at_ NU
Areas for future work

8
Parallel I/O Acceleration

The I/O bottleneck
The growing gap between the speed of processors,
networks and underlying I/O devices
Many imaging and scientific applications access
disks very frequently
I/O intensive applications
Out-of-core applications
Work on large datasets that cannot fit in main
memory
File-intensive applications
Access file-based datasets frequently
Large number of file operations

9
Introduction

Storage architectures
Direct Attached Storage (DAS)
Storage device is directly attached to the
computer
Network Attached Storage (NAS)
Storage subsystem is attached to a network of
servers and file requests are passed through a
parallel filesystem to the centralized storage
device
Storage Area Network (SAN)
A dedicated network to provide an any-to-any
connection between processors and disks

10
I/O Partitioning
P
An I/O intensive application
Disk
11
I/O Partitioning

I/O is parallelized at both the application level
(using MPI and MPI-IO) and the disk level (using
file partitioning)
Ideally, every process will only access files on
local disk (though this is typically not possible
due to data sharing)
How to recognize the access patterns?
Profile-guided approach

12
Profile Generation
Run the application
Capture I/O execution profiles
Apply our partitioning algorithm
Rerun the tuned application
13
I/O traces and partitioning

For every process, for every contiguous file
access, we capture the following I/O profile
information
Process ID
File ID
Address
Chunk size
I/O operation (read/write)
Timestamp
Generate a partition for every process
Optimal partitioning is NP-complete, so we
develop a greedy algorithm
We have found we can use partial profiles to
guide partitioning

14
Greedy File Partitioning Algorithm
for each IO process, create a partition for each
contiguous data chunk total up the of
read/write accesses on a process-ID basis if
the chunk is accessed by only one
process assign the chunk to the associated
partition if the chunk is read (but never
written) by multiple processes duplicate the
chunk in all partitions where read if the chunk
is written by one process, but later read by
multiple assign the chunk to all partitions
where read and broadcast the updates on
writes else assign the chunk to a shared
partition For each
partition sort chunks based on the earliest
timestamp for each chunk
15
Parallel I/O Workloads

NASA Parallel Benchmark (NPB2.4)/BT
Computational fluid dynamics
Generates a file (1.6 GB) dynamically and then
reads it back
Writes/reads sequentially in chunk sizes of 2040
Bytes
SPEChpc96/seismic
Seismic processing
Generates a file (1.5 GB) dynamically and then
reads it back
Writes sequential chunks of 96 KB and reads
sequential chunks of 2 KB
Tile-IO
Parallel Benchmarking Consortium
Tile access to a two-dimensional matrix (1 GB)
with overlap
Writes/reads sequential chunks of 32 KB, with 2KB
of overlap
Perf
Parallel I/O test program within MPICH
Writes a 1 MB chunk at a location determined by
rank, no overlap
Mandelbrot
An image processing application that includes
visualization
Chunk size is dependent on the number of processes

16
RAID Node
Beowulf Cluster
P2-350Mhz
P2-350Mhz
P2-350Mhz
10/100Mb Ethernet Switch
Local PCI-IDE Disk
Local PCI-IDE Disk
P2-350Mhz
P2-350Mhz
P2-350Mhz
RAID Node
17
Hardware Specifics

DAS configuration
Linux box, Western Digital WD800BB (IDE), 80GB,
7200RPM
Beowulf cluster (base configuration)
Fast Ethernet 100Mbits/sec
Network Attached RAID - Morstor TF200 with 6-9GB
drives Seagate SCSI disks, 7200rpm, RAID-5
Local attached IDE disks IBM UltraATA-350840,
5400rpm
Fibre channel disks
Seagate Cheetah X15 ST-336752FC, 15000rpm

18
Write/Read Bandwidth
NPB2.4/BT
SPECHPC/seis
19
Write/Read Bandwidth
MPI-Tile
Perf
Mandelbrot
20
(No Transcript)
21
Profile training sensitivity analysis

We have found that IO access patterns are
independent of file-based data values
When we increase the problem size or reduce the
number of processes, either
the number of IOs increases, but access patterns
and chunk size remain the same (SPEChpc96,
Mandelbrot), or
the number of IOs and IO access patterns remain
the same, but the chunk size increases (NBT,
Tile-IO, Perf)
Re-profiling can be avoided

22
Execution-driven Parallel I/O Modeling

Growing need to process large, complex datasets
in high performance parallel computing
applications
Efficient implementation of storage architectures
can significantly improve system performance
An accurate simulation environment for users to
test and evaluate different storage architectures
and applications

23
Execution-driven I/O Modeling

Target applications parallel scientific programs
(MPI)
Target machine/Host machine Beowulf clusters
Use DiskSim as the underlying disk drive
simulator
Direct execution to model CPU and network
communication
We execute the real parallel I/O accesses and
meanwhile, calculate the simulated I/O response
time

24
Validation Synthetic I/O Workload on DAS
25
Simulation Framework - NAS
26
(No Transcript)
27
Simulation Framework SAN direct

A variety of SAN where disks are distributed
across the network and each
server is directly connected to a single device
File partitioning
Utilize I/O profiling and data partitioning
heuristics to distribute portions of
files to disks close to the processing nodes

28
(No Transcript)
29
Hardware Specifications
30
(No Transcript)
31
(No Transcript)
32
Publications

Profile-guided File Partitioning on Beowulf
Clusters, Journal of Cluster Computing, Special
Issue on Parallel I/O, to appear 2005.
Execution-Driven Simulation of Network Storage
Systems, Proceedings of the 12th ACM/IEEE
International Symposium on Modeling, Analysis and
Simulation of Computer and Telecommunication
Systems (MASCOTS), October 2004, pp. 604-611.
Profile-Guided I/O Partitioning, Proceedings of
the 17th ACM International Symposium on
Supercomputing, June 2003, pp. 252-260.
Source Level Transformations to Apply I/O Data
Partitioning, Proceedings of the IEEE Workshop
on Storage Network Architecture And Parallel IO,
Oct. 2003, pp. 12-21.
Profile-Based Characterization and Tuning for
Subsurface Sensing and Imaging Applications,
International Journal of Systems, Science and
Technology, September 2002, pp. 40-55.

33
Summary of Cluster-based Work

Many imaging applications are dominated by
file-based I/O
Parallel systems can only be effectively utilized
if I/O is also parallelized
Developed a profile-guided approach to I/O data
partitioning
Impacting clinical trials at MGH
Reduced overall execution time by 27-82 over
MPI-IO
Execution-driven I/O model is highly accurate and
provides significant modeling flexibility

34
Outline

Motivation to study file-based I/O
Profile-driven partitioning for parallel file I/O
I/O Qualification Laboratory _at_ NU
Areas for future work

35
I/O Qualification Laboratory

Working with Enterprise Strategy Group
Develop a state-of-the-art facility to provide
independent performance qualification of
Enterprise Storage systems
Provide a quarterly report to ES customer base on
the status of current ES offerings
Work with leading ES vendors to provide them with
custom early performance evaluation of their beta
products

36
I/O Qualification Laboratory

Contacted by IOIntegrity and SANGATE for product
qualification
Developed potential partners that are leaders in
the ES field
Initial proposals already reviewed by IBM,
Hitachi and other ES vendors
Looking for initial endorsement from industry

37
I/O Qualification Laboratory

Why _at_ NU
Track record with industry (EMC, IBM, Sun)
Experience with benchmarking and IO
characterization
Interesting set of applications (medical,
environmental, etc.)
Great opportunity to work within the cooperative
education model

38
Outline

Motivation to study file-based I/O
Profile-driven partitioning for parallel file I/O
I/O Qualification Laboratory _at_ NU
Areas for future work

39
Areas for Future Work

Designing a Peer-to-Peer storage system on a Grid
system by partitioning datasets across
geographically distributed storage devices

Head node
Head node
40
(No Transcript)
41
Areas for Future Work

Reduce simulation time by identifying
characteristic phases in I/O workloads
Apply machine learning algorithms to identify
clusters of representative I/O behavior
Utilize K-Means and Multinomial clustering to
obtain high fidelity in simulation runs utilizing
sampled I/O behavior

A Multinomial Clustering Model for Fast
Simulation of Architecture Designs, submitted to
the 2005 ACM KDD Conference.

Write a Comment

User Comments (0)

About PowerShow.com

Research @ Northeastern University PowerPoint PPT Presentation