Parallelism and Distributed Applications - PowerPoint PPT Presentation

Loading...

PPT – Parallelism and Distributed Applications PowerPoint presentation | free to download - id: 1f57b2-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallelism and Distributed Applications

Description:

Credit: Ernie Chan. Cholesky SuperMatrix. Execute DAG tasks in parallel, ... Ernie Chan ... of COBE and IRAS maps (Schlegel, Finkbeiner and Davis, 1998) ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 48
Provided by: Nes66
Learn more at: http://www.nesc.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallelism and Distributed Applications


1
Parallelism and Distributed Applications
  • Daniel S. Katz
  • Director, Cyberinfrastructure and User Services,
    Center for Computation Technology
  • Associate Research Professor, Electrical and
    Computer Engineering Department

2
Context
  • Scientific/Engineering applications
  • Complex, multi-physics, multiple time scales,
    multiple spatial scales
  • Physics components
  • Elements such as I/O, solvers, etc.
  • Computer Science components
  • Parallelism across components
  • Parallelism within components, particularly
    physics components
  • Goal efficient application execution on both
    parallel and distributed platforms
  • Goal simple, reusable programming

3
Types of Systems
  • A lot of levels/layers to be aware of
  • Individual computers
  • Many layers of memory hierarchy
  • Multi-core -gt many-core CPUs
  • Clusters
  • Used to be reasonably-tightly coupled computers
    (1 CPU per node) or SMPs (multiple CPUs per node)
  • Grids elements
  • Individual computers
  • Clusters
  • Networks
  • Instruments
  • Data stores
  • Visualization systems
  • Etc

4
Types of Applications
  • Applications can be broken up into pieces
    (components)
  • Size (granularity) and relationship of pieces is
    key
  • Fairly large pieces, no dependencies
  • Parameter sweeps, Monte Carlo analysis, etc.
  • Fairly large pieces, some dependencies
  • Multi-stage applications - PHOEBUS
  • Workflow applications - Montage
  • Data grid apps?
  • Large pieces, tight dependencies (coupling,
    components?)
  • Distributed viz, coupled apps - Climate
  • Small pieces, no dependencies
  • Small pieces, some dependencies
  • Dataflow?
  • Small pieces, tight dependencies
  • MPI apps
  • Hybrids?

5
Parallelism within programs
  • Initial parallelism bitwise/vector (SIMD)
  • Highly computational tasks often contain
    substantial amounts of concurrency. At LLL the
    majority of these programs use very large,
    two-dimensional arrays in a cyclic set of
    instructions. In many cases, al new array values
    could be computed simultaneously, rather than
    stepping through one position at a time. To
    date, vectorization has been the most effective
    scheme for exploiting this concurrency. However,
    pipelining and independent multiprocessing forms
    of concurrency are also available in these
    programs, but neither the hardware not the
    software exist to make it workable. (James R.
    McGraw, Data Flow Computing The VAL Language,
    MIT Computational Structures Group Memo 188,
    1980)
  • Westinghouses Solomon introduced vector
    processing, early 1960s
  • Continued in ILLIAC IV, 1970s
  • Goodyear MPP, 128x128 array of 1 bit processors ,
    1980

6
Unhappy with your programming model?
7
Parallelism across programs
  • Co-operating Sequential Processes (CSP) - E. W.
    Dijkstra, The Structure of the THE-Multiprogramm
    ing System, 1968
  • We have given full recognition of the fact that
    in a single sequential process only the time
    succession of the various states has a logical
    meaning, but not the actual speed with which the
    sequential process is performed. Therefore we
    have arranged the whole system as a society of
    sequential processes, progressing with undefined
    speed ratios. To each user program corresponds
    a sequential process
  • This enabled us to design the whole system in
    terms of these abstract "sequential processes".
    Their harmonious co-operation is regulated by
    means of explicit mutual synchronization
    statements. The fundamental consequence of this
    approach is that the harmonious co-operation of
    a set of such sequential processes can be
    established by discrete reasoning as a further
    consequence the whole harmonious society of
    co-operating sequential processes is independent
    of the actual number of processors available to
    carry out these processes, provided the
    processors available can switch from process to
    process.

8
Parallelism within programs (2)
  • MIMD
  • Taxonomy from Flynn, 1972
  • Dataflow parallelism
  • The data flow concept incorporates these forms
    of concurrency in one basic graph-oriented
    system. Every computation is represented by a
    data flow graph. The nodes represent
    operations, the directed arcs represent data
    paths. (McGraw, ibid)
  • The ultimate goal of data flow software must be
    to help identify concurrency in algorithms and
    map as much as possible into the graphs. (McGraw
    ibid)
  • Transputer - 1984
  • programmed in occam
  • Uses CSP formalism, communication through named
    channels
  • MPPs - mid 1980s
  • Explicit message passing (CSP)
  • Other models actors, Petri nets,

9
PHOEBUS
MPP MACHINE
MESH
SYSTEM OF EQUATIONS
  • This matrix problem is filled and solved by
    PHOEBUS
  • The K submatrix is a sparse finite element matrix
  • The Z submatrices are integral equation matrices
  • The C submatrices are coupling matrices between
    the FE and IE equations
  • 1996! - 3 Executable, 2 programming models,
    executables run sequentially

Credit Katz, Cwik, Zuffada, Jamnejad
10
Cholesky Factorization
  • SuperMatrix work - Chan and van de Geijn, Univ.
    of Texas, in progress
  • Based on FLAME library
  • Aimed at NUMA systems, OpenMP programming model
  • Initial realization poor performance of LAPACK
    (w/ multithreaded BLAS) could be fixed by
    choosing a different variant

Credit Ernie Chan
11
Cholesky Factorization
Iteration 1
Iteration 2
Iteration 3
  • Can represent as DAG

Credit Ernie Chan
12
Cholesky SuperMatrix
  • Execute DAG tasks in parallel, possibly
    out-of-order
  • Similar in concept to Tomasulos algorithm and
    instruction-level parallelism on blocks of
    computation
  • Superscalar -gt SuperMatrix

Credit Ernie Chan
13
Uintah Framework
  • de St. German, McCorquedale, Parker, Johnson at
    SCI Institute, Univ. of Utah
  • Based on task graph model
  • Each algorithm define a description of
    computation
  • Required inputs and outputs
  • Callbacks to perform a task on a single region of
    space
  • Communication performed at graph edges
  • Graph created by Uintah

14
Uintah Tensor Product Task Graph
  • Each task is replicated over regions in space
  • Expresses data parallelism and task parallelism
  • Resulting detailed graph is tensor product of
    master graph and spatial regions
  • Efficient
  • Detailed tasks not replicated on all processors
  • Scalable
  • Control structure known globally
  • Communication structure known locally
  • Dependencies specified implicitly w/ simple
    algebra
  • Spatial dependencies
  • Computes
  • Variable (name, type)
  • Patch subset
  • Requires
  • Variable (name, type)
  • Patch subset
  • Halo specification
  • Other dependencies AMR, others

Credit Steve Parker
15
Uintah - How It Works
Credit Steve Parker
16
Uintah - More Details
  • Task graphs can be complex
  • Can include loops, nesting, recursion
  • Optimal scheduling is NP-hard
  • Optimal enough scheduling isnt too hard
  • Creating schedule can be expensive
  • But may not be done too often
  • Overall, good scaling and performance has been
    obtained with this approach

Credit Steve Parker
17
Applications and Grids
  • How to map applications to grids?
  • Some applications are Grid-unaware - they just
    want to run fast
  • May run on Grid-aware (Grid-enabled?) programming
    environments, e.g. MPICH-G2, MPIg
  • Other apps are Grid-aware themselves
  • This is where SAGA fits in, as an API to permit
    the apps to interact with the middleware

Grid-unaware applications
Grid-aware applications
Grid-enabled tools/environments
Simple API (SAGA)
Middleware
Grid resources, services, platforms
Credit Thilo Kielmann
18
Common Grid Applications
  • Data processing
  • Data exists on the grid, possibly replicated
  • Data is staged to a single set of resources
  • Application starts on that set of resources
  • Parameter sweeps
  • Lots of copies of a sequential/parallel job
    launched on independent resources, with different
    inputs
  • Controlling process start jobs and gathers
    outputs

19
More Common Grid Applications
  • Workflow applications
  • Multiple units of work, either sequential or
    parallel, either small or large
  • Data often transferred between tasks by files
  • Task sequence described as a graph, possibly a
    DAG
  • Abstract graph doesnt include resource
    information
  • Concrete graph does
  • Some process/service converts graph from abstract
    to concrete
  • Often all at once, ahead of job start - static
    mapping
  • Perhaps more gradually (JIT?) - dynamic mapping
  • Pegasus from ISI is an example of this, currently
    static
  • (Note Parameter sweeps are very simple workflows)

20
Montage - a Workflow App
  • An astronomical image mosaic service for the
    National Virtual Observatory
  • http//montage.ipac.caltech.edu/
  • Delivers custom, science grade image mosaics
  • Image mosaic combine many images so that they
    appear to be a single image from a single
    telescope or spacecraft
  • User specifies projection, coordinates, spatial
    sampling, mosaic size, image rotation
  • Preserve astrometry (to 0.1 pixels) flux (to
    0.1)
  • Modular, portable toolbox design
  • Loosely-coupled engines
  • Each engine is an executable compiled from ANSI C

100 µm sky aggregation of COBE and IRAS maps
(Schlegel, Finkbeiner and Davis, 1998). Covers
360 x 180 degrees in CAR projection.
Supernova remnant S147, from IPHAS The INT/WFC
Photometric H-alpha Survey of the Northern
Galactic Plane
David Hockney Pearblossom Highway 1986
21
Montage Workflow
22
Montage on the Grid Using Pegasus (Planning for
Execution on Grids)
23
Montage Performance
  • MPI version on a single cluster is baseline
  • Grid version on a single cluster has similar
    performance for large problems
  • Grid version on multiple clusters has performance
    dominated by data transfer between stages

24
Workflow Application Issues
  • Apps need to map processing to clusters
  • Depending on mapping, various data movement is
    needed, so the mapping either leads to networking
    requirements or is dependent on the available
    networking
  • Prediction (and mapping) needs some intelligence
  • One way to do this is through Pegasus, which
    currently does static mapping of an abstract
    workflow to a concrete workflow, but will do more
    dynamic mapping at some future point
  • Networking resources and availability could be
    inputs to Pegasus, or Pegasus could be used to
    request network resources at various times during
    a run.

25
Making Use of Grids
  • In general, groups of users (communities) want to
    run applications
  • Code/User/Infrastructure is aware of environment
    and does
  • Discover resources available now (or perhaps
    later)
  • Start my application
  • Have access to data and storage
  • Monitor and possibly steer the application
  • Other things that could be done
  • Migrate app to faster resources that are now
    available
  • Recover from hardware failure by continuing with
    fewer processors or by restarting from checkpoint
    on different resources
  • Use networks as needed (reserve them for these
    times)

Credit Thilo Kielmann and Gabrielle Allen
26
Less Common Grid Applications
  • True distributed MPI application over multiple
    resources/clusters
  • Other applications that use multiple coupled
    clusters
  • Uncommon because these jobs run poorly without
    sufficient network bandwidth, and there has been
    no good way for users to reserve bandwidth when
    needed

27
SPICE
  • Used for analyzing RNA translocation through
    protein pores
  • Using standard molecular dynamics would need
    millions of CPU hours
  • Instead, use Steered Molecular Dynamics and
    Jarzynskis Equation (SMD-JE)
  • Uses static visualization to understand
    structural features
  • Uses interactive simulations to determine
    near-optimal parameters
  • Uses Haptic interaction - requires low-latency
    bi-directional communication between user and
    simulation
  • Uses near-optimal parameters and many large
    parallel simulations to determine optimal
    parameters
  • 75 simulations on 128/256 processors
  • Uses optimal parameters to calculate full free
    energy profile along axis of pore
  • 100 simulations on 2500 processors

Credit Shantenu Jha, et. al.
28
NEKTAR
  • Simulates arterial blood flow
  • Uses hybrid approach
  • 3D detailed CFD computed at bifurcations
  • Waveform coupling between bifurcations modeled w/
    reduced set of 1D equations
  • 55 largest arteries in human body w/ 27
    bifurcations would require about 7 TB memory
  • Parallelized across and within clusters

Credit Shantenu Jha, et. al.
29
Cactus
  • Freely available, modular, portable and
    manageable environment for collaboratively
    developing parallel, high-performance
    multi-dimensional simulations (components-based)
  • Developed for numerical relativity, but now
    general framework for parallel computing (CFD,
    astro, climate, chem. eng., quantum gravity,
    etc.)
  • Finite difference, AMR, FE/FV, multipatch
  • Active user and developer communities, main
    development now at LSU and AEI
  • Science-driven design issues
  • Open source, documentation, etc.
  • Just over 10 years old

Credit Gabrielle Allen
30
Cactus Structure
remote steering
Plug-In Thorns (modules)
extensible APIs
ANSI C
Fortran/C/C
parameters
driver
scheduling
equations of state
Core Flesh
input/output
error handling
black holes
interpolation
make system
boundary conditions
grid variables
SOR solver
coordinates
multigrid
wave evolvers
Credit Gabrielle Allen
31
Cactus and Grids
  • HTTPD thorn, allows web browser to connect to
    running simulation, examine state of running
    simulation, change parameters
  • Worm thorn, makes Cactus app self-migrating
  • Spawner thorn, any routine can be done on another
    resource
  • TaskFarm, allows distributing of apps on Grid
  • Run a single app using distributed MPI

Credit Gabrielle Allen, Erik Schnetter
32
EnLIGHTened
  • Network research, driven by concrete application
    projects, all of which critically require
    progress in network technologies and tools that
    utilize them
  • EnLIGHTened testbed 10 Gbps optical networks
    running over NLR. Four all-photonic Calient
    switches are interconnected via Louisiana Optical
    Network Initiative (LONI), EnLIGHTened wave, and
    the Ultralight wave, all using GMPLS control
    plane technologies.
  • Global alliance of partners
  • Will develop, test, and disseminate advanced
    software and underlying technologies to
  • Provide generic applications with the ability to
    be aware of their network, Grid environment and
    capabilities, and to make dynamic, adaptive and
    optimized use (monitor abstract, request
    control) of networks connecting various high end
    resources
  • Provide vertical integration from the application
    to the optical control plane, including extending
    GMPLS
  • Will examine how to distribute the network
    intelligence among the network control plane,
    management plane, and the Grid middleware

33
EnLIGHTened Team
  • Savera Tanwir
  • Harry Perros
  • Mladen Vouk
  • Yufeng Xin
  • Steve Thorpe
  • Bonnie Hurst
  • Joel Dunn
  • Gigi Karmous-Edwards
  • Mark Johnson
  • John Moore
  • Carla Hunt
  • Lina Battestilli
  • Andrew Mabe
  • Ed Seidel
  • Gabrielle Allen
  • Seung Jong Park
  • Jon MacLaren
  • Andrei Hutanu
  • Lonnie Leger
  • Dan Katz
  • Olivier Jerphagnon
  • John Bowers
  • Steven Hunter
  • Rick Schlichting
  • John Strand
  • Matti Hiltunen
  • Javad Boroumand
  • Russ Gyurek
  • Wayne Clark
  • Kevin McGrattan
  • Peter Tompsu
  • Yang Xia
  • Xun Su
  • Dan Reed
  • Alan Blatecky
  • Chris Heermann

34
EnLIGHTened Testbed
To Canada
To Asia
To Europe
SEA
POR
BOI
EnLIGHTened wave (Cisco/NLR)
CAVE wave
PIT
OGD
DEN
CHI
KAN
CLE
SVL
WDC
Cisco/UltraLight wave
LONI wave
TUL
VCL _at_NCSU
DAL
  • International
  • Partners
  • Phosphorus - EC
  • G-lambda - Japan
  • GLIF
  • Members
  • MCNC GCNS
  • LSU CCT
  • NCSU
  • RENCI
  • Official Partners
  • ATT Research
  • SURA
  • NRL
  • Cisco Systems
  • Calient Networks
  • IBM

HOU
  • NSF Project Partners
  • OptIPuter
  • UltraLight
  • DRAGON
  • Cheetah

35
HARC Highly Available Robust Co-allocator
  • Extensible, open-sourced co-allocation system
  • Can already reserve
  • Time on supercomputers (advance reservation), and
  • Dedicated paths on GMPLS-based networks with
    simple topologies
  • Uses Paxos Commit to atomically reserve multiple
    resources, while providing a highly-available
    service
  • Used to coordinate bookings across EnLIGHTened
    and G-lambda testbeds in largest demonstration of
    its kind to date (more later)
  • Used for setting up the network for Thomas
    Sterlings HPC Class which goes out live in HD
    (more later)

Credit Jon MacLaren
36
Application (Visualization)
Application (MPI)
US
JAPAN
37
Data grid applications
  • Remote visualization
  • Data is somewhere, needs to flow quickly and
    smoothly to a visualization app
  • Data could be simulation results, or measured data

38
Distributed Viz/Collaboration
  • iGrid 2005 demo
  • Visualization at LSU
  • Interaction among San Diego, LSU, Brno
  • Data on remote LONI machines

39
Video for visualization
  • But also for videoconference between the three
    sites
  • 1080i (1920x1080, 60fps interlaced)
  • 1.5 Gbps / unidirectional stream, 4.5 Gbps each
    site (two incoming, one outgoing streams)
  • Jumbo frames (9000 bytes), Layer 2 lossless (more
    or less) dedicated network
  • Hardware capture
  • DVS Centaurus (HD-SDI) DVI -gt HD-SDI
    converter from Doremi

Credit Andrei Hutanu
40
Hardware setup one site
Credit Andrei Hutanu
41
Video distribution
  • Done in software (multicast not up to speed,
    optical multicast complicated to set up). Can do
    14 distribution with high-end Opteron
    workstations.
  • HD class 1-to-n
  • Only one stream is distributed - the one showing
    the presenter (Thomas Sterling) - others are just
    to LSU

42
Data analysis
  • Future scenario motivated by increases in network
    speed
  • Possibilities of simulations to store results
    locally are limited
  • Downsampling the output, not storing all data
  • Use remote (distributed, possibly virtual)
    storage
  • Can store all data
  • This will enable new types of data analysis

Credit Andrei Hutanu
43
Components
  • Storage
  • high-speed distributed file systems or virtual
    RAM disks
  • potential use cases global checkpointing
    facility data analysis using the data from this
    storage
  • distribution could be determined by the
    analysis routines
  • Data access
  • Various data selection routines gather data
    from the distributed storage elements (storage
    supports app-specific operations)

Credit Andrei Hutanu
44
More Components
  • Data transport
  • Components of the storage are connected by
    various networks. May need to use different
    transport protocols
  • Analysis (visualization or numerical analysis)
  • Initially single-machine but can also be
    distributed
  • Data source
  • computed in advance and preloaded on the
    distributed storage initially
  • or live streaming from the distributed simulation

Credit Andrei Hutanu
45
Conclusions
  • Applications exist where infrastructure exists
    that enables them
  • Very few applications (and application authors)
    can afford to get ahead of the infrastructure
  • We can run the same (grid-unaware) applications
    on more resources
  • Perhaps add features such as fault tolerance
  • Use SAGA to help here?

46
SAGA
  • Intent SAGA is to grid apps what MPI is to
    parallel apps
  • Questions/metrics
  • Does SAGA enable rapid development of new apps?
  • Does it allow complex apps with less code?
  • Is it used in libraries?
  • Roots Reality Grid (ReG Steering Library),
    GridLab (GAT), and others came together at GGF
  • Strawman API
  • Uses SIDL (from Babel, CCA)
  • Language independent spec.
  • OO base design - can adapt to procedural
    languages
  • Status
  • Started between GGF 11 GGF 12 (July/Aug 2004)
  • Draft API submitted to OGF early Oct. 2006
  • Currently, responding to comments

47
More Conclusions
  • Infrastructure is getting better
  • Middleware developers are working on some of the
    right problems
  • If we want to keep doing the same things better
  • And add some new things (grid-aware apps)
  • Web 3.1 is coming soon
  • Were not driving the distributed computing world
  • Have to keep trying new things
About PowerShow.com