Parallel and Distributed Programming Models and Languages - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Parallel and Distributed Programming Models and Languages

Description:

Parallel and Distributed Programming Models and Languages 15-740/18-740 Computer Architecture In-Class Discussion Dong Zhou Kun Li Mike Ralph Why distributed ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 23
Provided by: dong152
Learn more at: https://cs.login.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel and Distributed Programming Models and Languages


1
Parallel and Distributed ProgrammingModels and
Languages
  • 15-740/18-740 Computer Architecture
  • In-Class Discussion
  • Dong Zhou
  • Kun Li
  • Mike Ralph

2
Why distributed computations?
  • Buzzword Big Data
  • Take sorting as an example
  • Amount of data that can be sorted in 60 seconds
  • One computer can read 60 MB/sec from one disk
  • 2012 world record
  • Flat Datacenter Storage by Ed Nightingale et.al
  • 1470 GB
  • 256 heterogeneous nodes, 1033 disks
  • Google indexes 100 billion web pages

3
Solution use many nodes
  • Grid computing
  • Hundreds of supercomputers connected by high
    speed net
  • Cluster computing
  • Thousands or tens of thousands of PCs connected
    by high speed LANS
  • 1000 nodes potentially give 1000x speedup

4
Distributed computations are difficult to program
  • Sending data to/from nodes
  • Coordinating among nodes
  • Recovering from node failure
  • Optimizing for locality
  • Debugging

5
MapReduce
  • A programming model for large-scale computations
  • Process large amounts of input, produce output
  • No side-effects or persistent state
  • MapReduce is implemented as a runtime library
  • Automatic parallelization
  • Load balancing
  • Locality optimization
  • Handling of machine failures

6
MapReduce design
  • Input data is partitioned into M splits
  • Map extract information on each split
  • Each map produces R partitions
  • Shuffle and sort
  • Bring M partitions to the same reducer
  • Reduce aggregate, summarize, filter or transform
  • Output is in R result files

7
More specifically
  • Programmer specifies two methods
  • map(k, v) ? ltk', v'gt
  • reduce(k', ltv'gt) ? ltk'', v''gt
  • All v' with same k' are reduced together
  • Usually also specify
  • partition(k', total partitions) ? partition for
    k
  • often a simple hash of the key

8
Runtime
9
MapReduce is widely applicable
  • Distributed grep
  • Distributed clustering
  • Web link graph reversal
  • Detecting approx. duplicate web pages

10
Dryad
  • Similar goals as MapReduce
  • Focus on throughput, not latency
  • Automatic management of scheduling, distribution,
    fault tolerance
  • Computations expressed as a graph
  • Vertices are computations
  • Edges are communication channels
  • Each vertex has several input and output edges

11
Why using a dataflow graph?
  • Many programs can be represented as a distributed
    dataflow graph
  • The programmer may not have to know this
  • SQL-like queries LINQ
  • Dryad will run them for you

12
Runtime
  • Vertices (V) run arbitrary app code
  • Vertices exchange data through
  • files, TCP pipes etc.
  • Vertices communicate with JM to report
  • status
  • Daemon process (D)
  • executes vertices
  • Job Manager (JM) consults name server(NS)
  • to discover available machines.
  • JM maintains job graph and schedules vertices

13
Job Directed Acyclic Graph
Outputs
Processing vertices
Channels (file, pipe, shared memory)
Inputs
14
Advantages of DAG over MapReduce
  • Big jobs more efficient with Dryad
  • MapReduce big jobs runs gt 1 MR stages
  • Reducers of each stage write to replicated
    storage
  • Output of reduce 2 network copies, 3 disks
  • Dryad each job is represented with a DAG
  • Intermediate vertices write to local file

15
Pig Latin
  • High-level procedural abstraction of MapReduce
  • Contains SQL-like primitives
  • Example
  • good_urls FILTER urls BY pagerank gt 0.2
  • groups GROUP good_urls BY category
  • big_groups FILTER groups BY COUNT(good_urls)gt106
  • Output FOREACH big_groups GENERATE category,
    AVG(good_urls.pagerank)
  • Plus user-defined functions (UDFs)

16
Value
  • Reduces development time
  • Procedural vs. declarative
  • Overhead/performance costs worthwhile?

17
Green-Marl
  • High-level graph analysis language/compiler
  • Uses basic data types and graph primitives
  • Built-in graph function
  • BFS, RBFS, DFS
  • Uses domain specific optimizations
  • Both non-architecture and architecture specific
  • Compiler translates Green-Marl to other
    high-level language (ex. C)

18
Tradeoffs
  • Achieve speedup over hand-tuned parallel
    equivalents
  • Tested only on single workstation
  • Only works with graph representations
  • Difficulty representing certain data sets and
    computations
  • Domain specific vs. general purpose languages
  • Future work for more architectures, user-defined
    data structures

19
Questions and Discussion
20
Example count word frequencies in web page
  • Input is files with one doc per record
  • Map parses document into words
  • key document URL
  • value document contents
  • Output of map

"to", "1" "be", "1" "or", "1" "not", "1" "to",
"1" "be", "1"
"doc1", "to be or not to be"
21
Example count word frequencies in web page
  • Reduce computes sum for a key
  • Output of reduce saved

key "be" values "1", "1"
key "not" values "1"
key "or" values "1"
key "to" values "1", "1"
"2"
"1"
"2"
"2"
"to", "2" "be", "2" "or", "1" "not", "1"
22
Example Pseudo-code
  • Map(String input_key, String input_value)
  • //input_key document name
  • //input_value document contents
  • for each word w in input_values
  • EmitIntermediate(w, "1")
  • Reduce(String key, Iterator intermediate_values)
  • //key a word, same for input and output
  • //intermediate_values a list of counts
  • int result 0
  • for each v in inermediate_values
  • result ParseInt(v)
  • Emit(AsString(result))
Write a Comment
User Comments (0)
About PowerShow.com