Case%20Studies%20of%20Using%20Condor%20for%20Scientists%20%20Barcelona,%202006 - PowerPoint PPT Presentation

About This Presentation
Title:

Case%20Studies%20of%20Using%20Condor%20for%20Scientists%20%20Barcelona,%202006

Description:

Barcelona, 2006. Agenda. Extended user's tutorial. Advanced Uses of Condor. Java programs ... map a species' genome - build a huge database of information ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 43
Provided by: Csw5
Category:

less

Transcript and Presenter's Notes

Title: Case%20Studies%20of%20Using%20Condor%20for%20Scientists%20%20Barcelona,%202006


1
Case Studies of Using Condor for Scientists
Barcelona, 2006
2
Agenda
  • Extended users tutorial
  • Advanced Uses of Condor
  • Java programs
  • DAGMan
  • Stork
  • MW
  • Grid Computing
  • Case studies, and a discussion of your
    applications needs

3
BLAST
4
Background
  • Each species has a genetic encoding within its
    cells
  • Humans are made of approximately 1014 cells

5
Background
  • The human nucleus of each cell contains 46
    chromosomes
  • Each chromosome contains between 231 and 2958
    genes
  • Each chromosome is made of somewhere between 25
    million and 237 million (approximately) base pairs

6
(No Transcript)
7
Base Pairs (Simplified)
  • Each base pair is one of 4 nucleotides
  • Each nucleotide is represented by one letter
  • A C G T

8
The Science Issue
  • Scientists ask many questions and pose
    computationally difficult issues
  • map a species genome - build a huge database of
    information
  • understand evolution at a genetic level answer
    homology and related questions
  • identify mutations and genes to develop
    diagnoses and medical treatments

9
BLAST
  • Basic Local Alignment Search Tool
  • A really good pattern matching program
  • An answer to the science questions often requires
    queries such as
  • Does the following nucleotide sequence (1000
    pairs), or something close appear in the database
    (several billions of pairs)? To what certainty is
    there a match?

10
The Biological Magnetic Resonance Data Bank
  • Department of Biochemistry at University of
    Wisconsin-Madison
  • Part of the Center for Eukaryotic Structural
    Genomics (CESG)
  • Working on three dimensional protein structure

11
The BMRB and BLAST
  • The BMRB (with the help of the Condor Team) has a
    weekly set of automated BLAST runs
  • These BLAST runs compare progress on the BMRB set
    of working proteins to the Protein Data Bank

12
Serial versus Parallel
  • Too slow The BMRB working set could be input as
    a single BLAST program execution
  • Load the Protein Data Bank database
  • Serially query the database with each protein in
    the working set
  • Faster Divide the working set into pieces that
    allow parallel executions of BLAST

13
Weekly BMRB Runs
  1. Obtain and install the BLAST executable and
    Protein Data Bank database
  2. Decide on the best way to split the BMRB working
    set of proteins to minimize the parallel
    execution time
  3. Make a custom DAG for this split
  4. Produce a report on the BMRB run

14
The Custom DAG
. . .
B is BLAST
. . .
E is Extract results
15
An Economics Application
  • Computations are done at points on a coordinate
    plane
  • Initial values are known along the axes
  • Computation of one point at a time is too slow
    (serial execution)
  • Each point is dependent on 2 neighboring points
  • (x,y) can be computed knowing (x-1,y) and (x,y-1)

16
The Coordinate Plane
known result
6
5
4
3
2
1
1
2
3
5
6
4
17
The Coordinate Plane
known result
6
inputs ready
5
4
3
2
1
1
2
3
5
6
4
18
The Coordinate Plane
known result
6
inputs ready
5
4
3
2
1
1
2
3
5
6
4
19
The Coordinate Plane
known result
6
inputs ready
5
4
3
2
1
1
2
3
5
6
4
20
The Coordinate Plane
known result
6
inputs ready
5
4
3
2
1
1
2
3
5
6
4
21
The Coordinate Plane
known result
6
inputs ready
5
4
3
2
1
1
2
3
5
6
4
22
The DAG
1-4
1-3
1-2
2-3
etc.
1-1
2-2
2-1
3-2
3-1
4-1
23
Use DAGMan
  • Write a program to generate the DAG input file
  • The submit description file (and the executable)
    is the same for each node in the DAG

24
DAG Input File
  • Job 1-1 gonkulate.submit
  • Job 1-2 gonkulate.submit
  • Parent 1-1 Child 1-2
  • Job 2-1 gonkulate.submit
  • Parent 1-1 Child 2-1
  • Job 1-3 gonkulate.submit
  • Parent 1-2 Child 1-3
  • Job 2-2 gonkulate.submit
  • Parent 1-2 2-1 Child 2-2
  • Vars 2-2 leftfile1-2
  • Vars 2-2 belowfile2-1
  • Vars 2-2 resultfile2-2
  • . . .
  • DAG input file, continued
  • Job 3-4 gonkulate.submit
  • Parent 2-4 3-3 Child 3-4
  • Vars 3-4 leftfile2-4
  • Vars 3-4 belowfile3-3
  • Vars 3-4 resultfile3-4
  • . . .

25
Submit Description File
  • In gonkulate.submit
  • universe vanilla
  • executable gonkulate
  • output (result)
  • should_transfer_files YES
  • when_to_transfer_output ON_EXIT
  • transfer_input_files (left) (below)
  • log gonkulate.log
  • notification Never
  • queue

26
Nug30
27
Description of Nug30
  • nug30 (a Quadratic Assignment Problem instance of
    size 30) had been the holy grail of
    computational QAP research since 1968
  • In 2000, Anstreicher, Brixius, Goux, Linderoth
    set out to solve this problem
  • Using a mathematically sophisticated and
    well-engineered algorithm, they still estimated
    that we would require 11 CPU years to solve the
    problem.

28
Nugents Problem
  • There are a set of N locations and a set of N
    facilities, and each facility must be assigned a
    location. To measure the cost of each possible
    assignment, the flow between each pair of
    facilities is multiplied by the distance between
    the pair's assigned locations, and then a sum is
    taken over all of the pairs.
  • For Nug30, N 30

29
QAP Definition
  • The formal definition of the quadratic assignment
    problem is
  • Given two sets, P ("facilities") and L
    ("locations"), of equal size, together with a
    weight function w P x P g R and a distance
    function d L x L g R. Find the bijection f P
    g L (assignment) such that the cost function
  • w(a,b) . d(f(a), f(b))
  • is minimized and a and b are members of P.
  • Usually weight and distance functions are viewed
    as a square real-valued matrices.

Wikipedia
30
Scope of the Problem
  • This QAP problem is difficult due to the
    excessively large number of possible facility
    assignments.
  • The number of possible assignments is factorial
    in the number of facilities.
  • N! N x (N-1) x (N-2) x . . . x 2
  • 30! is approximately 2.6 x 1032

31
The Simplified Approach
  • Method of choice is branch and bound
  • The complete tree has 30! nodes as leaves
  • Branching grows the tree
  • Bounding results in pruning the tree

32
The Nug30 Solution
  • Used a new algorithm called
  • quadratic programming bound
  • developed by Anstreicher and Brixius
  • Sequential execution would have taken 7 years, so
    parallelization of the algorithm was important
  • Used MW

33
Nug30 Computational Grid
Number Arch/OS Location
414 Intel/Linux Argonne
96 SGI/Irix Argonne
1024 SGI/Irix NCSA
16 Intel/Linux NCSA
45 SGI/Irix NCSA
246 Intel/Linux Wisconsin
146 Intel/Solaris Wisconsin
133 Sun/Solaris Wisconsin
190 Intel/Linux Georgia Tech
94 Intel/Solaris Georgia Tech
54 Intel/Linux Italy (INFN)
25 Intel/Linux New Mexico
12 Sun/Solaris Northwestern
5 Intel/Linux Columbia U.
10 Sun/Solaris Columbia U.
  • Used tricks to make it look like one Condor pool
  • Flocking
  • Glidein
  • 2510 CPUs total

34
Workers Over Time
35
Nug30 solved
Wall Clock Time 6 days 220431 hours
Avg Machines 653
CPU Time 11 years
Parallel Efficiency 93
36
The Football Pool Problem
37
Win By Gambling
  • Each week, 6 games are played
  • The outcome of each game is
  • win
  • lose
  • tie

38
Bet, and win
  • Get 5 of the 6 games correctly predicted, and you
    win
  • What is the minimum number of predictions you
    must make to guarantee winning?

39
Known Values
number of games
minimum predictions
3 5
4 9
5 27
40
Problem Description
  • A covering code
  • An NP Hard problem
  • Many years of research and effort for 6 games
    leads to
  • 65 lt minimum number of predictions lt 73
  • An integer programming problem
  • Best solver is the commercial application CPLEX

41
Why the Problem is Difficult
  • Number of tickets possible 6! x 36
  • The tree that represents the problem (and
    solutions) has many isomorphic branches. This
    makes it difficult to prune the tree.
  • New techniques have been developed, which leads
    to reducing the interval of solution
  • The latest and greatest does many smaller
    problems using MW

42
Solution!
  • Not yet. . .
  • The first effort (many CPU years worth of time)
    had a very small error in input
  • Second effort is still in progress.
  • All this to improve the lower bound from 65 to
    70, thereby reducing the range for the solution
Write a Comment
User Comments (0)
About PowerShow.com