Title: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub
1NGS induction --- case study the BRIDGES
projectMicha BayerGrid Services Developer,
BRIDGES projectNational e-Science Centre,
Glasgow Hub
2The BRIDGES project
- Biomedical Research Informatics Delivered by
Grid-Enabled Services - 2 year e-Science project, started 1st October
2003 - aim provide data integration and grid-based
compute power for Cardiovascular Functional
Genomics project - CFG project investigates genetic predisposition
for hypertensive heart disease - my role on project develop grid applications for
end users
3BRIDGES requirements and the NGS
- functional
- high throughput compute tasks, e.g. large BLAST
jobs - non-functional
- interfaces to applications should be targeted at
the less computer literate --- users range in
computer literacy from fairly advanced to mildly
technophobic - security requirements should not cause any extra
work or inconvenience for users as this may put
them off altogether - resources provided by BRIDGES compete with
familiar, similar resources already on offer at
established bioinformatics institutions (EBI,
NCBI, EMBL) -gt need to make things palatable so
people do use it
4How to get your job onto the NGS
standard solutions
NGS portal
Leeds
GSI-SSH
Oxford
NGS clusters
RAL
Manchester
5Custom grid applications
- if possible/appropriate, get a developer to write
bespoke interface to a grid app running on NGS - only worthwhile if application is used frequently
and/or by many users and is relatively
unchanging/simple - best to hide complexity of grid from users
altogether - users should not even have to choose between
resources - automatic scheduling of jobs to resources that
currently have spare capacity is desirable - best option for delivery is portlet in
project-specific web portal just need web
browser for access then
6Project web portals
- portals are configurable, personalized
collections of web applications delivered to a
web browser as a single page - NGS encourage projects to maintain their own web
portals to deliver apps to their users - applications can then be provided through
user-friendly, specific portlet interfaces - allows the hiding of grid complexity from users
- requires developer time
- BRIDGES portal currently uses IBM Websphere (free
to academia)
7More on portals
- increasingly important technology not just for
grid computing (cf. Yahoo) - gives end users a customized view of software and
hardware resources specific to their particular
application domain - also provides a single point of access to
Grid-based resources following user
authentication (single-sign-on) - content is provided by portlets (Java servlet
extension) JSR168 standard provides for
exchangeability - some portal packages currently available IBM
Websphere, Gridsphere, JetSpeed, uPortal,
Jportlet, Apache Pluto
8Authentication and User Management (1)
- model adopted in BRIDGES
- requirement was for users not to have to obtain
and manage certificates - we applied for a single project account at NGS
users do not need individual NGS accounts - this account maps to a single user (BRIDGES) on
the NGS with home directories on all nodes (like
normal users) - authentication for this user on NGS is by means
of the host certificate of the machine where the
jobs are submitted from (under control of BRIDGES
project) - users authenticate via the BRIDGES web portal
using standard username and password pairs
9Authentication and User Management(2)
- Users can create accounts for themselves in
BRIDGES Websphere portal (self-care) - alternatively one could of course give the users
usernames and passwords - information gathered is kept in Websphere's
secure user database - current info is very basic but will be extended
to include more detail (e.g. URL of user's
project or departmental website where the user is
listed) - provides at least a basic means of accounting for
user activity - no need for physically visiting the Registration
Authority/presenting ID - may need to resort to stricter security if system
is abused e.g. if impersonation takes place etc.
10Authorisation with PERMIS
ScotGRID
- PERMIS grid authorisation software developed at
Salford University (http//sec.isi.salford.ac.uk/p
ermis/) - BRIDGES uses PERMIS to differentially allow users
access to resources - typical use is with GT3.3 service but lookup-type
use is also possible with other services (in our
case GT3.0.2) - code in our service calls a PERMIS authorisation
service running on a machine at NeSC - user's roles are queried and access to resource
is permitted or denied accordingly - gives BRIDGES staff full control over who is
allowed to use NGS resource through our
applications
NeSC Condor Pool
NGS
end user
Leeds
Oxford
RAL
Manchester
11Security in BRIDGES summary
make host proxy, authenticate with NGS and submit
job
job request is passed on securely with username
NeSC grid server with host credentials
NGS clusters
authenticate at BRIDGES web portal with username
and password only
get user authorisations
Leeds
Oxford
end user
BRIDGES web portal
RAL
Manchester
NeSC machine with PERMIS authorisation service
(GT3.3)
12Host authentication for job submission
- allows us to submit jobs to NGS as user BRIDGES
- apply for host certificate for the grid server
machine as normal (UK e-Science Certification
Authority) - results in a passwordless private key and host
certificate for the machine - Java Cog kit code can then be used to generate a
host proxy locally - this is used for job submission
13Use case Microarray reporter sequence BLAST jobs
Job processing please wait.... (and
wait....and wait....)
- microarray chips contain up to 400,000 reporter
sequences - these need to be compared to existing annotated
sequence databases - takes approx. 3 weeks to compute against human
genome on average desktop machine
14BLAST
- Basic Local Alignment Search Tool
- used for comparing biological sequences (DNA,
protein) against a set of target sequences - returns a sorted list of matches
- most widely used algorithm for this sort of thing
- compute intensive
15How do I get my application to run efficiently on
a grid?
- applications to be deployed on a compute grid
need to be parallelised to really benefit (can of
course just run them as single jobs too) - for this one must be able to partition a job into
several subjobs - these then get processed separately at the same
time on multiple processors - need to combine results of individual subjobs at
the end
16Parallel BLAST grid style
- partition your job by putting one or several
query sequences into a separate input file ( 1
subjob) - distribute all input files, the executable and
target data onto your grid clusters (stage-in) - results are returned to the server and combined
there - if 100 free processors are available, and 100
subjobs are to be run, the time taken is 1/100th
of the time it would have taken to run the whole
job on a single machine (plus overheads for
scheduling, data transfer and result combining)
17To stage or not to stage?
- file staging is the copying at runtime of
files onto the remote resource - example BLAST jobs
- we need
- input file
- target data file (database really a flat text
file) - executable (BLAST)
- target files and executable are unchanging
components for this kind of job - it is best to store these locally on the remote
resources to avoid staging overhead (target data
are in the region of several gb in size and
growing exponentially) - rather than individual users keeping multiple
copies of publicly available data in their home
directories, get sys admins to put up copies
visible to all - must stage in input files since these vary from
job to job
18BRIDGES GridBLAST Job Submission
ScotGRID worker nodes
ScotGRID masternode
NESC Grid Server (Titania)
end user machine
PBS server side BLAST
send job request
GT 3 core grid service
GridBLAST client
return result
jobs farmed out to compute nodes
PBS wrapper
BRIDGES Meta-Scheduler
Apache Tomcat
GT2.4 wrapper
NGS
19Current status of our system
- software is still at prototype stage havent
benchmarked any really big jobs yet - Java webstart client (launched from portal)
connects to service needs to be changed to
portlet - user registration needs to be revised and users
re-registered - happy to share portlet code etc with others once
finished
20How we worked with the NGS
- BRIDGES was one of the first projects doing bio
stuff on NGS - we established a basic infrastructure needed for
BLAST on the NGS clusters in collaboration with
NGS user support - good collaboration on our security requirements
very helpful and accommodating - our project account is the first of its kind and
we jointly tailored a solution that would fit
BRIDGES - ask for what you need! things are not cast in
stone and it is supposed to be a public service
21Public bioinformatics infrastructure on NGS
current status
- we are in the process of establishing an
infrastructure for BLAST jobs that can be used by
all - this includes
- making BLAST and mpiBLAST executables publicly
available - mirroring the entire NCBI BLAST databases
repository - currently trialling this on Leeds node will be
replicated at other nodes eventually - data replication on all nodes necessary to avoid
severe performance hits - input from others needed and welcome!
22Contact details
- BRIDGES website http//www.brc.dcs.gla.ac.uk/
projects/bridges/ - Code repository (available soon)
http//www.brc.dcs.gla.ac.uk/projects/bridges/publ
ic/code.htm - BRIDGES web portal http//europa.nesc.gla.ac.uk
9081/wps/portal - Contacts
- Micha Bayer at NeSC in Glasgow --
michab_at_dcs.gla.ac.uk - Richard Sinnott at NeSC in Glasgow --
ros_at_dcs.gla.ac.uk