Title: ANU data plans and National Data Services at the APAC National Facility
1ANU data plans and National Data Services at
the APAC National Facility
- Dr Ben Evans
- APAC National Facility
- ANU Supercomputer Facility
- (Ben.Evans_at_anusf.anu.edu.au)
2ANU Data Investments
- Background
- 1995
- Mass Data Storage System installed in 1995 (and
many upgrades beyond) - Solution to large and long-term scientific data
collections using infrastructure beyond the
capability of one area - 2006
- Expanded infrastructure role. Now Includes
- complex data management software environments
- data analysis
- expanded expertise and research consultants,
- better Campus/National and International
integration - support research activity and good management
practices
3ANU data plans for Institution
- ANU Established eResearch Task Force to scope
university requirements and structures to address
future needs. - Outcome (for data related activities)
- Continuing need for high-end computational
services - Increased growth in data-enabled research
activities - Structures for encouraging data access
- Integration with centralised infrastructure and
support for managing digital assets - Value of researchers providing e-Research enabled
services - Support for ANUs ongoing support for continuing
APAC National Facility and expanding role in data
services.
4APAC-NF Data Services
- Expanding Merit processes to support nationally
merited data projects. Assessed on yearly basis
but expectation is support for medium term,
perhaps longer. - Project plan for year, including value of data
project, requirements for project (and in some
cases) funding. - Provide infrastructure/framework, environment
- - at internationally competent levels
- - well linked to National and International
research activities - Goal to broaden access, cohesion and support for
National priority research activities and
networks -
5- Roles and Engagement
- Principal Investigator
- overall responsibility of project research
engagement, plans and outcomes - includes appropriately nominated archivists and
data curators - PI is data custodian, and interface to APAC-NF to
assist implement policy - APAC
- Competent well-managed infrastructure growth with
appropriate connectivity - advanced software environments to support
specialised projects. - Software development may happen in research teams
(central/distrib) - Deploying grid-enabled data expertise via APAC
- Access to consultants for best practise in data
management - National/International exposure and support for
linkage - Centralised vs distributed data depends on
project technical and policy issues and what is
best for overall support good management -
6APAC-NF Data System infrastructure
Data transfer, web access, virtual hosting, video
streaming Access by grid software, specialised
software, command line, API
Dedicated Real-time Relational Database engines
Data Analysis Cluster Big and little
endian (Future attachment)
Disk and tape pools
fast, on-line global filesystem
7Software environments
- Large toolbox of packages, compilers, libraries
(including file formats) and other software is
available on - http//nf.apac.edu.au/facilities/software/
- Data projects specialised toolkits and
integration with infrastructure - Eg
- Astronomy - data ingest, data search, VO enabled
- Earth Systems - OpenDAP, experimental and modeled
datasets - High energy physics - tiered storage SRB -gt SRM
- Humanities - Babble grid-ingest repository
management, annotation software - Materials Sciences Plexus project microCT
experimental data with abstract structures GRANI - Social Sciences - NESSTAR leximancer VOSON
- Terrestrial - categorisation, analysis and
visualisation (eg GIS)
8Data Discovery/Publishing
- Providing a repository of references to datasets
- http//nf.apac.edu.au/facilities/software/dataset.
php - Some fields have VO registries and will be
harvested and registered. (eg NVO in astronomy,
Geographic, Humanities, Social Sciences) - Work with APSR for general discovery service,
starting with APAC.
9Data Lifecycle management
- Storage media has a useful lifetime, in capacity,
speed and maintenance. - 6 generations of tape drives, 5 generations of
disks - Processes for assisting with Data project life
cycle - multiple generations of data standards in
metadata data and software - Access method changes (and change from protected
to public) - Software develop needs production and development
instances (may use virtualisation and data
replication) - Data may change from large (archival) to complex
data intensive
10Data Trends
- Large Data projects continue to get larger.
- Eg telescopes/instruments
- Requirement is changed from near-line to on-line.
- Reduce complexity of software
- Increase speed of access
- Enable analysis next to data
- Typical large scientific areas are being joined
with humanities - Complex data management tools and opportunities
being realised in all nearly disciplines.
Inclusive of large and small datasets (eg
skymapper) - Managing long-term software infrastructure
becoming more complex, especially in an ongoing
way. - Data management issues beyond capabilities of
individual research group and trends beyond
individual institution.
11Data Protection
- Frequent scheduling of archival copies.
- Standard practice of multiple archival copies of
all data, number and frequency handled by policy. - HA being established for some RDBMS and web
services - Monitoring and Instrumentation of performance
- Audit logs and backups of data
12Assistance with Co-scheduled Computational and
Data workflow
Computation
Search/query And presentation
Dataset Access
Computation
Dataset Access
Computation
13APAC National Grid
QPSF (JCU)
QPSF
APAC National Facility
IVEC
ac3
ANU
Computing Systems Peak Mid-range
Special
SAPAC
CSIRO
VPAC
TPAC
14National Data transfer backbone for data workflow
- Transfer of data from repository using managed
data transfer backbone - Tuned transfer systems
- Connection via high-bandwidth scalable pipes (in
progress) - GridFTP, SRB, dCache more generic tools
15Grid Infrastructure for data/compute workflow
- Establish data transfer metrics for National Grid
and International transfer performance across the
interconnecting fabric. - Lead to improvements in services over the network
in conjuction with network providers and local
institutions.