Dr. Xiao Liu PowerPoint PPT Presentation

presentation player overlay
1 / 94
About This Presentation
Transcript and Presenter's Notes

Title: Dr. Xiao Liu


1
Overview Cloud Computing and Workflow Research
in NGSP Group
  • Dr. Xiao Liu
  • Sessional Lecturer, Research Fellow
  • Centre of SUCCESS
  • Swinburne University of Technology
  • Melbourne, Australia

2
Outline
  • SUCCESS Centre and NGSP Group
  • Background Cloud Computing and Workflow
  • Research Topics
  • Performance Management in Scientific Workflows
  • Data Management in Scientific Cloud Workflows
  • Security and Privacy Protection in the Cloud
  • Data Reliability Assurance in the Cloud
  • SwinDeW-C Cloud Workflow System
  • Future Work and Conclusions

3
The Centre of SUCCESS
  • SUCCESS Swinburne University Centre for
    Computing and Engineering Software Systems
  • SUCCESS is the NO.1 Software Engineering Centre
    in Australia
  • SUCCESS is one of the 7 Tire 1 Centres at
    Swinburne University of Technology (Times World
    Ranking 351- 400)
  • The ambition of the Centre is to become the top
    centre for software research in the Southern
    Hemisphere within the next five years. To achieve
    world renowned software innovation and
    engineering with a balanced theoretic, applied,
    industry and education impact across the Centre

4
SUCCESS
  • Research Focus Areas
  • Knowledge and Data Intensive Systems
  • Nature of Software
  • Next Generation Software Platforms
  • SE Education and IBL/RBL
  • Software Analysis and Testing
  • Software RD Group
  • http//www.swinburne.edu.au/ict/success/research-e
    xpertise/

5
NGSP (Small) Group Overview
  • This group conducts research into cloud computing
    and workflow technologies for complex software
    systems and services.
  • Members

Others Prof John Grundy Prof Chengfei Liu
Researchers A/Prof Jinjun Chen (UTS) Dr Xiao Liu
(Postdoc) Dr Dong Yuan (Postdoc) Gaofeng
Zhang Wenhao Li Dahai Cao Xuyun Zhang Chang
Liu Jofry Hadi SUTANTO
Leader Prof Yun Yang (PC Member for ICSE 07/08,
FSE09 ICSE 10/11/12)
Visitors Prof Lee Osterweil Prof Lori
Clarke Prof Ivan Stojmenovic Prof Paola
Inverardi Prof Amit Sheth Prof Wil van der Aalst
Prof Hai Zhuge
6
RD Projects Grants
  • Primary projects
  • (Cloud) workflow technology
  • ARC LP0990393 (Y Yang, R Kotagiri, J Chen, C Liu)
  • Cloud computing
  • ARC DP110101340 (Y Yang, J Chen, J Grundy)
  • Secondary project
  • Management control systems for effective
    information sharing and security in government
    organisations
  • ARC LP110100228 (S Cugenasen, Y Yang)

7
RD Projects Overview
  • SwinDeW workflow family including SwinDeW-C
  • Architectures / Models (D Cao)
  • Scheduling / Data and service management (D Yuan,
    X Liu)
  • Verification / Exception handling (X Liu)
  • Cloud computing
  • Data management (D Yuan, X Liu, W Li)
  • Privacy and Security (G Zhang, X Zhang, C Liu)

8
Some Recent ERA A Ranked Publications
  • J. Chen and Y. Yang, Temporal Dependency based
    Checkpoint Selection for Dynamic Verification of
    Temporal Constraints in Scientific Workflow
    Systems. ACM Transactions on Software Engineering
    and Methodology, 20(3), 2011
  • X. Liu, Y. Yang, Y. Jiang and J. Chen, Preventing
    Temporal Violations in Scientific Workflows
    Where and How. IEEE Transactions on Software
    Engineering, 37(6)805-825, Nov./Dec. 2011.
  • D. Yuan, Y. Yang, X. Liu and J. Chen, On-demand
    Minimum Cost Benchmarking for Intermediate
    Datasets Storage in Scientific Cloud Workflow
    Systems. Journal of Parallel and Distributed
    Computing, 71(316-332), 2011
  • J. Chen and Y. Yang, Localising Temporal
    Constraints in Scientific Workflows. Journal of
    Computer and System Sciences, Elsevier,
    76(6)464-474, Sept. 2010
  • G. Zhang, Y. Yang and J. Chen, A Historical
    Probability based Noise Generation Strategy for
    Privacy Protection in Cloud Computing. Journal of
    Computer and System Sciences, Elsevier, published
    online, Dec. 2011.

9
Outline
  • SUCCESS Centre and NGSP Group
  • Background Cloud Computing and Workflow
  • Research Topics
  • Performance Management in Scientific Workflows
  • Data Management in Scientific Cloud Workflows
  • Security and Privacy Protection in the Cloud
  • Data Reliability Assurance in the Cloud
  • SwinDeW-C Cloud Workflow System
  • Future Work and Conclusions

9
10
Background Cloud Computing
  • What is cloud computing?
  • R. Buyya "A Cloud is a type of parallel and
    distributed system consisting of a collection of
    inter-connected and virtualised computers that
    are dynamically provisioned and presented as one
    or more unified computing resources based on
    service-level agreements established through
    negotiation between the service provider and
    consumers.
  • I. Foster " Cloud computing is a large-scale
    distributed computing paradigm that is driven by
    economies of scale, in which a pool of
    abstracted, virtualised, dynamically-scalable,
    managed computing power, storage, platforms, and
    services are delivered on demand to external
    customers over the Internet.
  • UC Berkeley Cloud computing is utility computing
    plus SaaS.

11
Why Cloud Computing
  • Data explosion
  • TB (1012), PB(1015), exabyte (EB, 1018),
    zettabyte (ZB, 1021), yottabyte (YB,1024)
  • The total amount of global data in 2010
  • Google processes ?data everyday in 2009
  • Every day, Facebook 10T, Twitter 7T, Youtube 4.5T
  • Moore's law vs. data explosion speed
  • Buzzwords data storage, data processing,
    parallel, distributed, virtualisation, commodity
    machines, energy consumption, data centres,
    utility computing, software (everything) as a
    service

1.2 ZB
24 PB
12
Benefits of Clouds
  • No upfront infrastructure investment
  • No procuring hardware, setup, hosting, power,
    etc..
  • On demand access
  • Lease what you need and when you need..
  • Efficient Resource Allocation
  • Globally shared infrastructure
  • Nice Pricing
  • Based on Usage, QoS, Supply and Demand, Loyalty,
  • Application Acceleration
  • Parallelism for large-scale data analysis
  • Highly Availability, Scalable, and Energy
    Efficient
  • Supports Creation of 3rd Party Services
    Seamless offering
  • Builds on infrastructure and follows similar
    Business model as Cloud

13
Successful Stories
  • Google
  • Animoto, 750,000 sign up in three days, 25,000
    access one hour, 10 times capability required,
    Amazon
  • NY Times, articles from 1851 to 1980,
    accomplished in 24 hours at a cost of only US240
  • Facebook, Saleforce CRM, IBM Research Compute
    Cloud
  • ..

14
Cloud Computing Classification
  • Cloud Services
  • IaaS infrastructure as a service, Amazon S3, EC2
  • PaaS platform as a service, Google App Engine
  • SaaS software as a servcie, Saleforce.com
  • Cloud Types
  • Public/Internet Clouds
  • Private/Enterprise Clouds
  • Hybrid/Mixed Clouds

15
Example (PaaS) Hadoop Project
  • The Apache Hadoop software library is a framework
    that allows for the distributed processing of
    large data sets across clusters of computers
    using a simple programming model. It is designed
    to scale up from single servers to thousands of
    machines, each offering local computation and
    storage. Hadoop provides a reliable shared
    storage and analysis system
  • Storage provided by HDFS a distributed file
    system that provides high-throughput access to
    application data
  • Analysis provided by MapReduce a software
    framework for distributed processing of large
    data sets on compute clusters
  • Hadoop for Yahoo! search
  • Hadoop The Definitive Guide (by Tom White)
  • http//hadoop.apache.org/

16
Cloud in Australia
  • Gartner estimated the global demand in 2009 for
    cloud computing at 46 billion, rising to 150
    billion by 2013
  • The Australian Governments business operations,
    ICT costs around 4.3 billion p.a.
  • Australian Government ICT Sustainability Plan
    2010 2015, an energy efficient technology for
    the Australian Government Data Centre Strategy.
  • The Department of Finance and Deregulation
    estimated that costs of 1 billion could be
    avoided by developing a data centre strategy for
    the next 15 years.
  • Australian Taxation Office (ATO), Department of
    Immigration and Citizenship (DIAC),, and
    Australian Maritime Safety Authority (AMSA),
    proof of concept, initiatives
  • The Australian Academy of Technological Sciences
    and Engineering (ATSE), opportunities and
    challenges for government, universities and
    business.
  • Westpac, Telstra, MYOB, Commonwealth Bank,
    Australian and New Zealand Banking Group and SAP,
    initiatives to support the migration and running
    of their business applications in the cloud.

16
17
Cloud in China
  • The national twelfth five years plan
  • http//www.chinacloud.cn/
  • http//www.china-cloud.com/
  • http//www.cloudcomputing-china.cn/

17
18
Background Workflow
  • The automation of a business process, in whole or
    part, during which documents, information or
    tasks are passed from one participant to another
    for action, according to a set of procedural
    rules.
  • A Workflow Management System is a system that
    provides procedural automation of a business
    process by managing the sequence of work
    activities and by managing the required resources
    (people, data applications) associated with the
    various activity steps.
  • -- Workflow Management Coalition

18
19
Why Workflow
  • Originated from office automation
  • Business process management, business agility
  • Business process analysis, re-design
  • Separation of workflow management system from
    software applications
  • Just like the separation of database management
    system from software applications
  • Software component reuse, Web-services
  • Programming by scripting the composition of
    software components

19
20
Workflow Applications
  • Office automation, review and approve process
  • Business process management systems, ERP systems
  • Machine shops, job shops and flow shops
  • Flight booking, insurance claim, tax refund
  • Scientific workflows
  • IBM WebSphere Workflow
  • Microsoft Windows Workflow Foundation
  • http//wm.microsoft.com/ms/msdn/netframework/intro
    wf.wmv

20
21
Workflow Reference Model
21
22
Example Pulsar Searching Workflow
  • Astrophysics pulsar searching
  • Pulsars the collapsed cores of stars that were
    once more massive than 6-10 times the mass of the
    Sun
  • http//astronomy.swin.edu.au/cosmos/P/Pulsar
  • Parkes Radio Telescope (http//www.parkes.atnf.csi
    ro.au/)
  • Swinburne Astrophysics group (http//astronomy.sw
    inburne.edu.au/) has been conducting pulsar
    searching surveys (http//astronomy.swin.edu.au/pu
    lsar/) based on the observation data from Parkes
    Radio Telescope.
  • Typical scientific workflow which involves a
    large number of data and computation intensive
    activities. For a single searching process, the
    average data volume (not including the raw stream
    data from the telescope) is over 4 terabytes and
    the average execution time is about 23 hours on
    Swinburne high performance supercomputing
    facility (http//astronomy.swinburne.edu.au/superc
    omputing/).

22
23
Pulsar Searching Workflow
23
24
Outline
  • SUCCESS Centre and NGSP Group
  • Cloud Computing and Workflow
  • Research Topics
  • Performance Management in Scientific Workflows
  • Data Management in Scientific Cloud Workflows
  • Security and Privacy Protection in the Cloud
  • Data Reliability Assurance in the Cloud
  • SwinDeW-C Cloud Workflow System
  • Future Work and Conclusions

24
25
Performance Management in Scientific Workflows
Research Topics
  • Dr. Xiao Liu
  • xliu_at_swin.edu.au
  • http//www.ict.swin.edu.au/personal/xliu/

26
Workflow QoS
  • QoS dimensions
  • time, cost, fidelity, reliability, security
  • QoS of Cloud Services
  • Workflow QoS
  • the overall QoS for a collection of cloud
    services
  • but not simply add up!

26
27
Temporal QoS
  • System performance
  • Response time
  • Throughput
  • Temporal constraints
  • Global constraints deadlines
  • Local constraints milestones, individual
    activity durations
  • Satisfactory temporal QoS
  • High performance fast response, high throughput
  • On-time completion low temporal violation rate

27
28
Problem Analysis
  • Setting temporal constraints
  • Coarse-grained and fine-grained temporal
    constraints
  • Prerequisite effective forecasting of activity
    durations
  • Monitoring temporal consistency state
  • Monitor workflow execution state
  • Detect potential temporal violations
  • Temporal violation handling
  • Where to conduct violation handling
  • What strategies to be used

28
29
Ultimate Goal
  • Achieving on-time completion
  • Measurements
  • Temporal correctness
  • Cost effectiveness

29
30
Temporal Consistency Model
  • Temporal correctness workflow execution towards
    the satisfaction of temporal constraints
  • Temporal consistency model defines the system
    running state at a specific workflow activity
    point (i.e. temporal checkpoint) against specific
    temporal constraints
  • Basic elements real workflow running time
    (before and including the activity point),
    estimated running time for uncompleted workflow
    (after the checkpoint), temporal constraints

31
Probability Based Temporal Consistency Model
  • Time attributes for workflow activity ai
  • Maximum activity duration D(ai)
  • Mean activity duration M(ai)
  • Minimum activity duration d(ai)
  • Runtime activity duration R(ai)
  • 3 sigm rule, normal distribution, 99.73
  • (µ-3s, µ3s), R(ai)N(µ, s)
  • D(ai) µ3s, M(ai) µ, d(ai) µ-3s

32
Probability Based Temporal Consistency Model
  • Type of Temporal Constraints
  • Upper bound temporal constraint, U(W)
  • Lower bound temporal constraint, L(W)
  • Fixed-time temporal constraint, F(W)
  • Relationship
  • Upper bound, lower bound, symmetric
  • Upper bound, fixed-time, special case
  • Choice
  • Upper bound/lower bound constraint for workflow
    build-time
  • Fixed-time constraint for workflow runtime

33
Probability Based Temporal Consistency Model
34
Probability Based Temporal Consistency Model
35
Temporal Framework
35
36
Temporal Framework
  • Component 1 Temporal Constraint Setting
  • Forecasting workflow activity durations
  • Setting coarse-grained temporal constraints
  • Setting fine-grained temporal constraints
  • Component 2 Temporal Consistency Monitoring
  • Temporal checkpoint selection
  • Temporal verification
  • Component 3 Temporal Violation Handling
  • Temporal violation handling point selection
  • Temporal violation handling

36
37
Component 1 Temporal Constraint Setting
37
38
Forecasting Activity Durations
  • Statistical time-series pattern based forecasting
    strategies
  • Selected Publications
  • X. Liu, Z. Ni, D. Yuan, Y. Jiang, Z. Wu, J. Chen,
    Y. Yang, A Novel Statistical Time-Series Pattern
    based Interval Forecasting Strategy for Activity
    Durations in Workflow Systems, Journal of Systems
    and Software (JSS), vol. 84, no. 3, Pages
    354-376, March 2011.
  • X. Liu, J. Chen, K. Liu and Y. Yang, Forecasting
    Duration Intervals of Scientific Workflow
    Activities based on Time-Series Patterns, Proc.
    of 4th IEEE International Conference on e-Science
    (e-Science08), pages 23-30, Indianapolis, USA,
    Dec. 2008.

38
39
Setting Temporal Constraints
  • Probability based temporal consistency model
  • Time analysis based on Stochastic Petri Nets
  • Selected Publications
  • X. Liu, Z. Ni, J. Chen, Y. Yang, A Probabilistic
    Strategy for Temporal Constraint Management in
    Scientific Workflow Systems, Concurrency and
    Computation Practice and Experience (CCPE),
    Wiley, 23(16)1893-1919, Nov. 2011 .
  • X. Liu, J. Chen and Y. Yang, A Probabilistic
    Strategy for Setting Temporal Constraints in
    Scientific Workflows, Proc. 6th International
    Conference on Business Process Management
    (BPM2008), Lecture Notes in Computer Science,
    Vol. 5240, pages 180-195, Milan, Italy, Sept.
    2008.

39
40
Component 2 Temporal Consistency Monitoring
40
41
Temporal Consistency Monitoring
  • Minimum (Probability) Time Redundancy based
    Checkpoint Selection Strategy
  • Temporal Dependency based Checkpoint Selection
    Strategy
  • Selected Publications
  • X. Liu, Y. Yang, Y. Jiang and J. Chen, Preventing
    Temporal Violations in Scientific Workflows
    Where and How. IEEE Transactions on Software
    Engineering, 37(6)805-825, Nov./Dec. 2011.
  • J. Chen and Y. Yang, Temporal Dependency based
    Checkpoint Selection for Dynamic Verification of
    Temporal Constraints in Scientific Workflow
    Systems. ACM Transactions on Software Engineering
    and Methodology, 20(3), 2011

42
Component 3 Temporal Violation Handling
42
43
Violation Handling
  • Violation Handling Point Selection
  • (Probability) Time deficit allocation
  • Workflow local rescheduling strategy ACO, GA,
    PSO
  • Selected Publications
  • X. Liu, Z. Ni, Z. Wu, D. Yuan, J. Chen and Y.
    Yang, A Novel General Framework for Automatic and
    Cost-Effective Handling of Recoverable Temporal
    Violations in Scientific Workflow Systems,
    Journal of Systems and Software, vol. 84, no. 3,
    pp. 492-509, 2011
  • X. Liu, Y. Yang, Y. Jiang and J. Chen, Do We Need
    to Handle Every Temporal Violation in Scientific
    Workflow Systems, submitted to ACM Transactions
    on Software Engineering and Methodology

43
44
Experiment Results on Temporal Violation Rates
44
45
Cost Analysis
46
Yearly Cost and Time Reduction
Yearly cost reduction for the pulsar searching
workflow
Yearly time reduction for the pulsar searching
workflow
47
Research Topics
Data Management in Scientific Cloud Workflows
Dr. Dong Yuan, Dr. Xiao Liudyuan_at_swin.edu.au,
xliu_at_swin.edu.au http//www.ict.swin.edu.au/perso
nal/dyuan/
48
Data Management in Cloud Computing
  • Scientific applications in cloud computing
  • Computation and data intensive applications
  • Massive computation and storage resources
  • Pay-as-you-go model
  • Computation and storage trade-off
  • Some datasets should be stored (Storage cost)
  • Some datasets can be regenerated (computation
    cost)
  • Data Placement

49
Data Dependency Graph (DDG)
  • A classification of the application data
  • Original data and generated data
  • Data provenance
  • A kind of meta-data that records how data are
    generated.
  • DDG

50
Attributes of a Dataset in DDG
  • A dataset di in DDG has the attributes ltxi, yi,
    fi, vi, provSeti, CostRigt
  • xi () denotes the generation cost of dataset di
    from its direct predecessors.
  • yi (/t) denotes the cost of storing dataset di
    in the system per time unit.
  • fi (Boolean) is a flag, which denotes the status
    whether dataset di is stored or deleted in the
    system.
  • vi (Hz) denotes the usage frequency, which
    indicates how often di is used.

51
Attributes of a Dataset in DDG
  • provSeti denotes the set of stored provenances
    that are needed when regenerating dataset di.
  • CostRi (/t) is dis cost rate, which means the
    average cost per time unit of di in the system.
  • Cost Computation Storage
  • Computation total cost of computation resources
  • Storage total cost of storage resources

52
Cost Model of Datasets Storage in the Cloud
  • Total cost rate for storing datasets in a DDG
  • S is the storage strategy of the DDG
  • This cost model also represents the trade-off
    between computation and storage in the cloud
  • For a DDG with n datasets, there are 2n different
    storage strategies

53
Minimum cost benchmark
  • What is the minimum cost benchmark?
  • The minimum cost for storing and regenerating
    datasets in the cloud
  • The best trade-off between computation and
    storage in the cloud
  • We need to find the Minimum Cost Storage Strategy
    (MCSS) for the application datasets
  • Significance of the minimum cost benchmark
  • Due to the pay-as-you-go model,
    cost-effectiveness is very important to users for
    deploying their applications in the cloud
  • The minimum cost benchmark is for users to
    evaluate the cost-effectiveness of their storage
    strategies.

54
Static On-Demand Minimum Cost Benchmarking
  • The static benchmarking is provided as an
    on-demand service for users
  • Whenever a benchmarking request comes, the
    corresponding algorithms will be triggered to
    calculate the minimum cost benchmark, which is a
    one-time only computation.
  • This approach is suitable for the situation that
    only occasional benchmarking is requested.
  • CTT-SP algorithm
  • A novel algorithm designed to find the MCSS of a
    DDG with polynomial time complexity
  • CTT-SP Cost Transitive Tournament Shortest Path

55
Linear CTT-SP Algorithm
  • CTT-SP algorithm for linear DDG
  • Essences of the algorithm
  • Construct a Cost Transitive Tournament based on
    DDG
  • In the CTT, every path (from the start to the
    end) represent a storage strategy of the DDG.
  • The paths have one-to-one mapping to the storage
    strategies.

56
Linear CTT-SP Algorithm
  • Set weights to the edges in CTT
  • We denote the weight of the edge from di to dj as
    , which is defined as the sum
    of cost rates of dj and the datasets between di
    and dj, supposing that only di and dj are stored
    and the rest of datasets between di and dj are
    all deleted.
  • Formally
  • The length of each path equals to the TCR (Total
    Cost Rate) of the corresponding storage strategy.

57
Linear CTT-SP Algorithm
  • Find the Shortest Path from ds to de in the CTT
  • The MCSS Smin is to Store the datasets that
    Pminltds , degt traverses.
  • The minimum cost benchmark is

58
General CTT-SP Algorithm
  • Take the simple DDG below as example (with a
    block)
  • For a general DDG, we select one branch from the
    first dataset to the last dataset as main branch
    (e.g. d1, d2, d5, d6, d7, d8 ) to construct the
    CTT.
  • For the rest of datasets, we denote them as sub
    branches (e.g. d3, d4 ).

59
General CTT-SP Algorithm
  • The general CTT-SP algorithm is a recursive
    algorithm
  • For the sub branches, given different stored
    predecessors and successors, the MCSS would be
    different, hence cannot be calculated at the
    beginning.
  • In the general CTT-SP algorithm, we will
    recursively call it on the sub branches and
    dynamically add the cost rates to the edges in
    the CTT of the main branch

60
Dynamic on-the-fly Minimum Cost Benchmarking
  • The benchmarking service is delivered on the fly
    to instantly respond to the benchmarking requests
  • By saving and utilising the pre-calculated
    results, whenever the application cost changes in
    the cloud, we can dynamically calculate the new
    minimum cost and keep the benchmark updated.
  • This approach is suitable for the situation that
    more frequent benchmarking is requested at
    runtime.
  • Partitioned Solution Space (PSS)
  • PSS saves all the possible MCSSs of a DDG
    segment.
  • For a DDG segment, given particular stored
    predecessors and successors, we can quickly
    locate the corresponding MCSS from the PSS.

61
PSS for a DDG_LS (Linear DDG Segment)
  • A DDG_LS has different MCSSs according to its
    preceding and succeeding datasets storage
    statuses.
  • CTT for a DDG_LS
  • Different selections of the start and end
    datasets (ds and de) may lead to different MCSSs
    for the segment.

62
PSS for a DDG_LS
  • Partition of the solution space
  • We assume that Si,j and Si',j' be two MCSSs in
    the solution space SCRi,j lt SCRi',j'. The border
    of Si,j and Si',j' in the solution space is that
    given particular X and V, the TCR of storing the
    DDG_LS with Si,j and Si',j' are equal.
  • Hence we have
  • Hence, the border of Si,j and Si',j' in the
    solution space is a straight line.

63
PSS for a DDG_LS
  • If we assume , the
    equation can be further simplified to
  • The figure below demonstrate the partition of the
    solution space.

64
PSS for a DDG_LS
  • We can calculate the partition lines of all the
    potential MCSSs in the solution space, which form
    the PSS.
  • With PSS, given any X and V, we can quickly
    locate the corresponding MCSS for the DDG_LS.

65
Dynamic on-the-fly Minimum Cost Benchmarking
  • PSS based benchmarking approach (key ideas)
  • Merge the PSSs of the DDG_LSs to derive the PSS
    of the whole DDG, from which the minimum cost
    benchmark can be obtained.
  • Save all the calculated PSSs along this process
    in a hierarchy.
  • Whenever the application cost changes, we can
    quickly derive the new minimum cost benchmark
    from the saved PSSs.
  • Hence, we can dynamically keep the minimum cost
    benchmark updated, so that benchmarking requests
    can be instantly responded on the fly.

66
Saving PSSs
  • We save all the PSSs of a DDG in a hierarchy
  • The level number indicates the number of DDG_LSs
    merged in the PSS at that level.
  • The link between two PSSs at Levels i and i1 in
    the hierarchy means the corresponding DDG segment
    of the PSS at Level i1 contains the DDG segment
    of the PSS at Level i.

67
Cost-Effective Storage Strategies
  • Cost Rate based Storage Strategy
  • The strategy directly compares generation cost
    rate and storage cost rate for every dataset to
    decide its storage status.
  • The strategy can guarantee that the stored
    datasets in the system are all necessary.
  • The strategy can dynamically check whether the
    re-generated datasets need to be stored, and if
    so, adjust the storage strategy accordingly.
  • This strategy is highly efficient with fairly
    reasonable cost effectiveness.

68
Cost-Effective Storage Strategies
  • Local-Optimisation based Storage Strategy
  • The strategy divides the DDG with large number of
    application datasets into small linear segments
    (DDG_LS).
  • The strategy utilise the linear CTT-SP algorithm
    to find the MCSS of every segment, hence achieves
    the local-optimisation
  • This strategy is highly cost-effective with very
    reasonable runtime efficiency.

69
Pulsar Searching Application Case Study
  • Analysing ONE PIECE of the observation data, six
    datasets are generated.
  • We directly utilise the on-demand benchmarking
    approach
  • MCSS is storing d2, d4, d6 and deleting d1, d3,
    d5.
  • The minimum cost benchmark is 0.51 per day.

70
PSSs merging process
There are two phases in the execution 1)Files
Preparation 2)Seeking Candidates.Two DDG_LSs
are generated correspondingly.
71
Pulsar Searching Application Case Study
72
Pulsar Searching Application Case Study
Datasets Strategies Extracted beam De-dispersion files Accelerated de-dispersion files Seek results Pulsar candidates XML files
1) Store none dataset Deleted Deleted Deleted Deleted Deleted Deleted
2) Store all datasets Stored Stored Stored Stored Stored Stored
3) Generation cost based strategy Deleted Stored Stored Deleted Deleted Stored
4) Usage based strategy Deleted Stored Deleted Deleted Deleted Deleted
5) Cost rate based strategy Deleted Stored (deleted initially) Deleted Stored Deleted Stored
6) Local-optimisation based strategy Deleted Stored Deleted Stored Deleted Stored
7) Minimum cost benchmark Deleted Stored Deleted Stored Deleted Stored
73
Data Placement
  • Compute near big data!
  • In scientific cloud workflows, large amounts of
    application data need to be stored in distributed
    data centres, a data manager must intelligently
    select data centres in which these data will
    reside, by considering
  • The dependencies between datasets
  • The movement of large datasets
  • Some data has fixed locations

74
A matrix based k-means clustering strategy
  • Build-time to group the existing datasets into k
    data centres based on data dependencies
  • Step 1 Setup and cluster the dependency matrix
  • Step 2 Partition and distribute datasets
  • Runtime to dynamically clusters newly generated
    datasets to the most appropriate data centres
    based on dependencies
  • Step 1 Data pre-allocation by the clustering
    algorithm
  • Step 2 Adjust data placement among data centres
    when new workflows are deployed or some data
    centres become overloaded

75
(No Transcript)
76
Research Topics
Security and Privacy Protection in the Cloud
Gaofeng Zhang gzhang_at_swin.edu.au
77
Background
  • Data Security vs. Data Privacy
  • Privacy in cloud computing
  • Massive data store and compute in open cloud
    environment
  • Customers cannot control inside cloud
  • The severity of privacy risk in cloud computing
  • One specific privacy risk in cloud computing
  • Indirectly private information (collectively
    information)
  • Normal service processes and functions (not
    disruption)
  • The approach noise obfuscation for privacy
    protection

78
Privacy Protection in Cloud
  • Roles in the view of privacy in regular IT system
  • Privacy owner, Privacy user and Privacy theft

Keep safe between Privacy owner and Privacy
user!
Privacy user
Privacy theft
Privacy owner
79
Privacy Protection in Cloud
  • Microsofts View on Cloud Ecosystem
  • Powerful, Green and Smart CloudIBM

80
Privacy Protection in Cloud
  • Roles in the view of privacy in Cloud
  • Privacy owner, privacy user and privacy theft

Virtualisation disable the keeping safe between
Privacy owner and Privacy user!
Privacy user
Privacy theft
Privacy owner
81
Noise Obfuscation(1)
  • Background
  • Massive data stores and computes in open cloud
    environments.
  • Customers cannot control inside cloud.
  • Main idea Dilute real private information with
    noise information
  • Not noise signal!

82
Noise Obfuscation(2)
  • A Motivating example
  • One customer, who often travels to one city in
    Australia, like Sydney, checks the weather
    report regularly from a weather service in cloud
    environments before departure. The frequent
    appearance of service requests about the weather
    report for Sydney can reveal the privacy that
    the customer usually goes to Sydney. But if a
    system aids the customer to inject other requests
    like Perth or Darwin into the Sydney queue,
    the service provider cannot distinguish which
    ones are real and which ones are noise as it
    just sees a similar style of service request.
    These requests should be responded and cannot
    reveal the location privacy of the customer. In
    such cases, the privacy can be protected by noise
    obfuscation in general.

From data privacy to process privacy!
83
Research Topics
  • Noise Generation
  • Historical probability based noise generation
    strategy
  • Time-series pattern based noise generation
    strategy
  • Association probability based noise generation
    strategy
  • Noise Utilisation
  • Trust model and injection strategy for noise
    obfuscation
  • Noise Cooperation Mechanism
  • Privacy protection framework under noise
    obfuscation

84
Research Topics
Cost-Effective Data Reliability Assurance in the
Cloud
Wenhao Li wli_at_swin.edu.au
85
Background
  • The growing of Cloud data
  • It is estimated that by 2015 the data stored in
    the Cloud will reach 0.8 ZB, while more data are
    stored or processed temperately in their journey.
    (IDC)
  • The size of Cloud applications is also expanding
  • Challenge
  • How to reduce the data storage cost for using
    Cloud storage services without sacrificing data
    reliability assurance.

86
Research issues
  • Data reliability modeling in the Cloud
  • Replication-based cost-effective data reliability
    management approaches
  • Data loss detection and data recovery

87
Replication-based Approaches
  • Incremental replication strategy CIR
    (Cost-effective Incremental Replication)
  • The generation of replicas follows an incremental
    pattern, in which replica is created only when
    current replicas cannot provide sufficient data
    reliability assurance to meet users requirement.
  • Data reliability management mechanism based on
    proactive replica checking PRCR (Proactive
    Replica Checking for Reliability)
  • According to different data reliability
    requirements, each file have no more than two
    replicas stored in the Cloud.
  • A replica checking process is proactively
    conducted to detect data loss and recover
    replica.

88
  • CIR can significantly reduce at most 2/3 of
    current Cloud storage cost, especially for data
    with short storage duration and low data
    reliability.
  • PRCR can reduce 1/3 to 2/3 of current Cloud
    storage cost, especially when the data amount is
    big.

88
89
Research Topics
Cloud Workflow System Design and Development
Dahai Cao dcao_at_swin.edu.au
90
SwinCloud Cloud Computing Testbed
  • SwinCloud

90
91
Prototype SwinDeW-C Cloud Workflow System
  • SwinDeW-C

91
92
New Progress
  • Successfully deploy on the Amazon Cloud
  • Eucalyptus the cloud infrastructure platform

93
Call for paper and call for workshop
  • 2012 International Conference on Cloud and Green
    Computing, Nov. 1-3, 2012, Xiangtan, Hunan, China
  • http//kpnm.hnust.cn/confs/cgc2012/
  • Important Dates
  • Workshop Proposal Ongoing as received
  • Submission Deadline
  • June 30, 2012
  • Authors Notification July 30, 2012
  • Final Manuscript Due August 10, 2012
  • Registration Due August 18, 2012

94
End - QA
  • Thanks for your attention!
Write a Comment
User Comments (0)
About PowerShow.com