Dr. Xiao Liu presentation

About This Presentation

Transcript and Presenter's Notes

Title: Dr. Xiao Liu

1
Overview Cloud Computing and Workflow Research
in NGSP Group

Dr. Xiao Liu
Sessional Lecturer, Research Fellow
Centre of SUCCESS
Swinburne University of Technology
Melbourne, Australia

2
Outline

SUCCESS Centre and NGSP Group
Background Cloud Computing and Workflow
Research Topics
Performance Management in Scientific Workflows
Data Management in Scientific Cloud Workflows
Security and Privacy Protection in the Cloud
Data Reliability Assurance in the Cloud
SwinDeW-C Cloud Workflow System
Future Work and Conclusions

3
The Centre of SUCCESS

SUCCESS Swinburne University Centre for
Computing and Engineering Software Systems
SUCCESS is the NO.1 Software Engineering Centre
in Australia
SUCCESS is one of the 7 Tire 1 Centres at
Swinburne University of Technology (Times World
Ranking 351- 400)
The ambition of the Centre is to become the top
centre for software research in the Southern
Hemisphere within the next five years. To achieve
world renowned software innovation and
engineering with a balanced theoretic, applied,
industry and education impact across the Centre

4
SUCCESS

Research Focus Areas
Knowledge and Data Intensive Systems
Nature of Software
Next Generation Software Platforms
SE Education and IBL/RBL
Software Analysis and Testing
Software RD Group
http//www.swinburne.edu.au/ict/success/research-e
xpertise/

5
NGSP (Small) Group Overview

This group conducts research into cloud computing
and workflow technologies for complex software
systems and services.
Members

Others Prof John Grundy Prof Chengfei Liu
Researchers A/Prof Jinjun Chen (UTS) Dr Xiao Liu
(Postdoc) Dr Dong Yuan (Postdoc) Gaofeng
Zhang Wenhao Li Dahai Cao Xuyun Zhang Chang
Liu Jofry Hadi SUTANTO
Leader Prof Yun Yang (PC Member for ICSE 07/08,
FSE09 ICSE 10/11/12)
Visitors Prof Lee Osterweil Prof Lori
Clarke Prof Ivan Stojmenovic Prof Paola
Inverardi Prof Amit Sheth Prof Wil van der Aalst
Prof Hai Zhuge
6
RD Projects Grants

Primary projects
(Cloud) workflow technology
ARC LP0990393 (Y Yang, R Kotagiri, J Chen, C Liu)
Cloud computing
ARC DP110101340 (Y Yang, J Chen, J Grundy)
Secondary project
Management control systems for effective
information sharing and security in government
organisations
ARC LP110100228 (S Cugenasen, Y Yang)

7
RD Projects Overview

SwinDeW workflow family including SwinDeW-C
Architectures / Models (D Cao)
Scheduling / Data and service management (D Yuan,
X Liu)
Verification / Exception handling (X Liu)
Cloud computing
Data management (D Yuan, X Liu, W Li)
Privacy and Security (G Zhang, X Zhang, C Liu)

8
Some Recent ERA A Ranked Publications

J. Chen and Y. Yang, Temporal Dependency based
Checkpoint Selection for Dynamic Verification of
Temporal Constraints in Scientific Workflow
Systems. ACM Transactions on Software Engineering
and Methodology, 20(3), 2011
X. Liu, Y. Yang, Y. Jiang and J. Chen, Preventing
Temporal Violations in Scientific Workflows
Where and How. IEEE Transactions on Software
Engineering, 37(6)805-825, Nov./Dec. 2011.
D. Yuan, Y. Yang, X. Liu and J. Chen, On-demand
Minimum Cost Benchmarking for Intermediate
Datasets Storage in Scientific Cloud Workflow
Systems. Journal of Parallel and Distributed
Computing, 71(316-332), 2011
J. Chen and Y. Yang, Localising Temporal
Constraints in Scientific Workflows. Journal of
Computer and System Sciences, Elsevier,
76(6)464-474, Sept. 2010
G. Zhang, Y. Yang and J. Chen, A Historical
Probability based Noise Generation Strategy for
Privacy Protection in Cloud Computing. Journal of
Computer and System Sciences, Elsevier, published
online, Dec. 2011.

9
Outline

SUCCESS Centre and NGSP Group
Background Cloud Computing and Workflow
Research Topics
Performance Management in Scientific Workflows
Data Management in Scientific Cloud Workflows
Security and Privacy Protection in the Cloud
Data Reliability Assurance in the Cloud
SwinDeW-C Cloud Workflow System
Future Work and Conclusions

9
10
Background Cloud Computing

What is cloud computing?
R. Buyya "A Cloud is a type of parallel and
distributed system consisting of a collection of
inter-connected and virtualised computers that
are dynamically provisioned and presented as one
or more unified computing resources based on
service-level agreements established through
negotiation between the service provider and
consumers.
I. Foster " Cloud computing is a large-scale
distributed computing paradigm that is driven by
economies of scale, in which a pool of
abstracted, virtualised, dynamically-scalable,
managed computing power, storage, platforms, and
services are delivered on demand to external
customers over the Internet.
UC Berkeley Cloud computing is utility computing
plus SaaS.

11
Why Cloud Computing

Data explosion
TB (1012), PB(1015), exabyte (EB, 1018),
zettabyte (ZB, 1021), yottabyte (YB,1024)
The total amount of global data in 2010
Google processes ?data everyday in 2009
Every day, Facebook 10T, Twitter 7T, Youtube 4.5T
Moore's law vs. data explosion speed
Buzzwords data storage, data processing,
parallel, distributed, virtualisation, commodity
machines, energy consumption, data centres,
utility computing, software (everything) as a
service

1.2 ZB
24 PB
12
Benefits of Clouds

No upfront infrastructure investment
No procuring hardware, setup, hosting, power,
etc..
On demand access
Lease what you need and when you need..
Efficient Resource Allocation
Globally shared infrastructure
Nice Pricing
Based on Usage, QoS, Supply and Demand, Loyalty,
Application Acceleration
Parallelism for large-scale data analysis
Highly Availability, Scalable, and Energy
Efficient
Supports Creation of 3rd Party Services
Seamless offering
Builds on infrastructure and follows similar
Business model as Cloud

13
Successful Stories

Google
Animoto, 750,000 sign up in three days, 25,000
access one hour, 10 times capability required,
Amazon
NY Times, articles from 1851 to 1980,
accomplished in 24 hours at a cost of only US240
Facebook, Saleforce CRM, IBM Research Compute
Cloud
..

14
Cloud Computing Classification

Cloud Services
IaaS infrastructure as a service, Amazon S3, EC2
PaaS platform as a service, Google App Engine
SaaS software as a servcie, Saleforce.com
Cloud Types
Public/Internet Clouds
Private/Enterprise Clouds
Hybrid/Mixed Clouds

15
Example (PaaS) Hadoop Project

The Apache Hadoop software library is a framework
that allows for the distributed processing of
large data sets across clusters of computers
using a simple programming model. It is designed
to scale up from single servers to thousands of
machines, each offering local computation and
storage. Hadoop provides a reliable shared
storage and analysis system
Storage provided by HDFS a distributed file
system that provides high-throughput access to
application data
Analysis provided by MapReduce a software
framework for distributed processing of large
data sets on compute clusters
Hadoop for Yahoo! search
Hadoop The Definitive Guide (by Tom White)
http//hadoop.apache.org/

16
Cloud in Australia

Gartner estimated the global demand in 2009 for
cloud computing at 46 billion, rising to 150
billion by 2013
The Australian Governments business operations,
ICT costs around 4.3 billion p.a.
Australian Government ICT Sustainability Plan
2010 2015, an energy efficient technology for
the Australian Government Data Centre Strategy.
The Department of Finance and Deregulation
estimated that costs of 1 billion could be
avoided by developing a data centre strategy for
the next 15 years.
Australian Taxation Office (ATO), Department of
Immigration and Citizenship (DIAC),, and
Australian Maritime Safety Authority (AMSA),
proof of concept, initiatives
The Australian Academy of Technological Sciences
and Engineering (ATSE), opportunities and
challenges for government, universities and
business.
Westpac, Telstra, MYOB, Commonwealth Bank,
Australian and New Zealand Banking Group and SAP,
initiatives to support the migration and running
of their business applications in the cloud.

16
17
Cloud in China

The national twelfth five years plan
http//www.chinacloud.cn/
http//www.china-cloud.com/
http//www.cloudcomputing-china.cn/

17
18
Background Workflow

The automation of a business process, in whole or
part, during which documents, information or
tasks are passed from one participant to another
for action, according to a set of procedural
rules.
A Workflow Management System is a system that
provides procedural automation of a business
process by managing the sequence of work
activities and by managing the required resources
(people, data applications) associated with the
various activity steps.
-- Workflow Management Coalition

18
19
Why Workflow

Originated from office automation
Business process management, business agility
Business process analysis, re-design
Separation of workflow management system from
software applications
Just like the separation of database management
system from software applications
Software component reuse, Web-services
Programming by scripting the composition of
software components

19
20
Workflow Applications

Office automation, review and approve process
Business process management systems, ERP systems
Machine shops, job shops and flow shops
Flight booking, insurance claim, tax refund
Scientific workflows
IBM WebSphere Workflow
Microsoft Windows Workflow Foundation
http//wm.microsoft.com/ms/msdn/netframework/intro
wf.wmv

20
21
Workflow Reference Model
21
22
Example Pulsar Searching Workflow

Astrophysics pulsar searching
Pulsars the collapsed cores of stars that were
once more massive than 6-10 times the mass of the
Sun
http//astronomy.swin.edu.au/cosmos/P/Pulsar
Parkes Radio Telescope (http//www.parkes.atnf.csi
ro.au/)
Swinburne Astrophysics group (http//astronomy.sw
inburne.edu.au/) has been conducting pulsar
searching surveys (http//astronomy.swin.edu.au/pu
lsar/) based on the observation data from Parkes
Radio Telescope.
Typical scientific workflow which involves a
large number of data and computation intensive
activities. For a single searching process, the
average data volume (not including the raw stream
data from the telescope) is over 4 terabytes and
the average execution time is about 23 hours on
Swinburne high performance supercomputing
facility (http//astronomy.swinburne.edu.au/superc
omputing/).

22
23
Pulsar Searching Workflow
23
24
Outline

SUCCESS Centre and NGSP Group
Cloud Computing and Workflow
Research Topics
Performance Management in Scientific Workflows
Data Management in Scientific Cloud Workflows
Security and Privacy Protection in the Cloud
Data Reliability Assurance in the Cloud
SwinDeW-C Cloud Workflow System
Future Work and Conclusions

24
25
Performance Management in Scientific Workflows
Research Topics

Dr. Xiao Liu
xliu_at_swin.edu.au
http//www.ict.swin.edu.au/personal/xliu/

26
Workflow QoS

QoS dimensions
time, cost, fidelity, reliability, security
QoS of Cloud Services
Workflow QoS
the overall QoS for a collection of cloud
services
but not simply add up!

26
27
Temporal QoS

System performance
Response time
Throughput
Temporal constraints
Global constraints deadlines
Local constraints milestones, individual
activity durations
Satisfactory temporal QoS
High performance fast response, high throughput
On-time completion low temporal violation rate

27
28
Problem Analysis

Setting temporal constraints
Coarse-grained and fine-grained temporal
constraints
Prerequisite effective forecasting of activity
durations
Monitoring temporal consistency state
Monitor workflow execution state
Detect potential temporal violations
Temporal violation handling
Where to conduct violation handling
What strategies to be used

28
29
Ultimate Goal

Achieving on-time completion
Measurements
Temporal correctness
Cost effectiveness

29
30
Temporal Consistency Model

Temporal correctness workflow execution towards
the satisfaction of temporal constraints
Temporal consistency model defines the system
running state at a specific workflow activity
point (i.e. temporal checkpoint) against specific
temporal constraints
Basic elements real workflow running time
(before and including the activity point),
estimated running time for uncompleted workflow
(after the checkpoint), temporal constraints

31
Probability Based Temporal Consistency Model

Time attributes for workflow activity ai
Maximum activity duration D(ai)
Mean activity duration M(ai)
Minimum activity duration d(ai)
Runtime activity duration R(ai)
3 sigm rule, normal distribution, 99.73
(µ-3s, µ3s), R(ai)N(µ, s)
D(ai) µ3s, M(ai) µ, d(ai) µ-3s

32
Probability Based Temporal Consistency Model

Type of Temporal Constraints
Upper bound temporal constraint, U(W)
Lower bound temporal constraint, L(W)
Fixed-time temporal constraint, F(W)
Relationship
Upper bound, lower bound, symmetric
Upper bound, fixed-time, special case
Choice
Upper bound/lower bound constraint for workflow
build-time
Fixed-time constraint for workflow runtime

33
Probability Based Temporal Consistency Model
34
Probability Based Temporal Consistency Model
35
Temporal Framework
35
36
Temporal Framework

Component 1 Temporal Constraint Setting
Forecasting workflow activity durations
Setting coarse-grained temporal constraints
Setting fine-grained temporal constraints
Component 2 Temporal Consistency Monitoring
Temporal checkpoint selection
Temporal verification
Component 3 Temporal Violation Handling
Temporal violation handling point selection
Temporal violation handling

36
37
Component 1 Temporal Constraint Setting
37
38
Forecasting Activity Durations

Statistical time-series pattern based forecasting
strategies
Selected Publications
X. Liu, Z. Ni, D. Yuan, Y. Jiang, Z. Wu, J. Chen,
Y. Yang, A Novel Statistical Time-Series Pattern
based Interval Forecasting Strategy for Activity
Durations in Workflow Systems, Journal of Systems
and Software (JSS), vol. 84, no. 3, Pages
354-376, March 2011.
X. Liu, J. Chen, K. Liu and Y. Yang, Forecasting
Duration Intervals of Scientific Workflow
Activities based on Time-Series Patterns, Proc.
of 4th IEEE International Conference on e-Science
(e-Science08), pages 23-30, Indianapolis, USA,
Dec. 2008.

38
39
Setting Temporal Constraints

Probability based temporal consistency model
Time analysis based on Stochastic Petri Nets
Selected Publications
X. Liu, Z. Ni, J. Chen, Y. Yang, A Probabilistic
Strategy for Temporal Constraint Management in
Scientific Workflow Systems, Concurrency and
Computation Practice and Experience (CCPE),
Wiley, 23(16)1893-1919, Nov. 2011 .
X. Liu, J. Chen and Y. Yang, A Probabilistic
Strategy for Setting Temporal Constraints in
Scientific Workflows, Proc. 6th International
Conference on Business Process Management
(BPM2008), Lecture Notes in Computer Science,
Vol. 5240, pages 180-195, Milan, Italy, Sept.
2008.

39
40
Component 2 Temporal Consistency Monitoring
40
41
Temporal Consistency Monitoring

Minimum (Probability) Time Redundancy based
Checkpoint Selection Strategy
Temporal Dependency based Checkpoint Selection
Strategy
Selected Publications
X. Liu, Y. Yang, Y. Jiang and J. Chen, Preventing
Temporal Violations in Scientific Workflows
Where and How. IEEE Transactions on Software
Engineering, 37(6)805-825, Nov./Dec. 2011.
J. Chen and Y. Yang, Temporal Dependency based
Checkpoint Selection for Dynamic Verification of
Temporal Constraints in Scientific Workflow
Systems. ACM Transactions on Software Engineering
and Methodology, 20(3), 2011

42
Component 3 Temporal Violation Handling
42
43
Violation Handling

Violation Handling Point Selection
(Probability) Time deficit allocation
Workflow local rescheduling strategy ACO, GA,
PSO
Selected Publications
X. Liu, Z. Ni, Z. Wu, D. Yuan, J. Chen and Y.
Yang, A Novel General Framework for Automatic and
Cost-Effective Handling of Recoverable Temporal
Violations in Scientific Workflow Systems,
Journal of Systems and Software, vol. 84, no. 3,
pp. 492-509, 2011
X. Liu, Y. Yang, Y. Jiang and J. Chen, Do We Need
to Handle Every Temporal Violation in Scientific
Workflow Systems, submitted to ACM Transactions
on Software Engineering and Methodology

43
44
Experiment Results on Temporal Violation Rates
44
45
Cost Analysis
46
Yearly Cost and Time Reduction
Yearly cost reduction for the pulsar searching
workflow
Yearly time reduction for the pulsar searching
workflow
47
Research Topics
Data Management in Scientific Cloud Workflows
Dr. Dong Yuan, Dr. Xiao Liudyuan_at_swin.edu.au,
xliu_at_swin.edu.au http//www.ict.swin.edu.au/perso
nal/dyuan/
48
Data Management in Cloud Computing

Scientific applications in cloud computing
Computation and data intensive applications
Massive computation and storage resources
Pay-as-you-go model
Computation and storage trade-off
Some datasets should be stored (Storage cost)
Some datasets can be regenerated (computation
cost)
Data Placement

49
Data Dependency Graph (DDG)

A classification of the application data
Original data and generated data
Data provenance
A kind of meta-data that records how data are
generated.
DDG

50
Attributes of a Dataset in DDG

A dataset di in DDG has the attributes ltxi, yi,
fi, vi, provSeti, CostRigt
xi () denotes the generation cost of dataset di
from its direct predecessors.
yi (/t) denotes the cost of storing dataset di
in the system per time unit.
fi (Boolean) is a flag, which denotes the status
whether dataset di is stored or deleted in the
system.
vi (Hz) denotes the usage frequency, which
indicates how often di is used.

51
Attributes of a Dataset in DDG

provSeti denotes the set of stored provenances
that are needed when regenerating dataset di.
CostRi (/t) is dis cost rate, which means the
average cost per time unit of di in the system.
Cost Computation Storage
Computation total cost of computation resources
Storage total cost of storage resources

52
Cost Model of Datasets Storage in the Cloud

Total cost rate for storing datasets in a DDG
S is the storage strategy of the DDG
This cost model also represents the trade-off
between computation and storage in the cloud
For a DDG with n datasets, there are 2n different
storage strategies

53
Minimum cost benchmark

What is the minimum cost benchmark?
The minimum cost for storing and regenerating
datasets in the cloud
The best trade-off between computation and
storage in the cloud
We need to find the Minimum Cost Storage Strategy
(MCSS) for the application datasets
Significance of the minimum cost benchmark
Due to the pay-as-you-go model,
cost-effectiveness is very important to users for
deploying their applications in the cloud
The minimum cost benchmark is for users to
evaluate the cost-effectiveness of their storage
strategies.

54
Static On-Demand Minimum Cost Benchmarking

The static benchmarking is provided as an
on-demand service for users
Whenever a benchmarking request comes, the
corresponding algorithms will be triggered to
calculate the minimum cost benchmark, which is a
one-time only computation.
This approach is suitable for the situation that
only occasional benchmarking is requested.
CTT-SP algorithm
A novel algorithm designed to find the MCSS of a
DDG with polynomial time complexity
CTT-SP Cost Transitive Tournament Shortest Path

55
Linear CTT-SP Algorithm

CTT-SP algorithm for linear DDG
Essences of the algorithm
Construct a Cost Transitive Tournament based on
DDG
In the CTT, every path (from the start to the
end) represent a storage strategy of the DDG.
The paths have one-to-one mapping to the storage
strategies.

56
Linear CTT-SP Algorithm

Set weights to the edges in CTT
We denote the weight of the edge from di to dj as
, which is defined as the sum
of cost rates of dj and the datasets between di
and dj, supposing that only di and dj are stored
and the rest of datasets between di and dj are
all deleted.
Formally
The length of each path equals to the TCR (Total
Cost Rate) of the corresponding storage strategy.

57
Linear CTT-SP Algorithm

Find the Shortest Path from ds to de in the CTT
The MCSS Smin is to Store the datasets that
Pminltds , degt traverses.
The minimum cost benchmark is

58
General CTT-SP Algorithm

Take the simple DDG below as example (with a
block)
For a general DDG, we select one branch from the
first dataset to the last dataset as main branch
(e.g. d1, d2, d5, d6, d7, d8 ) to construct the
CTT.
For the rest of datasets, we denote them as sub
branches (e.g. d3, d4 ).

59
General CTT-SP Algorithm

The general CTT-SP algorithm is a recursive
algorithm
For the sub branches, given different stored
predecessors and successors, the MCSS would be
different, hence cannot be calculated at the
beginning.
In the general CTT-SP algorithm, we will
recursively call it on the sub branches and
dynamically add the cost rates to the edges in
the CTT of the main branch

60
Dynamic on-the-fly Minimum Cost Benchmarking

The benchmarking service is delivered on the fly
to instantly respond to the benchmarking requests
By saving and utilising the pre-calculated
results, whenever the application cost changes in
the cloud, we can dynamically calculate the new
minimum cost and keep the benchmark updated.
This approach is suitable for the situation that
more frequent benchmarking is requested at
runtime.
Partitioned Solution Space (PSS)
PSS saves all the possible MCSSs of a DDG
segment.
For a DDG segment, given particular stored
predecessors and successors, we can quickly
locate the corresponding MCSS from the PSS.

61
PSS for a DDG_LS (Linear DDG Segment)

A DDG_LS has different MCSSs according to its
preceding and succeeding datasets storage
statuses.
CTT for a DDG_LS
Different selections of the start and end
datasets (ds and de) may lead to different MCSSs
for the segment.

62
PSS for a DDG_LS

Partition of the solution space
We assume that Si,j and Si',j' be two MCSSs in
the solution space SCRi,j lt SCRi',j'. The border
of Si,j and Si',j' in the solution space is that
given particular X and V, the TCR of storing the
DDG_LS with Si,j and Si',j' are equal.
Hence we have
Hence, the border of Si,j and Si',j' in the
solution space is a straight line.

63
PSS for a DDG_LS

If we assume , the
equation can be further simplified to
The figure below demonstrate the partition of the
solution space.

64
PSS for a DDG_LS

We can calculate the partition lines of all the
potential MCSSs in the solution space, which form
the PSS.
With PSS, given any X and V, we can quickly
locate the corresponding MCSS for the DDG_LS.

65
Dynamic on-the-fly Minimum Cost Benchmarking

PSS based benchmarking approach (key ideas)
Merge the PSSs of the DDG_LSs to derive the PSS
of the whole DDG, from which the minimum cost
benchmark can be obtained.
Save all the calculated PSSs along this process
in a hierarchy.
Whenever the application cost changes, we can
quickly derive the new minimum cost benchmark
from the saved PSSs.
Hence, we can dynamically keep the minimum cost
benchmark updated, so that benchmarking requests
can be instantly responded on the fly.

66
Saving PSSs

We save all the PSSs of a DDG in a hierarchy
The level number indicates the number of DDG_LSs
merged in the PSS at that level.
The link between two PSSs at Levels i and i1 in
the hierarchy means the corresponding DDG segment
of the PSS at Level i1 contains the DDG segment
of the PSS at Level i.

67
Cost-Effective Storage Strategies

Cost Rate based Storage Strategy
The strategy directly compares generation cost
rate and storage cost rate for every dataset to
decide its storage status.
The strategy can guarantee that the stored
datasets in the system are all necessary.
The strategy can dynamically check whether the
re-generated datasets need to be stored, and if
so, adjust the storage strategy accordingly.
This strategy is highly efficient with fairly
reasonable cost effectiveness.

68
Cost-Effective Storage Strategies

Local-Optimisation based Storage Strategy
The strategy divides the DDG with large number of
application datasets into small linear segments
(DDG_LS).
The strategy utilise the linear CTT-SP algorithm
to find the MCSS of every segment, hence achieves
the local-optimisation
This strategy is highly cost-effective with very
reasonable runtime efficiency.

69
Pulsar Searching Application Case Study

Analysing ONE PIECE of the observation data, six
datasets are generated.
We directly utilise the on-demand benchmarking
approach
MCSS is storing d2, d4, d6 and deleting d1, d3,
d5.
The minimum cost benchmark is 0.51 per day.

70
PSSs merging process
There are two phases in the execution 1)Files
Preparation 2)Seeking Candidates.Two DDG_LSs
are generated correspondingly.
71
Pulsar Searching Application Case Study
72
Pulsar Searching Application Case Study
Datasets Strategies Extracted beam De-dispersion files Accelerated de-dispersion files Seek results Pulsar candidates XML files
1) Store none dataset Deleted Deleted Deleted Deleted Deleted Deleted
2) Store all datasets Stored Stored Stored Stored Stored Stored
3) Generation cost based strategy Deleted Stored Stored Deleted Deleted Stored
4) Usage based strategy Deleted Stored Deleted Deleted Deleted Deleted
5) Cost rate based strategy Deleted Stored (deleted initially) Deleted Stored Deleted Stored
6) Local-optimisation based strategy Deleted Stored Deleted Stored Deleted Stored
7) Minimum cost benchmark Deleted Stored Deleted Stored Deleted Stored
73
Data Placement

Compute near big data!
In scientific cloud workflows, large amounts of
application data need to be stored in distributed
data centres, a data manager must intelligently
select data centres in which these data will
reside, by considering
The dependencies between datasets
The movement of large datasets
Some data has fixed locations

74
A matrix based k-means clustering strategy

Build-time to group the existing datasets into k
data centres based on data dependencies
Step 1 Setup and cluster the dependency matrix
Step 2 Partition and distribute datasets
Runtime to dynamically clusters newly generated
datasets to the most appropriate data centres
based on dependencies
Step 1 Data pre-allocation by the clustering
algorithm
Step 2 Adjust data placement among data centres
when new workflows are deployed or some data
centres become overloaded

75
(No Transcript)
76
Research Topics
Security and Privacy Protection in the Cloud
Gaofeng Zhang gzhang_at_swin.edu.au
77
Background

Data Security vs. Data Privacy
Privacy in cloud computing
Massive data store and compute in open cloud
environment
Customers cannot control inside cloud
The severity of privacy risk in cloud computing
One specific privacy risk in cloud computing
Indirectly private information (collectively
information)
Normal service processes and functions (not
disruption)
The approach noise obfuscation for privacy
protection

78
Privacy Protection in Cloud

Roles in the view of privacy in regular IT system
Privacy owner, Privacy user and Privacy theft

Keep safe between Privacy owner and Privacy
user!
Privacy user
Privacy theft
Privacy owner
79
Privacy Protection in Cloud

Microsofts View on Cloud Ecosystem
Powerful, Green and Smart CloudIBM

80
Privacy Protection in Cloud

Roles in the view of privacy in Cloud
Privacy owner, privacy user and privacy theft

Virtualisation disable the keeping safe between
Privacy owner and Privacy user!
Privacy user
Privacy theft
Privacy owner
81
Noise Obfuscation(1)

Background
Massive data stores and computes in open cloud
environments.
Customers cannot control inside cloud.
Main idea Dilute real private information with
noise information
Not noise signal!

82
Noise Obfuscation(2)

A Motivating example
One customer, who often travels to one city in
Australia, like Sydney, checks the weather
report regularly from a weather service in cloud
environments before departure. The frequent
appearance of service requests about the weather
report for Sydney can reveal the privacy that
the customer usually goes to Sydney. But if a
system aids the customer to inject other requests
like Perth or Darwin into the Sydney queue,
the service provider cannot distinguish which
ones are real and which ones are noise as it
just sees a similar style of service request.
These requests should be responded and cannot
reveal the location privacy of the customer. In
such cases, the privacy can be protected by noise
obfuscation in general.

From data privacy to process privacy!
83
Research Topics

Noise Generation
Historical probability based noise generation
strategy
Time-series pattern based noise generation
strategy
Association probability based noise generation
strategy
Noise Utilisation
Trust model and injection strategy for noise
obfuscation
Noise Cooperation Mechanism
Privacy protection framework under noise
obfuscation

84
Research Topics
Cost-Effective Data Reliability Assurance in the
Cloud
Wenhao Li wli_at_swin.edu.au
85
Background

The growing of Cloud data
It is estimated that by 2015 the data stored in
the Cloud will reach 0.8 ZB, while more data are
stored or processed temperately in their journey.
(IDC)
The size of Cloud applications is also expanding
Challenge
How to reduce the data storage cost for using
Cloud storage services without sacrificing data
reliability assurance.

86
Research issues

Data reliability modeling in the Cloud
Replication-based cost-effective data reliability
management approaches
Data loss detection and data recovery

87
Replication-based Approaches

Incremental replication strategy CIR
(Cost-effective Incremental Replication)
The generation of replicas follows an incremental
pattern, in which replica is created only when
current replicas cannot provide sufficient data
reliability assurance to meet users requirement.
Data reliability management mechanism based on
proactive replica checking PRCR (Proactive
Replica Checking for Reliability)
According to different data reliability
requirements, each file have no more than two
replicas stored in the Cloud.
A replica checking process is proactively
conducted to detect data loss and recover
replica.

CIR can significantly reduce at most 2/3 of
current Cloud storage cost, especially for data
with short storage duration and low data
reliability.
PRCR can reduce 1/3 to 2/3 of current Cloud
storage cost, especially when the data amount is
big.

88
89
Research Topics
Cloud Workflow System Design and Development
Dahai Cao dcao_at_swin.edu.au
90
SwinCloud Cloud Computing Testbed

SwinCloud

90
91
Prototype SwinDeW-C Cloud Workflow System

SwinDeW-C

91
92
New Progress

Successfully deploy on the Amazon Cloud
Eucalyptus the cloud infrastructure platform

93
Call for paper and call for workshop

2012 International Conference on Cloud and Green
Computing, Nov. 1-3, 2012, Xiangtan, Hunan, China
http//kpnm.hnust.cn/confs/cgc2012/
Important Dates
Workshop Proposal Ongoing as received
Submission Deadline
June 30, 2012
Authors Notification July 30, 2012
Final Manuscript Due August 10, 2012
Registration Due August 18, 2012

Dr. Xiao Liu PowerPoint PPT Presentation