Data Warehouse Operational Issues Potential Research Directions presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Warehouse Operational Issues Potential Research Directions

1
Data Warehouse Operational IssuesPotential
Research Directions
2
Query processingGeneral issues

Query processing for OLAP queries has been
addressed extensively in the past, including
standard OLAP operations, ad-hoc queries, etc.
Research Challenges
There is a need for a standardized query
language for OLAP, a Cube algebra
For several queries it would be sufficient to
provide approximate answers. We need to identify
approximation measures and formal techniques to
evaluate them.
Integrate what-if analysis facilities in
current OLAP tools, with particular emphasis on
improved performance, in order to avoid using
more than one tools for this purpose (as
currently happens)
Support holistic aggregation (i.e., aggregation
that cannot exploit differentials) efficiently.
Examples involve the computation of top-K, or
median values in the presence of streaming
insertions and deletions.

3
Stream data

Streams are sequences of data, continuously
flowing from a data source with the particular
characteristic that, due to their volume, each
tuple is available only for a limited time window
for querying. Stream examples would involve
Stock rates extracted from the web
Packets going through a router
Clickstreams from a web site

4
Query processingOLAP aggregation and stream data

Example these are the standard OLAP queries we
know (cube operator, etc)
Research challenges
Issues related to processing OLAP queries over
streams streams are eventually stored in the DW
(updated consistently) an interesting issue is
defining consistency between stream and DW.
How to split query processing between on-line
queries on stream data and queries on the
contents of the warehouse it might also be
necessary to store some of the results back in
the DW to allow for future queries.
Synchronization of multiple incoming streams with
different arrival rates, e.g., when we need to
combine them with some join operation and
aggregation, in the presence of consistency
constraints.

5
Query processingSubscription queries

Example Set of indicators define whether or not
business is going well depending on whether
stream values exceed some threshold
Subscription queries are defined in order to
continuously monitor the DW. One way to do that
is by defining ECA rules on the data warehouse
evaluation of triggers requires access to the DW
when event takes place, execute queries against
the warehouse.
Research challenges
Modeling of subscription queries (is it a new
language or standard trigger languages are
enough?)
What kind of extensions do we need at the
physical level to support that?
How to optimize trigger condition checking
(difficult in light of heavy aggregation)

6
Query processingCached queries

Example Compute the sum per country and per
month and save it because you can compute based
on this the sum per continent and year.
Important component, should exist in any data
warehousing architecture.
Research challenges
How to and under what conditions can you do query
rewriting? cube algebra would be useful here.
Work on all caching issues updating/refreshing,
deciding whether or not to cache, replacement
policies in the case of bounded space, etc.
Work on the same problem but for streams (fast
arriving data) and for particular types of
queries that are less sensitive to temporal
changes or compute approximate answers (otherwise
caching is of no interest as data changes too
fast)

7
Query processingPersonalization

Example Profile (query) set by the user denotes
that she is interested in sales between 2k and 4k
in August in Germany and does not care about
sales between 4k and 5k in July in Italy. She
wishes is to see in the results highlighted
regions of the cube that satisfy her profile.
Research challenges
Can personalization be used in the DW context?
Does it make sense in the context of aggregation
and what kind of predicates based on aggregations
are interesting?
How to express profiles?
How to process profiles? Particular emphasis
should be put on the scalability of profile
management in the presence of multiple users.
How could profiles be used for the design of the
DW so as to avoid unnecessary processing?
Is this topic really new Research or
Re-engineering of existing solutions for simple
queries?

8
Query processingData mining

Example a user specifies an ECA rule that
involves the identification of potential
interesting opportunities, based on the values of
a cube. The condition part has to do with the
identification of a pattern more than 5 times
within the last 2 days. The pattern is computed
by an online data mining algorithm
Research challenges
The integration of data mining algorithms in OLAP
tools, especially in the presence of novel types
of data and data streams. Particular emphasis is
put on the efficiency of the algorithms as well
as their smooth integration within the overall
OLAP environment.

9
ETLTraditional ETL

Definition Traditional ETL processes are
responsible for the extraction of data from
several sources, their cleansing, customization
and insertion into a DW. These tasks are repeated
on a regular basis and in most cases, they are
asynchronous.
Example integration of data extracted from OLTP
and legacy systems to a DW
State of the art So far, the research community
has only partially dealt with the problem of
designing and managing ETL processes.
Typically, research approaches concern (a)
stand-alone problems (e.g., the problem of
duplicate detection) in an isolated setting and
(b) problems mostly related to web
data.Recently, research on data streams has
brought up the possibility of giving an
alternative look to the problem of ETL.
Nevertheless, for the moment, research in data
streaming has focused on different topics, such
as on-the-fly computation of queries.

10
ETLTraditional ETL (cont.)

Research Challenges
A formal description of ETL processes with
particular emphasis on an algebra (for
optimization purposes) and a formal declarative
language.
Optimization of ETL processes in logical and
physical level. A challenge will be either the
optimization of the whole ETL process or of any
individual transformation. Parallel processing of
ETL processes is of particular importance.
Propagation of changes back to the sources.
Potential quality problems observed at the
end-user level can lead to clean data being
propagated back to the sources, in order to avoid
the repetition of several tasks in future
application of the ETL process.
Standard-based metadata for ETL processes. There
does not exist common model for the metadata of
ETL processes.
Integration of ETL with XML wrappers, EAI
(Enterprise Application Integration) tools data
quality tools.
Extension of the ETL mechanisms for
non-traditional data, like XML/HTML, spatial and
biomedical data.
Security issues in ETL source data and data in
transit are security risks.

11
ETLStream ETL (cont.)

Definition
Stream ETL is an ETL process involving the
possible filtering, value conversion and
transformation of the incoming information in a
desirable format.
Although, streams in their entirety cannot
always be stored (also do not want to go over
them again and again in ETL processing), some
patterns or snapshot aggregates of them can be
stored for subsequent querying.
Research Challenges
An issue that arises concerns correctness
guarantees.
Also, we need cost models for the tuning of the
incoming stream within specified time window.
Another issue is the audit of the incoming
stream data for several constraints or business
rules, also with respect to stored data (e.g.,
primary key violations).

12
ETLOn-Demand ETL

Example
Some users request data to be brought in from
the web. The administrator/programmer is assigned
the task of constructing an ETL process that
extracts the dates from the specified sites,
transforms them and ultimately stores them in the
of the warehouse. Any time the user needs this
data, this on-demand ETL process brings in the
relevant information.
Definition
An ETL process executed sporadically, manually
initiated by some user demand. The process is
responsible for retrieving external data and
loading them in the DW after the appropriate
transformations.
Research Cahllenges
Since this process is mostly focused towards web
data, there is a need for the appropriate
operators.
Also, additional challenges involve
Minimum effort/time/resources for the
construction of the process
Easy adaptability to the changes of the external
data
Efficient algorithms.

Write a Comment

User Comments (0)

About PowerShow.com

Data Warehouse Operational Issues Potential Research Directions PowerPoint PPT Presentation