Data Warehouse Operational Issues Potential Research Directions - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehouse Operational Issues Potential Research Directions

Description:

Data Warehouse Operational Issues Potential Research Directions Query processing General issues Query processing for OLAP queries has been addressed extensively in ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 13
Provided by: DT180
Learn more at: http://cci.drexel.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Warehouse Operational Issues Potential Research Directions


1
Data Warehouse Operational IssuesPotential
Research Directions
2
Query processingGeneral issues
  • Query processing for OLAP queries has been
    addressed extensively in the past, including
    standard OLAP operations, ad-hoc queries, etc.
  • Research Challenges
  • There is a need for a standardized query
    language for OLAP, a Cube algebra
  • For several queries it would be sufficient to
    provide approximate answers. We need to identify
    approximation measures and formal techniques to
    evaluate them.
  • Integrate what-if analysis facilities in
    current OLAP tools, with particular emphasis on
    improved performance, in order to avoid using
    more than one tools for this purpose (as
    currently happens)
  • Support holistic aggregation (i.e., aggregation
    that cannot exploit differentials) efficiently.
    Examples involve the computation of top-K, or
    median values in the presence of streaming
    insertions and deletions.

3
Stream data
  • Streams are sequences of data, continuously
    flowing from a data source with the particular
    characteristic that, due to their volume, each
    tuple is available only for a limited time window
    for querying. Stream examples would involve
  • Stock rates extracted from the web
  • Packets going through a router
  • Clickstreams from a web site

4
Query processingOLAP aggregation and stream data
  • Example these are the standard OLAP queries we
    know (cube operator, etc)
  • Research challenges
  • Issues related to processing OLAP queries over
    streams streams are eventually stored in the DW
    (updated consistently) an interesting issue is
    defining consistency between stream and DW.
  • How to split query processing between on-line
    queries on stream data and queries on the
    contents of the warehouse it might also be
    necessary to store some of the results back in
    the DW to allow for future queries.
  • Synchronization of multiple incoming streams with
    different arrival rates, e.g., when we need to
    combine them with some join operation and
    aggregation, in the presence of consistency
    constraints.

5
Query processingSubscription queries
  • Example Set of indicators define whether or not
    business is going well depending on whether
    stream values exceed some threshold
  • Subscription queries are defined in order to
    continuously monitor the DW. One way to do that
    is by defining ECA rules on the data warehouse
    evaluation of triggers requires access to the DW
    when event takes place, execute queries against
    the warehouse.
  • Research challenges
  • Modeling of subscription queries (is it a new
    language or standard trigger languages are
    enough?)
  • What kind of extensions do we need at the
    physical level to support that?
  • How to optimize trigger condition checking
    (difficult in light of heavy aggregation)

6
Query processingCached queries
  • Example Compute the sum per country and per
    month and save it because you can compute based
    on this the sum per continent and year.
  • Important component, should exist in any data
    warehousing architecture.
  • Research challenges
  • How to and under what conditions can you do query
    rewriting? cube algebra would be useful here.
  • Work on all caching issues updating/refreshing,
    deciding whether or not to cache, replacement
    policies in the case of bounded space, etc.
  • Work on the same problem but for streams (fast
    arriving data) and for particular types of
    queries that are less sensitive to temporal
    changes or compute approximate answers (otherwise
    caching is of no interest as data changes too
    fast)

7
Query processingPersonalization
  • Example Profile (query) set by the user denotes
    that she is interested in sales between 2k and 4k
    in August in Germany and does not care about
    sales between 4k and 5k in July in Italy. She
    wishes is to see in the results highlighted
    regions of the cube that satisfy her profile.
  • Research challenges
  • Can personalization be used in the DW context?
    Does it make sense in the context of aggregation
    and what kind of predicates based on aggregations
    are interesting?
  • How to express profiles?
  • How to process profiles? Particular emphasis
    should be put on the scalability of profile
    management in the presence of multiple users.
  • How could profiles be used for the design of the
    DW so as to avoid unnecessary processing?
  • Is this topic really new Research or
    Re-engineering of existing solutions for simple
    queries?

8
Query processingData mining
  • Example a user specifies an ECA rule that
    involves the identification of potential
    interesting opportunities, based on the values of
    a cube. The condition part has to do with the
    identification of a pattern more than 5 times
    within the last 2 days. The pattern is computed
    by an online data mining algorithm
  • Research challenges
  • The integration of data mining algorithms in OLAP
    tools, especially in the presence of novel types
    of data and data streams. Particular emphasis is
    put on the efficiency of the algorithms as well
    as their smooth integration within the overall
    OLAP environment.

9
ETLTraditional ETL
  • Definition Traditional ETL processes are
    responsible for the extraction of data from
    several sources, their cleansing, customization
    and insertion into a DW. These tasks are repeated
    on a regular basis and in most cases, they are
    asynchronous.
  • Example integration of data extracted from OLTP
    and legacy systems to a DW
  • State of the art So far, the research community
    has only partially dealt with the problem of
    designing and managing ETL processes.
    Typically, research approaches concern (a)
    stand-alone problems (e.g., the problem of
    duplicate detection) in an isolated setting and
    (b) problems mostly related to web
    data.Recently, research on data streams has
    brought up the possibility of giving an
    alternative look to the problem of ETL.
    Nevertheless, for the moment, research in data
    streaming has focused on different topics, such
    as on-the-fly computation of queries.

10
ETLTraditional ETL (cont.)
  • Research Challenges
  • A formal description of ETL processes with
    particular emphasis on an algebra (for
    optimization purposes) and a formal declarative
    language.
  • Optimization of ETL processes in logical and
    physical level. A challenge will be either the
    optimization of the whole ETL process or of any
    individual transformation. Parallel processing of
    ETL processes is of particular importance.
  • Propagation of changes back to the sources.
    Potential quality problems observed at the
    end-user level can lead to clean data being
    propagated back to the sources, in order to avoid
    the repetition of several tasks in future
    application of the ETL process.
  • Standard-based metadata for ETL processes. There
    does not exist common model for the metadata of
    ETL processes.
  • Integration of ETL with XML wrappers, EAI
    (Enterprise Application Integration) tools data
    quality tools.
  • Extension of the ETL mechanisms for
    non-traditional data, like XML/HTML, spatial and
    biomedical data.
  • Security issues in ETL source data and data in
    transit are security risks.

11
ETLStream ETL (cont.)
  • Definition
  • Stream ETL is an ETL process involving the
    possible filtering, value conversion and
    transformation of the incoming information in a
    desirable format.
  • Although, streams in their entirety cannot
    always be stored (also do not want to go over
    them again and again in ETL processing), some
    patterns or snapshot aggregates of them can be
    stored for subsequent querying.
  • Research Challenges
  • An issue that arises concerns correctness
    guarantees.
  • Also, we need cost models for the tuning of the
    incoming stream within specified time window.
  • Another issue is the audit of the incoming
    stream data for several constraints or business
    rules, also with respect to stored data (e.g.,
    primary key violations).

12
ETLOn-Demand ETL
  • Example
  • Some users request data to be brought in from
    the web. The administrator/programmer is assigned
    the task of constructing an ETL process that
    extracts the dates from the specified sites,
    transforms them and ultimately stores them in the
    of the warehouse. Any time the user needs this
    data, this on-demand ETL process brings in the
    relevant information.
  • Definition
  • An ETL process executed sporadically, manually
    initiated by some user demand. The process is
    responsible for retrieving external data and
    loading them in the DW after the appropriate
    transformations.
  • Research Cahllenges
  • Since this process is mostly focused towards web
    data, there is a need for the appropriate
    operators.
  • Also, additional challenges involve
  • Minimum effort/time/resources for the
    construction of the process
  • Easy adaptability to the changes of the external
    data
  • Efficient algorithms.
Write a Comment
User Comments (0)
About PowerShow.com