Data Warehousing - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Data Warehousing

Description:

Digital imaging, GIS, GUI, tables, multi-dimensions, graphs, VR, 3D, animation ... Data input from maps, aerial photos, etc. Data storage, retrieval and query ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 57
Provided by: michae570
Category:
Tags: data | warehousing

less

Transcript and Presenter's Notes

Title: Data Warehousing


1
Data Warehousing Data Mining
2
Some Definitions
  • A data warehouse (DW) is a collection of
    integrated databases designed to support a DSS
  • An operational data store (ODS) stores data for a
    specific application. It feeds the data
    warehouse a stream of desired raw data.
  • A data mart is a lower-cost, scaled-down version
    of a data warehouse, usually designed to support
    a small group of users (rather than the entire
    firm)
  • The metadata is information that is kept about
    the warehouse
  • Online Analytical Processing (OLAP) is the broad
    category of software technology that enables
    multidimensional analysis of enterprise data

3
Business Intelligence and Analytics
  • Business intelligence (BI)
  • Acquisition of data and information for use in
    decision-making activities
  • Business analytics (BA)
  • Models and solution methods
  • Web intelligence
  • Application of business intelligence techniques
    to Web sites
  • Web analytics
  • Application of business analytics to Web sites
  • Data mining
  • Applying models and methods to data to identify
    patterns and trends

4
Data Warehouse
  • Subject-oriented (as opposed to
    application-oriented)
  • Data is organised based on its intended use
  • Scrubbed and cleansed so that data from
    heterogeneous sources are standardised
  • Time series, historical data
  • Non-volatile (read only)
  • Summarised in decision-usable format
  • Data from both internal and external sources is
    present
  • Metadata included
  • Business metadata
  • Semantic metadata

5
Data Warehouse Environment
  • The organisations legacy systems and data stores
    provide data to the data warehouse (DW) or mart
  • During the transfer of data from the various
    sources, cleansing or transformation may occur,
    so the data in the DW is more uniform
  • Simultaneously, metadata is recorded
  • Finally, the DW or mart may be used to create one
    or more personal warehouses

6
Data Warehouse Environment
7
Integration of Data Sources
  • Access needed to multiple sources
  • Often enterprise-wide
  • Disparate and heterogeneous databases
  • XML becoming language standard
  • External data sources Web
  • Intelligent agents
  • Document management systems
  • Content management systems
  • External data sources commercial databases
  • Might buy / sell access to specialised databases

8
Integration of Data Sources
9
Data Marts
  • Dependent
  • Created from warehouse
  • Replicated
  • Functional subset of warehouse
  • Independent
  • Scaled down, less expensive version of data
    warehouse
  • Designed for a department or SBU
  • Organisation may have multiple data marts
  • Difficult to integrate

10
Migrating Data
  • Business rules
  • Stored in metadata repository
  • Applied to data warehouse centrally
  • Data extracted from all relevant sources
  • Loaded through data-transformation tools or
    programs
  • Separate operation and decision support
    environments
  • Correct problems in quality before data stored
  • Cleanse and organise in consistent manner

11
Data Quality
  • Quality is critical
  • Quality determines usefulness
  • Often neglected or casually handled
  • Problems exposed when data is summarised

12
Data Quality
13
Data Quality
  • Cleanse data
  • When populating warehouse
  • Data quality action plan
  • Best practices for data quality
  • Measure results
  • Data integrity issues
  • Uniformity
  • Version
  • Completeness check
  • Conformity check
  • Genealogy or drill-down

14
Advantages of Data Warehousing
  • Simplicity
  • a data warehouse provides a single image of
    business reality by integrating various data
  • Better quality data improved productivity
  • consistency and accuracy leads to better and more
    productive decision-making end-user computing
    boosts productivity
  • Fast access
  • necessary data is in one place, so system
    response time is cut
  • Easy to use
  • designed for specific informational needs of end
    users
  • Separate decision-support operation from
    production operation
  • speeds access, avoids conflict and integrity
    problems

15
Advantages of Data Warehousing
  • Gives competitive advantage
  • through better management and and utilisation of
    corporate knowledge
  • Ultimate distributed database
  • a data warehouse pulls together information from
    disparate and potentially incompatible locations
    throughout the organisation
  • Information flow management
  • a data warehouse, especially the meta data, is
    helpful in the continual task of incrementally
    refining process workflows in a changing business
    environment
  • Enables parallel processing
  • users can ask questions that were too
    process-intensive to answer before and a data
    warehouse can handle more users, transactions,
    queries, and messages
  • Robust processing engines
  • data warehouses allow users to directly obtain
    and refine data from different software
    applications without affecting the operational
    databases
  • Security
  • since clients of the data warehouses cannot
    directly query the production databases, the
    security of the production databases is increased

16
Disadvantages of Data Warehousing
  • Complexity and anticipation in development
  • you cannot just buy a data warehouse you have to
    build one because each warehouse has a unique
    architecture and a set of requirements that
    spring from the individual needs of the
    organisation
  • Takes time to build
  • Expensive to build
  • End-user training
  • It is necessary to create a new mind-set with
    all employees who must be prepared to capitalise
    upon the innovative data analysis provided by
    data warehouses
  • Complexity involved in symmetrical
    multiprocessing (SMP) and massively parallel
    processing (MPP)

17
The Future of Data Warehousing
  • As the DW becomes a standard part of an
    organisation, there will be efforts to find new
    ways to use the data. This will likely bring
    with it several new challenges
  • Regulatory constraints may limit the ability to
    combine sources of disparate data (e.g. Data
    Protection Act)
  • These disparate sources are likely to contain
    unstructured data, which is hard to store
  • The Internet makes it possible to access data
    from virtually anywhere. Of course, this just
    increases the disparity.

18
Data Mining
  • Definition the analysis of data to discover
    previously unknown relationships that provide
    useful information (Hand et al.)
  • Data mining makes use of statistical and
    visualisation techniques to discover and present
    information in a form that is easily
    comprehensible
  • Data mining can be applied to tasks such as
    decision support, forecasting, estimation, and
    uncovering and understanding relationships among
    data elements

19
Data Mining
  • Traditionally the task of identifying and
    utilising information hidden in data has been
    achieved through some form of traditional
    statistical methods
  • Typically, this involves a user formulating a
    guess about a possible relationship in the data
    and evaluating this hypothesis via a statistical
    test. This is a largely time-intensive,
    user-driven, top-down approach to data analysis.
  • With data mining, the interrogation of the data
    is done by the data mining algorithm rather than
    by the user
  • Data mining is a self-organising,
    data-influenced, bottom-up approach to data
    analysis
  • Simply put, what data mining does is sort through
    masses of data to uncover patterns and
    relationships, then build models to predict
    behaviours

20
Web Mining
  • Web mining is a special case of data mining where
    the mining occurs over a Website
  • It enhances the website with intelligent
    behaviour, such as suggesting related links or
    recommending new products
  • It allows you to unobtrusively learn the
    interests of the visitors and modify their user
    profiles in real time
  • They also allow you to match resources to the
    interests of the visitor

21
Data Mining Why the Growth in Popularity?
  • One reason is that we keep getting more and more
    data all the time and need tools to understand it
  • We also are aware that the human brain has
    trouble processing multidimensional data
  • A third reason is that machine learning
    techniques are becoming more affordable and more
    refined at the same time

22
Verification -v- Knowledge Data Discovery
  • In the past, decision support activities were
    primarily based on the concept of verification
  • This required a great deal of prior knowledge on
    the decision-makers part in order to verify a
    suspected relationship
  • With the advance of technology, the concept of
    verification began to turn into knowledge data
    discovery

23
Knowledge Data Discovery
  • Knowledge data discovery (KDD) techniques
    include statistical analysis, neural or fuzzy
    logic, intelligent agents, data visualisation
  • KDD techniques not only discover useful patterns
    in the data, but also can be used to develop
    predictive models

24
The Knowledge Discovery Search Process
  • Define the business problem and obtain the data
    to study it
  • Use data mining software to model the problem
  • Mine the data to search for patterns of interest
  • Review the mining results and refine them by
    re-specifying the model
  • Once validated, make the model available to other
    users of the DW

25
Analytic Systems
  • Real-time queries and analysis
  • Real-time decision-making
  • Real-time data warehouses updated daily or more
    frequently
  • Updates may be made while queries are active
  • Not all data updated continuously
  • Deployment of business analytic applications

26
On-line Analytical Processing (OLAP)
  • Activities performed by end users in on-line
    (i.e. live multi-user) systems
  • Specific, open-ended query generation e.g. SQL
  • Ad hoc reports
  • Statistical analysis
  • Building DSS applications
  • Modeling and visualisation capabilities
  • Special class of tools
  • DSS, BI, BA, DBMS, GIS, etc.

27
Multidimensional OLAP (MOLAP)
  • Data can be viewed across several dimensions.
    Here sales are arrayed by region and product
  • A fourth dimension could be added by using
    several graphs, perhaps at different time points
  • Most analyses have many more dimensions than
    this. MOLAP handles data as an n-dimensional
    hypercube

28
Relational OLAP (ROLAP)
  • A large relational database server replaces the
    multidimensional one
  • The database contains both detailed and
    summarised data, allowing drill down techniques
    to be applied
  • SQL interfaces allow vendors to build tools, both
    portable and scalable
  • This requires databases with many relational
    tables which may lead to substantial processor
    overhead on complex joins

29
Data Mining Technologies
  • Statistics the most mature data mining
    technologies, but are often not applicable
    because they need clean data. In addition, many
    statistical procedures assume linear
    relationships, which limits their use.
  • Neural networks, genetic algorithms, fuzzy logic
    these technologies are able to work with
    complicated and imprecise data. Their broad
    applicability has made them popular in the field.

30
Data Mining Technologies
  • Decision trees these technologies are
    conceptually simple and have gained in popularity
    as better tree growing software was introduced.
    Because of the way they are used, they are
    perhaps better called classification trees.

31
Data Mining Techniques
  • Paralleling the popularity of data mining itself,
    the development of new techniques is exploding as
    well
  • Many innovations are vendor-specific, which
    sometimes does little to advance the state of the
    art
  • Regardless, data-mining techniques tend to fall
    into four major categories
  • classification
  • association
  • sequencing
  • clustering

32
Classification Methods
  • The goal is to discover rules that define whether
    an item belongs to a particular subset or class
    of data
  • For example, if we are trying to determine which
    households will respond to a direct mail
    campaign, we will want rules that separate the
    probables from the not probables.
  • These IF-THEN rules often are portrayed in a
    tree-like structure

33
Sequencing Methods
  • These methods are applied to time series data in
    an attempt to find hidden trends
  • If found, these can be useful predictors of
    future events
  • For example, customer groups that tend to
    purchase products tied-in with hit movies would
    be targeted with promotional campaigns timed to
    release dates

34
Clustering Techniques
  • Clustering techniques attempt to create
    partitions in the data according to some
    distance metric
  • Clustering aims to segment a diverse group into a
    number of similar subgroups or clusters
  • The clusters formed are data grouped together
    simply by their similarity to their neighbours
  • By examining the characteristics of each cluster,
    it may be possible to establish rules for
    classification
  • In clustering, there are no predefined classes
    and no examples. The records are grouped together
    on the basis of self-similarity.

35
Association Methods
  • These techniques search all transactions from a
    system for patterns of occurrence
  • A common method is market basket analysis, in
    which the set of products purchased by thousands
    of consumers are examined
  • It finds affinity groupings that discover what
    items are usually purchased with others,
    predicting the frequency with which certain items
    are purchased at the same time
  • Results are then portrayed as percentages for
    example, 30 of the people that buy steaks also
    buy charcoal

36
Association Market Basket Analysis
  • This is the most widely used and, in many ways,
    most successful data mining algorithm
  • It essentially determines what products people
    purchase together
  • Retailers can use this information to place these
    products in the same area
  • Direct marketers can use this information to
    determine which new products to offer to their
    current customers
  • Inventory policies can be improved if reorder
    points reflect the demand for the complementary
    products

37
Market Basket Analysis Method
  • We first need a list of transactions to see what
    was purchased. This can be easily obtained from
    cash registers / POS devices.
  • Next, we choose a list of products to analyse,
    and tabulate how many times each was purchased
    with the others

38
A Convenience Store Example
  • Consider the following simple example about five
    transactions at a convenience store
  • Transaction 1 Pizza, cola, milk
  • Transaction 2 Milk, potato chips
  • Transaction 3 Cola, pizza
  • Transaction 4 Milk, biscuits
  • Transaction 5 Cola, biscuits
  • These need to be cross tabulated and displayed in
    a table

39
A Convenience Store Example
  • Pizza and Cola sell together more often than any
    other combination a cross-marketing opportunity?
  • Milk sells well with everything people probably
    come here specifically to buy it

40
Market Basket AnalysisUsing the Results
  • The tabulations can immediately be translated
    into association rules and the numerical measures
    computed
  • Comparing this weeks table to last weeks table
    can immediately show the affect of this weeks
    promotional activities
  • Some rules are going to be trivial (e.g. hot dogs
    and buns sell together) or inexplicable /
    spurious (e.g. wheelbarrows sell best on
    Wednesdays?)

41
Market Basket Analysis Limitations
  • A large number of real transactions are needed to
    do an effective basket analysis, but the datas
    accuracy is compromised if all the products do
    not occur with similar frequency
  • The analysis can sometimes capture results that
    were due to the success of previous marketing
    campaigns (and not natural tendencies of
    customers)
  • (Have a look at Amazon.com to see it in action)

42
Data Visualisation
  • Data visualisation is so powerful because the
    human visual cortex converts objects into
    information so quickly
  • See an example on the next slide where height and
    shading add additional dimensions to the figure

43
Data Visualisation An Enlivened Risk Analysis
Report
44
Data Visualisation
  • Technologies which support visualisation and
    interpretation include
  • Digital imaging, GIS, GUI, tables,
    multi-dimensions, graphs, VR, 3D, animation
  • Helps to visually identify relationships and
    trends
  • Data manipulation allows real-time inspection of
    performance data / CPI benchmarks

45
Geographical Information Systems (GIS)
  • A Geographical Information System (GIS) is a
    special purpose database that contains a spatial
    co-ordinate system
  • Computerised system for managing and manipulating
    data with digitised maps
  • Used for modeling and simulations
  • A comprehensive GIS requires
  • Data input from maps, aerial photos, etc.
  • Data storage, retrieval and query
  • Data transformation and modeling
  • Data reporting (maps, reports and plans)

46
GIS Sample Applications
47
Capabilities of a GIS
  • In general, a GIS contains two types of data
  • Spatial data these elements correspond to a
    uniquely-defined location on earth. They could
    be in point, line or polygon form
  • Attribute data These are the data that will be
    portrayed at the geographic references
    established by spatial data
  • Example (next slide) data from an opinion poll
    is displayed for multiple regions in the USA.
    Clicking on an area allows the user to drill down
    to the results for smaller areas.

48
Sample GIS ApplicationTelephone Polling Results
On the live map, clicking on an area allows the
user to drill down and see results for smaller
areas
49
Data Mining Some Applications
  • Pharmaceuticals Massive amounts of biological
    and clinical information can be analysed with
    data mining methods to discover new uses for
    existing drugs
  • Healthcare Hospitals are using data mining to
    perform utilisation analysis and pricing
    analysis, to estimate outcome analysis, to
    improve preventive care, and to detect fraud and
    questionable practices
  • Banking Data mining tools help banks to
    understand customer behaviour, conduct
    profitability analysis, improve cross-selling
    efforts, identify credit risk, identify customers
    for loan campaigns, tailor financial products to
    meet customer needs, seek new customers, and
    enhance customer service
  • Credit card companies Predictors for credit card
    customer attrition and fraud are frequently
    identified via data mining. Successful users of
    data mining include American Express and
    Citibank.
  • Financial services Security analysts are using
    data mining extensively to analyse large volumes
    of financial data in order to build trading and
    risk models for developing investment strategies

50
Data Mining Some Applications
  • Telemarketing and direct marketing In this
    sector, companies have gained big savings and are
    able to target customers more accurately by using
    data mining. Direct marketers are configuring and
    mailing their product catalogs based on
    customers' purchase history and demographic data.
  • Airlines As the competition in the airline
    business increases, understanding customers'
    needs has become imperative. Airlines capture
    customer data in order to make strategic
    movements such as expanding their services in new
    routes.
  • Manufacturers Data mining is widely used in
    manufacturing industries to control and schedule
    technical production processes.
  • Insurance companies The insurance industry is
    data intensive. Data mining has recently provided
    insurers with a wealth of useful information
    extracted from huge databases for decision
    making.

51
Data Mining Some Applications
  • Telecommunications By applying the insights
    learned through data mining, telecommunications
    companies can identify products and services that
    maximise value and then use this information to
    establish marketing campaigns to improve market
    share. A common example in this industry is
    identifying factors that influence customer
    retention. In the US, telephone companies were
    famous for their price-cutting strategy in the
    past, but the new strategy is to know their
    customers better. Using data mining, telephone
    companies are able to provide customers with a
    great variety of new services they are likely to
    purchase.
  • Distribution and retailing With the huge amount
    of consumer data flowing in daily from different
    sources, especially from e-commerce Web sites,
    data mining helps companies learn more about
    their customers and develop insights into their
    buying habits. Knowing the behaviours (e.g. likes
    and dislikes) of customers leads to better
    customer service and allows companies to create
    one-to-one relationships with customers,
    hopefully prolonging loyalty and prompting repeat
    business. As such, data mining is used
    extensively in the area of customer relationship
    management. Large users of data mining in
    retailing industry include Wal-Mart and
    Victoria's Secret.
  • Remotely sensed data Huge amounts of remotely
    sensed data are taken in every day from satellite
    images and other related sources. Data mining is
    used in prediction of weather, monitoring and
    reasoning about ozone depletion, etc.

52
Advantages of Data Mining
  • Provide better information to achieve competitive
    edge
  • This advantage is the primary motivation for data
    mining. Data mining has a powerful analytical
    ability to generate information, which allows an
    organisation to better understand itself, its
    customers, and the marketplace it competes in.
    When used as a marketing tool, data mining often
    results in sharper competitive edge, an
    evidence-based selling approach, a
    customer-oriented marketing plan, shorter selling
    cycles, and reduced operational costs.
  • Add value to a data warehouse
  • A data warehouse by itself is just a large
    repository of unstructured data, and data mining
    is the process of analysing the data and
    transforming it into useful information.
    Organisations have experienced a payback of 10 to
    70 times their data warehouse investment after
    data mining components are added.
  • Increase operating efficiency
  • Data mining's ability to quickly organise and
    analyse a large pool of data has dramatically
    increased workplace efficiency. It allows users
    to create complex financial statement in minutes
    compared with weeks by traditional methods.

53
Advantages of Data Mining
  • Provide flexibility in using data
  • With data mining, users gain control over the
    data. Instead of letting the system push the
    data, users are now able to pull the data they
    need. Users can let their imagination run and
    manipulate data in various ways to answer their
    questions. The easy-to-use interface of data
    mining tools and client/server technology has
    made the information directly accessible by
    individual users.
  • Reduce operating costs
  • Modern data mining tools are made of highly
    sophisticated hardware and software components.
    They allow these tools to analyse massive data
    sets efficiently with reduced operating costs.
    (e.g. the high costs faced by public sector
    organisations such as healthcare providers when
    asked to answer a parliamentary question raised
    in the Oireachtas could be reduced by the use of
    data warehouses and data mining)
  • Ready-to-use
  • Unlike traditional data analysis methods, data
    mining hardly requires pre-processing of data
    prior to analysis. It can use a mixture of
    numeric, categorical, and date data, and can
    tolerate missing and noisy data. The results are
    in the form of ready-to-use business rules with
    almost no statistical expertise and guesswork
    needed.
  • Solve research bottleneck
  • In many social science and business situations,
    conducting real experiments is almost impossible.
    Data mining is able to provide these research
    agendas with a more limited set of working
    hypotheses for further investigation based on
    large, unstructured data sets.

54
Disadvantages of Data Mining
  • No definitive answer
  • Data mining yields useful insights and clues but
    no definitive answers. The definitive answers
    need to be achieved through much more rigorous
    scientific experimentation. Experiences from Wall
    Street have shown that this technology may not
    outperform traditional methods. Therefore, users
    should have a realistic expectation of the
    results of data mining.
  • High cost
  • The cost of implementing data mining is quite
    high thus, it may not be appropriate in some
    business environments. Need to justify ROI by
    cost-benefit analysis
  • Complex and lengthy project
  • Experience from data mining system developers has
    shown that it takes a long time to get the
    project right. Developers suggest focusing on
    incremental development and benefits.
  • Privacy
  • The detailed data about individuals used in data
    mining might involve a violation of privacy. This
    problem worsens when the World Wide Web is
    involved, because detailed personal information
    is easily accessible and can fall into wrong
    hands.

55
Disadvantages of Data Mining
  • Knowledge requirement of user
  • Despite its increasingly simple interface and
    automation of the thinking processes, data mining
    is more suitable for people with statistical,
    operation research, and management science
    backgrounds. The ease of use becomes a critical
    factor for attracting more businesses to invest
    in this technology.
  • Unmanageable database
  • Many authors have suggested that organisations
    must increase the size of their databases
    tremendously in order to do data mining. However,
    some are concerned that this will result in
    unmanageable and unnecessary databases.
  • Wrong information from errors in data
  • The massive data used in data mining inevitably
    contains mistakes caused by human errors.
    Information generated should be used with caution
    to avoid lawsuits in areas such as hiring.
    Experts suggest using only relevant information
    for mining to reduce such risks.

56
Additional Resources
  • See case studies of successful implementations
    at http//www.sas.com/success/technology.html
  • See product demos at http//www.sap.com/solutions
    /analytics/
  • CIO Magazine - ERP Resources http//www.cio.com/e
    nterprise/erp/
  • White papers available from http//www.datawareho
    using.com/papers.asp
  • Industry research reports available from
    http//www.datawarehousingonline.com
  • The Data Warehousing Information Center
    http//www.dwinfocenter.org
Write a Comment
User Comments (0)
About PowerShow.com