Unit Five The nature and sources of data - PowerPoint PPT Presentation

1 / 75
About This Presentation

Unit Five The nature and sources of data


... stored in data warehouses and has become increasingly important in XML-based Web ... Data centric WH: -based on data model that is independent of any application. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 76
Provided by: A150
Tags: become | data | five | model | nature | sources | unit


Transcript and Presenter's Notes

Title: Unit Five The nature and sources of data

Unit FiveThe nature and sources of data
  • Data items about things, events, activities,
    and transactions are recorded, classified and
    stored but are not organized to convey any
    specific meaning.
  • Data item can be numeric, alphanumeric, figures,
    sounds, and images.
  • Information data that have been organized in a
    manner that gives them meaning for the recipient.

  • They confirm thing that the recipient know, or
    may have surprising value by something not known.
  • Knowledge consists of data items and /or
    information organized and processed to convey
    understanding, experience, accumulated learning
    and expertise that are applicable to current
    problem activity.
  • knowledge can be the application of data and
    information in making decisions.

  • Internal data
  • stored in one ore more places, they are about
    people, products, services, and processes
    (student data is stored in university DB).
  • External data
  • Has many resources, commercial DB, collected data
    by sensors, and satallite,

  • Available on CD, DVD, Internet, statistical
    bureaus, banks, chamber of commerce.
  • Data collection problems and quality
  • The need to collect data from internal and
    external sources makes MSS building complicated.
  • In some cases it is necessary to collect row data
    in the filed.
  • In other cases it is necessary to get data from
    people or to find it on the Internet.
  • Data must be validated and filltered.

  • Methods for collecting row data
  • 1- manually observations, questionnaires,
    interviews, soliciting information from experts.
  • 2- sensors and scanners for biometrics.
  • Data problems
  • 1-data are not correct generated carelessly,
    entered inaccurately.
  • 2- data is not timely methods for generating
    data are not fast enough to meet needs for data.

  • 3-data are not measured properly gathered
    inconsistently with the purposes of analysis.
  • 4- needed data do not exist no one ever stored
    data needed now.
  • Data quality
  • Quality determines usefulness of data as well as
    the quality of the decisions based on them.

  • Data quality problems
  • 1- contextual DQ relevancy, timeliness,
  • 2-intrinsic????? DQ accuracy, objectivity,
    believability, reputation.
  • 3- accessibility DQ access security.
  • 4- representation DQ interpretability, ease of
    understanding, consistent representation.

Data Integrity
  • Older filing system may lack integrity. If a
    change is made in the file in one place, it may
    not be made in the file in another laces or
    department, which results in conflicting data.
  • Data integrity considers the following
  • issues
  • 1- uniformity during data capturing, uniformity
    checks to ensure that data are within specific

  • 2- version checks are performed when the data
    are transformed through the use of metadata to
    ensure that the format of the original data has
    not been changed.
  • 3- completeness check ensures that the summaries
    are correct and that all values needed to create
    the summary are included.

  • 4- conformity check ensures that during data
    analysis and reporting, correlation are run
    between the value reported and previous values
    for the same numbers.
  • Sudden changes can indicate a basic change in the
    business analysis is error or bad data.
  • 5- genealogy ??? ???????check drill down, trace
    back to the data source through its various

Data Access and Integration
  • How to reach data in its storage area?
  • Data access can be done using one of the
    following methods
  • relation Database tables, XML documents,
    Electronic data messages, Cobol records, the
    Internet which has thousands of databases all
    over the world accessible through the
  • commercial data banks which are an online
    databases services selling services to
    specialized databases,

  • they can add external data to MSS in a timely
    manner and reasonable cost. Example is the GIS.

  • Supplements standard Operating system by allowing
    for greater integration of data, complex files
    structure, quick retrieval and changes, better
    data security.
  • It is SW programs for adding information to DB
    and updating, deleting, manipulating, storing and
    retrieving information.

DB types
  • 1- relational 2-hierarchical 3-network
  • 4- Object oriented DB
  • MSS application may require accessibility to
    complex data which may include pictures which can
    not be handled by the previous types.
  • Graphical representation used (OODB) may be used
    to handle pictures.

  • It is based on OOP by combining characteristics
    of OOP such as UML with mechanism for data
    storage and access.
  • OOBMS allows to analyze data at a conceptual
    level that emphasize the natural relationships
    between objects using encapsulation and
  • OODBMS defines data as objects and encapsulates
    data a long with their relevant structure and

  • 5- Multimedia-Based DB
  • MMDBMS manage data in a variety of formats in
    addition to text and numbers.
  • other formats include images such as digitized
    photographs, forms of bit-mapped graphics such as
    maps or .PIC files, hypertext images, videos,
    clips, sounds, and virtual reality
    (multidimensional images).

Data Warehousing
  • Is one or several databases which contain the
    information that is needed for tactical or
    strategic decisions. Collection of data designed
    to support decision making. Contains data that
    present a coherent picture of business conditions
    at a single point in time.

  • Data Warehousing can be
  • 1- utilized to support decision-making.
  • 2- analyzing large amount of data from various
    resources to provide fast results to support
    critical process.
  • Organization (public and private) continuously
    collect data, information and knowledge and store
    them in computerized system.

  • As the amount of data increases
  • 1- updating, retrieving, using, removing of
    information becomes complicated.
  • 2- number of data uses increase as a result of
    improved reliability and availability of network
  • Warehouse gets data from external and internal
    resources, organized in consistent with
    organizations needs.
  • Data WH has access to all information relevant to
    the organization which can come form internal or
    external sources.

  • With meta data and metadata repository,
    organization can improve their uses of
    information and application development
  • Business benefit from metadata as follows
  • 1- reduction of It- related problems.
  • 2- increase system value to business.
  • 3-improve business decisionmaking.

  • Business metadata comprise information that
    increase or understanding to traditional data
    (structured) reported.
  • Primary purpose is to provide context to the
    data, enriching information leading to knowledge.
    Context does not have to be the same for all
  • It assist in conversion of data and information
    into knowledge.

Data Warehousecharacteristics
  • 1- subject-oriented
  • Data are organized be detailed subjects
    (customer, policy type in insurance company)
  • Data contains only information relevant for
    decision support.
  • Enables users to determine how their business is
    performing and why it is performing that way.

  • It provides more comprehensive view of the
    organization, than operational DB which is
    oriented toward product and handles transactions.
  • 2- Integrated
  • Data at different source locations may be
    encoded differently. Example, person gender may
    be encoded as 0,or 1 and in other places as F, M.
    In data warehouse they are scrubbed ( cleaned)
    into one format which makes them standard and

  • 3- time variant (time series)
  • Data do not provide current status.
  • Data are kept for several years and are used for
    trends, forecasting and comparison.
  • Time is the one important dimension that all data
    warehouses must support.
  • Data for analysis from multiple sources contain
    multiple time point (daily, weekly, monthly views)

  • 4-non volatile
  • Once entered into the warehouse they are
    read-only, they can not be changed or updated.
  • Obsolete data are discarded, and changes are
    recorded as new data
  • 5- summarized operational data are aggregate
    into summaries.

  • 6- not normalized
  • data in data warehouse are not normalized and
    highly redundant.
  • 7- sources all data are present both internal
    and external.
  • 8- metadata data about data are includes in data

  • Describes the structure of and some meaning about
    the data which affect its effective or
  • The key of making user comfortable with
  • Involves knowledge, and capturing and making them
    accessible through the organization have become
    important success factor.

  • With metadata and metadata repository,
    organization can improve their uses of
    information and application development
  • Business benefits from metadata as follows
  • 1- reduction of IT-related problems.
  • 2- increase system value to business.
  • 3- improved business decision-making.

  • Business metadata comprises information that
    increase our understanding to traditional
    (structured) data reported.
  • Primary purpose is to provide context to the
    data, enriching information leading to knowledge.
    Context does not have to be the same for all
  • It assist in conversion of data and information
    into knowledge.

  • Data about data.
  • Metadata describes how and when and by whom a
    particular set of data was collected, and how the
    data is formatted. Metadata is essential for
    understanding information stored in data
    warehouses and has become increasingly important
    in XML-based Web applications.

Data Ware Housearchitecture and process
  • Could be of one, two, or three layers.
  • DWH can be divided into three parts
  • 1- the DWH itself, which contains the data
    associated SW.
  • 2- data acquisition SW which extracts data from
    legacy systems and external sources ,
    consolidates and summarizes them, and loads them
    into the DWH.
  • 3- client SW which allows users to access and
    analyze data in the ware house.

  • In the three layer architecture contains
  • 1- operational system contains in the data SW for
    data acquisition in one server (layer).
  • 2- the DWH is another layer.
  • 3- the third layer includes decision
    support/business intelligence, business analytics
    engine and the client. This has advantage it
    separate functions of data WH eliminating
    resources constraints and makes it possible to
    create data marts easily.

  • In Two layer
  • Dss engine is on the same platforms as the WH
    which makes it more economical than the three
    layer structure.
  • Some issues to consider when selecting an
  • 1- which DBMS to use? Most DWH built using
    relational DBMS, oracle, SQL server which support
    client-server and Web-server architecture.
  • 2- will parallel processing or partitioning be
  • parallel processing enables multiple CPUs to
    process data WH query request at the same time.
    Partitioning the DB tables into smaller ones to
    improve access efficiency.

  • 3- will data migration tools be used to load the
  • 4- what tools will be used to support data
    retrieval and analysis?

Data Ware House Development
  • The process of migration data to DWH involves
    extraction of data from all relevant resourcesgt
  • Data sources consists of the following
  • 1- files extracted from online transaction
    processing (OLTP).
  • 2- spread sheet
  • 3- personal DB (ms-Access)
  • 4- external fles.

  • DWH contains a number of business rules that
    define the following
  • 1- how the data will be used.
  • 2-summarization rules.
  • 3-standardization of encoded attributes.
  • 4-calculation rules.
  • And data quality issues need to be corrected
    before its loaded into the DWH.

  • One of the well defined DWH benefits is that
    these rules can be stored in a meta data
    repository and applied to DWH centrally.
  • In OLTP, rules are scattered all over the system.
  • Load process into DWH can be performed either by
  • 1- data transformation tools which provide
    Graphical User Interface (GUI) to help in
    development and maintenance business rules.
  • 2- developing programs or utilities to load data
    WH using programming languages such as PL/SQL,
    C or .net.

  • There are several issues that affect whether to
    build a data transformation tool or buy one ,
    which are
  • 1- cost of transformation tool.
  • 2-they may take time to learn how to use.
  • 3- it is difficult to measure how the IT
    organization is doing until it has learned to use
    the tool.

  • Benefits of transformation tools
  • 1- simplifying the maintenance of the
    organization DWH.
  • 2-effective in detecting and scrubbing, removing
    of bad data.

Star Schema
  • DWH design is based on dimensional modeling.
  • dimensional modeling is retrieval-based model, it
    supports high amount of query access.
  • Star schema is how dimensional modeling
  • Is implemented.
  • Star schema contains a central fact table which
  • 1- the attributes needed to perform decision
  • 2- descriptive attributes used for query
  • 3-foreign key to link to dimensional table.
  • Decisional analysis attributes consists of

  • A- performance measure
  • B- operational metrics
  • C-aggregate measures
  • D- other metrics needed to analyze org.
  • Fact table address what the DWH supports for
    decision analysis.
  • Dimensional table contains attributes that
    describe the data contained in the fact table.
  • Dimensional table address how data will be

  • Grain of data WH defines the highest level of
    detail supported, grain indicates whether that
    DWH is high summarized or include detailed
    transaction data.
  • If the grain is defined too high, the WH may not
    support detailed requests to drill down into
  • Drill-down analysis is the process of probing
    beyond a summarized value to investigate each of
    the detail transaction that comprise the summary.
  • Low level granularity results in more data being
    stored in DWH.
  • Larger amounts of detail may affect the
    performance of query making response time longer.

Implementing Data Ware House
  • DWH projects can be identified as either data
    centric or application centric.
  • Data centric WH
  • -based on data model that is independent of any
  • -designed to support variety of user needs and
  • -supports flexibility since organization
    information constantly needs change. More dynamic
    business means more data needs will change.

  • application centric
  • -designed to support a single initiative or small
    set of initiatives.
  • -preferred for independent data mart development.
  • -provides more focused scope increasing the
    success of DWH implementation.
  • Its disadvantage is that critical data needs may
    be lost out during the initial development
    therefore multiple iterations is necessary.

  • Factors that play a big role in the successful
    implementation of DWH, can be categorized into
    organizational issues, project issues and
    technical issues, the factors are
  • 1-management support 2-champion
  • 3-resources 4-user participation
  • 5-team skills 6-source system
  • 7-development technology

  • Implementation of Web-based DWH (Webhousing),
    make it easier to have access to large amounts of
    data, but it is difficult to determine the hard
    benefits of DWH.
  • Hard benefits organization benefits that can be
    expressed in Monterey terms (org. has priorities
    when it comes to money).
  • Project champion helps ensuring that DWH project
    will receive the necessary resources for
    successful implementation.
  • Resources could be costly, require high
    processors, and large increase in direct-access
    storage devices, web-based WH need special
    security requirements.

  • User participation
  • -participation in data modeling and access
  • during data modeling, expertise is required to
    determine the following
  • 1-what data are needed?
  • 2-define business rule of data
  • 3-what aggregation and calculations needed?
  • Access modeling is needed to determine
  • 1-how data is to be retrieved from DWH?
  • 2-help in physical definition of WH to help in
    determining which data needs indexing.
  • 3- indicates whether data marts are needed to
    facilitate information retrieval.

  • Team skills require in-dept knowledge of DB
    technologies and development tools.
  • Source system and development technology refer to
    many inputs and processes used to load and
    maintain DWH.
  • Ubiquitous

Best practices for DWH implementation
  • The project must fit with corporate strategy and
    business objectives.
  • There must be complete buy-in to the project
    (executives, managers, users)
  • Manage expectations.
  • DWH must be built incrementally.
  • Project must be managed by IT and managers.
  • Load cleaned data and of quality
  • Do not overlook training requirements.

DWH Risks
  • Many risks is WH project, they are serious
    because DWH are large-scale and expensive
    projects. Some risks are
  • Quality of source data is not known.
  • Skills are not in place.
  • Inadequate budget
  • Lack of supporting SW.
  • Weak or loss of sponsor.
  • Users are computer literate.
  • Unrealistic users expectations.
  • Key people may leave project.
  • Too much new technology
  • Team geography , language culture.

Mistakes to avoid in developing a successful DWH
  • Be aware of the following problems
  • 1-setting expectations that you can not meet
  • 2-loading WH with any available data.
  • 3- DWH managers must be user oriented not
    technology oriented.
  • 4-focusing in ad hoc data mining and periodic
    reporting instead of alert.
  • The natural progression in a DWH is
  • 1- extract data from legacy system, clean then
    and input them into WH.
  • 2-support ad hoc reporting until you learn what
    people want
  • 3-convert the ad hoc reports into regular
    scheduled reports.

Massive data WH and scalability
  • DWH needs to be flexible and support scalability.
  • Scalability deals with
  • -the amount of data in WH.
  • -how quickly the WH is expected to grow
  • -number of concurrent users
  • -the complexity of users queries
  • DWH grows as function of data growth.
  • The need to expand WH is to support new business
  • measures of data size is in betabyte, terabyte,
    huge data sizes needs powerful computers and
    smart indexing and searching methods.

Users capabilities and benefits
  • Users of DWH are
  • Managers, analysts, executives, administrative
    assistants, professionals.
  • DWH solution should provide
  • -ready access to critical data.
  • -insulate operational DB from ad hoc processing.
  • -provide high level summary information and
    drill-down capabilities.
  • These improve
  • -business knowledge
  • -provide competitive advantage
  • -Enhance customer services and satisfaction.
  • -facilitate decision making.
  • -improve workers productivity.
  • -help streamline business processes.

Data Marts
  • A subset of DWH
  • Consists of a single subject area
  • ( marketing, sales, operations etc)
  • It can be either dependent or independent
  • Dependent Data Marts
  • Subset created directly from the data WH.
  • Advantages
  • -uses consistent data model.
  • -provide quality data.
  • -support the concept of a single enterprise wide
    data model.
  • -ensures that end-user is viewing the same
    version of data that is accessible by all other
    DWH users.
  • -because of DWH high cost, their use is limited
    to large organizations.

  • Independent Data Marts
  • Lower cost, scaled down version of DWH.
  • Designed for strategic business units
  • Advantages of Data marts
  • -low cost
  • -shorter time to implement
  • -controlled locally
  • -contain less information decreases the response
  • -allow the building of private DSS without
    relying om\n centralized IS departmewnt.

Business Intelligence
  • Describes the basic components of business
    intelligence environment ranging from business
    process modeling and data modeling to business
    rules systems, data profiling, information
    compliance and data quality, data WH and data
  • It involves acquiring data and information from a
    wide variety of sources and utilize their
    decision making.

  • Business Analytics deals with models and solution
  • Business intelligence methods
  • -provide charts and graphs of multidimensional
  • -they access data from DWH and bring them to
    local DB system, such as OLAP methods.
  • OLAP methods allow users to slice and dice data
    and observing graphs and tables that reflect the
    dimension being observed

Data Mining Methods
  • Apply statistical and deterministic models, AI
    methods to data to identify hidden relationships
    or discover knowledge among various data or text
  • Dash Boards
  • -provide managers with exactly the information
    they need in the correct format at the correct
  • -the foundation of Dash Boards is Business
  • Provide real-time views of data (daily, weekly,

  • Dashboard is a user-interface feature Apple
    introduced with the release of Mac OS X 10.4
    Tiger. It allows access to all kinds of "widgets"
    that show the time, weather, stock prices, phone
    numbers, and other useful data. With the Tiger
    operating system, Apple included widgets that do
    all these things, plus a calculator, language
    translator, dictionary, address book, calendar,
    unit converter, and iTunes controller. Besides
    the bundled widgets, there are also hundreds of
    other widgets available from third parties that
    allow users to play games, check traffic
    conditions, and view sports scores, just to name
    a few.

On Line Analytical ProcessingOLPA
  • It concentrate on building mission-critical
    system that
  • -support organization transaction processing.
  • -fault-tolerant
  • -provide fast response.
  • OLAP is an example of such systems
  • Concentrate on distributed relational DB
  • Refers to a variety of activities usually
    performed by end-users in Online systems, such

  • -generating queries.
  • -Requesting ad hoc reports and graphs.
  • -conducting statistical analysis and building DSS
    and multimedia applications.
  • In OLAP users ask specific, open-ended questions.
  • OLAP tools
  • -query tools, spread sheets, data mining tools,
    data visualization tools.

OLAP Tools
  • Four types of processing that are performed by
    analysts within an organization
  • 1- categorical analysis is a static based upon
    historical data. It is base on the fact that past
    performance is an indicator of the future.
  • 2- exegetical (??????) analysis is based on
    historical data with the ability of drilling-down
    analysis (the ability to query further into data
    to determine the detail data used to determine
    derived value).
  • 3-contemplative (?????) analysis allows a user
    to change a single value to its impact.
  • 4-forulaic analysis permits changes to multiple

Data Mining
  • A term used to describe knowledge discovery in
  • Is a process that uses statistical, mathematical,
    artificial intelligence and machine-learning
    techniques to extract and identify useful
    information and subsequent knowledge from large
    data base.
  • It is the process of engineering mathematical
    patterns from usually large set of data. Patterns
    can be rules affinities, correlations, trends, or
    prediction models.
  • Used when relationships between system variables
    can not be expressed mathematically and modeling
    is not possible.

Data Mining Activities
  • 1-knowledge extraction.
  • 2-data archaeology
  • 3- data exploration
  • 4-data pattern processing
  • 5-data dredging
  • 6-information harvesting

DM characteristics and Objectives
  • 1-data are often buried deep within very large
    DB, sometimes contain data from several years
    that are cleaned and consolidated in DWH.
  • 2-DM environment is usually client-server or
    Web-based architecture.
  • 3-new tools help to remove the information ore
    buried in files or archive records. Finding it
    requires synchronization of data to get the right
  • 4- miner is an end user, uses powerful query
    tools to ask ad hoc questions and obtain quick
    answers quickly with little or no programming
    skills .
  • 5-DM tools are combined with spreadsheets which
    makes analyzing and processing data easy and
  • Because of large amount of data and massive
    search efforts, it is necessary to use parallel
    processing for DM.

How DM works?
  • Intelligent DM discovers information within DWH
    that queries and reports can not effectively
  • DM tools find patterns in data and may infer
    rules from them.
  • Way to identify patterns in data
  • 1-simple models SQL-based query, OLAP, human
  • 2-intermediate models regression, decision trees,
  • 3-complex models neural networks.

Data Mining application classes
  • Each class is supported by a set of algorithmic
    approaches to extract the relevant relationships
    in the data. These classes are
  • 1-classification infers the defining
    characteristics of a certain group. Example
    customers who have lost to competitors.
  • 2-clustering identifies groups of items that
    share a certain characteristics (no predefining
    characteristics are given). Example classes of
    customers with certain needs to be met.
  • 3-association identifies relationship between
    events that occur over one time. Example what
    products sell with other ones, and to what degree.

  • 4-sequencing identifies relationship between
    events that occur over a period of time.
    Examplerepeat visits to supermarket
  • 5- regression used to map data to a prediction
    value using linear and nonlinear techniques. It
    is a form of estimation.
  • 6-forecasting estimates future values based on
    patterns within large sets of data.

  • DM can either hypothesis driven or discovery
  • Hypothesis driven data mining begins with a
    proposition by the user, then seek a validation
    for the truthfulness of the proposition. Example,
    a marketing manager may ask the proposition are
    DVD players sales related to TV sets sales?.
  • Discovery driven data mining finds patterns,
    relationships among the data. It can uncover
    facts that are unknown by the organization.

  • DM uses many tools to discover patterns and
    relationships in data to make accurate
  • Steps of DATA Mining
  • 1- define the business problem.
  • 2-build (find or acquire) DM database.
  • 3-explore the data.
  • 4-prepare the data for modeling.
  • 5-build or find models.
  • 6-evaluate the models.
  • Act on the result.

Data Mining misconceptions
  • Results of DM can increase revenues, decrease
    costs, identify fraud, identify opportunities and
    offer new competitive advantage. But there are a
    number of misconceptions a bout DM which are
  • 1-DM provides instant, crystal-ball predictions
    it is a multi-step process that requires
    deliberate, proactive design and use.
  • 2-DM is not available for business applications
    it ready to go for any business.
  • 3-requires a separate, dedicated database db are
    not required but they are desirable.
  • 4-require professionals any trained user can use
  • 5-it is only for large firms with huge data if
    data is accurate then DM can be used by any firm
    regardless of its size.

Data Mining Tools and techniques
  • 1-Statistical methods these include linear and
    nonlinear regression, point estimation.
  • 2-decision trees which are used in
    classification and clustering methods.
  • 3-genetic algorithms work on the principle of
    expansion of possible outcomes. Given a fixed
    number of possible outcomes, it seeks to define
    new and better solutions.

Text Mining
  • It is the application of DM to nonstructural text
    files. DM takes the advantage of the
    infrastructure of stored data to extract
    additional useful information. For example, by
    data mining a customer DB an analyst may find
    that customers who buy product A also buy product
    B and C but six months later.

  • Text mining helps organization to
  • 1- find the hidden content of documents and
    additional useful information.
  • 2-relate documents across previous unnoticed
    divisions, example finding that customers in
    two different product divisions have the same
  • 3- group documents by common subjects. Example
    all customers in an insurance firm who have the
    same complaints and cancel their polices.

Data Visualization
  • It refers to technologies that support
    visualization and interpretation of data and
    information at several points.
  • It includes digital images, geographic
    information systems, graphical user interface,
    virtual reality, tables an graphs,
    multidimensional, and animations.
Write a Comment
User Comments (0)
About PowerShow.com