data mining - PowerPoint PPT Presentation

About This Presentation
Title:

data mining

Description:

datamining basics – PowerPoint PPT presentation

Number of Views:571

less

Transcript and Presenter's Notes

Title: data mining


1
Data Mining-PART I By
M.Dhilsath Fathima
2
Topics to cover..
  • Introduction
  • Types of Data
  • Data Mining Functionalities
  • Interestingness of Patterns
  • Classification of Data Mining Systems
  • Data Mining Task Primitives
  • Integration of a Data Mining System with a Data
    Warehouse
  • Issues
  • Data Preprocessing.

3
What is Database?
A database is any organized collection of data.

4
Examples
Co-workers
5
Examples
Patient Information
6
Examples
Airline reservation system
7
DATABASE
  • Database Shared collection of logically related
    data (and a description of this data), designed
    to meet the information needs of an organization.
  • Database management System A software system
    that enables users to define, create, and
    maintain the database and that provides
    controlled access to this database.

8
Who and How to do it ?
  • Database Management System (DBMS) does this job.
  • Using Software tools Access, FileMaker, Lotus
    Notes, Oracle or SQL Server, .
  • It includes tools to add, modify or delete data
    from the database, ask questions (or queries)
    about the data stored in the database and produce
    reports summarizing selected contents.

9
Why do we need a database?
  • Keep records of our
  • Clients
  • Staff
  • Volunteers
  • To keep a record of activities.
  • Keep sales records
  • Develop reports
  • Perform Querying

10
Data vs. information
  • What is data?
  • Data is unprocessed information.
  • What is information?
  • Information is data that have been organized and
    communicated in a logical and meaningful manner.

11
Purpose of Database system/Stages of Database
System
  • Data is converted into information, and
    information is converted into knowledge.
  • Knowledge information evaluated and organized so
    that it can be used purposefully.

Is to transform
Data (Unprocessed information)
Information (processed Data)
Knowledge (Evaluated Information using measures)
Action (Data Analysis Future Prediction)
12
Data Mining works with Warehouse Data
  • Data Warehousing provides the Enterprise with a
    memory.
  • Data Mining provides the Enterprise with
    intelligence

13
Data Mining works with Warehouse Data
14
What is data Mining?
  • Now a days, huge data sets have become available
    due to advances in technology.
  • As a result, there is an increasing interest in
    various scientific communities to explore the use
    of emerging data mining techniques for the
    analysis of these large data sets . 
  • Data mining is the extraction of implicit,
    previously unknown and potentially useful
    information,patterns,associations from data .
  • Data mining is the Exploration analysis, by
    automatic or semi-automatic means, of large
    quantities of data in order to discover
    meaningful patterns .

15
  • WHO USES DATAMINING?
  • Banking
  • future prediction
  • Amazon.com (Online Stores)
  • recommendation
  • Facebook 
  • prediction how active a user will be after 3
    months.

16
Datamining is
17
  • DATAMINING IS NOT
  • Data warehousing
  • SQL / Ad Hoc Queries / Reporting
  • Online Analytical Processing (OLAP)
  • Data Visualization
  • DATAMINING IS
  • Explores Data
  • Find Patterns
  • Performs Prediction

18
KDD Process
  • Knowledge discovery in databases (KDD) is a multi
    step process of finding useful information and
    patterns in data
  • Data Mining is the use of algorithms to extract
    information and patterns derived by the KDD
    process.
  • Many texts treat KDD and Data Mining as the same
    process, but it is also possible to think of Data
    Mining as the discovery part of KDD.

19
Steps of KDD Process
20
STEPS OF KDD PROCESS
  • 1. Selection-
  • Data Extraction -Obtaining Data from
    heterogeneous data sources -Databases, Data
    warehouses, World wide web or other information
    repositories.
  • 2. Preprocessing-
  • Data Cleaning- Incomplete , noisy,
    inconsistent data to be cleaned- Missing data may
    be ignored or predicted, erroneous data may be
    deleted or corrected.
  • 3. Transformation-
  • Data Integration- Combines data from
    multiple sources into a coherent store -Data can
    be encoded in common formats, normalized, reduced.

21
Steps of KDD Process
  • 4. Data mining
  • Apply algorithms to transformed data an
    extract patterns.
  • 5. Pattern Interpretation/evaluation
  • Pattern Evaluation- Evaluate the
    interestingness of resulting patterns or apply
    interestingness measures to filter out discovered
    patterns.
  • Knowledge presentation- present the mined
    knowledge- visualization techniques can be used.

22
Types of Data /What kind of Data can be mined
  •  Data mining should be applicable to any kind of
    information repository. However, algorithms and
    approaches may differ when applied to different
    types of data.
  • Relational Databases
  • Data Warehouse
  • Transaction Databases
  • Advanced DB systems and information repositories
  • Spatial databases
  • Time-series data
  • multimedia databases
  • WWW

23
Relational Databases
  • A relational database consists of a set of
    tables containing either values of entity
    attributes, or values of attributes from entity
    relationships.
  • Tables have columns and rows, where columns
    represent attributes and rows represent tuples.
  • A tuple in a relational table corresponds to
    either an object or a relationship between
    objects and is identified by a set of attribute
    values representing a unique key.

24
Data Warehouse
  • A data warehouse as a storehouse, is a repository
    of data collected from multiple data sources
    (often heterogeneous) and is intended to be used
    as a whole under the same unified schema. A data
    warehouse gives the option to analyze data from
    different sources under the same roof.

25
Transaction Databases
  • A transaction database is a set of records
    representing transactions, each with a time
    stamp, an identifier and a set of items.
    Associated with the transaction files could also
    be descriptive data for the items.
  •  Transactions are usually stored in flat files or
    stored in two normalized transaction tables, one
    for the transactions and one for the transaction
    items.
  • Applications Airline reservation, Railway
    reservation, Log records etc.

26
MULTIMEDIA DATABASE
  • Multimedia databases include video, images,
    audio, Sound clips, and text data. They can be
    stored on extended object-relational or
    object-oriented databases, or simply on a file
    system. 
  • Ex Digital Music Player, Social Media,
    Electronic publishing.

27
Spatial Databases
  • A spatial database is a database that is enhanced
    to store and access spatial data that defines a
    geometric space.
  • These data are often associated with geographic
    locations and features, or constructed features
    like cities. Data on spatial databases are stored
    as coordinates, points, lines, polygons and
    topology. 
  • Ex store geographical information like maps, and
    global or regional positioning. 

28
Time Series Database
  • A Time-Series Database is a database that
    contains data for each point in time.
  • Examples Weather Data, stock market data ,
    Browser logged activities, ocean tides.

29
Time Series Database-Example
30
World Wide Web
  • The World Wide Web is the most heterogeneous and
    dynamic repository available. 
  • Data in the World Wide Web is organized in
    inter-connected documents. These documents can be
    text, audio, video, raw data, and even
    applications. 

31
Typical Architecture of Data Mining System
32
Integration of a Data Mining System with a
Database/Data Warehouse System
  • The list of Integration Schemes is as follows -
  • No Coupling - In this scheme, the data mining
    system does not utilize any of the database or
    data warehouse functions. It fetches the data
    directly from a particular source and processes
    that data using some data mining algorithms. The
    data mining result is stored in another file.(Ex
    Collect data directly from Transactional
    database)
  • Loose Coupling/Semi-tight Coupling - In this
    scheme, the data mining system may use some of
    the functions of database and data warehouse
    system. It fetches the data from the data
    respiratory managed by these systems and performs
    data mining on that data or fetch directly from
    particular sources. (Ex Taken from transactional
    DB Database/DWH)
  • Tight coupling - In this scheme, the data mining
    system is smoothly integrated into the database
    or data warehouse system. The data mining
    subsystem is treated as one functional component
    of an information system.

33
Integrated architecture of a Data Mining with
DWH/ AN OLAM SYSTEM ARCHITECTURE
34
Data Mining Task Primitives
  • We can specify a data mining task in the form of
    a data mining query.
  • This query is input to the system.
  • A data mining query is defined in terms of data
    mining task primitives.
  • Note - These primitives allow us to communicate
    in an interactive manner with the data mining
    system. Here is the list of Data Mining Task
    Primitives -
  • Kind of knowledge to be mined.
  • Set of task relevant data to be mined.
  • Representation for visualizing the discovered
    patterns.
  • Background knowledge to be used in discovery
    process.
  • Interestingness measures and thresholds for
    pattern evaluation.

35
Data Mining Task Primitives-Example of Data
mining query
  • use database AllElectronics_db use state_
    location_hierarchy for B.address mine
    characteristics as customerPurchasing analyze
    count in relevance to C.age,I.type,I.place_made
    from customer C, item I, purchase P, items_sold
    S, branch B where I.item_ID S.item_ID and
    P.cust_ID C.cust_ID and P.method_paid "AmEx"
    and B.address "Canada" and I.price 100 with
    noise threshold 5 display as table

36
Data Mining Task Primitives-cont..
  • Kind of knowledge to be mined
  • It refers to the kind of functions to be
    performed. These functions are -
  • Characterization
  • Association and Correlation Analysis
  • Classification
  • Prediction
  • Clustering
  • Outlier Analysis
  • Set of task relevant data to be mined
  • This is the portion of database in which the user
    is interested. This portion includes the
    following -
  • Database Attributes
  • Data Warehouse dimensions of interest

37
Data Mining Task Primitives-cont..
  • 3. Representation for visualizing the discovered
    patterns
  • This refers to the form in which discovered
    patterns are to be displayed. These
    representations may include the following -
  • Rules
  • Tables
  • Charts
  • Graphs
  • Decision Trees
  • Cubes

38
Data Mining Task Primitives-cont..
  • 4. Background knowledge
  • The background knowledge allows data to be mined
    at multiple levels of abstraction. For example,
    the Concept hierarchies are one of the background
    knowledge that allows data to be mined at
    multiple levels of abstraction.
  • 5.Interestingness measures and thresholds for
    pattern evaluation
  • This is used to evaluate the patterns that are
    discovered by the process of knowledge discovery.
    There are different interesting measures for
    different kind of knowledge.

39
Classification of Data mining System
40
Classification of Data mining System(Cont..)
  • Data to be mined
  • Relational, data warehouse, transactional,
    stream, object-oriented/relational, active,
    spatial, time-series, text, multi-media,
    heterogeneous, legacy, WWW
  • Knowledge to be mined
  • Characterization, discrimination, association,
    classification, clustering, trend/deviation,
    outlier analysis, etc.
  • Multiple/integrated functions and mining at
    multiple levels
  • Techniques utilized
  • Database-oriented, data warehouse (OLAP), machine
    learning, statistics, visualization, etc.
  • Applications adapted
  • Retail, telecommunication, banking, fraud
    analysis, bio-data mining, stock market analysis,
    text mining, Web mining, etc.
Write a Comment
User Comments (0)
About PowerShow.com