Knowledge Engineering - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Knowledge Engineering

Description:

Knowledge Engineering & Data mining – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 35
Provided by: drd144
Category:

less

Transcript and Presenter's Notes

Title: Knowledge Engineering


1

 
Knowledge Engineering

Data mining
2
We are deluged by data !
scientific data, medical data, demographic
data, financial data, and marketing data
People have no time to look at this data !
? we must find a tool to automatically
analyze the data, classify it, summarize it,
discover and characterize trends in it, and
flag anomalies .
This magic tool is "Dataminig ".
3
The data explosion
Increase in use of electronic data gathering
devices e.g. point-of-sale, remote sensing
devices etc. Data storage became easier and
cheaper with increasing computing power
4
? What is Data Mining
lt Definition gt
non trivial extraction of implicit, previously
unknown, and potentially useful information from
data
OR
the variety of techniques to identify nuggets of
information or decision-making knowledge in
bodies of data, and extracting these in such a
way that they can be put to use in the areas such
as decision support, prediction, forecasting and
estimation. The data is often voluminous, but as
it stands of low value as no direct use can be
made of it it is the hidden information in the
data that is useful
OR
extraction of hidden predictive information from
large databases
5
Data Mining and DBMS
DBMS
Queries based on the data held e.g.
last months sales for each product sales
grouped by customer age etc. list of customers
who lapsed their policy
Data Mining
Infer knowledge from the data held to answer
queries e.g.
what characteristics do customers share who
lapsed their policies and how do they differ from
those who renewed their policies? why is the
Cleveland division so profitable?
6
Characteristics of a Data Mining System
  • Large quantities of data
  • volume of data so great it has to be analyzed by
    automated techniques e.g. satellite information,
    credit card transactions etc.
  • Noisy, incomplete data
  • imprecise data is characteristic of all data
    collection
  • databases - usually contaminated by errors,
    cannot assume that the data they contain is
    entirely correct e.g. some attributes rely on
    subjective or measurement judgments
  • Complex data structure - conventional statistical
    analysis not possible
  • Heterogeneous data stored in legacy systems

7
Data Mining Goals
  • Classification
  • Association
  • Sequence / Temporal analysis
  • Cluster outlier analysis

8
Data Mining and Machine Learning
Data Mining or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge Machine Learning is concerned with improving performance of an agent e.g. , training a neural network to balance a pole is part of ML, but not of KDD
DM is concerned with very large, real-world databases ML typically looks at smaller data sets
DM deals with real-world data , which tends to have problems such as missing values , dynamic data , noise , and pre-existing data ML has laboratory type examples for the training set
Efficiency of the algorithm and scalability is
more important in DM or KDD .
9
Issues in Data Mining
  • Noisy data
  • Missing values
  • Static data
  • Sparse data
  • Dynamic data
  • Relevance
  • Interestingness
  • Heterogeneity
  • Algorithm efficiency
  • Size and complexity of data

10
Data Mining Process
  • Data pre-processing
  • heterogeneity resolution
  • data cleansing
  • data warehousing
  • Data Mining Tools applied
  • extraction of patterns from the pre-processed
    data
  • Interpretation and evaluation
  • user bias i.e. can direct DM tools to areas of
    interest
  • attributes of interest in databases
  • goal of discovery
  • domain knowledge
  • prior knowledge or belief about the domain

11
Techniques
  • Object-oriented database methods
  • Statistics
  • Clustering
  • Visualization
  • Neural networks
  • Rule Induction

12
Techniques
  • Object-oriented approaches/Databases
  • Making use of DBMSs to discover knowledge, SQL is
    limiting .
  • Advantages
  • Easier maintenance. Objects may be understood as
    stand-alone entities
  • Objects are appropriate reusable components
  • For some systems, there may be an obvious mapping
    from real world entities to system objects

13
Techniques
  • Statistics
  • Can be used in several data mining stages
  • data cleansing i.e. the removal of erroneous or
    irrelevant data known as outliers
  • EDA, exploratory data analysis e.g. frequency
    counts, histograms etc.
  • data selection - sampling facilities and so
    reduce the scale of computation
  • attribute re-definition e.g. Body Mass Index,
    BMI, which is Weight/Height2
  • data analysis - measures of association and
    relationships between attributes, interestingness
    of rules, classification etc.

14
Techniques
  • Visualization

Visualization enhances EDA and makes patterns
more visible 1-d, 2-d, 3-d visualizations
Example NETMAP , a commercial data mining tool
, uses this technique
15
Techniques
  • Cluster outlier analysis
  • Clustering according to similarity .
  • Partitioning the database so that each partition
    or group is similar according to some criteria or
    metric .
  • Appears in many disciplines e.g. in chemistry the
    clustering of molecules
  • Data mining applications make use of it e.g. to
    segment a client/customer base .
  • Provides sub-groups of a population for further
    analysis or action - very important when dealing
    with very large databases
  • Can be used for profile generation for target
    marketing

16
Techniques
  • Artificial Neural Networks (ANN)
  • An trained ANN can be thought of as an "expert"
    in the category of information it has been given
    to analyze .
  • It provides projections given new situations of
    interest and answers "what if" questions .
  • Problems include
  • the resulting network is viewed as a black box
  • no explanation of the results is given i.e.
    difficult for the user to interpret the results
  • difficult to incorporate user intervention
  • slow to train due to their iterative nature

17
Techniques
Artificial Neural Networks (ANN) Data mining
example using neural networks .
18
Techniques
  • Decision trees
  • Built using a training set of data and can then
    be used to classify new objects
  • Description
  • internal node is a test on an attribute.
  • branch represents an outcome of the test, e.g.,
    Colorred.
  • leaf node represents a class label or class
    label distribution.
  • At each node, one attribute is chosen to split
    training examples into distinct classes as much
    as possible
  • new case is classified by following a matching
    path to a leaf node.

19
Techniques
  • Decision trees
  • Building a decision tree
  • Top-down tree construction
  • At start, all training examples are at the root.
  • Partition the examples recursively by choosing
    one attribute each time.
  • Bottom-up tree pruning
  • Remove sub-trees or branches, in a bottom-up
    manner, to improve the estimated accuracy on new
    cases

20
Techniques
  • Decision trees

Example
Outlook
Decision Tree for Play?
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21
Techniques
  • Rules

The extraction of useful if-then rules from data
based on statistical significance.
Example format If X Then Y
22
Techniques
  • Frames
  • Frames are templates for holding clusters of
    related knowledge about a very particular subject
    .
  • It is a natural way to represent knowledge .
  • It has a taxonomy approach .
  • Problem they are more complex than rule
    representation .

23
Techniques
  • Frames
  • Example

24
Data Warehousing
Definition Any centralized data repository
which can be queried for business benefit .
  • Warehousing makes it possible to
  • extract archived operational data .
  • overcome inconsistencies between different legacy
    data formats .
  • integrate data throughout an enterprise,
    regardless of location, format, or communication
    requirements .
  • incorporate additional or expert information .

25
Characteristics of a Data Warehouse
  • subject-oriented - data organized by subject
    instead of application e.g.
  • an insurance company would organize their data by
    customer, premium, and claim, instead of by
    different products (auto, life, etc.)
  • contains only the information necessary for
    decision support processing
  • integrated - encoding of data is often
    inconsistent e.g.
  • gender might be coded as "m" and "f" or 0 and 1
    but when data are moved from the operational
    environment into the data warehouse they assume a
    consistent coding convention
  • time-variant - the data warehouse contains a
    place for storing data that are five to 10 years
    old, or older e.g.
  • this data is used for comparisons, trends, and
    forecasting
  • these data are not updated
  • non-volatile - data are not updated or changed in
    any way once they enter the data warehouse
  • data are only loaded and accessed

26
Data Warehousing Processes
  • insulate data - i.e. the current operational
    information
  • preserves the security and integrity of
    mission-critical OLTP applications
  • gives access to the broadest possible base of
    data
  • retrieve data - from a variety of heterogeneous
    operational databases
  • data is transformed and delivered to the data
    warehouse/store based on a selected model (or
    mapping definition)
  • metadata - information describing the model and
    definition of the source data elements
  • data cleansing - removal of certain aspects of
    operational data, such as low-level transaction
    information, which slow down the query times.
  • transfer - processed data transferred to the data
    warehouse, a large database on a high performance
    box

27
Data warehouse Architecture
28
Criteria for Data Warehouses
1. Load Performance require incremental loading
of new data on a periodic basis must not
artificially constrain the volume of data
2. Load Processing data conversions,
filtering, reformatting, integrity checks,
physical storage, indexing, and metadata update
3. Data Quality Management ensure local
consistency, global consistency, and referential
integrity despite "dirty" sources and massive
database size 4. Query Performance must not
be slowed or inhibited by the performance of the
data warehouse RDBMS 5. Terabyte Scalability
Data warehouse sizes are growing at astonishing
rates so RDBMS must have no architectural
limitations. It must support modular and parallel
management.
29
Criteria for Data Warehouses
6. Mass User Scalability Access to warehouse
data must not be limited to the elite few has to
support hundreds, even thousands, of concurrent
users while maintaining acceptable query
performance. 7. Networked Data Warehouse Data
warehouses rarely exist in isolation, users must
be able to look at and work with multiple
warehouses from a single client workstation
8. Warehouse Administration large scale and
time-cyclic nature of the data warehouse demands
administrative ease and flexibility 9. The
RDBMS must Integrate Dimensional Analysis
dimensional support must be inherent in the
warehouse RDBMS to provide highest performance
for relational OLAP tools 10. Advanced Query
Functionality End users require advanced
analytic calculations, sequential comparative
analysis, consistent access to detailed and
summarized data
30
Problems with Data Warehousing
The rush of companies to jump on the band wagon
as these companies have slapped "data warehouse"
labels on traditional transaction-processing
products and co- opted the lexicon of the
industry in order to be considered players in
this fast-growing category . Chris Erickson, Red
Brick
31
Data Warehousing OLTP
32
OLTP Systems
Designed to maximize transaction capacity But
they
  • cannot be repositories of facts and historical
    data for business analysis
  • cannot quickly answer ad hoc queries
  • rapid retrieval is almost impossible
  • data is inconsistent and changing, duplicate
    entries exist,
  • entries can be missing
  • OLTP offers large amounts of raw data which is
    not easily understood
  • Typical OLTP query is a simple aggregation e.g.
  • what is the current account balance for this
    customer?
  • Data Warehouse Systems
  • Data warehouses are interested in query
    processing
  • as opposed to transaction processing

33
OLAP On-Line Analytical Processing
  • Problem is how to process larger and larger
    databases OLAP involves many data items (many
    thousands or even millions) which are involved in
    complex relationships .
  • Fast response is crucial in OLAP .
  • Difference between OLAP and OLTP
  • OLTP servers handle mission-critical production
    data accessed through simple queries
  • OLAP servers handle management-critical data
    accessed through an iterative analytical
    investigation

34
The end
We hope you enjoy it .
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com