To view this presentation, you'll need to enable Flash.

Show me how

After you enable Flash, refresh this webpage and the presentation should play.

Loading...

PPT – DATA, TEXT, PowerPoint presentation | free to download - id: 5bf538-MTdkM

The Adobe Flash plugin is needed to view this content

View by Category

Presentations

Products
Sold on our sister site CrystalGraphics.com

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

Chapter 7

- DATA, TEXT,
- AND WEB MINING

Learning Objectives

- Define data mining and list its objectives and

benefits - Understand different purposes and applications of

data mining - Understand different methods of data mining,

especially clustering and decision tree models - Build expertise in use of some data mining

software

Learning Objectives

- Learn the process of data mining projects
- Understand data mining pitfalls and myths
- Define text mining and its objectives and

benefits - Appreciate use of text mining in business

applications - Define Web mining and its objectives and benefits

Data Mining Concepts and Applications

- Six factors behind the sudden rise in popularity

of data mining - General recognition of the untapped value in

large databases - Consolidation of database records tending toward

a single customer view - Consolidation of databases, including the

concept of an information warehouse - Reduction in the cost of data storage and

processing, providing for the ability to collect

and accumulate data - Intense competition for a customers attention in

an increasingly saturated marketplace and - The movement toward the de-massification of

business practices

Data Mining Concepts and Applications

- Data mining (DM)
- A process that uses statistical, mathematical,

artificial intelligence and machine-learning

techniques to extract and identify useful

information and subsequent knowledge from large

databases

Data Mining Concepts and Applications

- Major characteristics and objectives of data

mining - Data are often buried deep within very large

databases, which sometimes contain data from

several years sometimes the data are cleansed

and consolidated in a data warehouse - The data mining environment is usually

client/server architecture or a Web-based

architecture

Data Mining Concepts and Applications

- Major characteristics and objectives of data

mining - Sophisticated new tools help to remove the

information ore buried in corporate files or

archival public records finding it involves

massaging and synchronizing the data to get the

right results. - The miner is often an end user, empowered by data

drills and other power query tools to ask ad hoc

questions and obtain answers quickly, with little

or no programming skill

Data Mining Concepts and Applications

- Major characteristics and objectives of data

mining - Striking it rich often involves finding an

unexpected result and requires end users to think

creatively - Data mining tools are readily combined with

spreadsheets and other software development

tools the mined data can be analyzed and

processed quickly and easily - Parallel processing is sometimes used because of

the large amounts of data and massive search

efforts

Data Mining Concepts and Applications

- How data mining works
- Data mining tools find patterns in data and may

even infer rules from them - Three methods are used to identify patterns in

data - Simple models
- Intermediate models
- Complex models

Data Mining Concepts and Applications

- Classification
- Supervised induction used to analyze the

historical data stored in a database and to

automatically generate a model that can predict

future behavior - Common tools used for classification are
- Neural networks
- Decision trees
- If-then-else rules

Data Mining Concepts and Applications

- Clustering
- words cluster analysis is an exploratory data

analysis tool which aims at sorting different

objects into groups in a way that the degree of

association between two objects is maximal if

they belong to the same group and minimal

otherwise - cluster analysis simply discovers structures in

data without explaining why they exist. - The term cluster analysis (first used by Tryon,

1939) encompasses a number of different

algorithms and methods for grouping objects of

similar kind into respective categories. - Example, people and animal classification
- Joining (Tree Clustering), Two-way Joining (Block

Clustering), and k-Means Clustering

Data Mining Concepts and Applications

- k-Means Clustering the k-means method will

produce exactly k different clusters of greatest

possible distinction. - Algorithms
- Given a set of observations (x1, x2, , xn),

where each observation is a d-dimensional real

vector, k-means clustering aims to partition the

n observations into k sets (k n)

S S1, S2, , Sk so as to minimize the

within-cluster sum of squares (WCSS) - where µi is the mean of points in Si.
- See paper.

Data Mining Concepts and Applications

- 1) k initial "means" (in this case k3) are

randomly generated within the data domain (shown

in color). - 2) k clusters are created by associating every

observation with the nearest mean. The partitions

here represent the Voronoi diagram generated by

the means. - 3) The centroid of each of the k clusters becomes

the new mean. - 4) Steps 2 and 3 are repeated until convergence

has been reached.

Data Mining Concepts and Applications

- EM clustering on an artificial dataset ("mouse").

The tendency of k-means to produce equi-sized

clusters leads to bad results, while EM benefits

from the Gaussian distribution present in the

data set

Data Mining Concepts and Applications

- Expectation Maximization) Clustering to detect

clusters in observations (or variables) and to

assign those observations to the clusters. - A typical example application a number of

consumer behavior related variables are measured

for a large sample of respondents.

Data Mining Concepts and Applications

- Association
- A category of data mining algorithm that

establishes relationships about items that occur

together in a given record - These powerful exploratory techniques have a wide

range of applications in many areas of business

practice and also research - from the analysis of

consumer preferences or human resource

management, to the history of language. - These techniques enable analysts and researchers

to uncover hidden patterns in large data sets,

such as "customers who order product A often also

order product B or C" or "employees who said

positive things about initiative X also

frequently complain about issue Y but are happy

with issue Z." - For example, if (CarPorsche and GenderMale and

Agelt20) then (RiskHigh and InsuranceHigh)).

Book store recommendation. - The implementation of the so-called a-priori

algorithm (see Agrawal and Swami, 1993 Agrawal

and Srikant, 1994 Han and Lakshmanan, 2001 see

also Witten and Frank, 2000) allows us to process

rapidly huge data sets for such associations,

based on predefined "threshold" values for

detection.

Data Mining Concepts and Applications

- Association
- Sequence Analysis. Sequence analysis is

concerned with a subsequent purchase of a product

or products given a previous buy. For instance,

buying an extended warranty is more likely to

follow (in that specific sequential order) the

purchase of a TV or other electric appliances.

Sequence rules, however, are not always that

obvious, and sequence analysis helps you to

extract such rules no matter how hidden they may

be in your market basket data. - Link Analysis. In retailing or

marketing, knowledge of purchase "patterns" can

help with the direct marketing of special offers

to the "right" or "ready" customers (i.e., those

who, according to the rules, are most likely to

purchase specific items given their observed past

consumption patterns). Link analysis" is often

used when these techniques - for extracting

sequential or non-sequential association rules -

are applied to organize complex "evidence." It is

easy to see how the "transactions" or "shopping

basket" metaphor can be applied to situations

where individuals engage in certain actions, open

accounts, contact other specific individuals, and

so on. - Unique data analysis requirements.

Crosstabulation tables, and in particular

Multiple Response tables

Data Mining Concepts and Applications

- Visualization can be used in conjunction with

data mining to gain a clearer understanding of

many underlying relationships

Data Mining Concepts and Applications

Data Mining Concepts and Applications

- a-priori algorithm
- See paper.

Data Mining Concepts and Applications

- Regression is a well-known statistical technique

that is used to map data to a prediction value - Forecasting estimates future values based on

patterns within large sets of data

Data Mining Concepts and Applications

- Hypothesis-driven data mining
- Begins with a proposition by the user, who then

seeks to validate the truthfulness of the

proposition - Discovery-driven data mining
- Finds patterns, associations, and relationships

among the data in order to uncover facts that

were previously unknown or not even contemplated

by an organization

Data Mining Concepts and Applications

Data mining applications

- Marketing
- Banking
- Retailing and sales
- Manufacturing and production
- Brokerage and securities trading
- Insurance

- Computer hardware and software
- Government and defense
- Airlines
- Health care
- Broadcasting
- Police
- Homeland security

Data Mining Techniques and Tools

- Data mining tools and techniques can be

classified based on the structure of the data and

the algorithms used - Statistical methods
- Decision trees
- Defined as a root followed by internal nodes.

Each node (including root) is labeled with a

question and arcs associated with each node cover

all possible responses

Data Mining Techniques and Tools

- Data mining tools and techniques can be

classified based on the structure of the data and

the algorithms used - Case-based reasoning
- Neural computing
- Intelligent agents
- Genetic algorithms
- Other tools
- Rule induction
- Data visualization

Data Mining Techniques and Tools

- A general algorithm for building a decision tree
- Create a root node and select a splitting

attribute. - Add a branch to the root node for each split

candidate value and label - Take the following iterative steps
- Classify data by applying the split value.
- If a stopping point is reached, then create leaf

node and label it. Otherwise, build another

subtree

Data Mining Techniques and Tools

- Gini index
- Used in economics to measure the diversity of

the population. The same concept can be used to

determine the purity of a specific class as a

result of a decision to branch along a particular

attribute/variable - Formula
- Gini(S)1-?pj2
- Where S is a data set that contains example from

n classes. - Pj is a relative frequency of class j in S.

Data Mining Techniques and Tools

- Example

Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk

Pattern Income Credit Rating Loan Risk

0 1 2 3 4 5 23 17 43 68 32 20 High Low Low High Moderate High High High High Low Low High

There is only two classes, High and Low, the data

set S with p High and n low elements, then the

Gini computation is as follows

Data Mining Techniques and Tools

- Phighp/(pn)

pLown/(np) - Gini(S)1 p2High p2 Low
- If data set S is split into S1 and S2, the

splitting index is defined as follows - GiniSPLIT(S) (p1 n 1)/(p n)Gini(S1)

(p2 n 2)/(p n)Gini(S2) - Where p1,n 1 (p2 n 2) denote p1 High elements

and n1 Low element in the data set S1 (S2). - In this definition, the best split point is the

one with the lowest value of the GiniSPLIT index.

For our example, reorder the data according to

the income

Pattern Income Loan Risk

17 20 23 32 43 68 1 5 0 4 2 3 High High High Low High Low

Data Mining Techniques and Tools

- Possible value of a split point for the Income

attribute are Incomelt17, Incomelt20, Incomelt23,

incomelt32, Incomelt43, and Income lt68. - Now we can compute the Gini index for each of

these levels of splits - Consider the choice of dividing the data at

Income lt17. We have the following choices of

classifications

Pattern Count High Low

Incomelt17 Income gt17 1 3 0 2

So the Gini index for Incomelt17 and Income gt 17

will be G(Incomelt17) 1 (Proportion of

records with High risk)2 (Proportion of records

with High risk)2 1 12 020. Similarly,

G(Income gt 17) 1 ((3/5)2 (2/5)2)12/25

Data Mining Techniques and Tools

- Gini index for the split choice is computed as

follows - GiniSPLIT (Proportion of records at Income

lt17G(Incomelt17) (Proportion of records at

Income gt17 )G( Income gt17) - That is
- GSPLIT(1/6) 0 (5/6) (12/25)

2/5. - Now consider the choice Income lt20.

Pattern Count High Low

Incomelt20 Income gt20 2 2 0 2

So the Gini index for Incomelt20 and Income gt 20

will be G(Incomelt20) 1 ((1)2 (0)2)

0. G(Income gt 20) 1 ((2/4)2

(2/4)2)1/2. GSPLIT(2/6) 0 (4/6) (1/2)

1/3.

Data Mining Techniques and Tools

- For choice split at Income 23

Pattern Count High Low

Incomelt23 Income gt23 3 1 0 2

G(Incomelt23) 1 ((1)2 (0)2) 0. G(Income gt

23) 1 ((1/3)2 (2/3)2)4/9. GSPLIT(3/6) 0

(3/6) (4/9) 2/9. For choice split at Income

32

Pattern Count High Low

Incomelt32 Income gt32 3 1 1 1

G(Incomelt32) 1 ((3/4)2 (1/4)2)

3/8. G(Income gt 32) 1 ((1/2)2

(1/2)2)1/2. GSPLIT(4/6) 3/8 (2/6) (1/2)

7/24.

Data Mining Techniques and Tools

- The lowest value of GSPLIT is for Incomelt23. So

we take the two nearest values and average them.

Thus, we have a split point at Income

(2332)/227.5. - Attribute lists are divided at the split point.

That is, we expect to have a rule that says - If Incomelt27.5
- Then
- Else if Incomegt27.5
- Then
- The following is the attribute list for

Incomelt27.5

Income Pattern Loan Risk Credit Rating

17 20 23 1 5 0 High High High Low High High

So the conclusion is if the Incomelt27.5, the

loan risk is high.

Data Mining Techniques and Tools

- But what about the Income gt 27.5?
- The following tables suggest that Income gt27.5 is

not a definitive indicator of Loan Risk.

Income Pattern Loan Risk Credit Rating

32 43 68 4 2 3 High Low High Moderate Low High

So we can borrow examining credit rating to

develop the subtree for Income gt 27.5

case. However, credit rating is category

variable. The rules for category variable is

slightly different from those for a continuous

variable. The Gini index formula will be

Gini ( Two Proportion)1

p2one proportion p2 the other proportion

Data Mining Techniques and Tools

- In case of category variable, one proportion is

the set of records of Credit Rating Low, and

the other proportion is the set of records of

Credit Rating not Low, or ?Moderate, High.

Thus we have to compute proportion of each

category and its complement. But what about the

Income gt 27.5? - The following tables suggest that Income gt27.5 is

not a definitive indicator of Loan Risk.

Pattern Count Loan Risk High Loan Risk Low

Credit RatingLow Credit RatingModerate Credit RatingHigh 0 1 1 1 0 0

First, compute the Gini index for each

category G( Credit RatingLow) 1 02 12

0 G( Credit RatingModerate) 1 12 02 0 G(

Credit RatingLow) 1 12 02 0

Data Mining Techniques and Tools

- Next, compute the Gini index for complement

categories - G( Credit Rating ? Low, Moderate) 1 (½)2

(1/2)21/2 - G( Credit Rating ?Low, High) 1/2
- G( Credit Rating ?Moderate, High) 1 02 12

0

Third, compute the Gini index for possible

branches. For branch choice of credit rating

low and Moderate, high, we would

have GSPLIT (Proportion of records with Credit

Rating Low) G (Credit Rating ?Low)

(Proportion of records with Credit Rating not

Low) G (Credit Rating ?not Low)

(Proportion of records with Credit Rating Low)

G (Credit Rating ?Low) (Proportion of

records with Credit Rating High, Moderate) G

(Credit Rating High, Moderate) GSPLIT(Credite

Rating Low) (1/3) 0(2/3) 00.

Data Mining Techniques and Tools

- Last, compute the Gini index for other

categories - GSPLIT(Credite Rating Moderate) (1/3)

0(2/3) (1/2)1/3 - GSPLIT(Credite Rating High) (1/3) 0(2/3)

(1/2)1/3 - GSPLIT(Credite Rating Low, Moderate) (2/3)

(1/2)(1/3) 01/3 - GSPLIT(Credite Rating Low, High) (2/3)

(1/2)(1/3) 01/3 - GSPLIT(Credite Rating Moderate) (2/3)

0(1/3) 00 - The lowest value of the Gini index for the split

is zero at Credit Rating Low and Credit Rating

?Moderate, High, thus this is split point and

these are the next branch of subtree. See figure.

Data Mining Techniques and Tools

Data Mining Techniques and Tools

- The ID3 algorithm decision tree approach
- Entropy
- Measures the extent of uncertainty or randomness

in a data set. If all the data in a subset belong

to just one class, then there is no uncertainty

or randomness in that dataset, therefore the

entropy is zero

Data Mining Techniques and Tools

- Cluster analysis for data mining
- Cluster analysis is an exploratory data analysis

tool for solving classification problems - The object is to sort cases into groups so that

the degree of association is strong between

members of the same cluster and weak between

members of different clusters

Data Mining Techniques and Tools

- Cluster analysis results may be used to
- Help identify a classification scheme
- Suggest statistical models to describe

populations - Indicate rules for assigning new cases to classes

for identification, targeting, and diagnostic

purposes - Provide measures of definition, size, and change

in what were previously broad concepts - Find typical cases to represent classes

Data Mining Techniques and Tools

- Cluster analysis methods
- Statistical methods
- Optimal methods
- Neural networks
- Fuzzy logic
- Genetic algorithms
- Each of these methods generally works with one of

two general method classes - Divisive
- Agglomerative

Data Mining Techniques and Tools

- Hierarchical clustering method and example
- Decide which data to record from the items
- Calculate the distances between all initial

clusters. Store the results in a distance matrix - Search through the distance matrix and find the

two most similar clusters - Fuse those two clusters together to produce a

cluster that has at least two items - Calculate the distances between this new cluster

and all the other clusters - Repeat steps 3 to 5 until you have reached the

prespecified maximum number of clusters

Data Mining Techniques and Tools

- Classes of data mining tools and techniques as

they relate to information and business

intelligence (BI) technologies - Mathematical and statistical analysis packages
- Personalization tools for Web-based marketing
- Analytics built into marketing platforms
- Advanced CRM tools
- Analytics added to other vertical

industry-specific platforms - Analytics added to database tools (e.g., OLAP)
- Standalone data mining tools

Data Mining Project Processes

Data Mining Project Processes

Data Mining Project Processes

- Knowledge discovery in databases (KDD)
- A comprehensive process of using data mining

methods to find useful information and patterns

in data

Data Mining Project Processes

- KDD process
- Selection
- Preprocessing
- Transformation
- Data mining
- Interpretation/evaluation

Text Mining

- Text mining
- Application of data mining to nonstructured or

less structured text files. It entails the

generation of meaningful numerical indices from

the unstructured text and then processing these

indices using various data mining algorithms

Text Mining

- Text mining helps organizations
- Find the hidden content of documents, including

additional useful relationships - Relate documents across previous unnoticed

divisions - Group documents by common themes

Text Mining

- Applications of text mining
- Automatic detection of e-mail spam or phishing

through analysis of the document content - Automatic processing of messages or e-mails to

route a message to the most appropriate party to

process that message - Analysis of warranty claims, help desk

calls/reports, and so on to identify the most

common problems and relevant responses

Text Mining

- Applications of text mining
- Analysis of related scientific publications in

journals to create an automated summary view of a

particular discipline - Creation of a relationship view of a document

collection - Qualitative analysis of documents to detect

deception

Text Mining

- How to mine text
- Eliminate commonly used words (stop-words)
- Replace words with their stems or roots (stemming

algorithms) - Consider synonyms and phrases
- Calculate the weights of the remaining terms

Web Mining

- Web mining
- The discovery and analysis of interesting and

useful information from the Web, about the Web,

and usually through Web-based tools

Data Mining Project Processes

Web Mining

- Web content mining
- The extraction of useful information from Web

pages - Web structure mining
- The development of useful information from the

links included in the Web documents - Web usage mining
- The extraction of useful information from the

data being generated through webpage visits,

transaction, etc.

Web Mining

- Uses for Web mining
- Determine the lifetime value of clients
- Design cross-marketing strategies across products
- Evaluate promotional campaigns
- Target electronic ads and coupons at user groups
- Predict user behavior
- Present dynamic information to users

Data Mining Project Processes

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Presentation Removal Request Contact Us Send Us Feedback

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "DATA, TEXT," is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!