???????Data miningCh. 1 Introduction

- Major Interdisciplinary program of the

integrated biotechnology - Graduate school of bio- information technology
- Youngil Lim (N110), Lab. FACS
- phone 82 31 670 5200 (secretary), 82 31 670

5207 (direct) - Fax 82 31 670 5209, mobile phone 82 10 7665

5207 - Email limyi_at_hknu.ac.kr, homepage

http//facs.maru.net

Outline

Course Course name Time Time Room

Data mining Thu. 9-12? N130/N116

Overview In recent year, there has been stunning progress in data mining and machine learning. The synthesis of statistics, machine learning, information theory and computing has created a solid science, with a firm mathematical base, and with very powerful tools. This lecture presents the basic theory of automatically extracting models from experimental data, and then validating those models. This lecture includes a multivariate linear regression, training/testing/validation techniques, principle component analysis (PCA), partial least square (PLS) and artificial neural network (ANN) algorithms. Matlab or Weka toolkit is used for computational practices. This lecture is given in English. In recent year, there has been stunning progress in data mining and machine learning. The synthesis of statistics, machine learning, information theory and computing has created a solid science, with a firm mathematical base, and with very powerful tools. This lecture presents the basic theory of automatically extracting models from experimental data, and then validating those models. This lecture includes a multivariate linear regression, training/testing/validation techniques, principle component analysis (PCA), partial least square (PLS) and artificial neural network (ANN) algorithms. Matlab or Weka toolkit is used for computational practices. This lecture is given in English. In recent year, there has been stunning progress in data mining and machine learning. The synthesis of statistics, machine learning, information theory and computing has created a solid science, with a firm mathematical base, and with very powerful tools. This lecture presents the basic theory of automatically extracting models from experimental data, and then validating those models. This lecture includes a multivariate linear regression, training/testing/validation techniques, principle component analysis (PCA), partial least square (PLS) and artificial neural network (ANN) algorithms. Matlab or Weka toolkit is used for computational practices. This lecture is given in English. In recent year, there has been stunning progress in data mining and machine learning. The synthesis of statistics, machine learning, information theory and computing has created a solid science, with a firm mathematical base, and with very powerful tools. This lecture presents the basic theory of automatically extracting models from experimental data, and then validating those models. This lecture includes a multivariate linear regression, training/testing/validation techniques, principle component analysis (PCA), partial least square (PLS) and artificial neural network (ANN) algorithms. Matlab or Weka toolkit is used for computational practices. This lecture is given in English.

Method Lecture(?), Seminar (?), Computational practice (?), Factory tour (?), Beam projector(?) Lecture(?), Seminar (?), Computational practice (?), Factory tour (?), Beam projector(?) Lecture(?), Seminar (?), Computational practice (?), Factory tour (?), Beam projector(?) Lecture(?), Seminar (?), Computational practice (?), Factory tour (?), Beam projector(?)

Evaluation Attendance 8, homework 20, Mid-exam 30, Final-exam 30, Presentation 12 Attendance 8, homework 20, Mid-exam 30, Final-exam 30, Presentation 12 Attendance 8, homework 20, Mid-exam 30, Final-exam 30, Presentation 12 Attendance 8, homework 20, Mid-exam 30, Final-exam 30, Presentation 12

Text Main Witten and Frank, Data mining practical machine learning tools and techniques, Elsevier, 2005. Main Witten and Frank, Data mining practical machine learning tools and techniques, Elsevier, 2005. Main Witten and Frank, Data mining practical machine learning tools and techniques, Elsevier, 2005. Main Witten and Frank, Data mining practical machine learning tools and techniques, Elsevier, 2005.

Weekly Lecture Plan

Week Contents Remarks

1 Introduction

2 EndNote 13, Uses and practices Presentation 1 EndNote

3 Part I. Machine learning tools and techniques, Ch. 1 what is it all about?

4 Part I. Machine learning tools and techniques, Ch. 2 input data

5 Part I. Machine learning tools and techniques, Ch. 3 output data

6 Field trip (Factory tour, October 7, 2010) KITECH, Biomass gasifier (Dr. Lee Uen-Do)

7 Part I. Machine learning tools and techniques, Ch. 4 algorithms Presentation 2 ch. 4

8 Part I. Machine learning tools and techniques, Ch. 4 algorithms

9 Mid-term exam.

10 Part II. The weka program, Ch. 9 Introduction to Weka

11 Part II. The weka program, Ch. 10 Explorer of Weka ?

12 Field trip (Factory tour)

13 Part II. The weka program, Ch. 10 Explorer of Weka

14 Ammonia emission problem (Lim et al., 2007) Analysis of Lim et al. (2007)

15 Final exam. (Report on the ammonia emission problem)

Overview of this lecture

output (ch. 3)

input (ch. 2)

Information (data, database)

Data mining (extraction of useful information)

Relationships? Modeling Structural

patterns Technical tools machine learning

Knowledge (understanding, application,

prediction)

- - Machine learning acquisition of structural

descriptions automatically or semi-auto. - (it is similar as the brain development from

repeating experiences) - Weka written in JAVA (object-oriented

programming language) - (JAVA is free to OS and its calculation is 2-3

times slower than C, C and Fortran - - Java compiler (Java virtual machine) translate

the byte-code into machine code

Outline of this lecture

- Part I. Machine learning tools and techniques
- - Level 1 Ch 1. Applications, common problems
- Ch 2. Input, concepts, instances and

attributes - Ch 3. Output, knowledge

representation - - Level 2 Ch 4. Numerical algorithms, the basic

methods - - Level 3 Ch 5-6 (advanced topics)
- Part II. Weka manual (ftp//facs/lim/lecture_relat

ed/weka3.4.exe) - - Level 1 Ch 9. Introduction of Weka
- Ch 10. Explorer
- - Level 2 Ch 11-15 (advanced options in Weka)

But, you need to read those chapters to make a

paper on data mining

Ch. 1. Whats it all about

Life and death It is up to Machine Learning

Cow breeding of farmers - 1/5 cows to be abated -

4/5 cows to be bred What is the decision criteria?

- Human in vitro fertilization
- 60 embryos fertilized ? select just 1 embryo
- What is the decision criteria?

inputs (attributes)

outputs (results)

inputs (attributes)

outputs (results)

- - age
- - health
- calving
- -

Live or die

- morphology - oocyte -

Live or die

1.1 data mining and machine learning

- Data mining process of discovering structural

patterns in data. - Machine learning technical tools for finding

structural patterns in data automatically or

semi-automatically.

Machine learning

Describing structural patterns

- learning and training
- training mindless learning
- Machine learning includes numerical algorithms

for automatic calculations

- See Table 1.1
- there are 4 attributes (descriptors)
- there are 3 decisions (outputs)
- 24 possibilities and 24 instances
- no data missing
- no noise in data
- perfect prediction is possible
- It is an ideal and fictitious example

1.1 data mining and machine learning

Table 1.1

1.2 sample examples weather problems and others

- Different datasets tend to expose new issues,

challenges, and different numerical algorithms

(case by case).

2. Contact lenses ideal case

1. The weather problem

- machine learning is to
- identify the data structure
- predict for new cases
- see Table 1.1, Figure 1.1, and Figure 1.2
- which representation is better understandable

between Figure 1.1 and Figure 1.2?

- See Table 1.2
- there are 4 attributes (descriptors)
- there are 2 decisions (outputs)
- 36 possibilities and 14 instances
- decision list in order (see p11)
- numeric-attribute problem
- mixed-attribute problem
- The classification rule is one of the
- association rules, and it is the best rule.

In Ch. 3, We will learn more classification/associ

ation rules

1.2 Examples

Table 1.2

Table 1.3

1.2 Examples

Figure 1.1

Figure 1.2 Decision Tree

1.2 sample examples weather problems and others

3. Irises numerical dataset

4. CPU performance numeric prediction

- See Table 1.4 (Fisher, 1935)
- there are 4 attributes
- numeric-attribute problem
- see p16.
- - Output is the 3 categories
- More compact rule is in Ch. 3.
- That is we use the following statement
- if then
- else if then
- end if

- see Table 1.5
- there are 6 attributes
- Output is estimated by multi-variable linear

regression (MLR) - prediction of numeric performance
- Numerical algorithms will be reviewed in ch. 4

1.2 examples

Table 1.4

1.2 examples

Table 1.5

1.2 sample examples weather problems and others

5. Labor negotiation

- see Table 1.6
- there are 16 attributes and 40 examples
- missing data but realistic case
- 2 decisions (acceptable or not)
- see Figure 1.3 (a) and (b)
- (a) simple and intuitive decision tree
- (b) complex and accurate representation
- which is the artifact and overfited?
- Ch. 5 and 6 concern about cross-validation and

missing data

6. Soybean classification

- see Table 1.7
- a successful story of ML techniques
- soybean disease diagnosing
- 35 attributes and 680 examples
- 19 outputs (categories)
- 97.5 accuracy over 72 of expert
- the expert adopted ML rules

1.2 examples

Table 1.6

1.2 examples

Table 1.7

1.3 Fielded applications

The previous examples are speculative and toy

problems. Where is the beef?

1. Decision of loan company

- for borderline applicants of loan (sub-prime

loan) - there are about 20 attributes
- there are 2 decisions (accept or not)
- the borrowers pay off or default
- correct predictions from ML are related to the

profit of the loan company.

2. Screening satellite images for oil slicks

detection

- For warning ecological disasters
- oil slicks appear as dark regions in the image
- the detection is an expensive manual process
- this problem is challenged because
- scarcity of training data
- a very small fraction are actual oil slicks
- batch process (case by case)

1.3 Fielded applications

3. Power load forecasting in electricity supply

industry

- for estimating Max. and Min. of load for hour,

day, month, season, and year - there are many attributes (day/night, holiday,

weekend, weather ) - we need a dynamic model
- ML system is far quicker than the trained human

forecasters. - a few seconds or a few days

4. Diagnosis of machines and devices

- For determination of the kind of fault,
- the diagnosis process is too labor intensive
- 1000 different devices, and noisy data
- outputs are 600 faults
- low level attributes from vibration records of

devices - derived attributes from Fouriers analysis
- ML performance is slightly superior to that of

expert

1.3 Fielded applications

5. Marketing and sales

- for planning store layouts, special discounts,

offering coupons, - attributes costumers purchase records (by

membership card) - ? market basket analysis
- econometrics
- detecting customers who is fickle and defect

6. Other applications

- control problems of plants
- biology identification of genes
- biomedicine prediction of drug activity, and 3D

structure - astronomy
- chemistry structure identification of certain

organic compounds - automation

1.4 Machine learning and statistics

What is the difference between ML and statistics?

- ML statistics marketing - Statistics ML

that has arisen out of computer science - But,

two perspectives have converged - Statistical

tests are used to validate ML models and to

evaluate ML algorithms

1.6 data mining and ethics

Data mining is used for people, it provokes

ethical problems such as racial, sexual and

religious

1.5 Generalization as search

ML and statistics generalization as search

This section is optional, as indicated by the

gray bar !

Field trip report (as a scientific report)

- Components or contents
- Introduction backgrounds, states-of-art, aims,

and short overview of the report - Main body Knowledge or information from field

trips, applications, results, and analyses - Conclusion summary, and perspectives
- Reference books, papers, patents, reports,

websites - Appendix accessory information

An example of field trip report

- Application of data mining to a process in

Petro-chemical company, Samsung Total - Miso Kim (misokim_at_hknu.ac.kr), 200720111
- Dept. Chemical engineering, Hankyong National

University - Gyonggi-do Anseong Jungangno 167, 456-749 Korea
- 1. Introduction
- 1.1 What is data mining?
- 1.2 Aims of this report
- 1.3 Overview of this report
- 2. Main processes of Petro-chemical plant of

Samsung Total. - 2.1 PE and PP processes
- 2.2 BTX processes
- 2.3
- Each table and each figure have own number and

title. Those tables and figures should be well

explained in the text. - 3. Application of data mining tools