Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com
1Data Mining(and machine learning)
- DM Lecture 1 Overview of DM, and overview of
the DM part of the DMML module - Some of these slides are derivative of Nick
Taylors slides used for this module in previous
years
2Overview of My Lectures
- All at http//www.macs.hw.ac.uk/dwcorne/Teach
ing/dmml.html - Lecture 1 about data and data mining
- Lectures 2 and 3 Basic and useful ways to
process and understand data - Lectures 4, 5, 6, 7, 8 Details of useful
algorithms for finding knowledge from data - Lecture 9 overview of what else there is.
3Module assessment
- 100 by coursework
- Two main items of coursework, 50 each
- Four small items of coursework, worth nothing,
but if you dont do them adequately you fail the
module.
4This Semester
- PDW lectures on Mondays (machine learning)
- DWC lectures on Thursdays (data mining)
- Friday slot usually unused we may use it, and
will let you know in advance - All coursework set by DWC
5Coursework submission
- ALL coursework must be submitted as follows
- as PDF
- by email to dwcorne_at_gmail.com
- the c/w is an attachment
- Subject line DMML Coursework A
- ( or B, C, D, 1, 2)
- Body of the email includes your Name and your
Course (e.g. Joe Smith, BSc CS Jill Brown, MSc
AI)
6DWC lectures and c/w, key dates
Thur sep 17th This lecture Handout C/W A
Thur sep 24th Lecture Handout C/W B
Thur Oct 1st Lecture Handout Main C/W 1 (50)
Thur Oct 8th Lecture
Thur Oct 15th Lecture
Thur Oct 22nd Lecture Handout Main C/W 2 (50)
Thur Oct 29th NO LECTURE (handin C/W A,B and 1on Fri 30th)
Thur Nov 5th NO LECTURE
Thur Nov 12th Lecture Handout C/W C --- C/W 1 vivas on Fri 13th
Thur Nov 19th Lecture Handout C/W D
Thur Nov 26th Lecture (handin C/W C,D and 2 on Fri 27th)
Thur Dec 3rd C/W 2 vivas
7At last, the lecture
8What some people think can be done with data
- Answer simple questions like
- How many female clients do we have?
- How much paint did we sell in 2007?
- Which is the most profitable branch of our
supermarket? - Which postcodes suffered the most dropped calls
in July?
9that is so
10that is so
11More interesting things that can be done with data
- Answer difficult and valuable questions like
- How can we predict Ovarian cancer early enough to
treat it successfully? - How can I make significant profit on the stock
market next month? - Two different authors claim to have written this
story how can we resolve the dispute? - How can we get our customers to spend more money
in the store? - Is this loan applicant a good credit risk?
- Is this sonar image a mine, or a rock?
- What other websites will this browser be
interested in?
12Data Mining - Definition Goal
- Definition
- Data Mining is the exploration and analysis of
large quantities of data in order to discover
meaningful patterns and rules - Goal
- To permit some other goal to be achieved or
performance to be improved through a better
understanding of the data
13Some examples of large databases
- Retail basket data much commercial DM is done
with this. In one store, 18,000 baskets per month - Tesco has gt500 stores. Per year, 100,000,000
baskets ? - The Internet gt15,000,000,000 pages
- Lots of datasets UCI Machine Learning
repository - How can we begin to understand and exploit such
datasets? Especially the big ones?
14Like this
15and this
16and this
17or this
http//websom.hut.fi/websom/milliondemo/html/root.
html
18Data Mining - Basics
- Data Mining is the process of discovering
patterns and inferring associations in raw data - Data Mining is a collection of techniques
intended to analyse small or large amounts of
data - There is no single Data Mining approach
- Data Mining can employ a range of techniques,
either individually or in combination with each
other
19Data Mining Why is it important?
- Data are being generated in enormous quantities
- Data are being collected over long periods of
time - Data are being kept for long periods of time
- Computing power is formidable and cheap
- A variety of Data Mining software is available
- All of these data contain hidden knowledge
facts, rules, patterns, that can be usefully
exploited if we can find them.
20Data Mining History
- The approach has its roots over 40 years ago
- In the early 1960s Data Mining was called
statistical analysis, and the pioneers were
statistical software companies such as SPSS - By the late 1980s these traditional techniques
had been augmented by new methods such as machine
induction, artificial neural networks,
evolutionary computing, etc.
21(No Transcript)
22Some basic terminology
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
23This is called a data instance or a record or
just a line of data
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
24This is called a field or an attribute the value
of the Age field in the 4th record is 274
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
25Usually we are interested in predicting the value
of a particular field, given the values of the
other fields. What we want to predict is called
the class field, or the target class
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
26Some data-mining related projects that I am
currently working on (either myself, or with a
PhD student or RA)
Predicting whether or not two textures will be
considered similar by humans. Predicting which
of two or more writers is the author of a
given piece of text (you will do some work on
this) Discovering which subsets of many
thousands of genes play a role in specific
diseases (cancer, diabetes, etc) (you will do a
little work on this too) Discovering technical
trading rules for stock market trading (you will
do a little work on this too)
27Which pair of textures is most similar?
28Which pair of textures is most similar?
A line of data
0.23 1.88 9.64 3.22 7.1 1086.9 2.23
0.76
age of people who think they are similar
5,000 features for texture1
5,000 features for texture2
29Who wrote text chunk 4?
0.4 0.2 0.001 0.002 0.6
AuthorA 0.3 0.15 0 0.1 0.5
AuthorA 0.2 0.2 0.001 0.002 0.5
AuthorB 0.2 0.15 0 0.002 0.6
?
Word usage Fingerprint of a 1,000 word chunk
of text
30Did the Dow Jones go up or down in the following
week?
31Down
32Will the Dow Jones go up or down tomorrow?
33Data Mining Two Major Types
- Directed (Farming) Attempts to explain or
categorise some particular target field such as
income, medical disorder, genetic characteristic,
etc. - Undirected (Exploring) Attempts to find
patterns or similarities among groups of records
without the use of a particular target field or
collection of predefined classes - Compare with Supervised and Unsupervised systems
in machine learning
34Data Mining Tasks
- Classification - Example high risk for cancer or
not - Estimation - Example household income
- Prediction - Example credit card balance
transfer average amount - Affinity Grouping - Example people who buy X,
often also buy Y with a probability of Z - Clustering - similar to classification but no
predefined classes - Description and Profiling Identifying
characteristics which explain behaviour -
Example More men watch football on TV than
women
35Data Warehousing
- Note that Data Mining is very generic and can be
used for detecting patterns in almost any data - Retail data
- Genomes
- Climate data
- Etc.
- Data Warehousing, on the other hand, is almost
exclusively used to describe the storage of data
in the commercial sector
36What you should do this week
- Browse the UCI Machine Learning repository
datasets and associated information get
acquainted with data - Browse the statlib datasets archive, get
acquainted with that too. - And then
37Coursework A (0 marks, but you fail if you dont
submit an adequate attempt)
- Find three other dataset repositories as follows
- One that specialises in financial data
- One that specialises in time series data
- One that specialises in anything else.
- For each of these three, tell me the URL, and
write one paragraph, 100 words, in your own
words, describing the contents of this
repository, - Submit on or before Friday October 30th
38Au revoir
39If time available
- Some slides about data warehousing I dont
consider this an essential part of this module,
but in case you want to know what data
warehousing is
40Data Warehousing - Definitions
- A subject-oriented, integrated, time-variant and
nonvolatile collection of data in support of
management's decision making process - W. H. Inmon, "What is a Data Warehouse?" Prism
Tech Topic, Vol. 1, No. 1, 1995 -- a very
influential definition. - A copy of transaction data, specifically
structured for query and analysis - Ralph Kimball, from his 2000 book, The Data
Warehouse Toolkit
41Data Warehouse why?
- For organisational learning to take place data
from many sources must be gathered together over
time and organised in a consistent and useful way - Data Warehousing allows an organisation to
remember its data and what it has learned about
its data - Data Mining techniques make use of the data in a
Data Warehouse and subsequently add their results
to it
42(No Transcript)
43Data Warehouse - Contents
- A Data Warehouse is a copy of transaction data
specifically structured for querying, analysis
and reporting - The data will normally have been transformed
when it was copied into the Data Warehouse - The contents of a Data Warehouse, once acquired,
are fixed and cannot be updated or changed later
by the transaction system - but they can be added
to of course
44Data Marts
- A Data Mart is a smaller, more focused Data
Warehouse a mini-warehouse - A Data Mart will normally reflect the business
rules of a specific business unit within an
enterprise identifying data relevant to that
units acitivities
45From Data Warhousing to Machine Learning, via
Data Marts
46The Big Challenge for Data Mining
- The largest challenge that a Data Miner may face
is the sheer volume of data in the Data Warehouse - It is very important, then, that summary data
also be available to get the analysis started - The sheer volume of data may mask the important
relationships in which the Data Miner is
interested - Being able to overcome the volume and interpret
the data is essential to successful Data Mining
47What happens in practice
- Data Miners, both farmers and explorers, are
expected to utilise Data Warehouses to give
guidance and answer a limitless variety of
questions - The value of a Data Warehouse and Data Mining
lies in a new and changed appreciation of the
meaning of the data - There are limitations though - A Data Warehouse
cannot correct problems with its data, although
it may help to more clearly identify them