Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com
1Data Mining and Machine Learning
- Lecture 1 Why data is useful, and overview of
DMML -
2Overview of My Lectures
-
- http//www.macs.hw.ac.uk/dwcorne/Teaching/dmml.
html
3Module assessment
- 100 by coursework
- Three main items of coursework,
- CW 1 30 CW 2 40 CW 3 30
- Two small items of coursework (A and B), worth
0, but if you dont do them adequately you fail
the module.
4Coursework submission
- ALL coursework must be submitted as follows
- as PDF
- by email to dwcorne_at_gmail.com
- the c/w is an attachment
- Subject line DMML Coursework A
- ( or B, 1, 2, 3)
- Body of the email includes your Name and your
Course (e.g. Joe Smith, BSc CS Jill Brown, MSc
AI)
5Office Hour Doodle Poll
http//doodle.com/ndb69faydc6ivttw
6At last, the lecture
7What some people think can be done with data
- Answer simple questions like
- How many female clients do we have?
- How much paint did we sell in 2007?
- Which is the most profitable branch of our
supermarket? - Which postcodes suffered the most dropped calls
in July?
8that is so
9that is so
10More interesting things that can be done with data
- Answer difficult and valuable questions like
- How can we predict Ovarian cancer early enough to
treat it successfully? - How can I make significant profit on the stock
market next month? - Two different authors claim to have written this
story how can we resolve the dispute? - How can we get our customers to spend more money
in the store? - Is this loan applicant a good credit risk?
- Is this sonar image a mine, or a rock?
- What other websites will this browser be
interested in?
11Some competitions at
12Data Mining - Definition Goal
- Definition
- Data Mining is the exploration and analysis of
(often) large quantities of data in order to
discover meaningful patterns and rules -
- Goal
- To permit some other goal to be achieved or
performance to be improved through a better
understanding of the data
13Some examples of large databases
- Retail basket data much commercial DM is done
with this. In one store, 18,000 baskets per month - Tesco has gt500 stores. Per year, 100,000,000
baskets ? - The Internet gt20,000,000,000 pages
- Lots of datasets UCI Machine Learning
repository - How can we begin to understand and exploit such
datasets? Especially the big ones?
14Like this
15and this
16or this
http//websom.hut.fi/websom/milliondemo/html/root.
html
17Or this
18Data Mining Machine Learning - Basics
- Data Mining is the process of discovering
patterns and inferring associations in raw data - a collection of techniques intended to analyse
small or large amounts of data - can employ a range of techniques, either
individually or in combination with each other - Machine Learning is the same, but the term ML
emphasises a range of more sophisticated
algorithms that try to learn accurate predictive
models of data
19Data Mining Why is it important?
- Data are being generated in enormous quantities
- Data are being collected over long periods of
time - Data are being kept for long periods of time
- Computing power is formidable and cheap
- A variety of Data Mining software is available
- All of these data contain hidden knowledge
facts, rules, patterns, that can be usefully
exploited if we can find them.
20(No Transcript)
21Some basic terminology
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
22This is called a data instance or a record or
just a line of data
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
23This is called a field or an attribute the value
of the Age field in the 4th record is 274
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
24Usually we are interested in predicting the value
of a particular field, given the values of the
other fields. What we want to predict is called
the class field, or the target class
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
25Some data-mining related projects that I am
currently working on (either myself, or with a
PhD student or RA)
Analysing flow cytometry data to detect the
presence of specific contaminants in sea-water
samples Predicting which of two or more writers
is the author of a given piece of text
Discovering which subsets of many thousands of
genes play a role in specific diseases (cancer,
diabetes, etc) Discovering technical trading
rules for stock market trading
26Who wrote text chunk 4?
0.4 0.2 0.001 0.002 0.6
AuthorA 0.3 0.15 0 0.1 0.5
AuthorA 0.2 0.2 0.001 0.002 0.5
AuthorB 0.2 0.15 0 0.002 0.6
?
Word usage Fingerprint of a 1,000 word chunk
of text
27Did the Dow Jones go up or down in the following
week?
28Down
29Will the Dow Jones go up or down tomorrow?
30Data Mining Tasks
- Classification - Example high risk for cancer or
not - Estimation/Prediction - Example household income
/ sales - Association Rules- Example people who buy X,
often also buy Y with a probability of Z - Clustering - similar to classification but no
predefined classes identifies meaningful
segments of a dataset, discovers structure in
data -
31Data Warehousing
- Note that Data Mining is very generic and can be
used for detecting patterns in almost any data - Retail data
- Genomes
- Climate data
- Etc.
- Data Warehousing, on the other hand, is almost
exclusively used to describe the storage of data
in the commercial sector
32What you should do this week
- Browse the UCI Machine Learning repository
datasets and associated information get
acquainted with data - Browse the statlib datasets archive, get
acquainted with that too. - Browse the http//www.kaggle.com/ website - to
give you some idea of how hot data mining is - And then
33Coursework A (0 marks, but you fail if you dont
submit an adequate attempt)
- Find three other dataset repositories as follows
- One that specialises in sports data
- One that specialises in time series data
- One that specialises in anything else that is
interesting. - For each of these three, tell me the URL, and
write one paragraph, 100 words, in your own
words, describing the contents of this
repository, - Submit on or before 2359pm Friday October 11th
34Au revoir
35If interested
- Some slides about data warehousing I dont
consider this an essential part of this module,
but in case you want to know what data
warehousing is
36Data Warehousing - Definitions
- A subject-oriented, integrated, time-variant and
nonvolatile collection of data in support of
management's decision making process - W. H. Inmon, "What is a Data Warehouse?" Prism
Tech Topic, Vol. 1, No. 1, 1995 -- a very
influential definition. - A copy of transaction data, specifically
structured for query and analysis - Ralph Kimball, from his 2000 book, The Data
Warehouse Toolkit
37Data Warehouse why?
- For organisational learning to take place data
from many sources must be gathered together over
time and organised in a consistent and useful way - Data Warehousing allows an organisation to
remember its data and what it has learned about
its data - Data Mining techniques make use of the data in a
Data Warehouse and subsequently add their results
to it
38(No Transcript)
39Data Warehouse - Contents
- A Data Warehouse is a copy of transaction data
specifically structured for querying, analysis
and reporting - The data will normally have been transformed
when it was copied into the Data Warehouse - The contents of a Data Warehouse, once acquired,
are fixed and cannot be updated or changed later
by the transaction system - but they can be added
to of course
40Data Marts
- A Data Mart is a smaller, more focused Data
Warehouse a mini-warehouse - A Data Mart will normally reflect the business
rules of a specific business unit within an
enterprise identifying data relevant to that
units acitivities
41From Data Warhousing to Machine Learning, via
Data Marts
42The Big Challenge for Data Mining
- The largest challenge that a Data Miner may face
is the sheer volume of data in the Data Warehouse - It is very important, then, that summary data
also be available to get the analysis started - The sheer volume of data may mask the important
relationships in which the Data Miner is
interested - Being able to overcome the volume and interpret
the data is essential to successful Data Mining
43What happens in practice
- Data Miners, both farmers and explorers, are
expected to utilise Data Warehouses to give
guidance and answer a limitless variety of
questions - The value of a Data Warehouse and Data Mining
lies in a new and changed appreciation of the
meaning of the data - There are limitations though - A Data Warehouse
cannot correct problems with its data, although
it may help to more clearly identify them