David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com

Description:

Some of these s are derivative of Nick Taylor's s used for this module ... (Farming) Attempts to explain or categorise some particular target field such ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 48
Provided by: macs3
Category:

less

Transcript and Presenter's Notes

Title: David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com


1
Data Mining and Machine Learning
  • DM Lecture 1 Overview of DM, and overview of
    the DM part of the DMML module

2
Overview of My Lectures
  • Bookmark this
  • http//www.macs.hw.ac.uk/dwcorne/Teaching/dmml.
    html
  • Lecture 1 about data and data mining
  •  Lectures 2 and 3 Basic and useful ways to
    process and understand data
  •  Lectures 4, 5, 6, 7, 8  Details of useful
    algorithms for finding knowledge from data
  •  Lecture 9 overview of what else there is.

3
Module assessment
  • 100 by coursework
  • Three main items of coursework,
  • 40 (DM), 40 (DM), 20 (ML)
  • Two small items of coursework (A and B), worth
    nothing, but if you dont do them adequately you
    fail the module.

4
This Semester
  • DWC lectures on Mondays(data mining)
  • PC lectures on Fridays (machine learning)
  • Thursday slot usually unused we may use it, and
    will let you know in advance

5
Coursework submission
  • ALL coursework must be submitted as follows
  • as PDF
  • by email to dwcorne_at_gmail.com
  • the c/w is an attachment
  • Subject line DMML Coursework A
  • ( or B, 1, 2, 3)
  • Body of the email includes your Name and your
    Course (e.g. Joe Smith, BSc CS Jill Brown, MSc
    AI)

6
DWC lectures and c/w, key dates
week date Monday 1215 EM183 Thursday 1315 EM183 Friday 1215 EM183 coursework
1 w/b Mon 12th  Sep David   Paul DC Coursework A handout
2 w/b Mon 19th Sep    Paul
3 w/b Mon 26th Sep David   Paul DC Coursework B handout
4 w/b Mon 3th Oct David   Paul DC Coursework 1 handout
5 w/b Mon 10th Oct  David   Paul DC Courseworks A and B handin PC Coursework 2 handout?
6 w/b Mon 17th Oct  
7 w/b Mon 24th Oct   Paul DC Coursework 1 handin
8 w/b Mon 31st Oct David Paul DC Coursework 3 handout
9 w/b Mon 7th Nov David Paul  PC Coursework 2 handin
10 w/b Mon 14thNov David  
11 w/b Mon 21st Nov David  
12 w/b Mon 28th Nov  David   DC Coursework 3 handin 
7
At last, the lecture
8
What some people think can be done with data
  • Answer simple questions like
  • How many female clients do we have?
  • How much paint did we sell in 2007?
  • Which is the most profitable branch of our
    supermarket?
  • Which postcodes suffered the most dropped calls
    in July?

9
that is so

10
that is so
  • Boring

11
More interesting things that can be done with data
  • Answer difficult and valuable questions like
  • How can we predict Ovarian cancer early enough to
    treat it successfully?
  • How can I make significant profit on the stock
    market next month?
  • Two different authors claim to have written this
    story how can we resolve the dispute?
  • How can we get our customers to spend more money
    in the store?
  • Is this loan applicant a good credit risk?
  • Is this sonar image a mine, or a rock?
  • What other websites will this browser be
    interested in?

12
Data Mining - Definition Goal
  • Definition
  • Data Mining is the exploration and analysis of
    large quantities of data in order to discover
    meaningful patterns and rules
  • Goal
  • To permit some other goal to be achieved or
    performance to be improved through a better
    understanding of the data

13
Some examples of large databases
  • Retail basket data much commercial DM is done
    with this. In one store, 18,000 baskets per month
  • Tesco has gt500 stores. Per year, 100,000,000
    baskets ?
  • The Internet gt15,000,000,000 pages
  • Lots of datasets UCI Machine Learning
    repository
  • How can we begin to understand and exploit such
    datasets? Especially the big ones?

14
Like this
15
and this
16
and this
17
or this
  • see

http//websom.hut.fi/websom/milliondemo/html/root.
html
18
Data Mining - Basics
  • Data Mining is the process of discovering
    patterns and inferring associations in raw data
  • Data Mining is a collection of techniques
    intended to analyse small or large amounts of
    data
  • Data Mining can employ a range of techniques,
    either individually or in combination with each
    other

19
Data Mining Why is it important?
  • Data are being generated in enormous quantities
  • Data are being collected over long periods of
    time
  • Data are being kept for long periods of time
  • Computing power is formidable and cheap
  • A variety of Data Mining software is available
  • All of these data contain hidden knowledge
    facts, rules, patterns, that can be usefully
    exploited if we can find them.

20
Data Mining History
  • The approach has its roots over 40 years ago
  • In the early 1960s Data Mining was called
    statistical analysis, and the pioneers were
    statistical software companies such as SPSS
  • By the late 1980s these traditional techniques
    had been augmented by new methods such as machine
    induction, artificial neural networks,
    evolutionary computing, etc.

21
(No Transcript)
22
Some basic terminology
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
23
This is called a data instance or a record or
just a line of data
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
24
This is called a field or an attribute the value
of the Age field in the 4th record is 274
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
25
Usually we are interested in predicting the value
of a particular field, given the values of the
other fields. What we want to predict is called
the class field, or the target class
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
26
Some data-mining related projects that I am
currently working on (either myself, or with a
PhD student or RA)
Predicting whether or not two textures will be
considered similar by humans. Predicting which
of two or more writers is the author of a
given piece of text (you will do some work on
this) Discovering which subsets of many
thousands of genes play a role in specific
diseases (cancer, diabetes, etc) (you may do a
little work on this too) Discovering technical
trading rules for stock market trading (you may
do a little work on this too)
27
Which pair of textures is most similar?
28
Which pair of textures is most similar?
A line of data
0.23 1.88 9.64 3.22 7.1 1086.9 2.23
0.76
age of people who think they are similar
5,000 features for texture1
5,000 features for texture2
29
Who wrote text chunk 4?
0.4 0.2 0.001 0.002 0.6
AuthorA 0.3 0.15 0 0.1 0.5
AuthorA 0.2 0.2 0.001 0.002 0.5
AuthorB 0.2 0.15 0 0.002 0.6
?
Word usage Fingerprint of a 1,000 word chunk
of text
30
Did the Dow Jones go up or down in the following
week?
31
Down
32
Will the Dow Jones go up or down tomorrow?
33
Data Mining Two Major Types
  • Directed (Farming) Attempts to explain or
    categorise some particular target field such as
    income, medical disorder, genetic characteristic,
    etc.
  • Undirected (Exploring) Attempts to find
    patterns or similarities among groups of records
    without the use of a particular target field or
    collection of predefined classes
  • Compare with Supervised and Unsupervised systems
    in machine learning

34
Data Mining Tasks
  • Classification - Example high risk for cancer or
    not
  • Estimation - Example household income
  • Prediction - Example credit card balance
    transfer average amount
  • Affinity Grouping - Example people who buy X,
    often also buy Y with a probability of Z
  • Clustering - similar to classification but no
    predefined classes
  • Description and Profiling Identifying
    characteristics which explain behaviour -
    Example More men watch football on TV than
    women

35
Data Warehousing
  • Note that Data Mining is very generic and can be
    used for detecting patterns in almost any data
  • Retail data
  • Genomes
  • Climate data
  • Etc.
  • Data Warehousing, on the other hand, is almost
    exclusively used to describe the storage of data
    in the commercial sector

36
What you should do this week
  • Browse the UCI Machine Learning repository
    datasets and associated information get
    acquainted with data
  • Browse the statlib datasets archive, get
    acquainted with that too.
  • Browse the http//www.kaggle.com/ website - to
    give you some idea of how hot data mining is
  • And then

37
Coursework A (0 marks, but you fail if you dont
submit an adequate attempt)
  • Find three other dataset repositories as follows
  • One that specialises in financial data
  • One that specialises in time series data
  • One that specialises in anything else.
  • For each of these three, tell me the URL, and
    write one paragraph, 100 words, in your own
    words, describing the contents of this
    repository,
  • Submit on or before Friday October 14th

38
Au revoir
39
If time available
  • Some slides about data warehousing I dont
    consider this an essential part of this module,
    but in case you want to know what data
    warehousing is

40
Data Warehousing - Definitions
  • A subject-oriented, integrated, time-variant and
    nonvolatile collection of data in support of
    management's decision making process
  • W. H. Inmon, "What is a Data Warehouse?" Prism
    Tech Topic, Vol. 1, No. 1, 1995 -- a very
    influential definition.
  • A copy of transaction data, specifically
    structured for query and analysis
  • Ralph Kimball, from his 2000 book, The Data
    Warehouse Toolkit

41
Data Warehouse why?
  • For organisational learning to take place data
    from many sources must be gathered together over
    time and organised in a consistent and useful way
  • Data Warehousing allows an organisation to
    remember its data and what it has learned about
    its data
  • Data Mining techniques make use of the data in a
    Data Warehouse and subsequently add their results
    to it

42
(No Transcript)
43
Data Warehouse - Contents
  • A Data Warehouse is a copy of transaction data
    specifically structured for querying, analysis
    and reporting
  • The data will normally have been transformed
    when it was copied into the Data Warehouse
  • The contents of a Data Warehouse, once acquired,
    are fixed and cannot be updated or changed later
    by the transaction system - but they can be added
    to of course

44
Data Marts
  • A Data Mart is a smaller, more focused Data
    Warehouse a mini-warehouse
  • A Data Mart will normally reflect the business
    rules of a specific business unit within an
    enterprise identifying data relevant to that
    units acitivities

45
From Data Warhousing to Machine Learning, via
Data Marts
46
The Big Challenge for Data Mining
  • The largest challenge that a Data Miner may face
    is the sheer volume of data in the Data Warehouse
  • It is very important, then, that summary data
    also be available to get the analysis started
  • The sheer volume of data may mask the important
    relationships in which the Data Miner is
    interested
  • Being able to overcome the volume and interpret
    the data is essential to successful Data Mining

47
What happens in practice
  • Data Miners, both farmers and explorers, are
    expected to utilise Data Warehouses to give
    guidance and answer a limitless variety of
    questions
  • The value of a Data Warehouse and Data Mining
    lies in a new and changed appreciation of the
    meaning of the data
  • There are limitations though - A Data Warehouse
    cannot correct problems with its data, although
    it may help to more clearly identify them
Write a Comment
User Comments (0)
About PowerShow.com