David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

Title: PowerPoint Presentation Last modified by: dwcorne Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 48
Provided by: macsHwAc4
Category:

less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com


1
Data Mining(and machine learning)
  • DM Lecture 1 Overview of DM, and overview of
    the DM part of the DMML module
  • Some of these slides are derivative of Nick
    Taylors slides used for this module in previous
    years

2
Overview of My Lectures
  • All at http//www.macs.hw.ac.uk/dwcorne/Teach
    ing/dmml.html
  • Lecture 1 about data and data mining
  •  Lectures 2 and 3 Basic and useful ways to
    process and understand data
  •  Lectures 4, 5, 6, 7, 8  Details of useful
    algorithms for finding knowledge from data
  •  Lecture 9 overview of what else there is.

3
Module assessment
  • 100 by coursework
  • Two main items of coursework, 50 each
  • Four small items of coursework, worth nothing,
    but if you dont do them adequately you fail the
    module.

4
This Semester
  • PDW lectures on Mondays (machine learning)
  • DWC lectures on Thursdays (data mining)
  • Friday slot usually unused we may use it, and
    will let you know in advance
  • All coursework set by DWC

5
Coursework submission
  • ALL coursework must be submitted as follows
  • as PDF
  • by email to dwcorne_at_gmail.com
  • the c/w is an attachment
  • Subject line DMML Coursework A
  • ( or B, C, D, 1, 2)
  • Body of the email includes your Name and your
    Course (e.g. Joe Smith, BSc CS Jill Brown, MSc
    AI)

6
DWC lectures and c/w, key dates
Thur sep 17th This lecture Handout C/W A
Thur sep 24th Lecture Handout C/W B
Thur Oct 1st Lecture Handout Main C/W 1 (50)
Thur Oct 8th Lecture
Thur Oct 15th Lecture
Thur Oct 22nd Lecture Handout Main C/W 2 (50)
Thur Oct 29th NO LECTURE (handin C/W A,B and 1on Fri 30th)
Thur Nov 5th NO LECTURE
Thur Nov 12th Lecture Handout C/W C --- C/W 1 vivas on Fri 13th
Thur Nov 19th Lecture Handout C/W D
Thur Nov 26th Lecture (handin C/W C,D and 2 on Fri 27th)
Thur Dec 3rd C/W 2 vivas
7
At last, the lecture
8
What some people think can be done with data
  • Answer simple questions like
  • How many female clients do we have?
  • How much paint did we sell in 2007?
  • Which is the most profitable branch of our
    supermarket?
  • Which postcodes suffered the most dropped calls
    in July?

9
that is so

10
that is so
  • Boring

11
More interesting things that can be done with data
  • Answer difficult and valuable questions like
  • How can we predict Ovarian cancer early enough to
    treat it successfully?
  • How can I make significant profit on the stock
    market next month?
  • Two different authors claim to have written this
    story how can we resolve the dispute?
  • How can we get our customers to spend more money
    in the store?
  • Is this loan applicant a good credit risk?
  • Is this sonar image a mine, or a rock?
  • What other websites will this browser be
    interested in?

12
Data Mining - Definition Goal
  • Definition
  • Data Mining is the exploration and analysis of
    large quantities of data in order to discover
    meaningful patterns and rules
  • Goal
  • To permit some other goal to be achieved or
    performance to be improved through a better
    understanding of the data

13
Some examples of large databases
  • Retail basket data much commercial DM is done
    with this. In one store, 18,000 baskets per month
  • Tesco has gt500 stores. Per year, 100,000,000
    baskets ?
  • The Internet gt15,000,000,000 pages
  • Lots of datasets UCI Machine Learning
    repository
  • How can we begin to understand and exploit such
    datasets? Especially the big ones?

14
Like this
15
and this
16
and this
17
or this
  • see

http//websom.hut.fi/websom/milliondemo/html/root.
html
18
Data Mining - Basics
  • Data Mining is the process of discovering
    patterns and inferring associations in raw data
  • Data Mining is a collection of techniques
    intended to analyse small or large amounts of
    data
  • There is no single Data Mining approach
  • Data Mining can employ a range of techniques,
    either individually or in combination with each
    other

19
Data Mining Why is it important?
  • Data are being generated in enormous quantities
  • Data are being collected over long periods of
    time
  • Data are being kept for long periods of time
  • Computing power is formidable and cheap
  • A variety of Data Mining software is available
  • All of these data contain hidden knowledge
    facts, rules, patterns, that can be usefully
    exploited if we can find them.

20
Data Mining History
  • The approach has its roots over 40 years ago
  • In the early 1960s Data Mining was called
    statistical analysis, and the pioneers were
    statistical software companies such as SPSS
  • By the late 1980s these traditional techniques
    had been augmented by new methods such as machine
    induction, artificial neural networks,
    evolutionary computing, etc.

21
(No Transcript)
22
Some basic terminology
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
23
This is called a data instance or a record or
just a line of data
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
24
This is called a field or an attribute the value
of the Age field in the 4th record is 274
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
25
Usually we are interested in predicting the value
of a particular field, given the values of the
other fields. What we want to predict is called
the class field, or the target class
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
26
Some data-mining related projects that I am
currently working on (either myself, or with a
PhD student or RA)
Predicting whether or not two textures will be
considered similar by humans. Predicting which
of two or more writers is the author of a
given piece of text (you will do some work on
this) Discovering which subsets of many
thousands of genes play a role in specific
diseases (cancer, diabetes, etc) (you will do a
little work on this too) Discovering technical
trading rules for stock market trading (you will
do a little work on this too)
27
Which pair of textures is most similar?
28
Which pair of textures is most similar?
A line of data
0.23 1.88 9.64 3.22 7.1 1086.9 2.23
0.76
age of people who think they are similar
5,000 features for texture1
5,000 features for texture2
29
Who wrote text chunk 4?
0.4 0.2 0.001 0.002 0.6
AuthorA 0.3 0.15 0 0.1 0.5
AuthorA 0.2 0.2 0.001 0.002 0.5
AuthorB 0.2 0.15 0 0.002 0.6
?
Word usage Fingerprint of a 1,000 word chunk
of text
30
Did the Dow Jones go up or down in the following
week?
31
Down
32
Will the Dow Jones go up or down tomorrow?
33
Data Mining Two Major Types
  • Directed (Farming) Attempts to explain or
    categorise some particular target field such as
    income, medical disorder, genetic characteristic,
    etc.
  • Undirected (Exploring) Attempts to find
    patterns or similarities among groups of records
    without the use of a particular target field or
    collection of predefined classes
  • Compare with Supervised and Unsupervised systems
    in machine learning

34
Data Mining Tasks
  • Classification - Example high risk for cancer or
    not
  • Estimation - Example household income
  • Prediction - Example credit card balance
    transfer average amount
  • Affinity Grouping - Example people who buy X,
    often also buy Y with a probability of Z
  • Clustering - similar to classification but no
    predefined classes
  • Description and Profiling Identifying
    characteristics which explain behaviour -
    Example More men watch football on TV than
    women

35
Data Warehousing
  • Note that Data Mining is very generic and can be
    used for detecting patterns in almost any data
  • Retail data
  • Genomes
  • Climate data
  • Etc.
  • Data Warehousing, on the other hand, is almost
    exclusively used to describe the storage of data
    in the commercial sector

36
What you should do this week
  • Browse the UCI Machine Learning repository
    datasets and associated information get
    acquainted with data
  • Browse the statlib datasets archive, get
    acquainted with that too.
  • And then

37
Coursework A (0 marks, but you fail if you dont
submit an adequate attempt)
  • Find three other dataset repositories as follows
  • One that specialises in financial data
  • One that specialises in time series data
  • One that specialises in anything else.
  • For each of these three, tell me the URL, and
    write one paragraph, 100 words, in your own
    words, describing the contents of this
    repository,
  • Submit on or before Friday October 30th

38
Au revoir
39
If time available
  • Some slides about data warehousing I dont
    consider this an essential part of this module,
    but in case you want to know what data
    warehousing is

40
Data Warehousing - Definitions
  • A subject-oriented, integrated, time-variant and
    nonvolatile collection of data in support of
    management's decision making process
  • W. H. Inmon, "What is a Data Warehouse?" Prism
    Tech Topic, Vol. 1, No. 1, 1995 -- a very
    influential definition.
  • A copy of transaction data, specifically
    structured for query and analysis
  • Ralph Kimball, from his 2000 book, The Data
    Warehouse Toolkit

41
Data Warehouse why?
  • For organisational learning to take place data
    from many sources must be gathered together over
    time and organised in a consistent and useful way
  • Data Warehousing allows an organisation to
    remember its data and what it has learned about
    its data
  • Data Mining techniques make use of the data in a
    Data Warehouse and subsequently add their results
    to it

42
(No Transcript)
43
Data Warehouse - Contents
  • A Data Warehouse is a copy of transaction data
    specifically structured for querying, analysis
    and reporting
  • The data will normally have been transformed
    when it was copied into the Data Warehouse
  • The contents of a Data Warehouse, once acquired,
    are fixed and cannot be updated or changed later
    by the transaction system - but they can be added
    to of course

44
Data Marts
  • A Data Mart is a smaller, more focused Data
    Warehouse a mini-warehouse
  • A Data Mart will normally reflect the business
    rules of a specific business unit within an
    enterprise identifying data relevant to that
    units acitivities

45
From Data Warhousing to Machine Learning, via
Data Marts
46
The Big Challenge for Data Mining
  • The largest challenge that a Data Miner may face
    is the sheer volume of data in the Data Warehouse
  • It is very important, then, that summary data
    also be available to get the analysis started
  • The sheer volume of data may mask the important
    relationships in which the Data Miner is
    interested
  • Being able to overcome the volume and interpret
    the data is essential to successful Data Mining

47
What happens in practice
  • Data Miners, both farmers and explorers, are
    expected to utilise Data Warehouses to give
    guidance and answer a limitless variety of
    questions
  • The value of a Data Warehouse and Data Mining
    lies in a new and changed appreciation of the
    meaning of the data
  • There are limitations though - A Data Warehouse
    cannot correct problems with its data, although
    it may help to more clearly identify them
Write a Comment
User Comments (0)
About PowerShow.com