David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

Data Mining and Machine Learning Lecture 1: Why data is useful, and overview of DMML: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne_at_gmail.com – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 44
Provided by: macsHwAc4
Category:

less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com


1
Data Mining and Machine Learning
  • Lecture 1 Why data is useful, and overview of
    DMML

2
Overview of My Lectures
  • http//www.macs.hw.ac.uk/dwcorne/Teaching/dmml.
    html

3
Module assessment
  • 100 by coursework
  • Three main items of coursework,
  • CW 1 30 CW 2 40 CW 3 30
  • Two small items of coursework (A and B), worth
    0, but if you dont do them adequately you fail
    the module.

4
Coursework submission
  • ALL coursework must be submitted as follows
  • as PDF
  • by email to dwcorne_at_gmail.com
  • the c/w is an attachment
  • Subject line DMML Coursework A
  • ( or B, 1, 2, 3)
  • Body of the email includes your Name and your
    Course (e.g. Joe Smith, BSc CS Jill Brown, MSc
    AI)

5
Office Hour Doodle Poll
http//doodle.com/ndb69faydc6ivttw

6
At last, the lecture
7
What some people think can be done with data
  • Answer simple questions like
  • How many female clients do we have?
  • How much paint did we sell in 2007?
  • Which is the most profitable branch of our
    supermarket?
  • Which postcodes suffered the most dropped calls
    in July?

8
that is so

9
that is so
  • Boring

10
More interesting things that can be done with data
  • Answer difficult and valuable questions like
  • How can we predict Ovarian cancer early enough to
    treat it successfully?
  • How can I make significant profit on the stock
    market next month?
  • Two different authors claim to have written this
    story how can we resolve the dispute?
  • How can we get our customers to spend more money
    in the store?
  • Is this loan applicant a good credit risk?
  • Is this sonar image a mine, or a rock?
  • What other websites will this browser be
    interested in?

11
Some competitions at
12
Data Mining - Definition Goal
  • Definition
  • Data Mining is the exploration and analysis of
    (often) large quantities of data in order to
    discover meaningful patterns and rules
  • Goal
  • To permit some other goal to be achieved or
    performance to be improved through a better
    understanding of the data

13
Some examples of large databases
  • Retail basket data much commercial DM is done
    with this. In one store, 18,000 baskets per month
  • Tesco has gt500 stores. Per year, 100,000,000
    baskets ?
  • The Internet gt20,000,000,000 pages
  • Lots of datasets UCI Machine Learning
    repository
  • How can we begin to understand and exploit such
    datasets? Especially the big ones?

14
Like this
15
and this
16
or this
  • see

http//websom.hut.fi/websom/milliondemo/html/root.
html
17
Or this
18
Data Mining Machine Learning - Basics
  • Data Mining is the process of discovering
    patterns and inferring associations in raw data
  • a collection of techniques intended to analyse
    small or large amounts of data
  • can employ a range of techniques, either
    individually or in combination with each other
  • Machine Learning is the same, but the term ML
    emphasises a range of more sophisticated
    algorithms that try to learn accurate predictive
    models of data

19
Data Mining Why is it important?
  • Data are being generated in enormous quantities
  • Data are being collected over long periods of
    time
  • Data are being kept for long periods of time
  • Computing power is formidable and cheap
  • A variety of Data Mining software is available
  • All of these data contain hidden knowledge
    facts, rules, patterns, that can be usefully
    exploited if we can find them.

20
(No Transcript)
21
Some basic terminology
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
22
This is called a data instance or a record or
just a line of data
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
23
This is called a field or an attribute the value
of the Age field in the 4th record is 274
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
24
Usually we are interested in predicting the value
of a particular field, given the values of the
other fields. What we want to predict is called
the class field, or the target class
Gender weight height Age in mths 100m time
Male 52kg 1.71m 243 13.7s
Male 89kg 1.92m 388 22.3s
Female 48kg 1.67m 219 14.6s
Male 86kg 1.96m 274 9.58s
Male 80kg 1.88m 260 10.56s
etc etc etc etc etc
25
Some data-mining related projects that I am
currently working on (either myself, or with a
PhD student or RA)
Analysing flow cytometry data to detect the
presence of specific contaminants in sea-water
samples Predicting which of two or more writers
is the author of a given piece of text
Discovering which subsets of many thousands of
genes play a role in specific diseases (cancer,
diabetes, etc) Discovering technical trading
rules for stock market trading
26
Who wrote text chunk 4?
0.4 0.2 0.001 0.002 0.6
AuthorA 0.3 0.15 0 0.1 0.5
AuthorA 0.2 0.2 0.001 0.002 0.5
AuthorB 0.2 0.15 0 0.002 0.6
?
Word usage Fingerprint of a 1,000 word chunk
of text
27
Did the Dow Jones go up or down in the following
week?
28
Down
29
Will the Dow Jones go up or down tomorrow?
30
Data Mining Tasks
  • Classification - Example high risk for cancer or
    not
  • Estimation/Prediction - Example household income
    / sales
  • Association Rules- Example people who buy X,
    often also buy Y with a probability of Z
  • Clustering - similar to classification but no
    predefined classes identifies meaningful
    segments of a dataset, discovers structure in
    data

31
Data Warehousing
  • Note that Data Mining is very generic and can be
    used for detecting patterns in almost any data
  • Retail data
  • Genomes
  • Climate data
  • Etc.
  • Data Warehousing, on the other hand, is almost
    exclusively used to describe the storage of data
    in the commercial sector

32
What you should do this week
  • Browse the UCI Machine Learning repository
    datasets and associated information get
    acquainted with data
  • Browse the statlib datasets archive, get
    acquainted with that too.
  • Browse the http//www.kaggle.com/ website - to
    give you some idea of how hot data mining is
  • And then

33
Coursework A (0 marks, but you fail if you dont
submit an adequate attempt)
  • Find three other dataset repositories as follows
  • One that specialises in sports data
  • One that specialises in time series data
  • One that specialises in anything else that is
    interesting.
  • For each of these three, tell me the URL, and
    write one paragraph, 100 words, in your own
    words, describing the contents of this
    repository,
  • Submit on or before 2359pm Friday October 11th

34
Au revoir
35
If interested
  • Some slides about data warehousing I dont
    consider this an essential part of this module,
    but in case you want to know what data
    warehousing is

36
Data Warehousing - Definitions
  • A subject-oriented, integrated, time-variant and
    nonvolatile collection of data in support of
    management's decision making process
  • W. H. Inmon, "What is a Data Warehouse?" Prism
    Tech Topic, Vol. 1, No. 1, 1995 -- a very
    influential definition.
  • A copy of transaction data, specifically
    structured for query and analysis
  • Ralph Kimball, from his 2000 book, The Data
    Warehouse Toolkit

37
Data Warehouse why?
  • For organisational learning to take place data
    from many sources must be gathered together over
    time and organised in a consistent and useful way
  • Data Warehousing allows an organisation to
    remember its data and what it has learned about
    its data
  • Data Mining techniques make use of the data in a
    Data Warehouse and subsequently add their results
    to it

38
(No Transcript)
39
Data Warehouse - Contents
  • A Data Warehouse is a copy of transaction data
    specifically structured for querying, analysis
    and reporting
  • The data will normally have been transformed
    when it was copied into the Data Warehouse
  • The contents of a Data Warehouse, once acquired,
    are fixed and cannot be updated or changed later
    by the transaction system - but they can be added
    to of course

40
Data Marts
  • A Data Mart is a smaller, more focused Data
    Warehouse a mini-warehouse
  • A Data Mart will normally reflect the business
    rules of a specific business unit within an
    enterprise identifying data relevant to that
    units acitivities

41
From Data Warhousing to Machine Learning, via
Data Marts
42
The Big Challenge for Data Mining
  • The largest challenge that a Data Miner may face
    is the sheer volume of data in the Data Warehouse
  • It is very important, then, that summary data
    also be available to get the analysis started
  • The sheer volume of data may mask the important
    relationships in which the Data Miner is
    interested
  • Being able to overcome the volume and interpret
    the data is essential to successful Data Mining

43
What happens in practice
  • Data Miners, both farmers and explorers, are
    expected to utilise Data Warehouses to give
    guidance and answer a limitless variety of
    questions
  • The value of a Data Warehouse and Data Mining
    lies in a new and changed appreciation of the
    meaning of the data
  • There are limitations though - A Data Warehouse
    cannot correct problems with its data, although
    it may help to more clearly identify them
Write a Comment
User Comments (0)
About PowerShow.com