Dr Robert Sanderson - PowerPoint PPT Presentation

Loading...

PPT – Dr Robert Sanderson PowerPoint presentation | free to download - id: ad1d6-NDg0Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Dr Robert Sanderson

Description:

ARM: A Priori and Data Structures. ARM: Improvements. ARM: Advanced Techniques ... assessment exercises. Software: ... Lecture Notes, Assignments, Exercises: ... – PowerPoint PPT presentation

Number of Views:257
Avg rating:3.0/5.0
Slides: 343
Provided by: cscL6
Learn more at: http://www.csc.liv.ac.uk
Category:
Tags: robert | sanderson

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Dr Robert Sanderson


1
COMP527 Data Mining
  • Dr Robert Sanderson
  • (azaroth_at_liv.ac.uk)?
  • Dept. of Computer Science
  • University of Liverpool
  • 2008
  • This is the full course notes, but not quite
    complete. You should come to the lectures anyway.
    Really.

Introduction to the Course January
18, 2008 Slide 1

2
COMP527 Data Mining
  • Introduction to the Course
  • Introduction to Data Mining
  • Introduction to Text Mining
  • General Data Mining Issues
  • Data Warehousing
  • Classification Challenges, Basics
  • Classification Rules
  • Classification Trees
  • Classification Trees 2
  • Classification Bayes
  • Classification Neural Networks
  • Classification SVM
  • Classification Evaluation
  • Classification Evaluation 2
  • Regression, Prediction

Input Preprocessing Attribute Selection Associatio
n Rule Mining ARM A Priori and Data
Structures ARM Improvements ARM Advanced
Techniques Clustering Challenges,
Basics Clustering Agglomerative/Divisive Clusteri
ng Advanced Algorithms Hybrid Approaches Graph
Mining, Web Mining Text Mining Challenges,
Basics Text Mining Text-as-Data Text Mining
Text-as-Language Revision for Exam
Introduction to the Course January
18, 2008 Slide 2

3
COMP527 Data Mining
  • Me, You Introductions
  • Lectures
  • Tutorials
  • References
  • Course Summary
  • Assessment
  • Something Fun Or at least more fun,
    hopefully

Introduction to the Course January
18, 2008 Slide 3

4
COMP527 Data Mining
  • Dr. Robert Sanderson
  • Office 1.04, Ashton Building
  • Extension 54252 external 795 4252
  • Email azaroth_at_liv.ac.uk
  • Web http//www.csc.liv.ac.uk/azaroth/
  • Hours 1000 to 1800, not Thursday
  • Email for a time, or show up at any time knowing
    that I might not be there.
  • Where's your accent from New Zealand

Introduction to the Course January
18, 2008 Slide 4

5
COMP527 Data Mining
So you went to Waikato? Your PhD is in Data
Mining? ... Computer Science? ... Science?
Math? Engineering? You at least write Java? ...
C? What sort of CS Lecturer are you?!
Introduction to the Course January
18, 2008 Slide 5

6
COMP527 Data Mining
Went to University of Canterbury (NZ, not
Kent)? ... But I do know Ian Witten quite
well. PhD is in French/History ... But focused
on Computing in the Humanities/Informatics Python
! Information Science Information Retrieval,
Data Mining, Text Mining, XML, Databases,
Interoperability, Grid Processing, Digital
Preservation ...
Introduction to the Course January
18, 2008 Slide 6

7
COMP527 Data Mining
...
Introduction to the Course January
18, 2008 Slide 7

8
COMP527 Data Mining
Lecture Slots Monday 10-11am
Here Tuesday 10-11am Here Friday
2-3pm Here Course requirement 30
hours of lectures Semester Timetable 8 weeks
class, 3 weeks easter, 4 weeks class. Dates 21s
t January to 11th of March (Rob _at_ conference
on 14th)? 7th April to 21st April (But may
run to 25th?)?
Introduction to the Course January
18, 2008 Slide 8

9
COMP527 Data Mining
Location Lab 6, Tuesdays 3-4pm (just
before departmental seminar)? Aims Provide
time for practical experience Answer any
questions from lectures/reading Informal
self-assessment exercises Software Data
mining 'workbench' software WEKA installed on
Windows image. May be available under Linux.
Freely downloadable from University of
Waikato http//www.cs.waikato.ac.nz/ml/weka/
Introduction to the Course January
18, 2008 Slide 9

10
COMP527 Data Mining
  • Departmental Home Page
  • http//www.csc.liv.ac.uk/teaching/modules/newmscs2
    /comp527.html
  • Lecture Notes, Assignments, Exercises
  • http//www.csc.liv.ac.uk/azaroth/courses/current/
    comp527/

Introduction to the Course January
18, 2008 Slide 10

11
COMP527 Data Mining
Witten, Ian and Eibe Frank, Data Mining
Practical Machine Learning Tools and Techniques,
Second Edition, Morgan Kaufmann, 2005 Dunham,
Margaret H, Data Mining Introductory and
Advanced Topics, Prentice Hall, 2003
Introduction to the Course January
18, 2008 Slide 11

12
COMP527 Data Mining
  • Han and Kamber, Data Mining Concepts and
    Techniques, Second Edition, Morgan Kaufmann, 2006
  • Berry, Browne, Lecture Notes in Data Mining,
    World Scientific, 2006
  • Berry and Linoff, Data Mining Techniques, Second
    Edition, Wiley, 2004
  • Zhang, Association Rule Mining, Springer, 2002
  • Konchady, Text Mining Application Programming,
    Thomson, 2006
  • Weiss et al., Text Mining Predictive Methods for
    Analyzing Unstructured Information, Springer,
    2005
  • Inmon, Building the Data Warehouse, Wiley, 1993
  • KDD (http//www.kdd2007.com)?
  • PAKDD (http//lamda.nju.edu.cn/conf/PAKDD07/)?
  • PKDD (http//www.ecmlpkdd2008.org/)?

Introduction to the Course January
18, 2008 Slide 12

13
COMP527 Data Mining
  • CiteSeer http//citeseer.ist.psu.edu/
  • KDNuggets http//www.kdnuggets.com/
  • UCI Repository http//kdd.ics.uci.edu/ (plus
    follow link to Machine Learning Archive)?
  • Wikipedia http//en.wikipedia.org/wiki/Data_min
    ing
  • MathWorld http//mathworld.wolfram.com/
  • Google Scholar http//scholar.google.com/
  • NaCTeM http//www.nactem.ac.uk/

Introduction to the Course January
18, 2008 Slide 13

14
COMP527 Data Mining
  • Introduction, Basics 4 lectures
  • Data Warehousing 1 lecture
  • Classification 10 lectures
  • Input Preprocessing 2 lectures
  • Association Rule Mining 4 lectures
  • Clustering 3 lectures
  • Hybrid Approaches 1 lecture
  • Graph Mining 1 lecture
  • Text Mining 3 lectures
  • Revision 1 lecture
  • Total 30 lectures

Introduction to the Course January
18, 2008 Slide 14

15
COMP527 Data Mining
  • 75 End of Year Exam
  • 2 ½ hours
  • Short Answer and/or Essays
  • Choose 4 of 5 sections
  • 25 Continuous Assessment
  • 12 Assignment 1 (Due 2008-03-10 160000)?
  • 13 Assignment 2 (Due 2007-04-25 160000)
  • Self assessment exercises
  • Weekly (or as desired) during tutorial session

Introduction to the Course January
18, 2008 Slide 15

16
COMP527 Data Mining
... what you've all been waiting for
... Something Fun! (Or more fun
than the rest of the lecture at least, your
mileage may vary, opinions expressed herein bla
bla bla)?
Introduction to the Course January
18, 2008 Slide 16

17
COMP527 Data Mining
  • The Rules
  • Each player is dealt 7 cards by the dealer
  • The first person to have no cards in hand wins
  • Every turn, each player discards a card
  • Play starts with the person to the left of the
    dealer and proceeds to the left
  • The dealer and then the winner of each round
    makes a secret rule
  • If you break a rule, you receive a penalty from
    the rule's creator
  • The penalty is You must draw one card

Introduction to the Course January
18, 2008 Slide 17

18
COMP527 Data Mining
  • Later rules may overturn earlier rules, either
    completely or in part
  • Each rule may only change one aspect of the game
    play
  • Penalty conditions for breaking rules include
  • Illegal card played (eg black on red)?
  • Procedural error (eg playing out of turn)?
  • Incorrect penalty (eg when a later rule
    enables a play)?
  • Each rule is numbered (eg Procedural error under
    Rule 3)?
  • When taking a penalty for playing out of turn, or
    discarding multiple cards, you must return the
    state of the game to as it was before the penalty
    and then the penalty is incurred.

Introduction to the Course January
18, 2008 Slide 18

19
COMP527 Data Mining
  • Introduction to the Course
  • Introduction to Data Mining
  • Introduction to Text Mining
  • General Data Mining Issues
  • Data Warehousing
  • Classification Challenges, Basics
  • Classification Rules
  • Classification Trees
  • Classification Trees 2
  • Classification Bayes
  • Classification Neural Networks
  • Classification SVM
  • Classification Evaluation
  • Classification Evaluation 2
  • Regression, Prediction

Input Preprocessing Attribute Selection Associatio
n Rule Mining ARM A Priori and Data
Structures ARM Improvements ARM Advanced
Techniques Clustering Challenges,
Basics Clustering Improvements Clustering
Advanced Algorithms Hybrid Approaches Graph
Mining, Web Mining Text Mining Challenges,
Basics Text Mining Text-as-Data Text Mining
Text-as-Language Revision for Exam
Introduction to Data Mining January
18, 2008 Slide 19

20
COMP527 Data Mining
  • What is Data Mining?
  • Definitions
  • Views on the Process
  • Basic Functions
  • Why would you do this?
  • Motivations
  • Applications
  • WEKA Waikato Environment for Knowledge
    Analysis (And a cute little bird!)

Introduction to Data Mining January
18, 2008 Slide 20

21
COMP527 Data Mining
  • Some Definitions
  • The nontrivial extraction of implicit,
    previously unknown, and potentially useful
    information from data (Piatetsky-Shapiro)
  • "...the automated or convenient extraction of
    patterns representing knowledge implicitly stored
    or captured in large databases, data warehouses,
    the Web, ... or data streams." (Han, pg xxi)?
  • ...the process of discovering patterns in data.
    The process must be automatic or (more usually)
    semiautomatic. The patterns discovered must be
    meaningful... (Witten, pg 5)?
  • ...finding hidden information in a database.
    (Dunham, pg 3)?
  • ...the process of employing one or more computer
    learning techniques to automatically analyse and
    extract knowledge from data contained within a
    database. (Roiger, pg 4)?

Introduction to Data Mining January
18, 2008 Slide 21

22
COMP527 Data Mining
  • Keywords from each definition
  • The nontrivial extraction of implicit,
    previously unknown, and potentially useful
    information from data (Piatetsky-Shapiro)?
  • "...the automated or convenient extraction of
    patterns representing knowledge implicitly stored
    or captured in large databases, data warehouses,
    the Web, ... or data streams." (Han, pg xxi)
  • ...the process of discovering patterns in data.
    The process must be automatic or (more usually)
    semiautomatic. The patterns discovered must be
    meaningful... (Witten, pg 5)?
  • ...finding hidden information in a database.
    (Dunham, pg 3)?
  • ...the process of employing one or more computer
    learning techniques to automatically analyze and
    extract knowledge from data contained within a
    database. (Roiger, pg 4)?

Introduction to Data Mining January
18, 2008 Slide 22

23
COMP527 Data Mining
  • Many texts treat KDD and Data Mining as the same
    process, but it is also possible to think of Data
    Mining as the discovery part of KDD.
  • Dunham KDD is the process of finding useful
    information and patterns in data. Data Mining is
    the use of algorithms to extract information and
    patterns derived by the KDD process.
  • For this course, we will discuss the entire
    process (KDD) but focus mostly on the algorithms
    used for discovery.

Introduction to Data Mining January
18, 2008 Slide 23

24
COMP527 Data Mining

Knowledge
Interpretation
Data Model
Data Mining
Transformed Data
Transformation
Preprocessed Data
Preprocessing
Target Data
Selection
Initial Data
(As tweaked by Dunham)?
Introduction to Data Mining January
18, 2008 Slide 24

25
COMP527 Data Mining

Introduction to Data Mining January
18, 2008 Slide 25

26
COMP527 Data Mining
  • All Data Mining functions can be thought of as
    attempting to find a model to fit the data.
  • Each function needs Criteria to create one model
    over another.
  • Each function needs a technique to Compare the
    data.
  • Two types of model
  • Predictive models predict unknown values based on
    known data
  • Descriptive models identify patterns in data
  • Each type has several sub-categories, each of
    which has many algorithms. We won't have time to
    look at ALL of them in detail.

Introduction to Data Mining January
18, 2008 Slide 26

27
COMP527 Data Mining

Classification Maps data into predefined
classes Regression Maps data into a
function Prediction Predict future data
states Time Series Analysis Analyze data over
time (Supervised Learning)?
Predictive
Data Mining
Clustering Find groups of similar
items Association Rules Find relationships
between items Characterisation Derive
representative information Sequence Discovery
Find sequential patterns (Unsupervised Learning)?
Descriptive
Introduction to Data Mining January
18, 2008 Slide 27

28
COMP527 Data Mining
  • The aim of classification is to create a model
    that can predict the 'type' or some category for
    a data instance that doesn't have one.
  • Two phases
  • 1. Given labelled data instances, learn model
    for how to predict the class label for them.
    (Training)?
  • 2. Given an unlabelled, unseen instance, use the
    model to predict the class label. (Prediction)?
  • Some algorithms predict only a binary split
    (yes/no), some can predict 1 of N classes, some
    give probabilities for each of N classes.

Introduction to Data Mining January
18, 2008 Slide 28

29
COMP527 Data Mining
  • The aim of clustering is similar to
    classification, but without predefined classes.
  • Clustering attempts to find clusters of data
    instances which are more similar to each other
    than to instances outside of the cluster.
  • Unsupervised Learning learning by observation,
    rather than by example.
  • Some algorithms must be told how many clusters to
    find, others try to find an 'appropriate' number
    of clusters.

Introduction to Data Mining January
18, 2008 Slide 29

30
COMP527 Data Mining
  • The aim of association rule mining is to find
    patterns that occur in the data set frequently
    enough to be interesting. Hence the association
    or correlation of data attributes within
    instances, rather than between instances.
  • These correlations are then expressed as rules
    if X and Y appear in an instance, then Z also
    appears.
  • Most algorithms are extensions of a single base
    algorithm known as 'A Priori', however a few
    others also exist.

Introduction to Data Mining January
18, 2008 Slide 30

31
COMP527 Data Mining
  • That all sounds ... complicated. Why should I
    learn about Data Mining?
  • What's wrong with just a relational database? Why
    would I want to go through these extra
    complicated steps?
  • Isn't it expensive? It sounds like it takes a lot
    of skill, programming, computational time and
    storage space. Where's the benefit?
  • Data Mining isn't just a cute academic exercise,
    it has very profitable real world uses.
    Practically all large companies and many
    governments perform data mining as part of their
    planning and analysis.

Introduction to Data Mining January
18, 2008 Slide 31

32
COMP527 Data Mining
  • The rate of data creation is accelerating each
    year. In 2003, UC Berkeley estimated that the
    previous year generated 5 exabytes of data, of
    which 92 was stored on electronically accessible
    media. Mega lt Giga lt Tera lt Peta lt Exa ... All
    the data in all the books in the US Library of
    Congress is 136 Terabytes. So 37,000 New
    Libraries of Congress in 2002.
  • VLBI Telescopes produce 16 Gigabytes of data
    every second.
  • Each engine of each plane of each company
    produces 1 Gigabyte of data every trans-atlantic
    length journey.
  • Google searches 18 billion accessible web pages.

Introduction to Data Mining January
18, 2008 Slide 32

33
COMP527 Data Mining
  • As the amount of data increases, the proportion
    of information decreases.
  • As more and more data is generated automatically,
    we need to find automatic solutions to turn those
    stored raw results into information.
  • Companies need to turn stored data into profit
    ... otherwise why are they storing it?
  • Let's look at some real world examples.

Introduction to Data Mining January
18, 2008 Slide 33

34
COMP527 Data Mining
  • The data generated by airplane engines can be
    used to determine when it needs to be serviced.
    By discovering the patterns that are indicative
    of problems, companies can service working
    engines less often (increasing profit) and
    discover faults before they materialise
    (increasing safety).
  • Loan companies can give you results in minutes
    by classifying you into a good credit risk or a
    bad risk, based on your personal information and
    a large supply of previous, similar customers.
  • Cell phone companies can classify customers into
    those likely to leave, and hence need enticement,
    and those that are likely to stay regardless.

Introduction to Data Mining January
18, 2008 Slide 34

35
COMP527 Data Mining
  • Discover previously unknown groups of
    customers/items.
  • By finding clusters of customers, companies can
    then determine how best to handle that particular
    cluster.
  • For example, this could be used for targeted
    advertising, special offers, transferring
    information gathered by association rule mining
    to other members of the cluster, and so forth.
  • The concept of 'Similarity' is often used for
    determining other items that you might be
    interested in, eg 'More Like This' links.

Introduction to Data Mining January
18, 2008 Slide 35

36
COMP527 Data Mining
  • By finding association rules from shopping
    baskets, supermarkets can use this information
    for many things, including
  • Product placement in the store
  • What to put on sale
  • What to create as 'joint special offers'
  • What to offer the customer in terms of coupons
  • What to advertise together
  • It shouldn't be surprising that your Tesco
    coupons are for things that you sometimes buy,
    rather than things you always or never buy.
  • Wal-Mart in the US records every transaction at
    every store -- petabytes of information to sift
    through. (TeraData)?

Introduction to Data Mining January
18, 2008 Slide 36

37
COMP527 Data Mining
  • Note well that data mining applications have no
    wisdom. They cannot apply the knowledge that
    they discover appropriately.
  • For example, a data mining application may tell
    you that there is a correlation between buying
    music magazines and beer, but it doesn't tell you
    how to use that knowledge. Should you put the
    two close together to reinforce the tendency, or
    should you put them far apart as people will buy
    them anyway and thus stay in the store longer?
  • Data mining can help managers plan strategies for
    a company, it does not give them the strategies.

Introduction to Data Mining January
18, 2008 Slide 37

38
COMP527 Data Mining
Introduction to Data Mining January
18, 2008 Slide 38

39
COMP527 Data Mining
Introduction to Data Mining January
18, 2008 Slide 39

40
COMP527 Data Mining
  • Witten Chapter 1
  • Dunham Chapter 1
  • Han Chapter 1 Sections 6.1, 7.1
  • Berry Linoff Chapters 1,2
  • http//en.wikipedia.org/wiki/Data_mining and
    linked pages

Introduction to Data Mining January
18, 2008 Slide 40

41
COMP527 Data Mining
  • Introduction to the Course
  • Introduction to Data Mining
  • Introduction to Text Mining
  • General Data Mining Issues
  • Data Warehousing
  • Classification Challenges, Basics
  • Classification Rules
  • Classification Trees
  • Classification Trees 2
  • Classification Bayes
  • Classification Neural Networks
  • Classification SVM
  • Classification Evaluation
  • Classification Evaluation 2
  • Regression, Prediction

Input Preprocessing Attribute Selection Associatio
n Rule Mining ARM A Priori and Data
Structures ARM Improvements ARM Advanced
Techniques Clustering Challenges,
Basics Clustering Improvements Clustering
Advanced Algorithms Hybrid Approaches Graph
Mining, Web Mining Text Mining Challenges,
Basics Text Mining Text-as-Data Text Mining
Text-as-Language Revision for Exam
Introduction to Text Mining January
18, 2008 Slide 41

42
COMP527 Data Mining
  • Information Retrieval (IR)?
  • What is IR?
  • Typical IR Process
  • Data Mining on Text
  • Text Mining
  • What is Text Mining?
  • Typical Text Mining Process
  • Applications

Introduction to Text Mining January
18, 2008 Slide 42

43
COMP527 Data Mining
IR is concerned with retrieving textual records,
not data items like relational databases, nor
(specifically) with finding patterns like data
mining. Examples SQL Find rows where the
text column LIKE information retrieval DM
Find a model in order to classify document
topics. IR Find documents with text that
contains the words Information adjacent to
Retrieval, Protocol or SRW, but not Google.
Introduction to Text Mining January
18, 2008 Slide 43

44
COMP527 Data Mining
  • IR focuses on finding the most appropriate or
    relevant records to the user's request.
  • The supremacy of Google can be attributed
    primarily to its PageRank algorithm for ranking
    web pages in order of relevance to the user's
    query. 741.79 (on 2007-11-06, up from 471.80
    on 2006-11-03) a share says this topic is
    important to understand!
  • IR also focuses on finding these records as
    quickly as possible.
  • Not only does Google find relevant pages, it
    finds them Fast, for many thousands (maybe
    millions?) of concurrent users.

Introduction to Text Mining January
18, 2008 Slide 44

45
COMP527 Data Mining
  • So is Google the answer to the question of
    Information Retrieval?
  • No! Google has a good answer for how to search
    the web, but there are many more sources of data,
    and many more interesting questions.
  • Many other examples, including
  • Library catalogues
  • XML searching
  • Distributed searching
  • Query languages

Introduction to Text Mining January
18, 2008 Slide 45

46
COMP527 Data Mining
Research topics exist for each
box and arrow!
Search Engine
User
Need
Query
Information
Introduction to Text Mining January
18, 2008 Slide 46

47
COMP527 Data Mining
Compare to the KDD process we
looked at last time!
Documents
Search Engine
Target Documents
Preprocessed Documents
Records
Information
Introduction to Text Mining January
18, 2008 Slide 47

48
COMP527 Data Mining
What information do we need to store? Query
Documents containing Information and Retrieval
but not Protocol Need to find which documents
contain which words. Could perform this query
using a document/term matrix
Introduction to Text Mining January
18, 2008 Slide 48

49
COMP527 Data Mining
Also useful to know is the frequency of the term
in the document. Each row in the matrix is a
vector, and useful for data mining functions as
the document has been reduced to a series of
numbers rather than words. Our new matrix might
look like
Introduction to Text Mining January
18, 2008 Slide 49

50
COMP527 Data Mining
Common evaluation for IR relevance ranking
Precision and Recall Precision Number
Relevant and Retrieved / Number Retrieved Recall
Number Relevant and Retrieved / Number Relevant F
Score recall precision / ((recall
precision) / 2)? Ideal situation is all and only
relevant documents retrieved. Also used in Data
Mining evaluation.
Introduction to Text Mining January
18, 2008 Slide 50

51
COMP527 Data Mining
Format Processing Extraction of text from
different file formats Indexing Efficient
extraction/storage of terms from text Query
Languages Formulation of queries against those
indexes Protocols Transporting queries from
client to server Relevance Ranking Determining
the relevance of a document to the user's
query Metasearch Cross-searching multiple
document sets with the same query GridIR Using
the grid (or other massively parallel
infrastructure) to perform IR processes Multimedia
IR IR techniques on multimedia objects,
compound digital objects...
Introduction to Text Mining January
18, 2008 Slide 51

52
COMP527 Data Mining
All of the Data Mining functions can be applied
to textual data, using term as the attribute and
frequency as the value. Classification Classify
a text into subjects, genres, quality, reading
age, ... Clustering Cluster together similar
texts Association Rule Mining Find words that
frequently appear together Finds texts that are
frequently cited together Key challenge is the
very large number of terms (eg the number of
different words across all documents)?
Introduction to Text Mining January
18, 2008 Slide 52

53
COMP527 Data Mining
So, we've looked at Data Mining and IR... What's
Text Mining then? Good question. No canonical
definition yet, but a similar definition for Data
Mining could be applied The non-trivial
extraction of previously unknown, interesting
facts from an (invariably large) collection of
texts. So it sounds like a combination of IR and
Data Mining, but actually the process involves
many other steps too. Before we look at what
actually happens, let's look at why it's
different...
Introduction to Text Mining January
18, 2008 Slide 53

54
COMP527 Data Mining
  • Data Mining finds a model for the data based on
    the attributes of the items. The only attributes
    of text are the words that make up the text.
  • As we looked at for IR, this creates a very
    sparse matrix.
  • Even if we create that matrix, what sort of
    patterns could we find
  • Classification We could classify texts into
    pre-defined classes (eg spam / not spam)?
  • Association Rule Mining Finding frequent sets
    of words. (eg if 'computer' appears 3 times,
    then 'data' appears at least once)?
  • Clustering Finding groups of similar documents
    (IR?)?
  • None of these fit our definition of Text Mining.

Introduction to Text Mining January
18, 2008 Slide 54

55
COMP527 Data Mining
Information Retrieval finds documents that match
the user's query. Even if we matched at a
sentence level rather than document, all we do is
retrieve matching sentences, we're not
discovering anything new. The relevance ranking
is important, but it still just matches
information we already knew... it just orders it
appropriately. IR (typically) treats a
document as a big bag of words... but doesn't
care about the meaning of the words, just if they
exist in the document. IR doesn't fit our
definition of Text Mining either.
Introduction to Text Mining January
18, 2008 Slide 55

56
COMP527 Data Mining
  • How would one find previously unknown facts from
    a bunch of text?
  • Need to understand the meaning of the text!
  • Part of speech of words
  • Subject/Verb/Object/Preposition/Indirect Object
  • Need to determine that two entities are the same
    entity.
  • Need to find correlations of the same entity.
  • Form logical chains Milk contains Magnesium.
    Magnesium stimulates receptor activity. Inactive
    receptors cause Headaches --gt Milk is good for
    Headaches. (fictional example!)?

Introduction to Text Mining January
18, 2008 Slide 56

57
COMP527 Data Mining
First we need to tag the text with the parts of
speech for each word. eg Rob/noun
teaches/verb the/article course/noun How could
we do this? By learning a model for the
language! Essentially a data mining
classification problem -- should the system
classify the word as a noun, a verb, an
adjective, etc. Lots of different tags, often
based on a set called the Penn Treebank. (NN
Noun, VB Verb, JJ Adjective, RB Adverb,
etc)?
Introduction to Text Mining January
18, 2008 Slide 57

58
COMP527 Data Mining
Now we need to discover the phrases and parts of
each clause. Rob/noun teaches/verb the/article
course/noun (Subject Rob Verbteaches (Object
thecourse))? The phrase sections are often
expressed as trees ( TOP ( S ( NP ( DT This
) ( JJ crazy ) ( NN sentence ) ) ( VP ( VBD
amused ) ( NP ( NNP Rob ) ) ( PP ( IN
for ) ( NP ( DT a ) ( JJ few ) ( NNS minutes
) )
Introduction to Text Mining January
18, 2008 Slide 58

59
COMP527 Data Mining
Once we've parsed the text for linguistic
structure, we need to identify the real world
objects referred to. Rob teaches the
course Rob Me (Sanderson, Robert D.
b.1976-07-20 Rangiora/New Zealand)? the course
Comp527 2006/2007, University of Liverpool,
UK This is typically done via lookups in very
large thesauri or 'ontologies', specific to the
domain being processed (eg medical, historical,
current events, etc.)?
Introduction to Text Mining January
18, 2008 Slide 59

60
COMP527 Data Mining
There will normally be a lot more text to
parse Rob Sanderson, a lecturer at the
University of Liverpool, teaches a masters level
course on data mining (Comp527)? Rob is a
lecturer Rob is at the University of
Liverpool Rob teaches a course The course is
called Comp527 The course is masters level The
course is about data mining
Introduction to Text Mining January
18, 2008 Slide 60

61
COMP527 Data Mining
Rob Sanderson, a lecturer at the University of
Liverpool, teaches a masters level course on data
mining (Comp527)? Data mining is about finding
models to describe data sets. --gt The
University of Liverpool has a course about
finding models to describe data sets. (Not very
interesting or novel in this case, but that's the
process)?
Introduction to Text Mining January
18, 2008 Slide 61

62
COMP527 Data Mining
Search engines of all types are based on IR. But
where would you use text mining? Most research
so far is on medical data sets ... because this
is the most profitable! If you could correlate
facts to find a cure for cancer, you would be
very VERY rich! So ... lots of people are trying
to do just that for various values of
'cancer'. Also because of the wide availability
of ontologies and datasets, in particular
abstracts for medical journal articles
(PubMed/Medline)?
Introduction to Text Mining January
18, 2008 Slide 62

63
COMP527 Data Mining
More application areas News feeds Terrorism
detection Social sciences analysis Historical
text analysis Corpus linguistics 'Net Nanny'
filters etc.
Introduction to Text Mining January
18, 2008 Slide 63

64
COMP527 Data Mining
  • Weiss et al Chapter 1 (and 2 if you're
    interested)?
  • Baeza-Yates, Modern Information Retrieval,
    Chapter 1
  • Jackson and Moulinier, Natural Language
    Processing for Online Applications, Chapter 1
  • http//www.jisc.ac.uk/publications/publications/pu
    b_textmining.aspx
  • http//people.ischool.berkeley.edu/hearst/text-mi
    ning.html

Introduction to Text Mining January
18, 2008 Slide 64

65
COMP527 Data Mining
  • Introduction to the Course
  • Introduction to Data Mining
  • Introduction to Text Mining
  • General Data Mining Issues
  • Data Warehousing
  • Classification Challenges, Basics
  • Classification Rules
  • Classification Trees
  • Classification Trees 2
  • Classification Bayes
  • Classification Neural Networks
  • Classification SVM
  • Classification Evaluation
  • Classification Evaluation 2
  • Regression, Prediction

Input Preprocessing Attribute Selection Associatio
n Rule Mining ARM A Priori and Data
Structures ARM Improvements ARM Advanced
Techniques Clustering Challenges,
Basics Clustering Improvements Clustering
Advanced Algorithms Hybrid Approaches Graph
Mining, Web Mining Text Mining Challenges,
Basics Text Mining Text-as-Data Text Mining
Text-as-Language Revision for Exam
General Data Mining Issues January
18, 2008 Slide 65

66
COMP527 Data Mining
  • Machine Learning?
  • Input to Data Mining Algorithms
  • Data types
  • Missing values
  • Noisy values
  • Inconsistent values
  • Redundant values
  • Number of values
  • Over-fitting / Under-fitting
  • Scalability
  • Human Interaction
  • Ethical Data Mining

General Data Mining Issues January
18, 2008 Slide 66

67
COMP527 Data Mining
  • What do we mean by 'learning' when applied to
    machines?
  • Not just committing to memory ( storage)?
  • Can't require consciousness
  • Learn facts (data), or processes (algorithms)?
  • Things learn when they change their behaviour in
    a way that makes them perform better (Witten)?
  • Ties to future performance, not the act itself
  • But things change behaviour for reasons other
    than 'learning'
  • Can a machine have the Intent to perform better?

General Data Mining Issues January
18, 2008 Slide 67

68
COMP527 Data Mining
The aim of data mining is to learn a model for
the data. This could be called a concept of the
data, so our outcome will be a concept
description. Eg, the task is classify emails as
spam/not spam. Concept to learn is the concept
of 'what is spam?' Input comes as instances.
Eg, the individual emails. Instances have
attributes. Eg sender, date, recipient, words
in text
General Data Mining Issues January
18, 2008 Slide 68

69
COMP527 Data Mining
Use attributes to determine what about an
instance means that it should be classified as a
particular class. Learning! Obvious input
structure Table of instances (rows) and
attributes (columns)?
General Data Mining Issues January
18, 2008 Slide 69

70
COMP527 Data Mining
_at_relation Iris _at_attribute sepal_length
numeric _at_attribute sepal_width numeric _at_attribute
petal_length numeric _at_attribute petal_width
numeric _at_data 5.1, 3.5, 1.4, 0.2 4.9, 3.0, 1.4,
0.2 4.7, 3.2, 1.3, 0.2 5.0, 3.6, 1.4,
0.2 ... But what about non numeric data?
General Data Mining Issues January
18, 2008 Slide 70

71
COMP527 Data Mining
Nominal Prespecified, finite number of
values eg cat, fish, dog, squirrel Includes
boolean true, false and all enumerations. Ord
inal Orderable, but no concept of
distance eg hot gt warm gt cool gt cold Domain
specific ordering, but no notion of how much
hotter warm is compared to cool.
General Data Mining Issues January
18, 2008 Slide 71

72
COMP527 Data Mining
Interval Ordered, fixed unit eg 1990 lt 1995
lt 2000 lt 2005 Difference between values makes
sense (1995 is 5 years after 1990)? Sum does not
make sense (1990 1995 year 3985??)? Ratio
Ordered, fixed unit, relative to a zero
point eg 1m, 2m, 3m, 5m Difference makes
sense (3m is 1m greater than 2m)? Sum makes sense
(1m 2m 3m)?
General Data Mining Issues January
18, 2008 Slide 72

73
COMP527 Data Mining
Nominal _at_attribute name option1, option2, ...
optionN Numeric _at_attribute name numeric --
real values Other _at_attribute name string --
text fields _at_attribute name date -- date
fields (ISO-8601 format)?
General Data Mining Issues January
18, 2008 Slide 73

74
COMP527 Data Mining
The following issues will come up over and over
again, but different algorithms have different
requirements. What happens if we don't know the
value for a particular attribute in an
instance? For example, the data was never stored,
lost or not able to be represented. Maybe that
data was important! ARFF records missing values
with a ? in the table How should we process
missing values?
General Data Mining Issues January
18, 2008 Slide 74

75
COMP527 Data Mining
  • Possible 'solutions' for dealing with missing
    values
  • Ignore the instance completely. (eg class
    missing in training data set) Not very useful
    solution if in test data to be classified!
  • Fill in values by hand Could be very slow, and
    likely to be impossible
  • Global 'missingValue' constant Possible for
    enumerations, but what about numeric data?
  • Replace with attribute mean
  • Replace with class's attribute mean
  • Train new classifier to predict missing value!
  • Just leave as missing and require algorithm to
    apply appropriate technique

General Data Mining Issues January
18, 2008 Slide 75

76
COMP527 Data Mining
  • By 'noisy data' we mean random errors scattered
    in the data.
  • For example, due to inaccurate recording, data
    corruption.
  • Some noise will be very obvious
  • data has incorrect type (string in numeric
    attribute)?
  • data does not match enumeration (maybe in yes/no
    field)?
  • data is very dissimilar to all other entries (10
    in an attr otherwise 0..1)?
  • Some incorrect values won't be obvious at all.
    Eg typing 0.52 at data entry instead of 0.25.

General Data Mining Issues January
18, 2008 Slide 76

77
COMP527 Data Mining
  • Some possible solutions
  • Manual inspection and removal
  • Use clustering on the data to find instances or
    attributes that lie outside the main body
    (outliers) and remove them
  • Use regression to determine function, then remove
    those that lie far from the predicted value
  • Ignore all values that occur below a certain
    frequency threshold
  • Apply smoothing function over known-to-be-noisy
    data
  • If noise is removed, can apply missing value
    techniques on it. If it is not removed, it may
    adversely affect the accuracy of the model.

General Data Mining Issues January
18, 2008 Slide 77

78
COMP527 Data Mining
Some values may not be recorded in different
ways. For example 'coke', 'coca cola',
'coca-cola', 'Coca Cola' etc etc In this case,
the data should be normalised to a single form.
Can be treated as a special case of noise. Some
values may be recorded inaccurately on
purpose! Email address r.d.nospam.sanderson_at_...
Spike in early census data for births on
11/11/1911. Had to put in some value, so
defaulted to 1s everywhere. Ooops! (Possibly
urban legend?)?
General Data Mining Issues January
18, 2008 Slide 78

79
COMP527 Data Mining
Just because the base data includes an attribute
doesn't make it worth giving to the data mining
task. For example, denormalise a typical
commercial database and you might
have ProductId, ProductName, ProductPrice,
SupplierId, SupplierAddress... SupplierAddress
is dependant on SupplierId (remember SQL
normalisation rules?) so they will always appear
together. A 100 confident, 100 support
association rule is not very interesting!
General Data Mining Issues January
18, 2008 Slide 79

80
COMP527 Data Mining
Is there any harm in putting in redundant values?
Yes for association rule mining, and ... yes for
other data mining tasks too. Can treat text as
thousands of numeric attributes term/frequency
from our inverted indexes. But not all of those
terms are useful for determining (for example) if
an email is spam. 'the' does not contribute to
spam detection. The number of attributes in the
table will affect the time it takes the data
mining process to run. It is often the case that
we want to run it many times, so getting rid of
unnecessary attributes is important.
General Data Mining Issues January
18, 2008 Slide 80

81
COMP527 Data Mining
  • Called 'dimensionality reduction'.
  • We'll look at techniques for this later in the
    course, but some simplistic versions
  • Apply upper and lower thresholds of frequency
  • Noise removal functions
  • Remove redundant attributes
  • Remove attributes below a threshold of
    contribution to classification (Eg if attribute
    is evenly distributed, adds no knowledge)

General Data Mining Issues January
18, 2008 Slide 81

82
COMP527 Data Mining
Learning a concept must stop at the appropriate
time. For example, could express the concept
of 'Is Spam?' as a list of spam emails. Any
email identical to those is spam. Accuracy 0 on
new data, 100 on training data. Ooops! This is
called Over-Fitting. The concept has been
tailored too closely to the training
data. Story US Military trained a neural
network to distinguish tanks vs rocks. It would
shoot the US tanks they trained it on very
consistently and never shot any rocks ... or
enemy tanks. probably fiction, but amusing
General Data Mining Issues January
18, 2008 Slide 82

83
COMP527 Data Mining
Extreme case of over-fitting Algorithm tries to
learn a set of rules to determine class. Rule1
attr1val1/1 and attr2val2/1 and attr3val3/1
class1 Rule2 attr1val1/2 and attr2val2/2 and
attr3val3/2 class2 Urgh. One rule for each
instance is useless. Need to prevent the
learning from becoming too specific to the
training set, but also don't want it to be too
broad. Complicated!
General Data Mining Issues January
18, 2008 Slide 83

84
COMP527 Data Mining
Extreme case of under-fitting Always pick the
most frequent class, ignore the data
completely. Eg if one class makes up 99 of
the data, then a 'classifier' that always picks
this class will be correct 99 of the time! But
probably the aim of the exercise is to determine
the 1, not the 99... making it accurate 0 of
the time when you need it.
General Data Mining Issues January
18, 2008 Slide 84

85
COMP527 Data Mining
  • We may be able to reduce the number of
    attributes, but most of the time we're not
    interested in small 'toy' databases, but huge
    ones.
  • When there are millions of instances, and
    thousands of attributes, that's a LOT of data to
    try to find a model for.
  • Very important that data mining algorithms scale
    well.
  • Can't keep all data in memory
  • Might not be able to keep all results in memory
    either
  • Might have access to distributed processing?
  • Might be able to train on a sample of the data?

General Data Mining Issues January
18, 2008 Slide 85

86
COMP527 Data Mining
  • Problem Exists Between Keyboard And Chair.
  • Data Mining experts are probably no
About PowerShow.com