comp3776: Data Mining and Text Analytics - PowerPoint PPT Presentation


PPT – comp3776: Data Mining and Text Analytics PowerPoint presentation | free to download - id: 54f51e-OGM2Y


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

comp3776: Data Mining and Text Analytics


Title: PowerPoint Presentation Author: Eric Atwell Last modified by: Eric Atwell Created Date: 9/6/2002 9:04:01 AM Document presentation format: 35mm Slides – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 24
Provided by: EricA109
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: comp3776: Data Mining and Text Analytics

comp3776Data Mining and Text Analytics
  • Intro to Data Mining
  • By Eric Atwell, School of Computing, University
    of Leeds
  • (including re-use of teaching resources from
    other sources, esp. Comp3740 Knowledge Management
    and Adaptive Systems
  • School of Computing, University of Leeds)

What has Machine Learning got to do with
Computing / Information Systems?
  • Most international organizations produce more
    information in a week than many people could read
    in a lifetime Adriaans and Zantinge

Objectives of knowledge discovery or machine
learning or data mining
  • Data mining is about discovering patterns in
  • For this we need
  • KD/DM techniques, algorithms, tools, eg BootCat,
  • A methodological framework to guide us, in
    collecting data and applying the best algorithms

Data Mining, Machine Learning, Knowledge
Discovery, Text Mining
  • Data Mining was originally about learning
    patterns from DataBases, data structured as
    Records, Fields
  • Knowledge Discovery is exotic term for DM???
  • Increasingly, data is unstructured text (WWW), so
  • Text Mining is a new subfield of DM, focussing on
    Knowledge Discovery from unstructured text data

define data mining
  • Data mining, also known as knowledge-discovery in
    databases (KDD), is the practice of automatically
    searching large stores of data for patterns. To
    do this, data mining uses computational
    techniques from artificial intelligence,
    statistics and pattern recognition.

define text mining
  • Text mining, also known as intelligent text
    analysis, text data mining or knowledge-discovery
    in text (KDT), refers generally to the process of
    extracting interesting and non-trivial
    information and knowledge from unstructured text.
    Text mining is a young interdisciplinary field
    which draws on information retrieval, data
    mining, machine learning, statistics and
    computational linguistics.

define knowledge discovery
  • Knowledge discovery is the process of finding
    novel, interesting, and useful patterns in data.
    Data mining is a subset of knowledge discovery.
    It lets the data suggest new hypotheses to
  • Data mining, also known as knowledge-discovery in
    databases (KDD), is the practice of automatically
    searching large stores of data for patterns. To
    do this, data mining uses computational
    techniques from AI, statistics and pattern

Data Mining Overview
Concepts, Instances or examples, Attributes
Data Mining
Concept Descriptions
Each instance is an example of the concept to be
learned or described. The instance may be
described by the values of its attributes.
  • Input to a data mining algorithm is in the form
    of a set of examples, or instances.
  • Each instance is represented as a set of features
    or attributes.
  • Usually in DB Data-Mining this set takes the form
    of a flat file each instance is a record in the
    file, each attribute is a field in the record.
  • In text-mining, instance may be word/term in
    context (surrounding words/document)
  • The concepts to be learned are formed from
    patterns discovered within the set of instances.

  • The types of concepts we try to learn include
  • Key indicators features or terms specific to
    our domain
  • Clusters or Natural partitions
  • Eg we might cluster customers according to their
    shopping habits.
  • Eg is this web-page British or American English?
  • Rules for classifying examples into pre-defined
  • Eg Mature students studying information systems
    with high grade for General Studies A level are
    likely to get a 1st class degree
  • General Associations
  • Eg People who buy nappies are in general likely
    also to buy beer

More concepts
  • The types of concepts we try to learn include
  • Unexpected (suspicious?) associations or
  • Eg known suspects A, B, C all phoned D last week
  • Numerical prediction
  • Eg look for rules to predict what salary a
    graduate will get, given A level results, age,
    gender, programme of study and degree result
    this may give us an equation
  • Salary aA-level bAge cGender dProg
  • (but are Gender, Programme really numbers???)

DB Example weather to play?
(No Transcript)
  • _at_relation weather.symbolic
  • _at_attribute outlook sunny, overcast, rainy
  • _at_attribute temperature hot, mild, cool
  • _at_attribute humidity high, normal
  • _at_attribute windy TRUE, FALSE
  • _at_attribute play yes, no

_at_data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,
no overcast,hot,high,FALSE,yes rainy,mild,high,FAL
SE,yes rainy,cool,normal,FALSE,yes rainy,cool,norm
al,TRUE,no overcast,cool,normal,TRUE,yes sunny,mil
d,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,
mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes o
vercast,mild,high,TRUE,yes overcast,hot,normal,FAL
SE,yes rainy,mild,high,TRUE,no
  • _at_relation weather
  • _at_attribute outlook sunny,overcast,rainy
  • _at_attribute temperature real
  • _at_attribute humidity real
  • _at_attribute windy TRUE, FALSE
  • _at_attribute play yes, no
  • _at_data
  • sunny,85,85,FALSE,no
  • sunny,80,90,TRUE,no
  • overcast,83,86,FALSE,yes
  • rainy,70,96,FALSE,yes
  • rainy,68,80,FALSE,yes
  • rainy,65,70,TRUE,no
  • overcast,64,65,TRUE,yes
  • sunny,72,95,FALSE,no
  • sunny,69,70,FALSE,yes
  • rainy,75,80,FALSE,yes

Text mining example Which English dominates the
WWW, UK or US?
  • First catch your rabbit (Mrs Beatons
    cookbook) Other tools are possible, but
    WWW-BootCat was easier to use
  • First sign up for Domain, SketchEngine account,
    Google key download seeds-en from
  • (see comp3740 specifications and lecture notes )

Example 2 Data Mining for an ontology
  • Ontology the concepts in a discipline, and
    meaning-relationships between these concepts
  • concepts roughly equates to terminology
    specialist words and phrases in a discipline
  • WordNet is freely-available for general English
  • What about other languages? EuroWordnet,
    BalkaNet, (but not ALL languages!)
  • What about specific domains? Domain-specific
    ONTOLOGIES have to be devised (by experts)
  • What about my own specific domain/language?
  • Automatic extraction of key words / concepts from
    example documents (machine learning / knowledge

Automatic terminology extraction
  • Terminology extraction thesaurus construction
  • based on documents (either retrieved set or the
    whole collection) as Corpus training text set
  • define a measure of how close one index term is
    to another in meaning-space, ?or literal
  • for each term, form a neighbourhood comprising
    the nearest n terms
  • treat these neighbourhoods like related
    thesaurus classes
  • terms with similar neighbourhoods are treated as

Finding coordinate terms
  • One attempt to define how close a term is to
  • If two terms are both used to index the same
    document many times in the collection, then they
    are deemed to be close.
  • From document-term matrix, compute
    term-correlation matrix
  • The term correlation matrix can be normalised so
    that terms that index a lot of documents dont
    have an unfair chance reduce weight of common

Other ways to find specialist terms
  • Other ways to find domain-specific terms and
  • Collect a domain corpus, find terms different
    from a generic gold standard corpus British
    National Corpus
  • Collocation-groups For each term, collect its
    collocations in the Corpus other words it
    appears next to (or near to). If two terms have
    similar collocation-sets, then they are deemed to
    be close.
  • Association matrix based on proximity compute
    average distance between pairs of terms (no. of
    words between them, literally), use this as
    closeness metric

Why build a thesaurus?
  • a thesaurus or ontology can be used to normalise
    a vocabulary and queries (?or documents?)
  • it can be used (with some human intervention) to
    increase recall and precision
  • generic thesaurus/ontology may not be effective
    in specialized collections and/or queries
  • Semi-automatic construction of thesaurus/ontology
    based on the retrieved set of documents has
    produced some promising results, e.g. Semantic

Data Mining Key points
  • Knowledge Discovery (Data Mining) tools
    semi-automate the process of discovering patterns
    in data.
  • Tools differ in terms of what concepts they
    discover (differences, key-terms, clusters,
    decision-trees, rules)
  • and in terms of the output they provide (eg
    clustering algorithms provide a set of
  • Selecting the right tools for the job is based on
    business objectives what is the USE for the
    knowledge discovered

A Data Mining consultant
  • You should be able to
  • Decide which is the appropriate data mining
    technique for a given a problem defined in terms
    of business objectives.
  • Decide which is the most appropriate form of
    input (which attributes/features will be useful
    for learning) and output (what does your client
    want to see?)