Text Mining - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Text Mining

Description:

Generating collections of similar text documents. alg | Automated Learning Group ... Using Oracles Designer 2000, assist with Data Model maintenance and assist with ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 45
Provided by: lorett5
Category:
Tags: mining | text

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining
  • May 23, 2006

2
Text Mining Definition
  • Many definitions in the literature
  • The non trivial extraction of implicit,
    previously unknown, and potentially useful
    information from (large amount of) textual data
  • An exploration and analysis of textual
    (natural-language) data by automatic and semi
    automatic means to discover new knowledge
  • What is previously unknown information?
  • Strict definition
  • Information that not even the writer knows
  • Lenient definition
  • Rediscover the information that the author
    encoded in the text

3
Text Mining Views from T2K and ThemeWeaver
4
Text Mining Themescape and ThemeRiver
  • Visualizing Relationships Between Documents

Images from Pacific Northwest Laboratory
5
Text Characteristics (1)
  • Large textual data base
  • Enormous wealth of textual information on the Web
  • Publications are electronic
  • High dimensionality
  • Consider each word/phrase as a dimension
  • Noisy data
  • Spelling mistakes
  • Abbreviations
  • Acronyms
  • Text messages are very dynamic
  • Web pages are constantly being generated
    (removed)
  • Web pages are generated from database queries
  • Not well structured text
  • Email/Chat rooms
  • r u available ?
  • Hey whazzzzzz up
  • Speech

6
Text Characteristics (2)
  • Dependency
  • Relevant information is a complex conjunction of
    words/phrases
  • Order of words in the query
  • hot dog stand in the amusement park
  • hot amusement stand in the dog park
  • Ambiguity
  • Word ambiguity
  • Pronouns (he, she )
  • Synonyms (buy, purchase)
  • Words with multiple meanings (bat it is related
    to baseball or mammal)
  • Semantic ambiguity
  • The king saw the rabbit with his glasses.
    (multiple meanings)
  • Authority of the source
  • IBM is more likely to be an authorized source
    then my second far cousin

7
Text Mining Process
  • Text Preprocessing
  • Syntactic/Semantic Text Analysis
  • Part of Speech (POS) Tagging
  • Features Generation
  • Bag of Words
  • Feature Selection
  • Simple Counting
  • Statistics
  • Selection based on POS
  • Text/Data Mining
  • Classification- Supervised Learning
  • Clustering- Unsupervised Learning
  • Information Extraction
  • Analyzing Results

8
Text PreProcessing Syntactic / Semantic Text
Analysis
  • Part Of Speech (POS) Tagging
  • Find the corresponding POS for each word
  • e.g., John (noun) gave (verb) the (det) ball
    (noun)
  • Word Sense Disambiguation
  • Context based or proximity based
  • Very accurate
  • Parsing
  • Generates a parse tree (graph) for each sentence
  • Each sentence is a stand alone graph

9
Feature Generation Bag of Words
  • Text document is represented by the words it
    contains (and their occurrences)
  • e.g., Lord of the rings ? the, Lord,
    rings, of
  • Highly efficient
  • Makes learning far simpler and easier
  • Order of words is not that important for certain
    applications
  • Stemming
  • Reduce dimensionality
  • Identifies a word by its root
  • e.g., flying, flew ? fly
  • Stop words
  • Identifies the most common words that are
    unlikely to help with text mining
  • e.g., the, a, an, you

10
Feature Selection
  • Reduce Dimensionality
  • Learners have difficulty addressing tasks with
    high dimensionality
  • Irrelevant Features
  • Not all features help!
  • Remove features that occur in only a few
    documents
  • Reduce features that occur in too many documents

11
Text Mining General Application Areas
  • Information Retrieval
  • Indexing and retrieval of textual documents
  • Finding a set of (ranked) documents that are
    relevant to the query
  • Information Extraction
  • Extraction of partial knowledge in the text
  • Web Mining
  • Indexing and retrieval of textual documents and
    extraction of partial knowledge using the web
  • Classification
  • Predict a class for each text document
  • Clustering
  • Generating collections of similar text documents

12
Text Mining Applications
  • Email Spam filtering
  • News Feeds Discover what is interesting
  • Medical Identify relationships and link
    information from different medical fields
  • Homeland Security
  • Marketing Discover distinct groups of potential
    buyers and make suggestions for other products
  • Industry Identifying groups of competitors web
    pages
  • Job Seeking Identify parameters in searching for
    jobs

13
Text Mining Supervised vs. Unsupervised Learning
  • Supervised learning (Classification)
  • Data (observations, measurements, etc.) are
    accompanied by labels indicating the class of the
    observations
  • Split into training data and test data for model
    building process
  • New data is classified based on the model built
    with the training data
  • Unsupervised learning (Clustering)
  • Class labels of training data is unknown
  • Given a set of measurements, observations, etc.
    with the aim of establishing the existence of
    classes or clusters in the data

14
Text Mining Classification Definition
  • Given Collection of labeled records
  • Each record contains a set of features
    (attributes), and the true class (label)
  • Create a training set to build the model
  • Create a testing set to test the model
  • Find Model for the class as a function of the
    values of the features
  • Goal Assign a class (as accurately as possible)
    to previously unseen records
  • Evaluation What Is Good Classification?
  • Correct classification
  • Known label of test example is identical to the
    predicted class from the model
  • Accuracy ratio
  • Percent of test set examples that are correctly
    classified by the model
  • Distance measure between classes can be used
  • e.g., classifying football document as a
    basketball document is not as bad as
    classifying it as crime

15
Text Mining Clustering Definition
  • Given Set of documents and a similarity measure
    among documents
  • Find Clusters such that
  • Documents in one cluster are more similar to one
    another
  • Documents in separate clusters are less similar
    to one another
  • Goal
  • Finding a correct set of documents
  • Similarity Measures
  • Euclidean distance if attributes are continuous
  • Other problem-specific measures
  • e.g., how many words are common in these
    documents
  • Evaluation What Is Good Clustering?
  • Produce high quality clusters with
  • high intra-class similarity
  • low inter-class similarity
  • Quality of a clustering method is also measured
    by its ability to discover some or all of the
    hidden patterns

16
Classification Techniques
  • Bayesian classification
  • Decision trees
  • Neural networks
  • Instance-Based Methods
  • Support Vector Machines

17
What is Information Extraction?
Advisory Programmer - Oracle (Austin, TX)
Response Code 1008-0074-97-iexc-jcn
Responsibilities This is an exciting opportunity
with Siemens Wireless Terminals a start-up
venture fully capitalized by a Global Leader in
Advanced Technologies. Qualified candidates will
Responsible for assisting with requirements
definition, analysis, design and implementation
that meet objectives, codes difficult and
sophisticated routines . Develops project plans,
schedules and cost data. Develop test plans and
implement physical design of databases. Develop
shell scripts for administrative and background
tasks, stored procedures and triggers. Using
Oracles Designer 2000, assist with Data Model
maintenance and assist with applications
development using Oracle Forms. Qualifications
BSCS, BSMIS or closely related field or related
equivalent knowledge normally obtained through
technical education programs. 5-8 years of
professional experience in development, system
design analysis, programming, installation using
Oracle development
  • Given
  • Source of textual documents
  • Well defined limited query (text based)
  • Find
  • Sentences with relevant information
  • Extract the relevant information and ignore
    non-relevant information (important!)
  • Link related information and output in a
    predetermined format

Example from Dan Roth Web Page
18
T2K and Using GATE
  • Load GATE_IE_Viz itinerary to see Information
    Extraction in action

19
Information Extraction from Streaming Text
  • Information extraction
  • process of using advanced automated machine
    learning approaches
  • to identify entities in text documents
  • extract this information along with the
    relationships these entities may have in the text
    documents
  • This project demonstrates information extraction
    of names, places and organizations from real-time
    news feeds. As news articles arrive, the
    information is extracted and displayed.

20
Bayesian Classification
  • Idea assign to example X the class label C such
    that P(CX) is maximal
  • Computes the distribution of an input associated
    with each class, for example, given the variable
    X with a value at xi the probability of it being
    in Class A is greater than it being in Class B

Mathematically speaking If one knows how P(X
C), and the densities P(xi) and P(cj) (prior
probabilities) are known then the classifier is
one which assigns class cj to datum xi if cj has
the highest posterior probability given the data.
21
Bayesian Classification Why?
  • Probabilistic learning Calculate explicit
    probabilities for hypothesis, is among the most
    practical approaches to certain types of learning
    problems
  • Incremental Each training example can
    incrementally increase/decrease the probability
    that a hypothesis is correct
  • Prior knowledge Can be combined with observed
    data
  • Standard
  • Provide a standard of optimal decision making
    against which other methods can be measured
  • In a simpler form, provide a baseline against
    which other methods can be measured

22
Naïve Bayesian Classification
  • Naïve assumption
  • Feature independence
  • P(xi C) is estimated as the relative frequency
    of examples having value xi as feature in class C
  • Computationally easy!!!

23
Classification by Decision Tree
  • Decision tree
  • Flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class
    distribution
  • Decision tree generation consists of two phases
  • Tree construction
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Use of decision tree
  • Classifying an unknown example
  • Test the attribute of the example against the
    decision tree

24
  • Text Mining in D2K
  • Email CLASSIFICATION
  • Naïve Bayesian

25
Email Classification
  • Input
  • Multiple mailboxes where each mailbox represents
    a class
  • Output
  • Results of the model on the testing set
  • Model that classifies future email

26
Mailbox Files
  • MONO.mbx
  • Mono Developer Discussion List
  • mono-list_at_lists.ximian.com
  • 216 messages
  • SPAM.mbx
  • Spam Mailbox
  • 100 messages
  • JINI.mbx
  • JINI-Users mail list
  • JINI-USERS_at_JAVA.SUN.COM
  • 104 messages

27
Opening the Itinerary
  • Click on the Itinerary Pane in the Resource
    Panel
  • Expand the T2K directory with a single click
  • Double click on EmailClassification-T2K

28
D2K A Few Features
  • Properties indicate that a module has settings
    that can be changed before execution
  • Indicated by a P in the lower left corner of a
    module
  • e.g., filename, maximum iterations, etc.
  • Resource Manager
  • Load data that is accessible by all modules

29
EmailClassification Itinerary
  • Use of D2Ks Resource Manager to store data that
    will serve as a dictionary
  • Contextual Rule File
  • Lexical Rule File
  • Stop words
  • Lexicon

30
EmailClassification Itinerary (2)
  • Load the mailbox data
  • Input File Name
  • Specify a directory by changing properties of
  • ReadFileNames
  • Sends each filename in this directory as output
  • MBX Email Parser
  • Parses the mailbox files
  • Email -gt Document
  • Converts the email document to the standard
    document object
  • Flags on whether to include sender/receiver info

31
EmailClassification Itinerary (3)
  • Pre-Process text data
  • Tokenizer
  • Forms word tokens for each word or symbol
  • Brill Pre-Tagger
  • Assigns part of speech tag to each token
  • Can be used without following 2 modules
  • Brill Lexical Tagger
  • Adjusts tag based on lexical rules
  • Lexical must precede Contextual
  • Brill Contextual Tagger
  • Adjusts tag based on contextual rules

32
EmailClassification Itinerary (4)
  • More Text Pre-Processing
  • Filler Stop Words
  • Removes Stop Words
  • Stemmer
  • Transforms words into their word stem
  • Removes plurals, etc.
  • Select Tokens By Speech Tag
  • Removes tokens that do not match speech tag of
    interest
  • Document-gtTermList
  • Counts the frequency of terms in the document
  • Adjusts counts for title weighting

33
EmailClassification Itinerary (5)
  • Creation of Sparse Table for learning
  • Add Series of Ints
  • Counts all values it receives
  • Outputs sum
  • TermLists -gt SparseTable
  • Creates Sparse Table to be used for mining
  • Term counts across documents is sparse
  • Conserve on memory and usage
  • Feature Filter
  • Eliminates terms that occur in only one document

34
EmailClassification Itinerary (6)
  • Selecting input and output attributes for
    classification
  • Choose Attributes
  • Select input and attributes
  • Select output attribute classification_DOCPROP

35
EmailClassification Itinerary (7)
  • Setting testing and training sets
  • Simple Train Test
  • Property window to set train and test percentages

36
EmailClassification Itinerary (8)
  • Model building and testing
  • Naïve Bayes Text Model
  • Builds a Naïve Bayesian classification model
  • Model Predict
  • Applies the model to the testing data

37
EmailClassification Itinerary (9)
  • Results of model on testing data
  • Prediction Table Report
  • Shows classification error
  • Shows confusion matrix

38
EmailClassification Itinerary (10)
  • Table Viewer
  • Shows original data
  • Shows the predicted column

39
Execute Itinerary
  • Check Properties for Input FileName modules
  • Click the Run button

40
Results
  • Results of model on testing data
  • Prediction Table Report
  • Shows classification error
  • Shows confusion matrix
  • Original table had 6718 attributes
  • After filtering the table had 2830 attributes

41
Other Scenarios
  • Change the weight for words in titles
  • Change whether or not sender/receiver info is
    included
  • Take the filter module out
  • What happens to accuracy of the model??
  • What happens to performance??

42
Scenario Verbs only
  • Original table had 1020 attributes
  • After filtering the table had 636 attributes

43
T2K in Review
  • Review T2K modules
  • Review T2K itineraries

44
The ALG Team
  • Staff
  • Bernie Acs
  • Loretta Auvil
  • David Clutter
  • Vered Goren
  • Eugene Grois
  • Luigi Marini
  • Robert McGrath
  • Chris Navarro
  • Greg Pape
  • Barry Sanders
  • Andrew Shirk
  • David Tcheng
  • Michael Welge
  • Students
  • Chen Chen
  • Hong Cheng
  • Yaniv Eytani
  • Fang Guo
  • Govind Kabra
  • Chao Liu
  • Haitao Mo
  • Xuanhui Wang
  • Qian Yang
  • Feida Zhu
Write a Comment
User Comments (0)
About PowerShow.com