Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)

Description:

... from Oracle. Intelligent Data ... http://www.oracle.com/ip/analyze/warehouse/datamining ... New Specification being proposed by SUN for a Data Mining API ... – PowerPoint PPT presentation

Number of Views:254
Avg rating:3.0/5.0
Slides: 59
Provided by: cedarB
Category:

less

Transcript and Presenter's Notes

Title: Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)


1
Data Mining with Unstructured Data A Study And
Implementation of Industry Product(s)
  • Samrat Sen

2
Goals
  • Issues in Text Mining with Unstructured Data
  • Analysis of Data Mining products
  • Study of a Real Life Classification Problem
  • Strategy for solving the problem

3
Issues in Text Mining
  • Different from KDD and DM techniques in
    structured Databases
  • Problems
  • 1. Concerned with predefined fields
  • 2. Based on learning from attribute- value
  • database
  • e.g
  • P.T.O

4
Issues in Text Mining
Potential Customer Table
Married to Table
Person Age Sex Income Customer Ann S
32 F 10,000 yes Jane G 53
F 20,000 no Sri S 35 M
65,000 yes Egor 25 M 10,000
yes
Husband Wife Egor
Ann S Sri H Jane

Induced Rules
  • If Married(Person, Spouse) and Income(Person) gt
    25,000
  • Then Potential-Customer(Spouse)
  • If Married(Person, Spouse) and Potential-Customer(
    Person)
  • Then Potential-Customer(Spouse)

5
Issues in Text Mining
  • Algorithm techniques like
  • Association Extraction from Indexed data,
  • Prototypical Document Extraction from full
    Text
  • Industry standard data mining tools cannot be
    used directly
  • e.g a usual process has to have the Text
    Transformer, Text Analyzer, Summary generator

6
Issues in Text Mining
  • The input and output interfaces, the file
    formats
  • may cost in time and money.
  • Exhaustive domains have to be set up for
  • classification.
  • Cost and Benefits have to be weighed before
  • model selection.
  • 1. Gain from positive prediction
  • 2. Loss from an incorrect positive
    prediction (false positive)
  • 3. Benefit from a correct negative
    prediction
  • 4. Cost of incorrect negative
    prediction (false negative)
  • 5. Cost of project time (a better
    product/algorithm may come

    up)

7
Data Mining Products/Tools
  • DARWIN from Oracle
  • Intelligent Data Miner from IBM
  • Intermedia Text with Oracle Database with context
    query feature
  • (theme based document retrieval)

FOR MORE INFO...
http//www.oracle.com/ip/analyze/warehouse/datamin
ing/ http//www-4.ibm.com/software/data/iminer/
8
Data Mining Products/Tools
  • New Specification being proposed by SUN for a
    Data Mining API
  • SQLServer 2000 Data mining and English query
    writing features
  • Verity Knowledge Organizer

FOR MORE INFO...
http//java.sun.com/aboutJava/communityprocess/j
sr/jsr_073_dmapi. html3 Additional Text Mining
sites 1.http//textmining.krdl.org.sg/resourves.h
tml 2. www.intext.de/TEXTANAE.htm 3.
www.cs.uku.fi/kuikka/systems.html
9
DARWIN
  • Functions
  • Prediction (from known values)
  • Classification (into categories)
  • Forecasting (future predictions)
  • Approach
  • Plan
  • Prepare Dataset
  • Build and Use models

10
DARWIN
  • The problem is defined in terms of data fields
    and data records
  • The fields are classified as follows
  • - Categorical and Ordered Fields
  • - Predictive Fields
  • - Target Fields
  • DARWIN dataset file has to be created containing
    all the records in the problem domain (using a
    descriptor file)

11
DARWIN - Models
  • Tree model Based on classification and
    regression tree algorithm
  • Net model A feed forward multilayer neural
    network
  • Match Model Memory based reasoning model, using
    a K-nearest neighbor algorithm

12
DARWIN Tree Model
Create Tree
Training Data
Test/Evaluate
Tree (Information on error rates of pruned
sub-trees)
I/P Prediction Dataset
Predict with Tree (using the selected
sub-tree)
Merged I/P O/P prediction dataset
Analyze Results
13
DARWIN Net Model
Training Dataset
Neural Network Model
Create Net
Train Net (Information on
error rates of pruned sub-trees)
I/P Prediction Dataset
Trained Neural Network
Prediction Dataset
Merged I/P O/P prediction dataset
Analyze Results
14
DARWIN Match Model
Training Data
Create Match Model
Optimize match weights
I/P Prediction Dataset
Predict with Match
Merged I/P O/P prediction dataset
Analyze Results
15
DARWIN Analyzing
Evaluate Evaluates the performance of a given
model on a given dataset, when working on known
data for test or evaluation purposes.
Summarize Data Provides a statistical summary of
the values taken by a data in the specified
fields of a dataset Frequency Count Provides
information on the frequency with which
particular data values appear in a dataset
16
DARWIN Analyzing
Performance Matrix Can be used to compare simple
fields or simple functions of fields Sensitivity P
rovides a model showing the relative importance
of attributes used in building a model
17
DARWIN Code Generation
  • Darwin can generate C, C, Java code for a
  • Tree or Net model so that a prediction
    function
  • can be called from an application Program
  • Java code can also be generated to embed a
  • model in a Web Applet

FOR MORE INFO...
http//technet.oracle.com/docs/products/datamining
/doc_index.htm
18
DARWIN
  • For more info
  • http//technet.oracle.com/software/products/interm
    edia/software_index.html
  • 1. Oracle Data Mining Data sheet
  • 2. Oracle Data Mining Solutions
  • http//www.oracle.com/ip/analyze/warehouse/datamin
    ing/
  • http//www.oracle.com/oramag/oracle/98-Jan/fast.ht
    ml
  • 1. Managing Unstructured Data with Oracle8
  • http//technet.oracle.com/products/datamining/
  • 1. Product manuals

19
DARWIN
20
Oracle Intermedia Text
  • Ranking technique called theme proving is used
  • Documents grouped into categories and
    subcategories
  • Integrated with the Oracle 8 database.
  • Absolutely no training or tuning required

21
Oracle Intermedia Text
  • Lexical Knowledge Base
  • - 200,000 concepts from very broad domains
  • - 2000 major categories
  • - Concepts mapped into one or more
    words/phrases in
  • canonical form
  • - Each of these have alternate inflectional
  • variations,acronyms, synonyms stored
  • - Total vocabulary of 450,000 terms
  • - Each entry has other parameters like parts
    of speech

22
Oracle Intermedia Text
  • Theme Extraction
  • -Themes are assigned initial ranks based on
  • structure of the document and the frequency of
    the theme.
  • - All the ancestor themes also included in the
    result
  • - Theme proving done before final ranking
  • Queries
  • Direct match, phrase search (contains),
    case-sensitive query, misspellings and fuzzy
    match, inflections (about), compound queries,
    Boolean operators, Natural language query

23
Oracle Intermedia Text
  • Oracle at Trec 8
  • (Eighth text retrieval conference-http//otn.or
    acle.com/products/intermedia/htdocs/imt_trec8pap.h
    tm)
  • Recall at 1000 71.57
    (3384/4728)
  • Average Precision 41.30
  • Initial precision (at 92.79
  • recall 0.0)
  • Final precision (at 07.91
  • recall 1.0)

24
Intermedia Text-Model
25
Interface Options
26
Language Selection
  • Java for robot
  • PL/SQL for data retrieval

27
Code Execution
28
Overview of the System
Intermedia Text
Customer Browser
Client Browser
Web Server
Oracle 8i
Listening at port 80
Server process
Tag stripper
JDBC
29
Intermedia Text
  • Steps for Building an application
  • Load the documents
  • Index the document
  • Issue Queries
  • Present the documents that satisfy the query

30
Loading Methods
  • Loading Methods
  • Insert Statements
  • SQL Loader
  • Ctxsrv This is a server daemon process which
    builds
  • the index at regular
    intervals
  • Ctxload Utility Used for
  • Thesaurus Import/Export
  • Text Loading
  • Document Updating/Exporting

31
Create and Populate a Simple Table
  • CREATE TABLE quick (
  • quick_id NUMBER CONSTRAINT
    quick_pk PRIMARY KEY,
  • text VARCHAR2(80) )
  • INSERT INTO quick
  • VALUES ( 1, 'The cat sat on the mat' )
  • INSERT INTO quick
  • VALUES ( 2, 'The fox jumped over the dog'
    )INSERT INTO quick
  • VALUES ( 3, 'The dog barked like a dog'
    )COMMIT

32
Run a Text Query
  • SELECT text FROM quick
  • WHERE CONTAINS ( text,
  • 'sat on the mat' ) gt 0DRG-10599 column is
    not indexed
  • You must have a Text index on a columnbefore you
    can do a contains query on it

33
Create the Text Index
CREATE INDEX quick_text on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT
  • CTXSYS is the system user for interMedia Text
  • The INDEXTYPE keyword is a feature of the
    Extensible Indexing Framework

34
Run a Text Query
  • SELECT text FROM quick
  • WHERE CONTAINS ( text,
  • 'sat on the mat' ) gt 0TEXT
  • -----------------------
  • The cat sat on the mat
  • You should regard the CONTAINS function as
    boolean in meaning
  • It is implemented as a number since SQL does not
    have a boolean datatype
  • The only sensible way to use it is with gt0

35
Run a Text Query
  • SELECT SCORE(42) s, text FROM quick
  • WHERE CONTAINS ( text, 'dog', 42 )
  • gt 0 / just for teaching purposes! /
    ORDER BY s
  • S TEXT
  • -- ---------------------------
  • 7 The dog barked like a dog
  • 4 The fox jumped over the dog
  • The better is the match, the higher is the score
  • The value can be used in ORDER BY but has no
    absolute significance
  • The score is zero when the query is not matched

36
Intermedia Text - Indexing Pipeline
Filtered Doc text
Doc Data
Sectioner
Datastore
Filter
Section Offsets
Column data
Engine
Lexer
Database
Plain text
Tokens
Index Data
  • First step is creating an index
  • Datastore
  • Reads the data out of the table (for URL
    datastore performs a GET )

37
Intermedia Text - Indexing Pipeline
  • Filter The data is transformed to some text
    type, this is needed as some of formats may be
    binary as when storing doc, pdf, HTML types
  • Sectioner Converts to plain text, removes tags
    and invisible info.
  • Lexer Splits the text into discrete tokens.
  • Engine Takes the tokens from lexer , the offsets
    from sectioner and a list of stoplist words to
    build an index.

38
Intermedia Text - Indexing Pipeline
  • Example of index creation
  • Statements
  • Insert into docs values(1,first document)
  • Insert into docs values(2,second document)
  • Produces an index
  • DOCUMENT ? doc 1 position 2, doc 2 position
    2
  • FIRST ? doc 1 position 1
  • SECOND ? doc 2 position 1

39
Testing procedure
  • Document set from newsgroups
  • 122 documents from a text mining site
  • Loaded using insert statements
  • File datastore used
  • Documents(HTML) from browsing
  • 20 documents
  • Loaded from server process
  • URL datastore used

40
Newsgroup Results
  1. 1.     Religion ,Atheism 15
  2. on bible, islam, religious beliefs
  3. 2.     Comp-os-ms-windows-misc - 17
  4. about operating sys, protocols,
    installation
  5. 3.     Comp.graphics 27
  6. on hardware and software for computer
    graphics
  7. 4.     Ice Hockey -
    18
  8. 5.     Computer hardware 12
  9. on installation of different peripheral
    devices
  10. 6.     Mideast.politics - 14
  11. on political development in mideast
  12. 7. Science.space - 19
  13. on various space programs,
    devices,theories

41
Newsgroup Results
42
Newsgroup Results
43
Newsgroup Results
Recall of correct positive
predictions
----------------------------------
of positive examples Precision of
correct positive predictions
---------------------------------
of positive predictions
44
Query
Syntax Binary Operators
  • AND
  • OR
  • EQUIV
  • MINUS -
  • NOT
  • ACCUM ,

cat dogcat dogcat dog cat - dogcat
dogcat , dog
45
Semantics Binary Operators
  • The semantics of all the binary operators is
    defined in terms of SCORE
  • However, the score for even the simplest query
    expression - a single word - is calculated by a
    subtle rule
  • the score is higher for a document where the
    query word occurs more frequently than for one
    where it occurs less frequently
  • but when word1 occurs N times indocument D,
    its score is lower than when word2 occurs N
    times in document D if word1 occurs more often
    in the whole document set than word2

46
The Salton Algorithm
  • interMedia Text uses an algorithm which is
    similar to the Salton Algorithm - widely used in
    Text Retrieval products
  • The score for a word is proportional to... f (
    1log ( N/n) )...where
  • f is the frequency of the search term in the
    document
  • N is the total number documents
  • and n is the number of documents which contain
    the search term
  • The score is converted into an integer in the
    range 0 - 100.

47
The Salton Algorithm
Assumption
  • Inverse frequency scoring assumes that frequently
    occurring terms in a document
  • set are noise terms, and so these terms are
    scored lower. For a document to score
  • high, the query term must occur frequently
    in the document but infrequently in the
  • document set as a whole.

48
The Salton Algorithm
  • This table assumes that only one document in the
    set contains the query term.
  • of Documents in Document Set
    Occurrences of Term in Document
    Needed to Score 100
    1

    34
  • 5
    20

  • 10
    17
  • 50
    13
  • 100
    12
  • 500
    10
  • 1,000
    9
  • 10,000
    7
  • 100,000
    5
  • 1,000,000
    4

49
Summary of operators
  • Binary operators

- ,
  • Built-in expansion...

? !
  • Thesaurus...

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT,
SYN, TR, TRSYN, TT
50
Summary of operators
  • Stored query expression...

SQE
  • Grouping and escaping...

() \
  • Special...

NEARWITHINABOUT
51
Application Details- Customer profile Analyzer
The http server For (User web Page caching) Is
started Oracle web Server also started
52
Log In Screen- Customer User
Log in Screen Used both By the customer And the
users
The oracle web- Server takes care Of the
secure Connections, while For the http
server, The user id is Common for the
session -no user can invoke a Document from
server Without user id.
53
Customer Interface Http Server
The user Uses the Interface Provided By the
custom http server
54
Main User Screen
User can Choose the Type of data To be
analyzed. Two types of data exist- 1.
Newsgroups 2. User Browsed URLs
55
Selection of Category and options
User chooses Category and Other options Like-
Generating theme Generating gist Generating-
marked-up text Date range
56
Results Page Gist Generation
Can use this Page for drilling Down to the
Actual document Which opens up in The browser
(generated By the filter option) Can generate
theme And gist from this Screen.
57
Search Screen
Search screen, Has advance options Like fuzzy
search, About search etc. A chain of
expressions Can be used along With conjunctions
(like not,or,and etc) for Joining the
statements
58
Conclusion
  • New estimation methods trying to find more
    meaning from text.
  • Industry has great text mining products and is
    constantly improving technology.
  • Unstructured Data Mining a long way to go.
Write a Comment
User Comments (0)
About PowerShow.com