Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)

Description:

... from Oracle. Intelligent Data ... http://www.oracle.com/ip/analyze/warehouse/datamining ... New Specification being proposed by SUN for a Data Mining API ... – PowerPoint PPT presentation

Number of Views:254

Avg rating:3.0/5.0

Slides: 59

Provided by: cedarB

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)

1
Data Mining with Unstructured Data A Study And
Implementation of Industry Product(s)

Samrat Sen

2
Goals

Issues in Text Mining with Unstructured Data
Analysis of Data Mining products
Study of a Real Life Classification Problem
Strategy for solving the problem

3
Issues in Text Mining

Different from KDD and DM techniques in
structured Databases
Problems
1. Concerned with predefined fields
2. Based on learning from attribute- value
database
e.g
P.T.O

4
Issues in Text Mining
Potential Customer Table
Married to Table
Person Age Sex Income Customer Ann S
32 F 10,000 yes Jane G 53
F 20,000 no Sri S 35 M
65,000 yes Egor 25 M 10,000
yes
Husband Wife Egor
Ann S Sri H Jane

Induced Rules

If Married(Person, Spouse) and Income(Person) gt
25,000
Then Potential-Customer(Spouse)
If Married(Person, Spouse) and Potential-Customer(
Person)
Then Potential-Customer(Spouse)

5
Issues in Text Mining

Algorithm techniques like
Association Extraction from Indexed data,
Prototypical Document Extraction from full
Text
Industry standard data mining tools cannot be
used directly
e.g a usual process has to have the Text
Transformer, Text Analyzer, Summary generator

6
Issues in Text Mining

The input and output interfaces, the file
formats
may cost in time and money.
Exhaustive domains have to be set up for
classification.
Cost and Benefits have to be weighed before
model selection.
1. Gain from positive prediction
2. Loss from an incorrect positive
prediction (false positive)
3. Benefit from a correct negative
prediction
4. Cost of incorrect negative
prediction (false negative)
5. Cost of project time (a better
product/algorithm may come

up)

7
Data Mining Products/Tools

DARWIN from Oracle
Intelligent Data Miner from IBM
Intermedia Text with Oracle Database with context
query feature
(theme based document retrieval)

FOR MORE INFO...
http//www.oracle.com/ip/analyze/warehouse/datamin
ing/ http//www-4.ibm.com/software/data/iminer/
8
Data Mining Products/Tools

New Specification being proposed by SUN for a
Data Mining API
SQLServer 2000 Data mining and English query
writing features
Verity Knowledge Organizer

FOR MORE INFO...
http//java.sun.com/aboutJava/communityprocess/j
sr/jsr_073_dmapi. html3 Additional Text Mining
sites 1.http//textmining.krdl.org.sg/resourves.h
tml 2. www.intext.de/TEXTANAE.htm 3.
www.cs.uku.fi/kuikka/systems.html
9
DARWIN

Functions
Prediction (from known values)
Classification (into categories)
Forecasting (future predictions)
Approach
Plan
Prepare Dataset
Build and Use models

10
DARWIN

The problem is defined in terms of data fields
and data records
The fields are classified as follows
- Categorical and Ordered Fields
- Predictive Fields
- Target Fields
DARWIN dataset file has to be created containing
all the records in the problem domain (using a
descriptor file)

11
DARWIN - Models

Tree model Based on classification and
regression tree algorithm
Net model A feed forward multilayer neural
network
Match Model Memory based reasoning model, using
a K-nearest neighbor algorithm

12
DARWIN Tree Model
Create Tree
Training Data
Test/Evaluate
Tree (Information on error rates of pruned
sub-trees)
I/P Prediction Dataset
Predict with Tree (using the selected
sub-tree)
Merged I/P O/P prediction dataset
Analyze Results
13
DARWIN Net Model
Training Dataset
Neural Network Model
Create Net
Train Net (Information on
error rates of pruned sub-trees)
I/P Prediction Dataset
Trained Neural Network
Prediction Dataset
Merged I/P O/P prediction dataset
Analyze Results
14
DARWIN Match Model
Training Data
Create Match Model
Optimize match weights
I/P Prediction Dataset
Predict with Match
Merged I/P O/P prediction dataset
Analyze Results
15
DARWIN Analyzing
Evaluate Evaluates the performance of a given
model on a given dataset, when working on known
data for test or evaluation purposes.
Summarize Data Provides a statistical summary of
the values taken by a data in the specified
fields of a dataset Frequency Count Provides
information on the frequency with which
particular data values appear in a dataset
16
DARWIN Analyzing
Performance Matrix Can be used to compare simple
fields or simple functions of fields Sensitivity P
rovides a model showing the relative importance
of attributes used in building a model
17
DARWIN Code Generation

Darwin can generate C, C, Java code for a
Tree or Net model so that a prediction
function
can be called from an application Program
Java code can also be generated to embed a
model in a Web Applet

FOR MORE INFO...
http//technet.oracle.com/docs/products/datamining
/doc_index.htm
18
DARWIN

For more info
http//technet.oracle.com/software/products/interm
edia/software_index.html
1. Oracle Data Mining Data sheet
2. Oracle Data Mining Solutions
http//www.oracle.com/ip/analyze/warehouse/datamin
ing/
http//www.oracle.com/oramag/oracle/98-Jan/fast.ht
ml
1. Managing Unstructured Data with Oracle8
http//technet.oracle.com/products/datamining/
1. Product manuals

19
DARWIN
20
Oracle Intermedia Text

Ranking technique called theme proving is used
Documents grouped into categories and
subcategories
Integrated with the Oracle 8 database.
Absolutely no training or tuning required

21
Oracle Intermedia Text

Lexical Knowledge Base
- 200,000 concepts from very broad domains
- 2000 major categories
- Concepts mapped into one or more
words/phrases in
canonical form
- Each of these have alternate inflectional
variations,acronyms, synonyms stored
- Total vocabulary of 450,000 terms
- Each entry has other parameters like parts
of speech

22
Oracle Intermedia Text

Theme Extraction
-Themes are assigned initial ranks based on
structure of the document and the frequency of
the theme.
- All the ancestor themes also included in the
result
- Theme proving done before final ranking
Queries
Direct match, phrase search (contains),
case-sensitive query, misspellings and fuzzy
match, inflections (about), compound queries,
Boolean operators, Natural language query

23
Oracle Intermedia Text

Oracle at Trec 8
(Eighth text retrieval conference-http//otn.or
acle.com/products/intermedia/htdocs/imt_trec8pap.h
tm)
Recall at 1000 71.57
(3384/4728)
Average Precision 41.30
Initial precision (at 92.79
recall 0.0)
Final precision (at 07.91
recall 1.0)

24
Intermedia Text-Model
25
Interface Options
26
Language Selection

Java for robot
PL/SQL for data retrieval

27
Code Execution
28
Overview of the System
Intermedia Text
Customer Browser
Client Browser
Web Server
Oracle 8i
Listening at port 80
Server process
Tag stripper
JDBC
29
Intermedia Text

Steps for Building an application
Load the documents
Index the document
Issue Queries
Present the documents that satisfy the query

30
Loading Methods

Loading Methods
Insert Statements
SQL Loader
Ctxsrv This is a server daemon process which
builds
the index at regular
intervals
Ctxload Utility Used for
Thesaurus Import/Export
Text Loading
Document Updating/Exporting

31
Create and Populate a Simple Table

CREATE TABLE quick (
quick_id NUMBER CONSTRAINT
quick_pk PRIMARY KEY,
text VARCHAR2(80) )
INSERT INTO quick
VALUES ( 1, 'The cat sat on the mat' )
INSERT INTO quick
VALUES ( 2, 'The fox jumped over the dog'
)INSERT INTO quick
VALUES ( 3, 'The dog barked like a dog'
)COMMIT

32
Run a Text Query

SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) gt 0DRG-10599 column is
not indexed

You must have a Text index on a columnbefore you
can do a contains query on it

33
Create the Text Index
CREATE INDEX quick_text on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT

CTXSYS is the system user for interMedia Text
The INDEXTYPE keyword is a feature of the
Extensible Indexing Framework

34
Run a Text Query

SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) gt 0TEXT
-----------------------
The cat sat on the mat

You should regard the CONTAINS function as
boolean in meaning
It is implemented as a number since SQL does not
have a boolean datatype
The only sensible way to use it is with gt0

35
Run a Text Query

SELECT SCORE(42) s, text FROM quick
WHERE CONTAINS ( text, 'dog', 42 )
gt 0 / just for teaching purposes! /
ORDER BY s
S TEXT
-- ---------------------------
7 The dog barked like a dog
4 The fox jumped over the dog

The better is the match, the higher is the score
The value can be used in ORDER BY but has no
absolute significance
The score is zero when the query is not matched

36
Intermedia Text - Indexing Pipeline
Filtered Doc text
Doc Data
Sectioner
Datastore
Filter
Section Offsets
Column data
Engine
Lexer
Database
Plain text
Tokens
Index Data

First step is creating an index
Datastore

Reads the data out of the table (for URL
datastore performs a GET )

37
Intermedia Text - Indexing Pipeline

Filter The data is transformed to some text
type, this is needed as some of formats may be
binary as when storing doc, pdf, HTML types
Sectioner Converts to plain text, removes tags
and invisible info.
Lexer Splits the text into discrete tokens.
Engine Takes the tokens from lexer , the offsets
from sectioner and a list of stoplist words to
build an index.

38
Intermedia Text - Indexing Pipeline

Example of index creation
Statements
Insert into docs values(1,first document)
Insert into docs values(2,second document)
Produces an index
DOCUMENT ? doc 1 position 2, doc 2 position
2
FIRST ? doc 1 position 1
SECOND ? doc 2 position 1

39
Testing procedure

Document set from newsgroups
122 documents from a text mining site
Loaded using insert statements
File datastore used
Documents(HTML) from browsing
20 documents
Loaded from server process
URL datastore used

40
Newsgroup Results

1. Religion ,Atheism 15
on bible, islam, religious beliefs
2. Comp-os-ms-windows-misc - 17
about operating sys, protocols,
installation
3. Comp.graphics 27
on hardware and software for computer
graphics
4. Ice Hockey -
18
5. Computer hardware 12
on installation of different peripheral
devices
6. Mideast.politics - 14
on political development in mideast
7. Science.space - 19
on various space programs,
devices,theories

41
Newsgroup Results
42
Newsgroup Results
43
Newsgroup Results
Recall of correct positive
predictions
----------------------------------
of positive examples Precision of
correct positive predictions
---------------------------------
of positive predictions
44
Query
Syntax Binary Operators

AND
OR
EQUIV
MINUS -
NOT
ACCUM ,

cat dogcat dogcat dog cat - dogcat
dogcat , dog
45
Semantics Binary Operators

The semantics of all the binary operators is
defined in terms of SCORE
However, the score for even the simplest query
expression - a single word - is calculated by a
subtle rule
the score is higher for a document where the
query word occurs more frequently than for one
where it occurs less frequently
but when word1 occurs N times indocument D,
its score is lower than when word2 occurs N
times in document D if word1 occurs more often
in the whole document set than word2

46
The Salton Algorithm

interMedia Text uses an algorithm which is
similar to the Salton Algorithm - widely used in
Text Retrieval products
The score for a word is proportional to... f (
1log ( N/n) )...where
f is the frequency of the search term in the
document
N is the total number documents
and n is the number of documents which contain
the search term
The score is converted into an integer in the
range 0 - 100.

47
The Salton Algorithm
Assumption

Inverse frequency scoring assumes that frequently
occurring terms in a document
set are noise terms, and so these terms are
scored lower. For a document to score
high, the query term must occur frequently
in the document but infrequently in the
document set as a whole.

48
The Salton Algorithm

This table assumes that only one document in the
set contains the query term.
of Documents in Document Set
Occurrences of Term in Document
Needed to Score 100
1

34
5
20
10
17
50
13
100
12
500
10
1,000
9
10,000
7
100,000
5
1,000,000
4

49
Summary of operators

Binary operators

- ,

Built-in expansion...

? !

Thesaurus...

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT,
SYN, TR, TRSYN, TT
50
Summary of operators

Stored query expression...

SQE

Grouping and escaping...

() \

Special...

NEARWITHINABOUT
51
Application Details- Customer profile Analyzer
The http server For (User web Page caching) Is
started Oracle web Server also started
52
Log In Screen- Customer User
Log in Screen Used both By the customer And the
users
The oracle web- Server takes care Of the
secure Connections, while For the http
server, The user id is Common for the
session -no user can invoke a Document from
server Without user id.
53
Customer Interface Http Server
The user Uses the Interface Provided By the
custom http server
54
Main User Screen
User can Choose the Type of data To be
analyzed. Two types of data exist- 1.
Newsgroups 2. User Browsed URLs
55
Selection of Category and options
User chooses Category and Other options Like-
Generating theme Generating gist Generating-
marked-up text Date range
56
Results Page Gist Generation
Can use this Page for drilling Down to the
Actual document Which opens up in The browser
(generated By the filter option) Can generate
theme And gist from this Screen.
57
Search Screen
Search screen, Has advance options Like fuzzy
search, About search etc. A chain of
expressions Can be used along With conjunctions
(like not,or,and etc) for Joining the
statements
58
Conclusion