Title: Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)
1Data Mining with Unstructured Data A Study And
Implementation of Industry Product(s)
2Goals
- Issues in Text Mining with Unstructured Data
- Analysis of Data Mining products
- Study of a Real Life Classification Problem
- Strategy for solving the problem
3Issues in Text Mining
- Different from KDD and DM techniques in
structured Databases - Problems
- 1. Concerned with predefined fields
- 2. Based on learning from attribute- value
- database
- e.g
- P.T.O
4Issues in Text Mining
Potential Customer Table
Married to Table
Person Age Sex Income Customer Ann S
32 F 10,000 yes Jane G 53
F 20,000 no Sri S 35 M
65,000 yes Egor 25 M 10,000
yes
Husband Wife Egor
Ann S Sri H Jane
Induced Rules
- If Married(Person, Spouse) and Income(Person) gt
25,000 - Then Potential-Customer(Spouse)
- If Married(Person, Spouse) and Potential-Customer(
Person) - Then Potential-Customer(Spouse)
5Issues in Text Mining
- Algorithm techniques like
- Association Extraction from Indexed data,
- Prototypical Document Extraction from full
Text - Industry standard data mining tools cannot be
used directly - e.g a usual process has to have the Text
Transformer, Text Analyzer, Summary generator
6Issues in Text Mining
- The input and output interfaces, the file
formats - may cost in time and money.
- Exhaustive domains have to be set up for
- classification.
- Cost and Benefits have to be weighed before
- model selection.
- 1. Gain from positive prediction
- 2. Loss from an incorrect positive
prediction (false positive) - 3. Benefit from a correct negative
prediction - 4. Cost of incorrect negative
prediction (false negative) - 5. Cost of project time (a better
product/algorithm may come
up)
7Data Mining Products/Tools
- DARWIN from Oracle
- Intelligent Data Miner from IBM
- Intermedia Text with Oracle Database with context
query feature - (theme based document retrieval)
FOR MORE INFO...
http//www.oracle.com/ip/analyze/warehouse/datamin
ing/ http//www-4.ibm.com/software/data/iminer/
8Data Mining Products/Tools
- New Specification being proposed by SUN for a
Data Mining API - SQLServer 2000 Data mining and English query
writing features - Verity Knowledge Organizer
FOR MORE INFO...
http//java.sun.com/aboutJava/communityprocess/j
sr/jsr_073_dmapi. html3 Additional Text Mining
sites 1.http//textmining.krdl.org.sg/resourves.h
tml 2. www.intext.de/TEXTANAE.htm 3.
www.cs.uku.fi/kuikka/systems.html
9DARWIN
- Functions
- Prediction (from known values)
- Classification (into categories)
- Forecasting (future predictions)
- Approach
- Plan
- Prepare Dataset
- Build and Use models
10DARWIN
- The problem is defined in terms of data fields
and data records - The fields are classified as follows
- - Categorical and Ordered Fields
- - Predictive Fields
- - Target Fields
- DARWIN dataset file has to be created containing
all the records in the problem domain (using a
descriptor file)
11DARWIN - Models
- Tree model Based on classification and
regression tree algorithm - Net model A feed forward multilayer neural
network - Match Model Memory based reasoning model, using
a K-nearest neighbor algorithm
12DARWIN Tree Model
Create Tree
Training Data
Test/Evaluate
Tree (Information on error rates of pruned
sub-trees)
I/P Prediction Dataset
Predict with Tree (using the selected
sub-tree)
Merged I/P O/P prediction dataset
Analyze Results
13DARWIN Net Model
Training Dataset
Neural Network Model
Create Net
Train Net (Information on
error rates of pruned sub-trees)
I/P Prediction Dataset
Trained Neural Network
Prediction Dataset
Merged I/P O/P prediction dataset
Analyze Results
14DARWIN Match Model
Training Data
Create Match Model
Optimize match weights
I/P Prediction Dataset
Predict with Match
Merged I/P O/P prediction dataset
Analyze Results
15DARWIN Analyzing
Evaluate Evaluates the performance of a given
model on a given dataset, when working on known
data for test or evaluation purposes.
Summarize Data Provides a statistical summary of
the values taken by a data in the specified
fields of a dataset Frequency Count Provides
information on the frequency with which
particular data values appear in a dataset
16DARWIN Analyzing
Performance Matrix Can be used to compare simple
fields or simple functions of fields Sensitivity P
rovides a model showing the relative importance
of attributes used in building a model
17DARWIN Code Generation
- Darwin can generate C, C, Java code for a
- Tree or Net model so that a prediction
function - can be called from an application Program
- Java code can also be generated to embed a
- model in a Web Applet
FOR MORE INFO...
http//technet.oracle.com/docs/products/datamining
/doc_index.htm
18DARWIN
- For more info
- http//technet.oracle.com/software/products/interm
edia/software_index.html - 1. Oracle Data Mining Data sheet
- 2. Oracle Data Mining Solutions
- http//www.oracle.com/ip/analyze/warehouse/datamin
ing/ - http//www.oracle.com/oramag/oracle/98-Jan/fast.ht
ml - 1. Managing Unstructured Data with Oracle8
- http//technet.oracle.com/products/datamining/
- 1. Product manuals
19DARWIN
20Oracle Intermedia Text
- Ranking technique called theme proving is used
- Documents grouped into categories and
subcategories - Integrated with the Oracle 8 database.
- Absolutely no training or tuning required
21Oracle Intermedia Text
- Lexical Knowledge Base
- - 200,000 concepts from very broad domains
- - 2000 major categories
- - Concepts mapped into one or more
words/phrases in - canonical form
- - Each of these have alternate inflectional
- variations,acronyms, synonyms stored
- - Total vocabulary of 450,000 terms
- - Each entry has other parameters like parts
of speech
22Oracle Intermedia Text
- Theme Extraction
- -Themes are assigned initial ranks based on
- structure of the document and the frequency of
the theme. - - All the ancestor themes also included in the
result - - Theme proving done before final ranking
- Queries
- Direct match, phrase search (contains),
case-sensitive query, misspellings and fuzzy
match, inflections (about), compound queries,
Boolean operators, Natural language query -
23Oracle Intermedia Text
- Oracle at Trec 8
- (Eighth text retrieval conference-http//otn.or
acle.com/products/intermedia/htdocs/imt_trec8pap.h
tm) - Recall at 1000 71.57
(3384/4728) - Average Precision 41.30
- Initial precision (at 92.79
- recall 0.0)
- Final precision (at 07.91
- recall 1.0)
24Intermedia Text-Model
25Interface Options
26Language Selection
- Java for robot
- PL/SQL for data retrieval
27Code Execution
28Overview of the System
Intermedia Text
Customer Browser
Client Browser
Web Server
Oracle 8i
Listening at port 80
Server process
Tag stripper
JDBC
29Intermedia Text
- Steps for Building an application
- Load the documents
- Index the document
- Issue Queries
- Present the documents that satisfy the query
30Loading Methods
- Loading Methods
- Insert Statements
- SQL Loader
- Ctxsrv This is a server daemon process which
builds - the index at regular
intervals - Ctxload Utility Used for
- Thesaurus Import/Export
- Text Loading
- Document Updating/Exporting
31Create and Populate a Simple Table
- CREATE TABLE quick (
- quick_id NUMBER CONSTRAINT
quick_pk PRIMARY KEY, - text VARCHAR2(80) )
- INSERT INTO quick
- VALUES ( 1, 'The cat sat on the mat' )
- INSERT INTO quick
- VALUES ( 2, 'The fox jumped over the dog'
)INSERT INTO quick - VALUES ( 3, 'The dog barked like a dog'
)COMMIT
32Run a Text Query
- SELECT text FROM quick
- WHERE CONTAINS ( text,
- 'sat on the mat' ) gt 0DRG-10599 column is
not indexed
- You must have a Text index on a columnbefore you
can do a contains query on it
33Create the Text Index
CREATE INDEX quick_text on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT
- CTXSYS is the system user for interMedia Text
- The INDEXTYPE keyword is a feature of the
Extensible Indexing Framework
34Run a Text Query
- SELECT text FROM quick
- WHERE CONTAINS ( text,
- 'sat on the mat' ) gt 0TEXT
- -----------------------
- The cat sat on the mat
- You should regard the CONTAINS function as
boolean in meaning - It is implemented as a number since SQL does not
have a boolean datatype - The only sensible way to use it is with gt0
35Run a Text Query
- SELECT SCORE(42) s, text FROM quick
- WHERE CONTAINS ( text, 'dog', 42 )
- gt 0 / just for teaching purposes! /
ORDER BY s - S TEXT
- -- ---------------------------
- 7 The dog barked like a dog
- 4 The fox jumped over the dog
- The better is the match, the higher is the score
- The value can be used in ORDER BY but has no
absolute significance - The score is zero when the query is not matched
36Intermedia Text - Indexing Pipeline
Filtered Doc text
Doc Data
Sectioner
Datastore
Filter
Section Offsets
Column data
Engine
Lexer
Database
Plain text
Tokens
Index Data
- First step is creating an index
- Datastore
- Reads the data out of the table (for URL
datastore performs a GET )
37Intermedia Text - Indexing Pipeline
- Filter The data is transformed to some text
type, this is needed as some of formats may be
binary as when storing doc, pdf, HTML types - Sectioner Converts to plain text, removes tags
and invisible info. - Lexer Splits the text into discrete tokens.
- Engine Takes the tokens from lexer , the offsets
from sectioner and a list of stoplist words to
build an index.
38Intermedia Text - Indexing Pipeline
- Example of index creation
- Statements
- Insert into docs values(1,first document)
- Insert into docs values(2,second document)
- Produces an index
- DOCUMENT ? doc 1 position 2, doc 2 position
2 - FIRST ? doc 1 position 1
- SECOND ? doc 2 position 1
39Testing procedure
- Document set from newsgroups
- 122 documents from a text mining site
- Loaded using insert statements
- File datastore used
- Documents(HTML) from browsing
- 20 documents
- Loaded from server process
- URL datastore used
40Newsgroup Results
- 1.    Religion ,Atheism 15
- on bible, islam, religious beliefs
- 2.    Comp-os-ms-windows-misc - 17
- about operating sys, protocols,
installation - 3.    Comp.graphics 27
- on hardware and software for computer
graphics - 4.    Ice Hockey -
18 - 5.    Computer hardware 12
- on installation of different peripheral
devices - 6.    Mideast.politics - 14
- on political development in mideast
- 7. Science.space - 19
- on various space programs,
devices,theories
41Newsgroup Results
42Newsgroup Results
43Newsgroup Results
Recall of correct positive
predictions
----------------------------------
of positive examples Precision of
correct positive predictions
---------------------------------
of positive predictions
44Query
Syntax Binary Operators
- AND
- OR
- EQUIV
- MINUS -
- NOT
- ACCUM ,
cat dogcat dogcat dog cat - dogcat
dogcat , dog
45Semantics Binary Operators
- The semantics of all the binary operators is
defined in terms of SCORE - However, the score for even the simplest query
expression - a single word - is calculated by a
subtle rule - the score is higher for a document where the
query word occurs more frequently than for one
where it occurs less frequently - but when word1 occurs N times indocument D,
its score is lower than when word2 occurs N
times in document D if word1 occurs more often
in the whole document set than word2
46The Salton Algorithm
- interMedia Text uses an algorithm which is
similar to the Salton Algorithm - widely used in
Text Retrieval products - The score for a word is proportional to... f (
1log ( N/n) )...where - f is the frequency of the search term in the
document - N is the total number documents
- and n is the number of documents which contain
the search term - The score is converted into an integer in the
range 0 - 100.
47The Salton Algorithm
Assumption
- Inverse frequency scoring assumes that frequently
occurring terms in a document - set are noise terms, and so these terms are
scored lower. For a document to score - high, the query term must occur frequently
in the document but infrequently in the - document set as a whole.
48The Salton Algorithm
- This table assumes that only one document in the
set contains the query term. - of Documents in Document Set
Occurrences of Term in Document
Needed to Score 100
1
34 - 5
20
- 10
17 - 50
13 - 100
12 - 500
10 - 1,000
9 - 10,000
7 - 100,000
5 - 1,000,000
4
49Summary of operators
- ,
? !
BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT,
SYN, TR, TRSYN, TT
50Summary of operators
- Stored query expression...
SQE
() \
NEARWITHINABOUT
51Application Details- Customer profile Analyzer
The http server For (User web Page caching) Is
started Oracle web Server also started
52Log In Screen- Customer User
Log in Screen Used both By the customer And the
users
The oracle web- Server takes care Of the
secure Connections, while For the http
server, The user id is Common for the
session -no user can invoke a Document from
server Without user id.
53Customer Interface Http Server
The user Uses the Interface Provided By the
custom http server
54Main User Screen
User can Choose the Type of data To be
analyzed. Two types of data exist- 1.
Newsgroups 2. User Browsed URLs
55Selection of Category and options
User chooses Category and Other options Like-
Generating theme Generating gist Generating-
marked-up text Date range
56Results Page Gist Generation
Can use this Page for drilling Down to the
Actual document Which opens up in The browser
(generated By the filter option) Can generate
theme And gist from this Screen.
57Search Screen
Search screen, Has advance options Like fuzzy
search, About search etc. A chain of
expressions Can be used along With conjunctions
(like not,or,and etc) for Joining the
statements
58Conclusion
- New estimation methods trying to find more
meaning from text. - Industry has great text mining products and is
constantly improving technology. - Unstructured Data Mining a long way to go.