GammaWare Technology June 2002 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

GammaWare Technology June 2002

Description:

Not Multi lingual, demands many training examples. Machine Learning. 12. What is Machine Learning ... Multi lingual. Low labor costs. 22. How to Classify ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 44
Provided by: Adi158
Category:

less

Transcript and Presenter's Notes

Title: GammaWare Technology June 2002


1
GammaWare TechnologyJune 2002
  • Yiftach Ravid, VP RD
  • GammaSite Inc.
  • yiftach_at_GammaSite.com

2
Overview
- The challenge
- Taxonomies
- Classification
- Categorization
- Focused Crawler
- QA
3
The challenge Generate Structured Taxonomies of
text repositories
Business, Relevant Content
Unstructured Data
Internal DB
Information
Word
Structured Data
Application
Web
Forms
XML
Services
Catalogues
Mail
Domino
  • Generate a structured taxonomy of huge text
    repositories

4
Taxonomy
5
What is a Taxonomy
  • Taxonomy
  • Taxis arrangement or division
  • Nomos law
  • The science of classification according to a
    pre-determined system
  • Best-known use of taxonomy is in Biology
  • taxonomies of animals and plants

6
Web Taxonomy
  • Best-known use of taxonomies
  • Web portals or Directories
  • Internet sites classified into hierarchical
    topics
  • General
  • Yahoo! http//www.yahoo.com/
  • Open Directory http//www.dmoz.org/
  • LookSmart http//www.looksmart.com/r?countryuk
  • Topical
  • Business.Com http//www.business.com/
  • HealthWeb http//www.healthweb.org/
  • Education Planet http//www.educationplanet.com/

7
Taxonomy - Sample
8
Taxonomy vs. Thesaurus
9
Classification
10
What is a Classifier
  • Concept (Topic, Subject)
  • An abstract or generic idea generalized from
    particular instances Merriam Webster
  • Classifier
  • A function on a concept (category) and on an
    object (document)
  • Returns a number between 0 and 1 called
    confidence rate
  • Confidence rate measuring the confidence that
    the object (document) belongs (should be
    classified) to the concept (category)

11
Methods for Automatic Classification
  • Rule based
  • Pre-defined set of rules
  • Advantage
  • incorporating prior knowledge
  • Disadvantages
  • extreme reliance on man-made rules
  • costly in terms of man-hours
  • Linguistics
  • Use of morphology, syntax and semantics
  • Not Multi lingual, demands many training examples
  • Machine Learning

12
What is Machine Learning
  • Machine Learning is the study of computer
    algorithms that automatically improve performance
    through experience

13
Sample for Machine Learning
14
Discriminating Features
Q1 Who is this person? Q2 What are the most
discriminating features?
15
Discriminating Features
  • Answer
  • Lips
  • Eyes

16
Discriminating Features
The Margaret Thatcher effect
17
Supervised Inductive Learning
  • A process where
  • A learning algorithm is provided with a set of
    labeled instances, positive and negative examples
    (a training set)
  • Using the training set the leaning algorithm
    generates a classifier
  • The quality of the classifier is measured via its
    ability to perform well on novel instances (a
    test set)

18
Supervised Inductive Learning Example
Training
Test
19
Evaluating a Classifier
Classifier
Category
20
Recall and Precision
Use a confusion matrix to count
Precision (P) GY / (GY GN) 70 / (7050)
0.58
Recall (R) GY / (GY BY) 70 / (7030)
0.70
Accuracy (A) (GYNN)/(GYGNBYBN) 220 / 300
0.73
F-measure (F) 2/(1/P 1/R)
2GY/(GYGNGYBY) 270/(100120) 0.63
21
Supervised Statistical Machine Learning
  • A Supervised Inductive Learning method that is
    based on statistics obtained from the training
    set
  • Benefits
  • Generality and flexibility
  • Successfully applied across a broad spectrum of
    problems
  • Multi lingual
  • Low labor costs

22
How to Classify documents
  • Pre defined fields ( Structured data )
  • Author
  • Title
  • Date
  • Content ( Unstructured data )
  • From title, main text, emphasized text
  • All words
  • All 2 words, All 3 words, etc.
  • Phrases, Synonyms, etc.

23
Getting Started
24
GammaWare Work Flow
Improve Classifiers
Check Seed
25
Requirements
  • Initial parameters and decisions
  • Level of percolation - affects
  • Recall
  • Precision
  • Multi label
  • Maximum number of categories into which a
    document can be classified
  • Types of training documents
  • Full text, Keywords
  • Different types per category
  • List of Stop Words
  • Common words in the used language and also in
    topic

26
Taxonomy
  • A Taxonomy is constructed according to
  • User\Business needs
  • who will be using the taxonomy
  • Data
  • content of documents for classification
  • Good taxonomy
  • requires critical attention to both the
    definition and application of categories and
    their labels
  • simple and intuitive
  • How Using the Expert Tool

27
Seeding process
  • Seeding process each category within the
    taxonomy needs to be given a few examples of
    relevant documents of the same type that the user
    seeks to catalog
  • An average of 3-6 relevant documents per category
  • Seeds can either be positive seeds or negative
    seeds for each category
  • For better results - training documents should be
    in a similar structure as the documents for
    classification
  • How Using the Expert Tool

28
Check Seed
  • Check seed Classify the seeds into the taxonomy
  • Output An HTML page (browsed by the Expert tool)
  • For each category shows the cataloging results
    for all the relevant seeds.
  • Why Help in locating seeding problems
  • Seeds that are multi labeled
  • Problems in taxonomy structure
  • How Using the GammaWare Manager

29
Train Classifiers
  • Train Train classifiers for all categories
  • Output A classifier file (gcl extension) for
    each category
  • Why The classifiers are used for categorization.
  • How Using the GammaWare Manager

30
Classify Documents
  • Categorization Catalogue documents into a
    Taxonomy
  • Output A table in a database
  • Why This is why we are here.
  • How Using the GammaWare Manager

31
Improve Classifiers
  • Methods to improve classification results using
    the Expert Tool.
  • Re-design the taxonomy
  • Seed problems
  • More examples
  • Add new seeds
  • drag and drop documents from classification view
  • Negative seeds
  • Modify Categorization and Train parameters

32
Categorization
33
Hierarchical Categorization
  • Goal Classify a document into the appropriate
    sub-topic(s) in the taxonomy
  • Difficulties
  • Many sub-topics
  • A document may fall into several sub-topics
  • Classifiers are not perfect
  • Must control Recall and Precision according
    to the clients needs

34
Hierarchical Categorization
  • Divide and Conquer solution
  • Solve the problem Level by Level
  • At each level decompose the problem into several,
    smaller sized classification sub-problems
  • Note ignoring interactions between sub-problems
    can yield poor results
  • Patent Pending on Categorization

35
Focused Crawler
36
Topic Specific Crawling
  • Retrieve all documents that are relevant to a
    specific topic of interest
  • Hyper-linked networks (Intranet, Internet)
  • Two options
  • Crawl the network. Then apply classification
    schemes to filter relevant documents.
  • Using classification schemes crawl the network
    while teaching the crawler to imitate
    (intelligent) human surfing strategies

37
Simple Crawling
  • The Network is huge
  • Storage
  • Network
  • Time
  • Good for general-purpose search engines
  • Crawling The process of retrieving documents
    from the net

38
Focused Crawling via Link Classifiers
  • Analyze the context of the link

Herbal tea specialist
My brother new born child
Link is irrelevant
  • Link classifier Decision according to the
    context of the link

39
Focused Crawler The Learning Process
Herbal tea specialist
  • Crawler Classifier Checks if the document is
    good for Crawling

40
GammaWare API
41
Architecture - Basic
42
Multiple Servers
  • Scalability and Availability

43
Q A
Write a Comment
User Comments (0)
About PowerShow.com