Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch

1 / 34
About This Presentation
Title:

Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch

Description:

Provides links to technical standards documents and hosts a database of experience information ... including homonym, antonym, etc. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 35
Provided by: hyau

less

Transcript and Presenter's Notes

Title: Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch


1
Relevant Document DiscoveryThrough Document
Metadata and IndexingThesis DefenseByHiu
Shan Yau (Christina)10-28-03
2
Statement of the Problem
  • Developers and users of space systems need to
  • Meet extensive technical standards requirements
  • Incorporate prior experience into new projects
    and products
  • NASA Technical Standards Program (NTSP)
  • Provides links to technical standards documents
    and hosts a database of experience information
  • Provides only simple search
  • NASA desires to have links between all standards
    and related lessons learned
  • Domain experts search with NTSP, read document
    content, and identify relationship
  • Only a small number of related information are
    identified and linked
  • Difficulty and low productivity in matching
    related documents

3
Purpose of Study
  • To investigate the design and implementation of a
    software component which helps domain experts in
    the discovery of related technical standards
    specifications and lessons learned, i.e.,
    document matching
  • Techniques
  • To model and capture knowledge in standards and
    lessons learned documents as metadata
  • To develop metadata generation tools
  • To use document metadata and index data to
    perform document matching

4
Research Questions and Hypothesis
  • How can a document be represented? What are the
    document elements?
  • Title, author, subject, description, abstract,
    table of contents
  • Reference Dublin Core (DC)
  • Which standards and lesson learned document
    elements can be captured as metadata to be used
    for matching?
  • Reference DCMES (Dublin Core Metadata Element
    Set) recommendation
  • Index and frequency of word occurrence
  • What should be the algorithm used to return best
    document match search result?
  • Compare title / abstract / keywords / index
  • Can index words and the frequency of occurrence
    be an important lead to the subjects, especially
    if keywords are not specified?
  • Assign scores to words according to location and
    frequency
  • Title and subject keyword elements weight more

5
Research Design
  • Document element and metadata modeling
  • Identify document metadata elements
  • Construct metadata and relation model
  • Develop document metadata schema
  • Metadata and index generation and storage
  • Develop form-based template to assist generation
    of metadata according to the metadata schema
  • Use indexer to index documents and their
    individual metadata elements for matching
  • Store metadata as XML files and index words in
    matching element index files
  • Metadata index matching, result display and
    relations update
  • Derive algorithm for matching indexed words
    generated from document and metadata
  • Find additional relevant document discovery
    strategies
  • Develop reports and user interface for
    presentation of match result

6
Document Element and Metadata Modeling
  • Document metadata elements identification
  • 10 from DCMES elements 1 other DC recommended
    element selected
  • With DCMI refinement and encoding scheme
  • Metadata and relation model construction
  • Concept behind the use of matching to find the
    relevant documents
  • if the metadata and indices of two documents are
    related, then the subject contents of the two
    documents are likely to be related (figure 1)
  • Document schema development
  • XML metadata schema generated (figure 2)

7
  • Fig 1. Finding document relationships via
    metadata relationships

8
Figure 2. Extract of code for SA_MetaMatch
Metadata XML Schema
  • ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
    ema" xmlnsdc"http//purl.org/dc/elements/1.1/"
    xmlnsdcterms"http//purl.org/dc/terms/"gt
  • ltxsannotationgt
  • ltxsdocumentation xmllang"en"gt
  • Simple XML Schema for Metadata Model,
    2003-01-20
  • lt/xsdocumentationgt
  • lt/xsannotationgt
  • ltxsimport namespace"http//www.w3.org/XML/1998/
    namespace" schemaLocation"http//www.w3.org/2001/
    03/xml.xsd"gtlt/xsimportgt
  • ltxsgroup name"metadata"gt
  • ltxssequencegt
  • ltdctitle name"title" type"xsstring" /gt
  • ltxselement name"creator" type"creatorType"
    ref"dccreator" /gt
  • ltxselement name"subject" type"subjectType"
    ref"dcsubject" /gt
  • ltxselement name"description"
    type"descriptionType"
  • ref"dcdescription" /gt
  • ltdcpublisher name"publisher"
    type"xsstring" ref"dcdescription" /gt
  • ltxselement name"date" type"dateType"
    ref"dcdate" /gt
  • ltdcformat name"format" type"xsstring" /gt
  • ltdcidentifier name"identifier"
    type"xsstring" /gt
  • ltxselement name"relation" type"relationType"
    ref"dcrelation" /gt

9
Metadata and Index Generation and Storage
  • Form-based template development
  • SA_MetaMatch Generate / Edit Metadata Interface
    Screen for Capturing Metadata
  • (figure 3 4)
  • XML metadata generated (figure 5)
  • Metadata and document indexing
  • Generate index with Swish-e indexer, Perl script
    (figure 6)
  • Swish-e configured to remove stopwords, number,
  • word length lt 3
  • Perl script to extract words and their absolute
    freq, calculate relative freq, filter words, and
    sort by freq (figure 7 8)

10
Metadata and Index Generation and Storage
  • Metadata and document indexing
  • As only the most important (high word score)
    words are matched to find relevant documents, the
    lower frequency words are filtered and removed
    from the indices.
  • The minimum index percentage parameter is defined
    in configuration file.
  • Metadata storage
  • Preliminarily as XML files (figure 5)
  • Can be stored into a metadata repository for
    better searching and management

11
Figure 3. SA_MetaMatch Generate / Edit Metadata
Interface Screen
Figure 4. SA_MetaMatch Generate / Edit Metadata
Interface Screen (contd)
12
Figure 5. Extract of Metadata Generated
  •   lt?xml version"1.0" ?gt
  • - ltMetadatagt
  •   lttitlegtElectrical Grounding Architecture For
    Unmanned Spacecraftlt/titlegt
  • - ltsubjectgt
  •   ltcontrolledKeywordgtSpacecraftlt/controlledKeyword
    gt
  •   ltuncontrolledKeywordgtGrounding architecture,
    Design, Rationalelt/uncontrolledKeywordgt
  •   ltclassification scheme"NASA PTSD"gt4000lt/classif
    icationgt
  •   lt/subjectgt
  • - ltdescriptiongt
  •   ltabstractgtThis handbook describes spacecraft
    grounding architecture options at the system
    level. Implementation of good
  • basis for understanding those choices and
    tradeoffs.lt/abstractgt
  •   lt/descriptiongt
  •   ltpublishergtNASAlt/publishergt
  •   ltdate dateType""gtFebruary 17, 1998lt/dategt
  • - ltformatgt
  •   ltextentgt257Klt/extentgt
  •   ltmediumgttext/pdflt/mediumgt

13
Figure 6. Index generation
14
Index output for alphabetical sorted listing of
the terms with location -----gt WORD INFO in index
NASA-HDBK-4001.idx lt----- 100m 1 1 1
(2348/1) 100m-1000m 1 1 1 (2347/1) 1553b 1 1
2 (2536/1 2539/1) 3k-ohm 1 1 1 (4940/1) . .
. omit for display adequacy 1 1 1
(1604/1) adjacent 1 1 1 (2131/1) adjusting 1
1 1 (3994/1) administration 1 1 1
(10/1) advantage 1 1 3 (3357/1 4520/1
4692/1) aeronautics 1 1 1 (5/1) aerospace 1 1
1 (879/1) . . . omit for display years 1
1 4 (106/1 2326/1 2332/1 3565/1) zero 1 1 4
(934/1 963/1 1610/1 1689/1) zero-potential 1 1
1 (1114/1) zero-to-low 1 1 1 (3896/1)
Doc No. NASA-HDBK-4001 Index words with
frequency gt 1 in descending order Term
Freq Rel. Freq
() ----------------------------------------------
----------- ground 99
1.85 power
94 1.76 spacecraft
86 1.61 grounding
81 1.52 chassis
77 1.44 isolated
74 1.39 isolation
74 1.39 system
62 1.16 interface
36 0.67 reference
35
0.66 architecture 33
0.62 handbook 33
0.62 design 32
0.60 . . . omit for display mil-b-5087
b 1
0.02 difficulties 1
0.02 requires 1
0.02 ------------------------------------
--------------------- 1449 non blank lines Total
freq of indexed word 5341
Figure 7. Extract of Index File Generated From
Swish-e Utility
Figure 8. Extract of Index File Processed with
Perl Script
15
Metadata Index Matchingand Result Display
  • Weighting scheme
  • Weights are assigned to different metadata
    elements according to the importance of the
    elements
  • 3 different matching algorithms
  • Comparison of index terms between same element
  • Comparison of index terms between alternative
    elements in addition to same element match
  • Matching word score and sum of word score for
    comparison

16
Matching Algorithms
  • Matching Algorithm 1
  • Compare index terms between same element
  • for each element in matching list
  • score element_weight element_match (i.e.
    match_word / target_element_word)
  • E.g. percentage relevant score 4 ( match
    in title)
  • 3 ( match in subject)
  • 2 ( match in scope / abstract)
  • 1 ( match in index)
  • / (4 3 2 1) 100
  • for title weight 4, subject weight 3,
    abstract weight 2, doc index 1
  • Matching Algorithm 2
  • Compare index terms between same and alternative
    elements
  • alternative element match with discounted weight

17
Matching Algorithms
  • Matching Algorithm 2
  • if (all words in an target element found matching
    in same comparing element)
  • score element_weight (100 match)
  • no need to compare with other element
  • else
  • element_match match_word / target_element_wor
    d
  • score element_weight element_match
  • compare the unmatch words with other elements
    starting from the closest weighted one until
    match word found in an element or no found for
    all elements
  • for each alternative match element alt_element,
    calculate the additional score add_score for
    matching words found in that element
  • wt_diff abs(target_element_weight -
    match_element_weight)
  • wt_diff_ratio min((wt_diff / target_element_weig
    ht), 1)
  • alt_element_weight weight (1- wt_diff_ratio)
  • alt_element_match match_word_with_alt_element
    / target_element_word)
  • add_score alt_element_weight
    alt_element_match
  • score add_score

18
Matching Algorithms
  • Matching Algorithm 3
  • Calculate a score for each individual matching
    word and derive relevance from the sum of word
    scores
  • word list of the target document (all the words
    listed) and matched document (only match words)
  • a score calculated for each word according to the
    frequency and location (i.e. type of element)
  • for each indexed element
  • // word refers to all words for target doc and
    it refers to only words
  • // in the searched doc found to match that in
    target doc
  • word score word freq of that element
    element_weight
  • document score sum of (word score)
  • relevance score of matching document matching
    word score
  • / (target word score matching word score)
    100

19
Additional Relevant Document Discovery Strategy
  • Recognize special pattern words employed
  • E.g. NASA-STD-7002 indicates a relation with
    another NASA standard document, according to the
    naming conventions of SDOs (reference figure 9)
  • Other strategies studied
  • Usage of a word in parts of speech (verbs, nouns,
    ...)
  • Remove irrelevant words selectively
  • Apply semantically related terms in an ontology
  • including synonyms in a thesaurus, or broader or
    narrower terms in a concept hierarchy (taxonomy)
  • Apply other ontology relationship
  • including homonym, antonym, etc.
  • Apply more ontology (domain model) processing of
    related terms and constraints

20
Reports and User Interface forMatch Result
Display
  • Result for matching was initially generated as a
    text report
  • relevant doc list sorted by the relevant score
  • the target document ID, a list of the matching
    document IDs, the target words and their
    locations, as well as the matching words and
    their locations
  • third algorithm with score for each term added
  • Screens for the third algorithm have been
    developed
  • to present the result data in a more interactive
    and more readable way

21
Reports andUser Interface forMatch Result
Display
Figure 9. Result Display for Matching
22
Figure 10. Word Summary Interface Screen for
Target Document (Standards document titled
Dynamic Environmental Criteria)
Figure 11. Word Summary Interface Screen for
Matched Document (Lesson Learned document titled
Environmental Test Sequencing)
23
Results and Findings
  • Document matching elements
  • DCMES recommendations provided a very useful
    reference for document metadata elements to be
    selected for the SA_MetaMatch schema
  • Matching index words from the selected
    descriptive metadata elements title, subject and
    abstract (from description) were found to be
    useful in identifying relevant documents
  • Index words from document content alone were also
    found to be useful as an important lead to the
    document subjects most of the time
  • When the metadata index and the document content
    index were combined, they were more powerful and
    gave better results in identifying relevant
    documents

24
Results and Findings
  • Different algorithms applied
  • Second and third algorithms were close in
    performance, and better than first algorithm
  • Third algorithm is preferred
  • provides an individual word score which helps
    domain expert assess each documents relevancy
  • Weight of element
  • Ratio of title subject abstracts index
  • 50 40 30 1 for the third algorithm gave
    good results
  • Relevant documents found
  • SA_MetaMatch has found to be helpful for the
    domain expert in finding relevant documents
    (table 1)

25
Results and Findings
  • Relevant documents found

Table 1. Number of Previously Known and
Newly-Discovered Relevant Links in the Collection
of LLIS Documents For Each Target Standard
Document
26
Results and Findings
  • Distributions of relevant documents rank

Table 2. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 25 Word Filtering
27
Results and Findings
  • Distributions of relevant documents rank

Table 3. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 50 Word Filtering
28
Results and Findings
  • Distributions of relevant documents rank
  • On average, about 70 of the known relevant
    documents rank in the top 25th percentile and
    over 90 of the known relevant documents rank on
    the top 50th percentile in both configurations of
    word filtering
  • Filtering
  • Good performance was found when
  • the minimum relative frequency for index words
    was set at 0.2
  • the low score word filtering out percentage was
    set at 25-50
  • depending on the length of the document
  • Selective removal of irrelevant words
  • Average amount of known relevant documents
    ranking in the top 25th percentile has increased
    to about 80, and that in the top 50th percentile
    to about 95

29
Results and Findings
  • Selective removal of irrelevant words

Table 4. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With Selective Word Removal
30
Results and Findings
  • Use of special pattern words
  • The special pattern words recognition strategy
    (for example, recognizing NASA-STD-5003) was
    found to be useful in discovering relevant
    standards specification mentioned in the target
    document.

31
Conclusions
  • Metadata
  • Dublin Core Metadata Element Set (DCMES)
    recommendations, DCMI recommendations, and the
    DCMI recommended guidelines provided very useful
    references for the XML metadata schema modeling
    and construction
  • Index
  • Indices for metadata and document content have
    been found useful in finding relevant documents
  • SA_MetaMatch prototype
  • Useful to NASA
  • Has improved the productivity of the document
    matching and linking process

32
Recommendations for Further Study
  • Apply stemming
  • many errors found in the Porter Stemmer algorithm
  • Implement word phrase index term
  • Analyze usage of a word and natural language
    processing
  • usage of a word in parts of speech (verbs, nouns,
    adjective, or others)
  • usage of a word in heading title, figure caption,
    etc.
  • Apply related terms and constraints
  • expand or refine index terms by applying
    semantically related terms in domain ontology
  • include synonyms in a thesaurus, or broader or
    narrower terms in a concept hierarchy (taxonomy)

33
Recommendations for Further Study
  • Other ontology relationships
  • including homonym, antonym, etc
  • can be applied to distinguish words of different
    context and to eliminate unrelated documents
  • Future metadata and ontology language
  • RDF, DAMLOIL, OWL or another ontology language
  • Learn from query
  • search query feedback

34
The End
  • Thank you!!
  • Questions and Suggestions?
Write a Comment
User Comments (0)