Title: Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch
1Relevant Document DiscoveryThrough Document
Metadata and IndexingThesis DefenseByHiu
Shan Yau (Christina)10-28-03
2Statement of the Problem
- Developers and users of space systems need to
- Meet extensive technical standards requirements
- Incorporate prior experience into new projects
and products - NASA Technical Standards Program (NTSP)
- Provides links to technical standards documents
and hosts a database of experience information - Provides only simple search
- NASA desires to have links between all standards
and related lessons learned - Domain experts search with NTSP, read document
content, and identify relationship - Only a small number of related information are
identified and linked - Difficulty and low productivity in matching
related documents
3Purpose of Study
- To investigate the design and implementation of a
software component which helps domain experts in
the discovery of related technical standards
specifications and lessons learned, i.e.,
document matching - Techniques
- To model and capture knowledge in standards and
lessons learned documents as metadata - To develop metadata generation tools
- To use document metadata and index data to
perform document matching
4Research Questions and Hypothesis
- How can a document be represented? What are the
document elements? - Title, author, subject, description, abstract,
table of contents - Reference Dublin Core (DC)
- Which standards and lesson learned document
elements can be captured as metadata to be used
for matching? - Reference DCMES (Dublin Core Metadata Element
Set) recommendation - Index and frequency of word occurrence
- What should be the algorithm used to return best
document match search result? - Compare title / abstract / keywords / index
- Can index words and the frequency of occurrence
be an important lead to the subjects, especially
if keywords are not specified? - Assign scores to words according to location and
frequency - Title and subject keyword elements weight more
5Research Design
- Document element and metadata modeling
- Identify document metadata elements
- Construct metadata and relation model
- Develop document metadata schema
- Metadata and index generation and storage
- Develop form-based template to assist generation
of metadata according to the metadata schema - Use indexer to index documents and their
individual metadata elements for matching - Store metadata as XML files and index words in
matching element index files - Metadata index matching, result display and
relations update - Derive algorithm for matching indexed words
generated from document and metadata - Find additional relevant document discovery
strategies - Develop reports and user interface for
presentation of match result
6Document Element and Metadata Modeling
- Document metadata elements identification
- 10 from DCMES elements 1 other DC recommended
element selected - With DCMI refinement and encoding scheme
- Metadata and relation model construction
- Concept behind the use of matching to find the
relevant documents - if the metadata and indices of two documents are
related, then the subject contents of the two
documents are likely to be related (figure 1) - Document schema development
- XML metadata schema generated (figure 2)
7- Fig 1. Finding document relationships via
metadata relationships
8Figure 2. Extract of code for SA_MetaMatch
Metadata XML Schema
- ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
ema" xmlnsdc"http//purl.org/dc/elements/1.1/"
xmlnsdcterms"http//purl.org/dc/terms/"gt - ltxsannotationgt
- ltxsdocumentation xmllang"en"gt
- Simple XML Schema for Metadata Model,
2003-01-20 - lt/xsdocumentationgt
- lt/xsannotationgt
- ltxsimport namespace"http//www.w3.org/XML/1998/
namespace" schemaLocation"http//www.w3.org/2001/
03/xml.xsd"gtlt/xsimportgt - ltxsgroup name"metadata"gt
- ltxssequencegt
- ltdctitle name"title" type"xsstring" /gt
- ltxselement name"creator" type"creatorType"
ref"dccreator" /gt - ltxselement name"subject" type"subjectType"
ref"dcsubject" /gt - ltxselement name"description"
type"descriptionType" - ref"dcdescription" /gt
- ltdcpublisher name"publisher"
type"xsstring" ref"dcdescription" /gt - ltxselement name"date" type"dateType"
ref"dcdate" /gt - ltdcformat name"format" type"xsstring" /gt
- ltdcidentifier name"identifier"
type"xsstring" /gt - ltxselement name"relation" type"relationType"
ref"dcrelation" /gt
9Metadata and Index Generation and Storage
- Form-based template development
- SA_MetaMatch Generate / Edit Metadata Interface
Screen for Capturing Metadata - (figure 3 4)
- XML metadata generated (figure 5)
- Metadata and document indexing
- Generate index with Swish-e indexer, Perl script
(figure 6) - Swish-e configured to remove stopwords, number,
- word length lt 3
- Perl script to extract words and their absolute
freq, calculate relative freq, filter words, and
sort by freq (figure 7 8)
10Metadata and Index Generation and Storage
- Metadata and document indexing
- As only the most important (high word score)
words are matched to find relevant documents, the
lower frequency words are filtered and removed
from the indices. - The minimum index percentage parameter is defined
in configuration file. - Metadata storage
- Preliminarily as XML files (figure 5)
- Can be stored into a metadata repository for
better searching and management
11Figure 3. SA_MetaMatch Generate / Edit Metadata
Interface Screen
Figure 4. SA_MetaMatch Generate / Edit Metadata
Interface Screen (contd)
12Figure 5. Extract of Metadata Generated
- lt?xml version"1.0" ?gt
- - ltMetadatagt
- lttitlegtElectrical Grounding Architecture For
Unmanned Spacecraftlt/titlegt - - ltsubjectgt
- ltcontrolledKeywordgtSpacecraftlt/controlledKeyword
gt - ltuncontrolledKeywordgtGrounding architecture,
Design, Rationalelt/uncontrolledKeywordgt - ltclassification scheme"NASA PTSD"gt4000lt/classif
icationgt - lt/subjectgt
- - ltdescriptiongt
- ltabstractgtThis handbook describes spacecraft
grounding architecture options at the system
level. Implementation of good -
-
- basis for understanding those choices and
tradeoffs.lt/abstractgt - lt/descriptiongt
- ltpublishergtNASAlt/publishergt
- ltdate dateType""gtFebruary 17, 1998lt/dategt
- - ltformatgt
- ltextentgt257Klt/extentgt
- ltmediumgttext/pdflt/mediumgt
13Figure 6. Index generation
14Index output for alphabetical sorted listing of
the terms with location -----gt WORD INFO in index
NASA-HDBK-4001.idx lt----- 100m 1 1 1
(2348/1) 100m-1000m 1 1 1 (2347/1) 1553b 1 1
2 (2536/1 2539/1) 3k-ohm 1 1 1 (4940/1) . .
. omit for display adequacy 1 1 1
(1604/1) adjacent 1 1 1 (2131/1) adjusting 1
1 1 (3994/1) administration 1 1 1
(10/1) advantage 1 1 3 (3357/1 4520/1
4692/1) aeronautics 1 1 1 (5/1) aerospace 1 1
1 (879/1) . . . omit for display years 1
1 4 (106/1 2326/1 2332/1 3565/1) zero 1 1 4
(934/1 963/1 1610/1 1689/1) zero-potential 1 1
1 (1114/1) zero-to-low 1 1 1 (3896/1)
Doc No. NASA-HDBK-4001 Index words with
frequency gt 1 in descending order Term
Freq Rel. Freq
() ----------------------------------------------
----------- ground 99
1.85 power
94 1.76 spacecraft
86 1.61 grounding
81 1.52 chassis
77 1.44 isolated
74 1.39 isolation
74 1.39 system
62 1.16 interface
36 0.67 reference
35
0.66 architecture 33
0.62 handbook 33
0.62 design 32
0.60 . . . omit for display mil-b-5087
b 1
0.02 difficulties 1
0.02 requires 1
0.02 ------------------------------------
--------------------- 1449 non blank lines Total
freq of indexed word 5341
Figure 7. Extract of Index File Generated From
Swish-e Utility
Figure 8. Extract of Index File Processed with
Perl Script
15Metadata Index Matchingand Result Display
- Weighting scheme
- Weights are assigned to different metadata
elements according to the importance of the
elements - 3 different matching algorithms
- Comparison of index terms between same element
- Comparison of index terms between alternative
elements in addition to same element match - Matching word score and sum of word score for
comparison
16Matching Algorithms
- Matching Algorithm 1
- Compare index terms between same element
- for each element in matching list
- score element_weight element_match (i.e.
match_word / target_element_word) -
- E.g. percentage relevant score 4 ( match
in title) - 3 ( match in subject)
- 2 ( match in scope / abstract)
- 1 ( match in index)
- / (4 3 2 1) 100
- for title weight 4, subject weight 3,
abstract weight 2, doc index 1 - Matching Algorithm 2
- Compare index terms between same and alternative
elements - alternative element match with discounted weight
17Matching Algorithms
- Matching Algorithm 2
- if (all words in an target element found matching
in same comparing element) - score element_weight (100 match)
- no need to compare with other element
-
- else
- element_match match_word / target_element_wor
d - score element_weight element_match
- compare the unmatch words with other elements
starting from the closest weighted one until
match word found in an element or no found for
all elements - for each alternative match element alt_element,
calculate the additional score add_score for
matching words found in that element -
- wt_diff abs(target_element_weight -
match_element_weight) - wt_diff_ratio min((wt_diff / target_element_weig
ht), 1) - alt_element_weight weight (1- wt_diff_ratio)
- alt_element_match match_word_with_alt_element
/ target_element_word) - add_score alt_element_weight
alt_element_match -
- score add_score
18Matching Algorithms
- Matching Algorithm 3
- Calculate a score for each individual matching
word and derive relevance from the sum of word
scores - word list of the target document (all the words
listed) and matched document (only match words) - a score calculated for each word according to the
frequency and location (i.e. type of element) - for each indexed element
- // word refers to all words for target doc and
it refers to only words - // in the searched doc found to match that in
target doc - word score word freq of that element
element_weight -
- document score sum of (word score)
- relevance score of matching document matching
word score - / (target word score matching word score)
100
19Additional Relevant Document Discovery Strategy
- Recognize special pattern words employed
- E.g. NASA-STD-7002 indicates a relation with
another NASA standard document, according to the
naming conventions of SDOs (reference figure 9) - Other strategies studied
- Usage of a word in parts of speech (verbs, nouns,
...) - Remove irrelevant words selectively
- Apply semantically related terms in an ontology
- including synonyms in a thesaurus, or broader or
narrower terms in a concept hierarchy (taxonomy) - Apply other ontology relationship
- including homonym, antonym, etc.
- Apply more ontology (domain model) processing of
related terms and constraints
20Reports and User Interface forMatch Result
Display
- Result for matching was initially generated as a
text report - relevant doc list sorted by the relevant score
- the target document ID, a list of the matching
document IDs, the target words and their
locations, as well as the matching words and
their locations - third algorithm with score for each term added
- Screens for the third algorithm have been
developed - to present the result data in a more interactive
and more readable way
21Reports andUser Interface forMatch Result
Display
Figure 9. Result Display for Matching
22Figure 10. Word Summary Interface Screen for
Target Document (Standards document titled
Dynamic Environmental Criteria)
Figure 11. Word Summary Interface Screen for
Matched Document (Lesson Learned document titled
Environmental Test Sequencing)
23Results and Findings
- Document matching elements
- DCMES recommendations provided a very useful
reference for document metadata elements to be
selected for the SA_MetaMatch schema - Matching index words from the selected
descriptive metadata elements title, subject and
abstract (from description) were found to be
useful in identifying relevant documents - Index words from document content alone were also
found to be useful as an important lead to the
document subjects most of the time - When the metadata index and the document content
index were combined, they were more powerful and
gave better results in identifying relevant
documents
24Results and Findings
- Different algorithms applied
- Second and third algorithms were close in
performance, and better than first algorithm - Third algorithm is preferred
- provides an individual word score which helps
domain expert assess each documents relevancy - Weight of element
- Ratio of title subject abstracts index
- 50 40 30 1 for the third algorithm gave
good results - Relevant documents found
- SA_MetaMatch has found to be helpful for the
domain expert in finding relevant documents
(table 1)
25Results and Findings
Table 1. Number of Previously Known and
Newly-Discovered Relevant Links in the Collection
of LLIS Documents For Each Target Standard
Document
26Results and Findings
- Distributions of relevant documents rank
Table 2. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 25 Word Filtering
27Results and Findings
- Distributions of relevant documents rank
Table 3. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 50 Word Filtering
28Results and Findings
- Distributions of relevant documents rank
- On average, about 70 of the known relevant
documents rank in the top 25th percentile and
over 90 of the known relevant documents rank on
the top 50th percentile in both configurations of
word filtering - Filtering
- Good performance was found when
- the minimum relative frequency for index words
was set at 0.2 - the low score word filtering out percentage was
set at 25-50 - depending on the length of the document
- Selective removal of irrelevant words
- Average amount of known relevant documents
ranking in the top 25th percentile has increased
to about 80, and that in the top 50th percentile
to about 95
29Results and Findings
- Selective removal of irrelevant words
Table 4. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With Selective Word Removal
30Results and Findings
- Use of special pattern words
- The special pattern words recognition strategy
(for example, recognizing NASA-STD-5003) was
found to be useful in discovering relevant
standards specification mentioned in the target
document.
31Conclusions
- Metadata
- Dublin Core Metadata Element Set (DCMES)
recommendations, DCMI recommendations, and the
DCMI recommended guidelines provided very useful
references for the XML metadata schema modeling
and construction - Index
- Indices for metadata and document content have
been found useful in finding relevant documents - SA_MetaMatch prototype
- Useful to NASA
- Has improved the productivity of the document
matching and linking process
32Recommendations for Further Study
- Apply stemming
- many errors found in the Porter Stemmer algorithm
- Implement word phrase index term
- Analyze usage of a word and natural language
processing - usage of a word in parts of speech (verbs, nouns,
adjective, or others) - usage of a word in heading title, figure caption,
etc. - Apply related terms and constraints
- expand or refine index terms by applying
semantically related terms in domain ontology - include synonyms in a thesaurus, or broader or
narrower terms in a concept hierarchy (taxonomy)
33Recommendations for Further Study
- Other ontology relationships
- including homonym, antonym, etc
- can be applied to distinguish words of different
context and to eliminate unrelated documents - Future metadata and ontology language
- RDF, DAMLOIL, OWL or another ontology language
- Learn from query
- search query feedback
34The End
- Thank you!!
- Questions and Suggestions?