Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch

1 / 34

About This Presentation

Title:

Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch

Description:

Provides links to technical standards documents and hosts a database of experience information ... including homonym, antonym, etc. ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 35

Provided by: hyau

more less

Transcript and Presenter's Notes

Title: Relevant Document Discovery Through Document Metadata and Indexing Thesis Defense By Hiu Shan Yau Ch

1
Relevant Document DiscoveryThrough Document
Metadata and IndexingThesis DefenseByHiu
Shan Yau (Christina)10-28-03
2
Statement of the Problem

Developers and users of space systems need to
Meet extensive technical standards requirements
Incorporate prior experience into new projects
and products
NASA Technical Standards Program (NTSP)
Provides links to technical standards documents
and hosts a database of experience information
Provides only simple search
NASA desires to have links between all standards
and related lessons learned
Domain experts search with NTSP, read document
content, and identify relationship
Only a small number of related information are
identified and linked
Difficulty and low productivity in matching
related documents

3
Purpose of Study

To investigate the design and implementation of a
software component which helps domain experts in
the discovery of related technical standards
specifications and lessons learned, i.e.,
document matching
Techniques
To model and capture knowledge in standards and
lessons learned documents as metadata
To develop metadata generation tools
To use document metadata and index data to
perform document matching

4
Research Questions and Hypothesis

How can a document be represented? What are the
document elements?
Title, author, subject, description, abstract,
table of contents
Reference Dublin Core (DC)
Which standards and lesson learned document
elements can be captured as metadata to be used
for matching?
Reference DCMES (Dublin Core Metadata Element
Set) recommendation
Index and frequency of word occurrence
What should be the algorithm used to return best
document match search result?
Compare title / abstract / keywords / index
Can index words and the frequency of occurrence
be an important lead to the subjects, especially
if keywords are not specified?
Assign scores to words according to location and
frequency
Title and subject keyword elements weight more

5
Research Design

Document element and metadata modeling
Identify document metadata elements
Construct metadata and relation model
Develop document metadata schema
Metadata and index generation and storage
Develop form-based template to assist generation
of metadata according to the metadata schema
Use indexer to index documents and their
individual metadata elements for matching
Store metadata as XML files and index words in
matching element index files
Metadata index matching, result display and
relations update
Derive algorithm for matching indexed words
generated from document and metadata
Find additional relevant document discovery
strategies
Develop reports and user interface for
presentation of match result

6
Document Element and Metadata Modeling

Document metadata elements identification
10 from DCMES elements 1 other DC recommended
element selected
With DCMI refinement and encoding scheme
Metadata and relation model construction
Concept behind the use of matching to find the
relevant documents
if the metadata and indices of two documents are
related, then the subject contents of the two
documents are likely to be related (figure 1)
Document schema development
XML metadata schema generated (figure 2)

Fig 1. Finding document relationships via
metadata relationships

8
Figure 2. Extract of code for SA_MetaMatch
Metadata XML Schema

ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
ema" xmlnsdc"http//purl.org/dc/elements/1.1/"
xmlnsdcterms"http//purl.org/dc/terms/"gt
ltxsannotationgt
ltxsdocumentation xmllang"en"gt
Simple XML Schema for Metadata Model,
2003-01-20
lt/xsdocumentationgt
lt/xsannotationgt
ltxsimport namespace"http//www.w3.org/XML/1998/
namespace" schemaLocation"http//www.w3.org/2001/
03/xml.xsd"gtlt/xsimportgt
ltxsgroup name"metadata"gt
ltxssequencegt
ltdctitle name"title" type"xsstring" /gt
ltxselement name"creator" type"creatorType"
ref"dccreator" /gt
ltxselement name"subject" type"subjectType"
ref"dcsubject" /gt
ltxselement name"description"
type"descriptionType"
ref"dcdescription" /gt
ltdcpublisher name"publisher"
type"xsstring" ref"dcdescription" /gt
ltxselement name"date" type"dateType"
ref"dcdate" /gt
ltdcformat name"format" type"xsstring" /gt
ltdcidentifier name"identifier"
type"xsstring" /gt
ltxselement name"relation" type"relationType"
ref"dcrelation" /gt

9
Metadata and Index Generation and Storage

Form-based template development
SA_MetaMatch Generate / Edit Metadata Interface
Screen for Capturing Metadata
(figure 3 4)
XML metadata generated (figure 5)
Metadata and document indexing
Generate index with Swish-e indexer, Perl script
(figure 6)
Swish-e configured to remove stopwords, number,
word length lt 3
Perl script to extract words and their absolute
freq, calculate relative freq, filter words, and
sort by freq (figure 7 8)

10
Metadata and Index Generation and Storage

Metadata and document indexing
As only the most important (high word score)
words are matched to find relevant documents, the
lower frequency words are filtered and removed
from the indices.
The minimum index percentage parameter is defined
in configuration file.
Metadata storage
Preliminarily as XML files (figure 5)
Can be stored into a metadata repository for
better searching and management

11
Figure 3. SA_MetaMatch Generate / Edit Metadata
Interface Screen
Figure 4. SA_MetaMatch Generate / Edit Metadata
Interface Screen (contd)
12
Figure 5. Extract of Metadata Generated

lt?xml version"1.0" ?gt
- ltMetadatagt
lttitlegtElectrical Grounding Architecture For
Unmanned Spacecraftlt/titlegt
- ltsubjectgt
ltcontrolledKeywordgtSpacecraftlt/controlledKeyword
gt
ltuncontrolledKeywordgtGrounding architecture,
Design, Rationalelt/uncontrolledKeywordgt
ltclassification scheme"NASA PTSD"gt4000lt/classif
icationgt
lt/subjectgt
- ltdescriptiongt
ltabstractgtThis handbook describes spacecraft
grounding architecture options at the system
level. Implementation of good
basis for understanding those choices and
tradeoffs.lt/abstractgt
lt/descriptiongt
ltpublishergtNASAlt/publishergt
ltdate dateType""gtFebruary 17, 1998lt/dategt
- ltformatgt
ltextentgt257Klt/extentgt
ltmediumgttext/pdflt/mediumgt

13
Figure 6. Index generation
14
Index output for alphabetical sorted listing of
the terms with location -----gt WORD INFO in index
NASA-HDBK-4001.idx lt----- 100m 1 1 1
(2348/1) 100m-1000m 1 1 1 (2347/1) 1553b 1 1
2 (2536/1 2539/1) 3k-ohm 1 1 1 (4940/1) . .
. omit for display adequacy 1 1 1
(1604/1) adjacent 1 1 1 (2131/1) adjusting 1
1 1 (3994/1) administration 1 1 1
(10/1) advantage 1 1 3 (3357/1 4520/1
4692/1) aeronautics 1 1 1 (5/1) aerospace 1 1
1 (879/1) . . . omit for display years 1
1 4 (106/1 2326/1 2332/1 3565/1) zero 1 1 4
(934/1 963/1 1610/1 1689/1) zero-potential 1 1
1 (1114/1) zero-to-low 1 1 1 (3896/1)
Doc No. NASA-HDBK-4001 Index words with
frequency gt 1 in descending order Term
Freq Rel. Freq
() ----------------------------------------------
----------- ground 99
1.85 power
94 1.76 spacecraft
86 1.61 grounding
81 1.52 chassis
77 1.44 isolated
74 1.39 isolation
74 1.39 system
62 1.16 interface
36 0.67 reference
35
0.66 architecture 33
0.62 handbook 33
0.62 design 32
0.60 . . . omit for display mil-b-5087
b 1
0.02 difficulties 1
0.02 requires 1
0.02 ------------------------------------
--------------------- 1449 non blank lines Total
freq of indexed word 5341
Figure 7. Extract of Index File Generated From
Swish-e Utility
Figure 8. Extract of Index File Processed with
Perl Script
15
Metadata Index Matchingand Result Display

Weighting scheme
Weights are assigned to different metadata
elements according to the importance of the
elements
3 different matching algorithms
Comparison of index terms between same element
Comparison of index terms between alternative
elements in addition to same element match
Matching word score and sum of word score for
comparison

16
Matching Algorithms

Matching Algorithm 1
Compare index terms between same element
for each element in matching list
score element_weight element_match (i.e.
match_word / target_element_word)
E.g. percentage relevant score 4 ( match
in title)
3 ( match in subject)
2 ( match in scope / abstract)
1 ( match in index)
/ (4 3 2 1) 100
for title weight 4, subject weight 3,
abstract weight 2, doc index 1
Matching Algorithm 2
Compare index terms between same and alternative
elements
alternative element match with discounted weight

17
Matching Algorithms

Matching Algorithm 2
if (all words in an target element found matching
in same comparing element)
score element_weight (100 match)
no need to compare with other element
else
element_match match_word / target_element_wor
d
score element_weight element_match
compare the unmatch words with other elements
starting from the closest weighted one until
match word found in an element or no found for
all elements
for each alternative match element alt_element,
calculate the additional score add_score for
matching words found in that element
wt_diff abs(target_element_weight -
match_element_weight)
wt_diff_ratio min((wt_diff / target_element_weig
ht), 1)
alt_element_weight weight (1- wt_diff_ratio)
alt_element_match match_word_with_alt_element
/ target_element_word)
add_score alt_element_weight
alt_element_match
score add_score

18
Matching Algorithms

Matching Algorithm 3
Calculate a score for each individual matching
word and derive relevance from the sum of word
scores
word list of the target document (all the words
listed) and matched document (only match words)
a score calculated for each word according to the
frequency and location (i.e. type of element)
for each indexed element
// word refers to all words for target doc and
it refers to only words
// in the searched doc found to match that in
target doc
word score word freq of that element
element_weight
document score sum of (word score)
relevance score of matching document matching
word score
/ (target word score matching word score)
100

19
Additional Relevant Document Discovery Strategy

Recognize special pattern words employed
E.g. NASA-STD-7002 indicates a relation with
another NASA standard document, according to the
naming conventions of SDOs (reference figure 9)
Other strategies studied
Usage of a word in parts of speech (verbs, nouns,
...)
Remove irrelevant words selectively
Apply semantically related terms in an ontology
including synonyms in a thesaurus, or broader or
narrower terms in a concept hierarchy (taxonomy)
Apply other ontology relationship
including homonym, antonym, etc.
Apply more ontology (domain model) processing of
related terms and constraints

20
Reports and User Interface forMatch Result
Display

Result for matching was initially generated as a
text report
relevant doc list sorted by the relevant score
the target document ID, a list of the matching
document IDs, the target words and their
locations, as well as the matching words and
their locations
third algorithm with score for each term added
Screens for the third algorithm have been
developed
to present the result data in a more interactive
and more readable way

21
Reports andUser Interface forMatch Result
Display
Figure 9. Result Display for Matching
22
Figure 10. Word Summary Interface Screen for
Target Document (Standards document titled
Dynamic Environmental Criteria)
Figure 11. Word Summary Interface Screen for
Matched Document (Lesson Learned document titled
Environmental Test Sequencing)
23
Results and Findings

Document matching elements
DCMES recommendations provided a very useful
reference for document metadata elements to be
selected for the SA_MetaMatch schema
Matching index words from the selected
descriptive metadata elements title, subject and
abstract (from description) were found to be
useful in identifying relevant documents
Index words from document content alone were also
found to be useful as an important lead to the
document subjects most of the time
When the metadata index and the document content
index were combined, they were more powerful and
gave better results in identifying relevant
documents

24
Results and Findings

Different algorithms applied
Second and third algorithms were close in
performance, and better than first algorithm
Third algorithm is preferred
provides an individual word score which helps
domain expert assess each documents relevancy
Weight of element
Ratio of title subject abstracts index
50 40 30 1 for the third algorithm gave
good results
Relevant documents found
SA_MetaMatch has found to be helpful for the
domain expert in finding relevant documents
(table 1)

25
Results and Findings

Relevant documents found

Table 1. Number of Previously Known and
Newly-Discovered Relevant Links in the Collection
of LLIS Documents For Each Target Standard
Document
26
Results and Findings

Distributions of relevant documents rank

Table 2. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 25 Word Filtering
27
Results and Findings

Distributions of relevant documents rank

Table 3. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With 50 Word Filtering
28
Results and Findings

Distributions of relevant documents rank
On average, about 70 of the known relevant
documents rank in the top 25th percentile and
over 90 of the known relevant documents rank on
the top 50th percentile in both configurations of
word filtering
Filtering
Good performance was found when
the minimum relative frequency for index words
was set at 0.2
the low score word filtering out percentage was
set at 25-50
depending on the length of the document
Selective removal of irrelevant words
Average amount of known relevant documents
ranking in the top 25th percentile has increased
to about 80, and that in the top 50th percentile
to about 95

29
Results and Findings

Selective removal of irrelevant words

Table 4. Distribution of Ranks for the Known
Relevant Documents to Each of the Target Standard
Documents With Selective Word Removal
30
Results and Findings

Use of special pattern words
The special pattern words recognition strategy
(for example, recognizing NASA-STD-5003) was
found to be useful in discovering relevant
standards specification mentioned in the target
document.

31
Conclusions

Metadata
Dublin Core Metadata Element Set (DCMES)
recommendations, DCMI recommendations, and the
DCMI recommended guidelines provided very useful
references for the XML metadata schema modeling
and construction
Index
Indices for metadata and document content have
been found useful in finding relevant documents
SA_MetaMatch prototype
Useful to NASA
Has improved the productivity of the document
matching and linking process

32
Recommendations for Further Study

Apply stemming
many errors found in the Porter Stemmer algorithm
Implement word phrase index term
Analyze usage of a word and natural language
processing
usage of a word in parts of speech (verbs, nouns,
adjective, or others)
usage of a word in heading title, figure caption,
etc.
Apply related terms and constraints
expand or refine index terms by applying
semantically related terms in domain ontology
include synonyms in a thesaurus, or broader or
narrower terms in a concept hierarchy (taxonomy)

33
Recommendations for Further Study

Other ontology relationships
including homonym, antonym, etc
can be applied to distinguish words of different
context and to eliminate unrelated documents
Future metadata and ontology language
RDF, DAMLOIL, OWL or another ontology language
Learn from query
search query feedback

34
The End