Lexical and Cooccurrence Evidence for Subject Vocabulary Reconciliation in ADS Databases Jonghoon Le - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Lexical and Cooccurrence Evidence for Subject Vocabulary Reconciliation in ADS Databases Jonghoon Le

Description:

galaxies: individual: m 31 % 9. Results were presented at LISAIII in April ... GALACTIC CLUSTERS GALAXIES: CLUSTERING. Pre-coordinated term to its components ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 38
Provided by: jlee90
Category:

less

Transcript and Presenter's Notes

Title: Lexical and Cooccurrence Evidence for Subject Vocabulary Reconciliation in ADS Databases Jonghoon Le


1
Lexical and Co-occurrence Evidence for Subject
Vocabulary Reconciliation in ADS
DatabasesJonghoon Lee, University of
IllinoisMichael Kurtz, Harvard-Smithsonian
Center for AstrophysicsDavid Dubin, University
of Illinois
2
PRESENTATOIN OUTLINE
  • INTRODUCTION
  • HETEROGENEOUS SUBJECT INDEXING
  • LEXICAL MATCHING APPROACH
  • SPREADING ACTIVATION MODEL
  • EXPERIMENT
  • PRELIMINARY RESULTS
  • FUTURE STUDY

3
INTRODUCTION
  • NASA ADS DATABASES
  • Many Astronomers use ADS Abstract Service
  • Search options title, author, object name,
    text..
  • No keyword-only searching capability
  • Subject descriptors (keywords) central concepts
  • Problems
  • ADS uses several different indexing languages
    (STI, ApJ, etc.)
  • Searching one form of a concept will miss
    documents
  • (M31 vs. Andromeda)
  • Ways to merge or reconcile these inconsistent
    indexing languages

4
Heterogeneous Subject Indexing
  • Subject descriptors
  • - controlled vocabulary (keywords/index terms)
  • - standardized labels of concepts
  • - central important concepts in a document
  • - improves information retrieval
  • Indexing inconsistency
  • - different indexing schemes applied
  • - term-specificity, precoordination, ...
  • - different terms used to index the same concept
  • - journal-specific indexing vocabulary

5
Heterogeneous Subject Indexing
  • Inconsistency of Indexing in ADS
  • - STI (Scientific and Technical Information)
    index terms
  • from the NASA Thesaurus (1975 to 1995)
  • - Journal-specific subject descriptors
  • a variety of indexing vocabulary since 1995
  • i.e. author-assigned keywords in ApJ
  • Examples
  • STI ApJ
  • LOW MASS STARS STARS LOW-MASS
  • DWARF NOVAE STARS DWARF NOVAE
  • ANDROMEDA GALAXY GALAXIESINDIVIDUAL MESSIER
    NUMBER M31
  • ELECTRIC FIELDS ???
  • ??? HIGH-LATITUDE OBJECTS
  • COSMOLOGY COSMOLOGY
  • COSMOLOGY EARLY UNIVERSE
  • COSMOLOGY THEORY
  • COSMOLOGY OBSERVATIONS

6
VOCABULARY MERGING
  • How to reconcile different controlled
    vocabularies
  • Vocabulary Switching System based upon Term
    Mapping
  • Issues
  • - how to identify term relationship
  • - automatic vs. manual
  • Sources of Evidence
  • - lexical resemblance spelling variants
  • - syndetic structure thesaurus
  • - co-occurrence data consistent assignment of
    descriptors to the same document

7
LEXICAL MATCHING APPROACH
  • Clustering Method based on Lexical Similarity
  • Term Signature galaxies individual m31
    31galaxym
  • Can merge terms like the following
  • galaxies individual (m 31) 1
  • galaxies individual(m 31) 1
  • galaxies m 31 5
  • galaxies individual m 31 9
  • Results were presented at LISAIII in April
  • Simple and computationally inexpensive
  • But, lexical similarity is not enough
  • Need to combine with other evidence

8
SPREADING ACTIVATION MODEL
  • Connectionist Approach
  • A Model of Human Associative Memory
  • A Spreading Activation Model based on
    Co-occurrence Data
  • Combine the networks of two databases through the
    commonly indexed documents
  • Terms from different vocabularies (e.g., STI and
    ApJ) consistently assigned to the same documents
    are identified.
  • Network Representation (3-layer)
  • - input term layer, document layer (hidden),
    output term layer
  • Activation Process
  • - feed-forward network
  • - activation is spread from one term layer to
    the other.

9
Information Retrieval 2-Layer Network
T1
T2
TERM LAYER
Tm
w11
w12
wij
DOCUMENT LAYER
D1
D2
D3
Dn
Figure 1. Network Representation of Information
Retrieval
TERM LAYER
T1
T2
Tm
DOCUMENT LAYER
D1
Dn
D2
D3
Figure 2. Spreading Activation Process during
Information Retrieval
  • Activation Rule
  • m
  • Dj S (Ti wij)
  • i1

Ti activation of term i Dj activation of
document j wij weight between term i and
document j
10
Vocabulary Merging 3-Layer Network
T1 ( I )
T2 ( I )
Tl ( I )
INPUT TERM LAYER
D1
D3
D2
DOCUMENT LAYER
Dm
OUTPUT TERM LAYER
T1 (O)
T2 (O)
Tn (O)
Figure 5. Network representation of vocabulary
merging
  • Activation Rule
  • l
  • Dj S (Si wij)
  • i1
  • m
  • Tk S (Dj wjk)
  • j1
  • Ti (I) activation of an input term i
  • Dj activation of a document j
  • Tk (O) activation of an output term k
  • wij weight between an input term i and a
    document j
  • wjk weight between an output term k and a
    document j

11
Weighting Scheme
  • Binary Indexing System
  • - wij 1 if document j is indexed by term i
  • 0 else
  • Conservation of Activation
  • - Total activation is divided up in amounts
  • proportional to the link weights
  • m
  • - wij wij / S wij
  • i1
  • - S input activation S output activation
  • Simple Weighting Rule
  • - wij 1 / Ni (where Ni the number of
    links
  • from input term i to the document
    nodes)
  • - wjk 1 / Nj (where Nj the number of
    links
  • from document j to the output term
    nodes)

12
Weighting Scheme (continued)
  • Bi-Directionality
  • - wij ? wji
  • wij 0.33 wji 1.0
  • Probability Interpretation

wji
wij
13
Weighting Scheme (continued)
  • Computation of Activation

wij 1 / Ni 0.5 l Dj
S ( Ti ( I ) wij) i1 0.5 wjk
1 / Nj 0.33
m Tk(O) S ( Dj wjk) j1
0.5 0.33 0.165
Ti ( I ) 1.0
wij
Dj
wjk
Tk ( O )
14
OUTPUT THRESHOLD FUNCTION
  • Output of spreading activation
  • - terms in the target vocabulary ranked by
    activation level
  • - terms above a cutoff value are selected
  • Cut-off criterion
  • - Mexican-Hat function (second-derivative of the
    Gaussian)
  • - to find the deepest drop in the slope of
    activation levels

15
Experiment
  • Material I
  • - Two indexing languages from ADS database (1983
    - 1998)
  • - STI (Scientific and Technical Information)
    terms
  • - ApJ (Astrophysical Journal) terms
  • Two types of merging ApJ STI, STI ApJ
  • Database statistics
  • STI ApJ
  • of documents 39,366 22,139
  • of indexing terms 10,200 3,335
  • avg. of postings 163 25
  • avg. of terms 10.2 3.7
  • STI-M ApJ-M
  • of co-indexed doc. 14,956 14,956
  • of indexing terms 4,120 2,305
  • avg. of postings 34 23
  • avg. of terms 9.6 3.5

16
Experiment
  • Material II
  • - Two indexing languages from ADS database
  • - Consistent and small (1998)
  • - AJ (Astronomical Journal) terms
  • - ApJ terms
  • Two types of merging AJ ApJ, ApJ AJ
  • Database statistics
  • AJ ApJ
  • of documents 221 221
  • of indexing terms 259 276
  • avg. of postings 2.8 2.5
  • avg. of terms 3.3 3.2

17
RESULTS
  • 1. TERM RELATIONSHIP
  • Term mappings identified by spreading activation
    model
  • GALAXIES THE GALAXY MILKY WAY GALAXY
  • GALAXIES ISM INTERSTELLAR MATTER
  • ISM ABUNDANCES ABUNDANCE
  • INTERSTELLAR MATTER
  • ANDROMEDA GALAXY GALAXIES INDIVIDUAL MESSIER
    NUMBER M31

18
  • Types of Term Mapping
  • Exact match
  • ASTROMETRY ASTROMETRY
  • Different word order
  • CLUSTERS GLOBULAR GLOBULAR CLUSTERS
  • Spelling variants
  • GALACTIC CLUSTERS GALAXIES CLUSTERING
  • Pre-coordinated term to its components
  • COSMIC RAYS ABUNDANCES ABUNDANCE
  • COSMIC RAYS
  • Term omission
  • COSMOLOGY DARK MATTER DARK MATTER
  • CATALOGS ASTRONOMICAL CATALOGS
  • Semantic factoring
  • COSMOLOGY THEORY COSMOLOGY
  • ASTRONOMICAL MODELS

19
A Sample Output (STI ApJ)
INPUT TERM (STI) STATISTICS ANDROMEDA GALAXY
Number of co-indexed document(s)
92 Number of active terms (a ³
0.01)171 (23) Number of terms above
the cutoff point 1 OUTPUT TERM (ApJ) GALAXIES
INDIVIDUAL MESSIER NUMBER M31 0.171 GALAXIES
STELLAR CONTENT 0.057 GALAXIES NUCLEI
0.035 GALAXIES LOCAL GROUP 0.031 GALAXIES
DISTANCES 0.028 STARS NOVAE
0.026 GALAXIES STRUCTURE 0.023 CLUSTERS
GLOBULAR 0.019 GALAXIES INTERNAL MOTIONS
0.019 GALAXIES PHOTOMETRY 0.019 GALAXIES
INTERSTELLAR MATTER 0.018 GAMMA RAYS BURSTS
0.018 GALAXIES MAGELLANIC CLOUDS
0.017 INTERSTELLAR MOLECULES 0.016 STARS
ABUNDANCES 0.016 COSMOLOGY 0.014 BLACK
HOLES 0.013 NEBULAE PLANETARY 0.013 STARS
NEUTRON 0.012 GALAXIES REDSHIFTS
0.012 GALAXIES ISM 0.011
20
INPUT TERM(STI) ANDROMEDA GALAXY
21
A Sample Output (ApJ STI)
INPUT TERM (APJ) STATISTICS ISM
ABUNDANCES Number of co-indexed document(s)
177 Number of active terms (a ³ 0.01)
500 (15) Number of terms above the
cutoff point 2 OUTPUT TERM (STI) ABUNDANCE
0.059 INTERSTELLAR MATTER 0.057 MOLECULAR
CLOUDS 0.028 INTERSTELLAR GAS
0.023 ABSORPTION SPECTRA 0.021 LINE SPECTRA
0.017 ASTRONOMICAL SPECTROSCOPY 0.016 H II
REGIONS 0.015 MILKY WAY GALAXY
0.014 ULTRAVIOLET ASTRONOMY 0.013 EMISSION
SPECTRA 0.012 STAR FORMATION
0.011 ASTRONOMICAL MODELS 0.011 SPECTRUM
ANALYSIS 0.010 ULTRAVIOLET SPECTRA 0.010
22
INPUT TERM(APJ) ISM ABUNDANCES
23
CUTOFF DATA ANALYSIS
  • STI-gtAPJ
  • TOTAL STI TERMS 900
  • AVERAGE CUTOFF POINT 1.90
  • AVERAGE ACTIVATION 0.17
  • AVERAGE NUMBER OF ACTIVATED TERMS 134.78
  • APJ-gtSTI
  • TOTAL APJ TERMS 338
  • AVERAGE CUTOFF POINT 1.62
  • AVERAGE ACTIVATION 0.10
  • AVERAGE NUMBER OF ACTIVATED TERMS 353.93

24
ACTIVATION LEVEL BY NUMBER OF DOCUMENT
  • ApJ gt STI
  • NUMBER OF DOC ACT AVG ACT CUT
    AVG
  • DOCUMENTS FREQ TERM ACT LEV OFF
    CUT
  • - 5 1737 13 0.073 0.45 4.32
    0.10
  • 5- 10 154 47 0.021 0.16 2.20
    0.07
  • 10- 20 76 82 0.012 0.12 1.83
    0.07
  • 20- 50 104 155 0.006 0.10 1.84
    0.06
  • 50-100 82 272 0.004 0.09 1.54
    0.06
  • 100-200 72 402 0.002 0.09 1.50
    0.06
  • 200- 80 652 0.002 0.10 1.55
    0.06

25
RESULTS
  • 2. TERM RANKING PREDICTION
  • - Document 1983ApJ...265..760W
  • - Input terms (ApJ)
  • GALAXIES THE GALAXY
  • - Target term list (STI)
  • GALACTIC NUCLEI
  • INFRARED ASTRONOMY
  • INTERSTELLAR EXTINCTION
  • INTERSTELLAR GAS
  • MILKY WAY GALAXY
  • - Output terms (STI)
  • MILKY WAY GALAXY 0.095
  • GALACTIC NUCLEI 0.041
  • GALACTIC STRUCTURE 0.030
  • INTERSTELLAR MATTER 0.026
  • MOLECULAR CLOUDS 0.022

26
INPUT STIOUTPUT ApJ INPUT ApJ
OUTPUT STI
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS
  • 1 5 1 947 2601 0.02 ( 3 )
  • 2 7 1 854 2188 0.12 ( 1 )
  • 3 10 2 646 2578 0.09 ( 1 5
    )
  • 4 7 1 978 4454 0.03 ( 1 )
  • 5 7 2 609 2386 0.07 ( 3 6
    )
  • 6 7 2 651 1710 0.35 ( 1 2
    )
  • 7 9 4 937 4449 0.22 ( 1 2
    3 11 )
  • 8 7 2 829 3648 0.08 ( 1 7
    )
  • 9 9 4 939 4702 0.20 ( 1 2
    3 9 )
  • 10 8 2 775 3460 0.13 ( 1 3
    )
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS
  • 1 1 5 456 201 0.16 (1 2 6 37
    146)
  • 2 1 7 563 265 0.12 (1 6 25
    28 36 120 275)
  • 3 2 10 897 711 0.14 (1 6 9 14
    17 96 102 143
  • 4 1 7 181 46 0.17 (1 5 7 14
    25 62 62)

27
TERM RANKING (AJ vs. ApJ)
  • AJ -gt ApJ
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS
  • 1 2 2 14 8 0.53 ( 1 2
    )
  • 2 3 3 19 10 0.65 ( 1 2
    3 )
  • 3 4 3 31 17 0.42 ( 1 2
    3 )
  • 4 3 3 18 11 0.73 ( 1 2
    3 )
  • 5 2 2 15 7 0.71 ( 1 2
    )
  • 6 2 2 25 14 0.45 ( 1 2
    )
  • 7 2 2 32 14 0.22 ( 1 3
    )
  • 8 5 6 49 23 0.64 ( 1 2
    3 4 4 4 )
  • 9 3 3 38 15 0.41 ( 1 2
    3 )
  • 10 2 2 5 2 0.75 ( 1 1
    )
  • ApJ -gt AJ
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS
  • 1 2 2 12 7 0.53 ( 1 2
    )
  • 2 3 3 18 9 0.66 ( 1 2
    3 )

28
EFFECT OF LEAVING OUT THE GIVEN DOCUMENT
  • 1. including the given document in the network
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS
  • 1 1 5 456 201 0.16 ( 1 2
    6 37 146 )
  • 2 1 7 563 265 0.12 ( 1 6
    25 28 36 120 275 )
  • 3 2 10 897 711 0.14 ( 1 6
    9 14 17 96 102 143 158 219)
  • 4 1 7 181 46 0.17 ( 1 5
    7 14 25 62 62 )
  • 5 2 7 586 205 0.12 ( 1 6
    11 12 13 22 97 )
  • 6 2 7 779 430 0.14 ( 1 2
    23 51 55 175 233 )
  • 7 4 9 1467 1875 0.18 ( 1 2
    3 6 9 17 68 98 115 )
  • 8 2 7 987 688 0.16 ( 1 2
    3 5 17 108 336 )
  • 9 4 9 1527 2088 0.19 ( 1 2
    3 7 9 12 17 72 190 )
  • 10 2 8 1093 1165 0.15 ( 1 2
    4 5 79 93 163 210 )
  • 2. leaving the given document out of the network
  • DOCU SOURCE TARGET ACTIVE SHARE ACTIV RANKINGS
    FOR THE
  • _ID TERM TERM TERM DOC LEVEL ACTIVE
    TARGET TERMS

29
Ranking 1 2 3 4 5 6 7 8 9
  • DOCUMENT ID 135 1986ApJ...301..240D 325
    co-indexed document(s)
  • TERM LIST FROM THE SOURCE VOC. (APJ) TERM LIST
    FROM THE TARGET VOC. (STI)
  • 1212 ACCRETION DISKS
  • STARS ECLIPSING BINARIES ASTRONOMICAL
    PHOTOMETRY
  • STARS INDIVIDUAL ALPHANUMERIC KPD 1911 BALMER
    SERIES
  • STARS INDIVIDUAL CONSTELLATION NAME
    CATACLYSMIC VARIABLES
  • V1315 AQUILAE ECLIPSING BINARY
    STARS
  • STARS NOVAE EMISSION SPECTRA
  • NOVAE
  • STELLAR PHYSICS
  • STELLAR SPECTRA
  • ACTIVE TERMS 636
  • ECLIPSING BINARY STARS 0.080
  • CATACLYSMIC VARIABLES 0.070
  • NOVAE 0.069
  • STELLAR SPECTRA 0.063
  • ACCRETION DISKS 0.062
  • EMISSION SPECTRA 0.062

30

Ranking 3 8 9 19 19 19 19 19
  • DOCUMENT ID 652 1986ApJ...306..532L 9
    co-indexed document(s)
  • TERM LIST FROM THE SOURCE VOC. (APJ) TERM LIST
    FROM THE TARGET VOC. (STI)
  • NEBULAE INDIVIDUAL NGC NUMBER NGC
    7538 ABUNDANCE
  • ASTRONOMICAL SPECTROSCOPY
  • ELECTRON DENSITY (CONCENTRATION)
  • INFRARED ASTRONOMY
  • LIGHT SCATTERING
  • LINE SPECTRA
  • NEBULAE
  • VISIBLE SPECTRUM
  • ACTIVE TERMS73
  • H II REGIONS 0.088
  • MOLECULAR CLOUDS 0.084
  • LINE SPECTRA 0.048
  • INTERSTELLAR MASERS 0.048
  • INFRARED SOURCES (ASTRONOMY) 0.047
  • EMISSION SPECTRA 0.036
  • AMMONIA 0.029

31
Ranking 63
  • DOCUMENT ID 652 1986ApJ...306..532L 4363
    co-indexed document(s)
  • TERM LIST FROM THE SOURCE VOC. (STI) TERM
    LIST FROM THE TARGET VOC. (APJ)
  • ABUNDANCE NEBULAE INDIVIDUAL NGC NUMBER NGC
    7538
  • ASTRONOMICAL SPECTROSCOPY
  • ELECTRON DENSITY (CONCENTRATION)
  • INFRARED ASTRONOMY
  • LIGHT SCATTERING
  • LINE SPECTRA
  • NEBULAE
  • VISIBLE SPECTRUM
  • ACTIVE TERMS 1430
  • INTERSTELLAR MOLECULES 0.028
  • STARS PRE--MAIN-SEQUENCE 0.018
  • INFRARED SOURCES 0.018
  • NEBULAE H II REGIONS 0.016
  • QUASARS 0.015
  • GALAXIES NUCLEI 0.015

32
Rocchios Measure
  • Normalized recall measure
  • Difference between Ideal ranking and Actual
    ranking
  • S (Actual rankings) - S (Ideal rankings)
  • Normalization
  • n1 (N - n1) where n1 number of target terms
  • N number of active terms
  • Example
  • Ideal ranking (1 2 3 4 5) Actual
    ranking (1 2 5 6 10)
  • Total active terms 40 target terms 5
  • S (Actual rankings) - S (Ideal rankings) 9
  • n1 (N - n1) 5(40-5) 175
  • Rocchio measure 1 - (9/175) .95

33
Rocchio Measure (ApJ STI)
Number of Cases
34
Rocchio Measure (STI ApJ)
Number of Cases
35
Future study
  • User evaluation
  • Compatibility measures
  • Comparison to other methods
  • Visualization of term relationships

36
DYNAMIC MODELS
KINEMATICS
ROTATING DISKS
GALAXIES KINEMATICS AND DYNAMICS
STARS FORMATION
INSTABILITIES
GALAXIES CLUSTERING
ACCRETION
GALACTIC STURUCTURE
GALACTIC STURUCTURE
37
THANK YOU!!!
Write a Comment
User Comments (0)
About PowerShow.com