Refined Online Citation Matching and Adaptive Canonical Metadata Construction

About This Presentation

Title:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction

Description:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 20

Provided by: Huaj150

Learn more at: https://www.cse.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Refined Online Citation Matching and Adaptive Canonical Metadata Construction

1
Refined Online Citation Matching and Adaptive
Canonical Metadata Construction

CSE 598B Course Project Report
Huajing Li

2
Outline

Introduction
Matching Citations and documents
Learning from Observations
Cluster Repair
Evaluation

3
Introduction

In research repositories, citations represent
important knowledge regarding work contexts.
The citation relationships form a data structure
generally known as a citation graph, where
documents are vertices and citations are directed
edges between citing and cited documents.
The methods to construction citation graphs
manual information extraction
autonomous citation indexing (ACI)

4
Introduction

Popular ACI systems
The CiteSeer Digital Library (a collection of
over 725,000 documents with over 8 million
citations )
Google Scholar (433 million document and citation
records )
Typical ACI process
Extract citations from research papers
Parse subfields to build accurate metadata for
each citation
Link citations to documents

5
Introduction

Typical problems in the ACI process
The citation parsers error-prone and often
produce noisy results
Errors in the citation text (such as typos)
Identity uncertainty in document matching
For an automatic DL system, the identity of
documents is uncertain (canonical metadata of the
document can be incomplete or inaccurate)
In such cases, citation metadata can be used to
correct the canonical metadata of documents

6
Our Research Goals

Provide better document metadata
Reduce the cost of maintenance
Allow the development of flexible APIs into
CiteSeer citation graph system
Maintain data security despite an open, wiki-like
approach to user-contributed metadata changes
Provide better citation matching compared to the
current system

7
Matching Citations and documents

Current offline approach
Citations are grouped according to their
extracted metadata
The citation group is linked to a real document
in the repository (exist inside the ACI system
and yet not collected)

8
Matching Citations and documents

Remember citations are themselves documents
Treat citations and documents differently brings
a lot of unnecessary complications into the
system
Citations pointing to a document in the ACI
system can be represented by the documents
identity
To represent the document which a citation points
to and not in the current system, we use the
notion of virtual document, which takes on the
extracted metadata of the citation.

9
Matching Citations and documents

Once the document enters the system, the
corresponding virtual record is then updated with
a pointer to the document file, making it a
real document record.
There are no citation edges pointing to an
external unknown resource. All edges are
internal in the document database and real and
virtual documents can be searched in the same
index space.
We use Lucene to match documents online.

10
Learning from Observations

A problem of generating beliefs in the identity
of a document based on observational evidence.
records may be linked with many information
sources
Extracted document metadata
Extracted citations
External records (from DBLP, ACM)
User correction
We focus on metadata elements with small
variability in correct representations, such as
names, titles, dates, etc.

11
Learning from Observations

We use Bayesian Belief network to construct
canonical metadata
Decide the canonical value X? from all
observations on X.
Each network BEL(X) is to develop degrees of
belief in each possible value X, and X? is chosen
based on the value with the largest belief score.
Given a prior belief vector BEL(x), BEL(x) can be
updated with a new observation ox? using only a
local computation.

12
Learning from Observations

An example
An example observation vector o?(x) may be (0, 0,
1, 0), indicating that o?(2) is the observed
value for x.
This vector must then be adjusted based on our
confidence in the observation. This is achieved
using a confidence matrix
assigning C0.7 to o? results in an actual
message of (0.1,0.1,0.7,0.1) sent to X.

13
Cluster Repair

Adjusting metadata dynamically in response to new
evidence can lead to inconsistencies in citation
groups.
repairCluster(R)
Find matching citations M for R
For each citation C in GR
If C is not contained in M
Add C to REVOKE
Set GR M
Reset belief vectors
For each citation C in GR
If C is not contained in REVOKE
Update belief vectors using C
If metadata changes
repairCluster(R)

14
Cluster Repair

Voting privilege
To prevent unbounded iterations, once a citation
C1 is removed from GR, it can return to GR but it
cannot influence metadata belief vectors for the
remainder of the repairCluster iterations.
At the end of a repairCluster call stack, the
non-voting citations regain voting privileges.

15
Evaluation

Ten frequently referenced document records were
selected from the top of CiteSeers most-cited
document list along with all corresponding
citations.
9,121 citations were used in the final test set.
the data set was run through a noise generation
program to purposely add some noise into the
citation records.
Randomly insert a word into the title.
Randomly delete a word from the title.
Randomly insert an author name.
Randomly delete an author name.
Randomly misspell a word in the title.
Randomly misspell an author name.
Mistakes in the publication year attribute.
Corresponding parameters are provided to control
the probability with which a certain category of
noise will occur, varying from 0 to 1. A noise
rate of 0 means the original version of citation
texts are adopted, without any intended
modifications. A noise rate of 1 means a type of
noise is destined to happen.

16
Index-Based Citation Clustering

Lucenes fuzzy query is utilized to match
citations to documents. We vary the similarity
threshold to observe the precision and recall

17
Index-Based Citation Clustering

Noise is introduced into the citation data to
test the capability of the matching algorithm to
handle inaccurate inputs.

18
Metadata Determination and Cluster Repair

Confidence in the document metadata was
arbitrarily set at 0.8, and confidence in
citation data was set at 0.5.
The cluster repair algorithm was then used to
iteratively query the citation index and repair
the documents metadata until convergence.
Only title, author, and year metadata was tested
for accuracy.

19
Metadata Determination and Cluster Repair

Write a Comment

User Comments (0)