Formal Models for Expert Finding on DBLP Bibliography Data - PowerPoint PPT Presentation

About This Presentation

Title:

Formal Models for Expert Finding on DBLP Bibliography Data

Description:

Expert finding received increased interest ... We address the expert finding task in a real world academic field ... Find out documents relevant to the query ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 30

Provided by: cseCu

Category:

more less

Transcript and Presenter's Notes

Title: Formal Models for Expert Finding on DBLP Bibliography Data

1
Formal Models for Expert Finding on DBLP
Bibliography Data
ICDM2008

Presented by Hongbo Deng
Co-worked with Irwin King and Michael R. Lyu
Department of Computer Science and Engineering
The Chinese University of Hong Kong
Dec. 16, 2008

2
Introduction

Traditional information retrieval

Expert finding task

Data mining
Data mining
3
Outline

Introduction
Related work
Methodology
Modeling Expertise
Statistical language model
Topic-based model
Hybrid model
Experiments
Conclusions

4
Introduction
II. Introduction

Expert finding received increased interest
W3C collection in 2005 and 2006 (introduced and
used by TREC)
CSIRO collection in 2007
Nearly all of the work has been evaluated on the
W3C collection
We address the expert finding task in a real
world academic field
An important practical problem
Some special problems and difficulties

5
Problems
II. Introduction

How to represent the expertise of a researcher?
The publications of a researcher
How to identify experts for a given query?
Relevance between a query and publications
Publications act as the bridge between query
and experts
What dataset can be used?
DBLP bibliography (limited information)
Use Google Scholar as a data supplement
How to measure the relevance between a query and
docs?
Language model, vector space model, etc.
Should we treat each publication equally?

6
Our Work
II. Introduction

Our setting DBLP bibliography and Google scholar
More than 955,000 articles with over 574,000
authors
About 20GB metadata crawled from Google Scholar
Differ from the W3C setting
Cover a wider range of topics
Contain much more expert candidates
Applications
Find experts for consultation on a new research
field
Assign papers to reviewers automatically
Recommend panels of reviews for grant applications

7
Related Work

Document model Candidate model (Balog et al.,
SIGIR06 SIGIR07)
Hierarchical language models (Petkova and Croft,
ICTAI06)
Voting model (Macdonald and Ounis, CIKM06)
Author-Persona-Topic model (Mimno and McCallum,
KDD07)
They do not consider the importance of documents.
Hardly to be used in large-scale expert finding.

8
Expertise Modeling
III. Methodology

Expert finding
p(caq) what is the probability of a candidate
ca being an expert given the query topic q?
Rank candidates ca according to this probability.
Approach
Using Bayes theorem,

where p(ca, q) is joint probability of a
candidate and a query, p(q) is the probability of
a query.
9
Expertise Modeling
III. Methodology

Problem How to estimate p(ca, q)?
Model 1 Statistical language model
Document-based approach
Find out the experts from the associated
publications
Model 2 Topic-based model
Association between the query with
several similar topics
Model 3 Hybrid model
Combination of Model1 and Model2

10
Basic Language Model
III. Model 1 Statistical language model

The probability pl(ca,q)

Fig1. Baseline model
Language Model
Conditionally independent

Find out documents relevant to the query
Model the knowledge of an expert from the
associated documents

11
Weighted Language Model
III. Model 1 Statistical language model
Fig3. Weighted model
Fig2. A query example
12
Topic-based Model
III. Model 2 Topic-based model

Observation researchers usually describe their
expertise as a combination of several topics
Each candidate is represented as a weighted sum
of multiple topics Z

Fig4. Topic-based model
Similarity between query and topics
z -gt as a query estimate p
13
Topic-based Model
III. Model 2 Topic-based model
Google Scholar ?z
1. Introduction to Modern Information
retrieval 2. Information retrieval 3. Modern
Information retrieval 5. A language modeling
approach to information retrieval 7. Information
filtering and information retrieval 99.
Cross-language information retrieval 100. On
modeling information retrieval with probabilistic
inference
Topic z
Information retrieval
represent
14
Topic-based Model
III. Model 2 Topic-based model

Challenge What similar topics would be selected?
T1 Calculate p(q?z), select the top K ranked
topics
Assume topics are independent
Ideal similar topics
Include topics from many different subtopics
Not include topics with high redundancy
Define a conditional probability function to
quantify the novelty and penalize the redundancy
of a topic
T2
T3

15
Topic Selection Algorithm
III. Model 2 Topic-based model

16
Hybrid Model
III. Model 3 Hybrid model

Aggregate the advantage of the pl and pt
Defined as

17
Experiments
IV. Experiments

DBLP Collection
Limitation
No abstract and index terms
Hard to represent the document
Representation for documents
Use Google Scholar for data supplementation
Title as query, crawled top 10 returned records
Up to 20 GB metadata (HTML pages)
The citation number of the publication

18
Topic Collection
IV. Experiments

2,498 well-defined topics from eventseer
Crawl the top 100 returned records from Google
Scholar

19
Benchmark Dataset
IV. Experiments

A benchmark dataset with 7 topics and expert lists

20
Evaluation Metrics
IV. Experiments

Precision at rank n (P_at_n)
Mean Average Precision (MAP)
Bpref The score function of the number of
non-relevant candidates

21
Preliminary Experiments
IV. Experiments

Performed on two corpora using basic language
model (B1)
Title corpus only using the title
GS corpus the representation of Google Scholar
Evaluation results on two corpora ()
More effective to represent d using Google
Scholar

22
Model 1 Statistical Language Models
IV. Experiments

Evaluation results of language modes
Weighted language model B3 and B2 outperform B1
Important to consider the prior probability

23
Model 2 Topic-based Models
IV. Experiments

Vary the number of topics (K) from 5 to 100
Results by using different values for K.

The number of topics will be cutoff automatically
for T2 T3
24
Model 2 Topic-based Models
IV. Experiments

Comparison of the three topic-based models

25
Model 3 Hybrid Models
IV. Experiments

Evaluation results of hybrid model

Hybrid model outperforms the pure language model
and topic-based model in most of the metrics

26
Conclusions and Future Work

Conclusions
Address expert finding task in a real world
academic field
Propose a weighted language model
Investigate a topic-base model to interpret the
expert finding task
Integrate the language model with the topic-based
model
Demonstrate that hybrid model achieves the best
performance in evaluation results
Future work
Take into account other types of information
Refine the results by utilizing social network
analysis

27
QA