Title: Formal Models for Expert Finding on DBLP Bibliography Data
1Formal Models for Expert Finding on DBLP
Bibliography Data
ICDM2008
- Presented by Hongbo Deng
- Co-worked with Irwin King and Michael R. Lyu
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- Dec. 16, 2008
2Introduction
- Traditional information retrieval
Data mining
Data mining
3Outline
- Introduction
- Related work
- Methodology
- Modeling Expertise
- Statistical language model
- Topic-based model
- Hybrid model
- Experiments
- Conclusions
4Introduction
II. Introduction
- Expert finding received increased interest
- W3C collection in 2005 and 2006 (introduced and
used by TREC) - CSIRO collection in 2007
- Nearly all of the work has been evaluated on the
W3C collection - We address the expert finding task in a real
world academic field - An important practical problem
- Some special problems and difficulties
5Problems
II. Introduction
- How to represent the expertise of a researcher?
- The publications of a researcher
- How to identify experts for a given query?
- Relevance between a query and publications
- Publications act as the bridge between query
and experts - What dataset can be used?
- DBLP bibliography (limited information)
- Use Google Scholar as a data supplement
- How to measure the relevance between a query and
docs? - Language model, vector space model, etc.
- Should we treat each publication equally?
6Our Work
II. Introduction
- Our setting DBLP bibliography and Google scholar
- More than 955,000 articles with over 574,000
authors - About 20GB metadata crawled from Google Scholar
- Differ from the W3C setting
- Cover a wider range of topics
- Contain much more expert candidates
- Applications
- Find experts for consultation on a new research
field - Assign papers to reviewers automatically
- Recommend panels of reviews for grant applications
7Related Work
- Document model Candidate model (Balog et al.,
SIGIR06 SIGIR07) - Hierarchical language models (Petkova and Croft,
ICTAI06) - Voting model (Macdonald and Ounis, CIKM06)
- Author-Persona-Topic model (Mimno and McCallum,
KDD07) -
- They do not consider the importance of documents.
Hardly to be used in large-scale expert finding.
8Expertise Modeling
III. Methodology
- Expert finding
- p(caq) what is the probability of a candidate
ca being an expert given the query topic q? - Rank candidates ca according to this probability.
- Approach
- Using Bayes theorem,
where p(ca, q) is joint probability of a
candidate and a query, p(q) is the probability of
a query.
9Expertise Modeling
III. Methodology
- Problem How to estimate p(ca, q)?
- Model 1 Statistical language model
- Document-based approach
- Find out the experts from the associated
publications - Model 2 Topic-based model
- Association between the query with
- several similar topics
- Model 3 Hybrid model
- Combination of Model1 and Model2
10Basic Language Model
III. Model 1 Statistical language model
Fig1. Baseline model
Language Model
Conditionally independent
- Find out documents relevant to the query
- Model the knowledge of an expert from the
associated documents
11Weighted Language Model
III. Model 1 Statistical language model
Fig3. Weighted model
Fig2. A query example
12Topic-based Model
III. Model 2 Topic-based model
- Observation researchers usually describe their
expertise as a combination of several topics - Each candidate is represented as a weighted sum
of multiple topics Z
Fig4. Topic-based model
Similarity between query and topics
z -gt as a query estimate p
13Topic-based Model
III. Model 2 Topic-based model
Google Scholar ?z
1. Introduction to Modern Information
retrieval 2. Information retrieval 3. Modern
Information retrieval 5. A language modeling
approach to information retrieval 7. Information
filtering and information retrieval 99.
Cross-language information retrieval 100. On
modeling information retrieval with probabilistic
inference
Topic z
Information retrieval
represent
14Topic-based Model
III. Model 2 Topic-based model
- Challenge What similar topics would be selected?
- T1 Calculate p(q?z), select the top K ranked
topics - Assume topics are independent
- Ideal similar topics
- Include topics from many different subtopics
- Not include topics with high redundancy
- Define a conditional probability function to
quantify the novelty and penalize the redundancy
of a topic - T2
- T3
15Topic Selection Algorithm
III. Model 2 Topic-based model
16Hybrid Model
III. Model 3 Hybrid model
- Aggregate the advantage of the pl and pt
- Defined as
17Experiments
IV. Experiments
- DBLP Collection
- Limitation
- No abstract and index terms
- Hard to represent the document
- Representation for documents
- Use Google Scholar for data supplementation
- Title as query, crawled top 10 returned records
- Up to 20 GB metadata (HTML pages)
- The citation number of the publication
18Topic Collection
IV. Experiments
- 2,498 well-defined topics from eventseer
- Crawl the top 100 returned records from Google
Scholar
19Benchmark Dataset
IV. Experiments
- A benchmark dataset with 7 topics and expert lists
20Evaluation Metrics
IV. Experiments
- Precision at rank n (P_at_n)
- Mean Average Precision (MAP)
- Bpref The score function of the number of
non-relevant candidates
21Preliminary Experiments
IV. Experiments
- Performed on two corpora using basic language
model (B1) - Title corpus only using the title
- GS corpus the representation of Google Scholar
- Evaluation results on two corpora ()
- More effective to represent d using Google
Scholar
22Model 1 Statistical Language Models
IV. Experiments
- Evaluation results of language modes
- Weighted language model B3 and B2 outperform B1
- Important to consider the prior probability
23Model 2 Topic-based Models
IV. Experiments
- Vary the number of topics (K) from 5 to 100
- Results by using different values for K.
The number of topics will be cutoff automatically
for T2 T3
24Model 2 Topic-based Models
IV. Experiments
- Comparison of the three topic-based models
25Model 3 Hybrid Models
IV. Experiments
- Evaluation results of hybrid model
- Hybrid model outperforms the pure language model
and topic-based model in most of the metrics
26Conclusions and Future Work
- Conclusions
- Address expert finding task in a real world
academic field - Propose a weighted language model
- Investigate a topic-base model to interpret the
expert finding task - Integrate the language model with the topic-based
model - Demonstrate that hybrid model achieves the best
performance in evaluation results - Future work
- Take into account other types of information
- Refine the results by utilizing social network
analysis
27QA
28Comparison to Other Systems
- Evaluation results of our language models and the
method TS
29Example results