Formal Models for Expert Finding on DBLP Bibliography Data - PowerPoint PPT Presentation

About This Presentation
Title:

Formal Models for Expert Finding on DBLP Bibliography Data

Description:

Expert finding received increased interest ... We address the expert finding task in a real world academic field ... Find out documents relevant to the query ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 30
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: Formal Models for Expert Finding on DBLP Bibliography Data


1
Formal Models for Expert Finding on DBLP
Bibliography Data
ICDM2008
  • Presented by Hongbo Deng
  • Co-worked with Irwin King and Michael R. Lyu
  • Department of Computer Science and Engineering
  • The Chinese University of Hong Kong
  • Dec. 16, 2008

2
Introduction
  • Traditional information retrieval
  • Expert finding task

Data mining
Data mining
3
Outline
  • Introduction
  • Related work
  • Methodology
  • Modeling Expertise
  • Statistical language model
  • Topic-based model
  • Hybrid model
  • Experiments
  • Conclusions

4
Introduction
II. Introduction
  • Expert finding received increased interest
  • W3C collection in 2005 and 2006 (introduced and
    used by TREC)
  • CSIRO collection in 2007
  • Nearly all of the work has been evaluated on the
    W3C collection
  • We address the expert finding task in a real
    world academic field
  • An important practical problem
  • Some special problems and difficulties

5
Problems
II. Introduction
  • How to represent the expertise of a researcher?
  • The publications of a researcher
  • How to identify experts for a given query?
  • Relevance between a query and publications
  • Publications act as the bridge between query
    and experts
  • What dataset can be used?
  • DBLP bibliography (limited information)
  • Use Google Scholar as a data supplement
  • How to measure the relevance between a query and
    docs?
  • Language model, vector space model, etc.
  • Should we treat each publication equally?

6
Our Work
II. Introduction
  • Our setting DBLP bibliography and Google scholar
  • More than 955,000 articles with over 574,000
    authors
  • About 20GB metadata crawled from Google Scholar
  • Differ from the W3C setting
  • Cover a wider range of topics
  • Contain much more expert candidates
  • Applications
  • Find experts for consultation on a new research
    field
  • Assign papers to reviewers automatically
  • Recommend panels of reviews for grant applications

7
Related Work
  • Document model Candidate model (Balog et al.,
    SIGIR06 SIGIR07)
  • Hierarchical language models (Petkova and Croft,
    ICTAI06)
  • Voting model (Macdonald and Ounis, CIKM06)
  • Author-Persona-Topic model (Mimno and McCallum,
    KDD07)
  • They do not consider the importance of documents.
    Hardly to be used in large-scale expert finding.

8
Expertise Modeling
III. Methodology
  • Expert finding
  • p(caq) what is the probability of a candidate
    ca being an expert given the query topic q?
  • Rank candidates ca according to this probability.
  • Approach
  • Using Bayes theorem,

where p(ca, q) is joint probability of a
candidate and a query, p(q) is the probability of
a query.
9
Expertise Modeling
III. Methodology
  • Problem How to estimate p(ca, q)?
  • Model 1 Statistical language model
  • Document-based approach
  • Find out the experts from the associated
    publications
  • Model 2 Topic-based model
  • Association between the query with
  • several similar topics
  • Model 3 Hybrid model
  • Combination of Model1 and Model2

10
Basic Language Model
III. Model 1 Statistical language model
  • The probability pl(ca,q)

Fig1. Baseline model
Language Model
Conditionally independent
  • Find out documents relevant to the query
  • Model the knowledge of an expert from the
    associated documents

11
Weighted Language Model
III. Model 1 Statistical language model
Fig3. Weighted model
Fig2. A query example
12
Topic-based Model
III. Model 2 Topic-based model
  • Observation researchers usually describe their
    expertise as a combination of several topics
  • Each candidate is represented as a weighted sum
    of multiple topics Z

Fig4. Topic-based model
Similarity between query and topics
z -gt as a query estimate p
13
Topic-based Model
III. Model 2 Topic-based model
Google Scholar ?z
1. Introduction to Modern Information
retrieval 2. Information retrieval 3. Modern
Information retrieval 5. A language modeling
approach to information retrieval 7. Information
filtering and information retrieval 99.
Cross-language information retrieval 100. On
modeling information retrieval with probabilistic
inference
Topic z
Information retrieval
represent
14
Topic-based Model
III. Model 2 Topic-based model
  • Challenge What similar topics would be selected?
  • T1 Calculate p(q?z), select the top K ranked
    topics
  • Assume topics are independent
  • Ideal similar topics
  • Include topics from many different subtopics
  • Not include topics with high redundancy
  • Define a conditional probability function to
    quantify the novelty and penalize the redundancy
    of a topic
  • T2
  • T3

15
Topic Selection Algorithm
III. Model 2 Topic-based model
  • T2
  • T3

16
Hybrid Model
III. Model 3 Hybrid model
  • Aggregate the advantage of the pl and pt
  • Defined as

17
Experiments
IV. Experiments
  • DBLP Collection
  • Limitation
  • No abstract and index terms
  • Hard to represent the document
  • Representation for documents
  • Use Google Scholar for data supplementation
  • Title as query, crawled top 10 returned records
  • Up to 20 GB metadata (HTML pages)
  • The citation number of the publication

18
Topic Collection
IV. Experiments
  • 2,498 well-defined topics from eventseer
  • Crawl the top 100 returned records from Google
    Scholar

19
Benchmark Dataset
IV. Experiments
  • A benchmark dataset with 7 topics and expert lists

20
Evaluation Metrics
IV. Experiments
  • Precision at rank n (P_at_n)
  • Mean Average Precision (MAP)
  • Bpref The score function of the number of
    non-relevant candidates

21
Preliminary Experiments
IV. Experiments
  • Performed on two corpora using basic language
    model (B1)
  • Title corpus only using the title
  • GS corpus the representation of Google Scholar
  • Evaluation results on two corpora ()
  • More effective to represent d using Google
    Scholar

22
Model 1 Statistical Language Models
IV. Experiments
  • Evaluation results of language modes
  • Weighted language model B3 and B2 outperform B1
  • Important to consider the prior probability

23
Model 2 Topic-based Models
IV. Experiments
  • Vary the number of topics (K) from 5 to 100
  • Results by using different values for K.

The number of topics will be cutoff automatically
for T2 T3
24
Model 2 Topic-based Models
IV. Experiments
  • Comparison of the three topic-based models

25
Model 3 Hybrid Models
IV. Experiments
  • Evaluation results of hybrid model
  • Hybrid model outperforms the pure language model
    and topic-based model in most of the metrics

26
Conclusions and Future Work
  • Conclusions
  • Address expert finding task in a real world
    academic field
  • Propose a weighted language model
  • Investigate a topic-base model to interpret the
    expert finding task
  • Integrate the language model with the topic-based
    model
  • Demonstrate that hybrid model achieves the best
    performance in evaluation results
  • Future work
  • Take into account other types of information
  • Refine the results by utilizing social network
    analysis

27
QA
  • Thanks!

28
Comparison to Other Systems
  • Evaluation results of our language models and the
    method TS

29
Example results
Write a Comment
User Comments (0)
About PowerShow.com