Statistical Machine Translation Models for Personalized Search - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Machine Translation Models for Personalized Search

Description:

1. Statistical Machine Translation Models for Personalized ... Noisy channel 1. French. Sentence. English. Sentence. Noisy Channel 2. Ideal. Document. Query ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 27

Provided by: researchw6

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Machine Translation Models for Personalized Search

1
Statistical Machine Translation Models for
Personalized Search

Rohini U
AOL India RD, Bangalore India
Rohini.uppuluri_at_corp.aol.com
Vamshi Ambati
Language Technologies Institute
Carnegie Mellon University Pittsburgh, USA
vamshi_at_cs.cmu.edu
Vasudeva Varma,
SIEL, LTRC, IIIT Hyderabad, India
vv_at_iiit.ac.in

2
Agenda

Introduction
Related Work
Background
User Profile as Translation Model
Personalized Search
Learning User Profile
Re-ranking
Experiments
Conclusions and Future Work

3
Introduction

Current Web Search engines
Provide users with documents relevant to their
information need
Issues
Information overload
To cater Hundreds of millions of users
Terabytes of data
Poor description of Information need
Short queries - Difficult to understand
Word ambiguities
Users only see top few results
Relevance
subjective depends on the user
One size Fits all ???

4
Continued..

Search is not a solved problem!
Poorly described information need
Java (Java island / Java programming language
)
Jaguar (cat /car)
Lemur (animal / lemur tool kit)
SBH (State bank of Hyderabad/Syracuse
Behavioral Health care)
Given prior information
I am into biology best guess for Jaguar?
past queries - information retrieval, language
modeling best guess for lemur?

5
Review of Personalized Search

Personalized Search
Query logs Machine learning
Language modeling Community based
Others

6
Statistical Language Modeling based Approaches
Introduction

Statistical language modeling task of
estimating probability distribution that captures
statistical regularities of natural language
Applied to a number of problems Speech, Machine
Translation, IR, Summarization

7
Statistical Language Modeling based Approaches
Background
Lemur
Query Formulation Model
Query
Given a query, which is most likely to be the
Ideal Document?
P(Q/D) P(q1.qn/D) ? P(qi/D)
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
8
Noisy Channel based approach Motivation

Query Generation Process (Noisy Channel)
Ideal Document
Retrieval
Query Generation Process (Noisy Channel)
9
Similar to Statistical Machine Translation

Given an english sentence translate into french
Given a query, retrieve documents closer to ideal
document

Noisy channel 1
English Sentence
French Sentence
P(e/f)
Noisy Channel 2
Ideal Document
Query
P(q/w)
10
Learning user profile

User profile Translation Model
Triples (qw,dw,p(qw/dw))
Use Statistical Machine Translation methods
Learning user profile training a translation
model
In SMT Training a translation model
From Parallel texts
Using EM algorithm

11
Learning User profile

Extracting Parallel Texts
From Queries and corresponding snippets from
clicked documents
Training a Translation Model
GIZA - an open source tool kit widely used for
training translation models in Statistical
Machine Translation research.

12
Sample user profile
13
Reranking

Recall, in general LM for IR
Noisy Channel based approach

P(Q/D) ? P(qi/D)

lemur
P(lemur/retrieval)
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
D1
D4
14
Experiments

Performed evaluation on explicit feedback data
collected from 7 users
Experiments
Comparison with Contextless Ranking
Comparison between different training models and
contexts

15

Data and Set up

Data
Explicit Feedback data collected from 7 users
For each query, each user examined top 10
documents and identified top 10 documents
Collected the top 10 results for all queries.
Total documents 3469 documents
Set up
3469 documents - created lucene index.
For reranking, first retrieve the results using
lucene and then rerank them using the noisy
channel approach.
We perform 10 fold cross validation

16
Data
User No. Q unique words in Q Total Rel Avg. Rel
1 37 89 236 6.378
2 50 68.42 178 3.56
3 61 82.63 298 4.885
4 26 86.95 101 3.884
5 33 80.76 134 4.06
6 29 78.08 98 3.379
7 29 88.31 115 3.965
17
Metrics

Precision_at_n
Number of documents relevant / n

18
Set up
User Profile Learner
Train Data
User Profiles
Data
Test Data
Reranker
Reranked Results
19
User Contextless Proposed
1 0.1433 0.2445
2 0.1426 0.2445
3 0.1016 0.1216
4 0.0557 0.1541
5 0.1887 0.3933
6 0.1566 0.3941
7 0.1 0.1833
Avg 0.1268 0.2332
20
Results
Training Model IBM Model1 IBM Model1 GIZA GIZA
Document Train Snippet Train Document Train Snippet Train
Document Test 0.2062 0.2333 0.1799 0.2075
Snippet Test 0.2028 0.2488 0.1834 0.2034
21
Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
22
Conclusions and Future Work

Proposed a stat MT based approach for modeling
user model
Captures Richer context, relations between q and
w.
In future,
N-gram based method trigrams etc
Noisy Channel based method bigram

Questions?

Thank you

25
References

Adam Berger and John D. Lafferty. 1999.
Information retrieval as statistical translation.
In Research and Development in Information
Retrieval, pages 222229.
Peter F. Brown, Vincent J. Della Pietra, Stephen
A. Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation
parameter estimation. Comput. Linguist.,
19(2)263311.
W. Bruce Croft, Stephen Cronen-Townsend, and
Victor Larvrenko. 2001. Relevance feedback and
personalization
A language modeling perspective. In DELOS
Workshop Personalisation and Recommender Systems
in Digital Libraries.
Jamie Allan et. al. 2003. Challenges in
information retrieval language modeling. In SIGIR
Forum, volume 37 Number 1.
K. Sugiyama K. Hatano and M. Yoshikawa. 2004.
Adaptive web search based on user profile
constructed without any effort from users. In
Proceedings of WWW 2004, page 675 684.
Victor Lavrenko and W. Bruce Croft. 2001.
Relevance-based language models. In Research and
Development in Information Retrieval, pages
120127.
F. Liu, C. Yu, and W. Meng. 2002. Personalized
web search by mapping user queries to categories.
In Proceedings of the eleventh international
conference on Information and knowledge
management, ACM Press, pages 558565.
Tom Mitchell. 1997. Machine Learning. McGrawHill.

Franz Josef Och and Hermann Ney. 2003. A
systematic comparison of various statistical
alignment models. Computational Linguistics,
29(1)1951.
Jay M. Ponte and W. Bruce Croft. 1998. A language
modeling approach to information retrieval. In
Research and Development in Information
Retrieval, pages 275281.
A. Pretschner and S. Gauch. 1999. Ontology based
personalized search. In ICTAI., pages 391398.
J. J. Rocchio. 1971. Relevance feedback in
information retrieval, the smart retrieval
system. Experiments in Automatic Document
Processing, pages 313323.
G. Salton and C. Buckley. 1990. Improving
retrieval performance by relevance feedback.
Journal of the American Society of Information
Science, 41288297.
Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005.
Implicit user modeling for personalized search.
In Proceedings of CIKM 2005.
F. Song and W. B. Croft. 1999. A general language
model for information retrieval. In Proceedings
on the 22nd annual international ACM SIGIR
conference, page 279280.
Micro Speretta and Susan Gauch. 2004.
Personalizing search based on user search
histories. In Thirteenth International Conference
on Information and Knowledge Management (CIKM
2004).
Chengxiang Zhai and John Lafferty. 2001. A study
of smoothing methods for language models applied
to ad hoc information retrieval. In Proceedings
of ACM SIGIR01, pages 334342.