Personalized Profile Based Search Interface With Ranked and Clustered Display

About This Presentation

Title:

Personalized Profile Based Search Interface With Ranked and Clustered Display

Description:

9. The Wire - Breaking News from the. 10. CNNSI.com from CNN and Sports Illustrated ... 5. Data Mining News (53) 6. SPSS Data Mining : SPSS delivers the ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 51

Provided by: sac6

Category:

more less

Transcript and Presenter's Notes

Title: Personalized Profile Based Search Interface With Ranked and Clustered Display

1

Personalized Profile Based Search Interface With
Ranked and Clustered Display

M.S. Thesis defense Sachin Kumar Advisor
Dr.Vipin Kumar
2
Acknowledgements

I would like to thank
Prof. Vipin Kumar
For his continuous guidance and invaluable
inspiration.
Eui-Hong Han
For working closely, suggesting innovative ideas,
and providing implementation level guidance.
B. Uygar Oztekin
For providing the command line version of Meta
Search Engine Mearf to suit to the specific needs
of Scout system.
Levent Ertoz
providing the SNN Algorithm for clustering, and
brute force phrase generation algorithm for
Indexing.
Saurabh Singhal and Eric Eilertson
in testing, and providing feedback and ideas to
improve the system.

3
Objective

Designing and implementing the search interface
with the capability of re-ranking the results
based on users profile
Developing the system having additional
capability of clustering, or indexing the
original results and be able to do relevance
feedback
Modular design to allow forthcoming developers to
easily plug and test their Algorithms.

4
Problem Definition

Huge Amount of Online Information
In response to a query, thousands of documents
are matched, and a few hundreds are provided by
search engines in a ranked order
A typical user can only browse through top few
items
Top ranked documents may be irrelevant in case of
Broad Topic Query
Multi Model Query
Imprecise Query etc

5
Possible Solutions-reranking

Customize the Ranking
Profile built for each user
Example query on language
User 1 (Software Engineer) sees programming
language related pages at the top
User 2 (from Liberal Arts) sees natural language
related pages at the top
Drawbacks of profile based reranking
Profile may saturate over the time
User will have difficulty doing a search in the
field, which in contrast with his profile

6
Possible Solutions-clustering

Document Clustering
Different themes in the different clusters
of documents
Number of clusters would be much less than the
number of documents.
Easy to browse.
Potential problems of Clustering
Label selection
Snippet tolerance
Only one path to reach the document

7
Possible Solutions-indexing

Indexing
Main phrases or topics among documents are
chosen as index
All documents containing the index terms are put
under that index.
Index can be further divided into sub indices
The same document can be present at different
leaf node
Drawback of Indexing
Redundancy
May be very computation intensive to generate the
phrases or topics, especially when we are working
with whole documents

8
Possible Solutions-relevance Feedback

Relevance Feedback
Helpful when initial query is vague
User specifies his/her likes and dislikes
New query is suggested to the user and new set of
results are displayed
Drawback of Relevance feedback
Requires extra effort on part of user to give his
feedback.
If document has some junk words, it might affect
the new set of results.

9
Profiles- Past Research

Various personalized information filtering agents
and systems
Content based Filtering systems
Syskill Webert
Learns by user explicitly telling about the good
links
Agent tells if the links on present page are
interesting
Amalthaea (by A.Moukas, and P.Maes 1998)
Learns by the behavior of user
filtering agents, and discovery agents. Filtering
agents help discovery agents
Social filtering or collaborative systems
Lets browse (by Lieverman, dyke, Vivacqua 1999)
Learns by behaviour of the users
Users linked by infrared transmitter. Local
connaissance
Adaptive web site agents (Pazzani Billsus,
1999)
Limited to a single web site. Learns from past
and current users actions, Suggests the
documents as user browses

10
Profiles- Past Research

Continued
Assisted Browsing Systems
Letizia(Lieberman H, 1995)
Assists only in browsing. Learns by browsing
pattern of user
Keeps quite till it finds enough information to
tell. Guides through its own window.
WebMate(Liren Chen, Katia Sycara, 1998)
Learns by explicit relevance feedback
Consists of stand-alone proxy and applet
controller as an interface while user browses
WebGlimpse (Udi Manber Mike, 1997)
Learns while user browses.
Creates neighborhood and adds recommendations to
the current pages
WebWatcher (Armstrong, Frietag, Joachims 1995)
Learns by browsing pattern of user
Search and browse guide, guides user by inserting
its markup

11
Profiles- Past Research

Continued
Other systems
ARIADNE (Twidale Michael B et al, 1995)
GroupWeb (Greenberg, S. and Roseman, M., 1996)
Jasper (Devis,weeks Revett, 1996)
Pluribus (Schapira, 1999)
SearchPad (Krishna Bharat, 2001)
Select (Alton- Scheidl et al, 1999)
WebHound (Lashkari Y. 1995)

12
Clustering- Past Research

This approach investigated in various papers and
incorporated in a few meta-search engines
Grouper (O. Zamir, O. Etzioni, 1998)
Meta search engine based on STC Algorithm
Manjara (Kannan R. and Vinay V., Yale)
a meta search engine that uses SVD-based
clustering technique
Interactive Track Interface using Scatter/Gather
(Cutting Douglass R., Karger David R., 1996)
Query based browsing system based on
scatter-gather paradigm.
The Paraphrase interface (P.Anick,
S.Vaithyanathan,1997)
Based on SVD algorithm

13
Indexing- Past Research

Used by meta-search engines e.g.
Vivisimo (www.vivisimo.com)
uses a form of conceptual clustering
MSEEC-Multi Search Engine with Multiple
Clustering
theme detection algorithm and LZW compression
method
Infogistics (www.infonetware.com)
Statistical, linguistic and conceptual analysis
to break document collection into topics and sub
topics
Morphological and syntactic transformations to
unify phrases according to a language grammar.

14
Introduction to Scout

A meta search engine with an integrated Interface
having facility of profile based re-ranking,
clustering, indexing and relevance feedback.
Makes it easy to plug and test new algorithms
Advanced users can control the various parameters
through advanced interface.
Scout Architecture is given on the next slide

15
(No Transcript)
16
Brief introduction to

Scout powered by Mearf, an optimized meta-Search
Engine based on expert agreement and content
based reranking which has the ability to combine
multiple methods and quality measures
Reference mearf.cs.umn.edu
(Submitted to CIKM 20001) B. U.Oztekin,
G.Karypis, V.Kumar
Supports various search engines Google, Excite,
Altavista, Directhit
Intelligent advertisement removal and duplicate
elimination module

17
Profile Based Reranking

Profile Storage
Each user is given a unique user ID.
For each user, in each search session, query and
all the documents visited are appended to the
past search session for that user.
Old profile phased out to keep it unsaturated,
and adapting to the changing interest of the
user.
Example of profile stored on the server
992729098 militari intellig 0 1 2
992728760 militari data mine 3 4 5 6
0 992729229 http//www.oss.net/Papers/hackers/In
fofare.html inform warfar librarian
frontlin imageri nation secur agenc data
mine internet joint nation militari intellig
command line librarian
1 992729173 http//www.dnd.ca/somalia/vol3/v3c25
ce.htm militari plan system major
threat indiscrimin mine former barr armi mission
constitut failur militari intellig col labb

18
Profile Based Reranking

Profile Vectors
Process acquires the profile information from the
server to create the profile vectors.
Profile vectors consists of Users profile vector
and Group profile vector.
Creation
Pick the snippets from the profile file/files
whose corresponding query partially matches with
present query
Each such snippet is converted into a vector
based on vector space model.
Each vector thus obtained is compared with
present query and top 100 documents are selected
Centroid of these top documents represent the
profile vector.

19
Profile Based Reranking

Original rank vector
It is obtained by taking the centroid of top (at
present 10) snippets vectors from the present
session.
Reranking Measure
Reranking measure (doci)
alpha cosine(doci, user profile Vector)
betacosine(doci, group profile Vector)
gamma cosine(doci, original rank Vector)
Where
Doci is the ith document in the present session
Alpha, beta and gamma are constants with the
values chosen as 3,2,1 respectively.

20
Interactive Profile Reranking

Instead of stored profile, user can request
reranking on user specified key words through
interface.
Reranking Measure
Reranking measure (doci)
alpha cosine(doci, interactive Vector)
beta cosine(doci, original rank Vector)
Where
Doci is the ith document in the present session
alpha, and beta are constants with the values
chosen as 1, and 1 respectively after experiments.

21
Clustering Algorithms

Implemented following algorithms
Kmeans
Bisective Kmeans
SNN algorithm
Uses overlap between the nearest neighbors of
each pair of snippets as a measure of similarity
Built in noise removal mechanism
Allows overlapping clusters
Reference Ertöz, L., Steinbach, M., Kumar, V.
Finding Topics in Collections of Documents A
Shared Nearest Neighbor Approach, First SIAM
International Workshop on Data Mining,2001

22
Clustering Algorithms (continued)

For k-means algorithm, these variations were
tried out.
Nouns in cluster titles.
Non noun words were filtered out from the titles
Noun Phrases in cluster titles
Each vector was augmented with noun phrases, so
that they could make place at cluster titles.
Above two variations used brill tagger for
generating nouns and noun phrases.
Modifying clusters by changing word weight
Through interface, user could modify the weight
of title words to influence the clustering
algorithm. This gave somewhat control over the
algorithm.

23
Indexing

Index Terms Generation Algorithms
Brute force sequential phrase generation
Sequential phrase generation using word
clustering
All snippets containing the index term are
grouped under that index.
Documents can be present under more than one
index term

24
Relevance Feedback Module

Current algorithm used
Rocchios query reformulation algorithm which is
described here in brief.
Q1 Q0 ß S Ri/n1 - ? S Si/n2,
where
Q1 the vector for the final query
Q0 the vector for the initial query
Ri the vector for the relevant document i
Si the vector for the non- relevant
document I
n1 the number of relevant documents chosen
n2 the number of non-relevant documents
chosen
ß,? weights to adjust the weights of relevant
and non-relevant document vectors.

25
Qualitative Evaluation
26
(No Transcript)
27
(No Transcript)
28
Query language, (Original results)
29
Query language, (Original results)

1. Foreign Languages for Travelers
2. travlang Foreign Language for Travelers
3. The Human-Languages Page is now iLoveLanguages
4. World Language Resources
5. Search Language Language Resources Directory
6. Jennifer's language page
7. Language -Learning.net
8. Internet Activities for Foreign Language
Classes
9. Python Language Website
10. HTML Hyper Text Markup Language
Out of top 10 results only 3 are relevant.
Precision 3/10 30

30
Influence of user past profile and interactive
profile on the ranking of results.
31
Profile based reranking

Query on Language (reranked results)
1. Extensible markup language (XML) (54)
2. Extensible markup language (XML) (88)
3. Java Language Specification (62)
4. Foreign language for travelers (1)
5. Travlang (2)
6. Python Language Website (44)
7. HTML Hyper Text Markup Language (10)
8. FAQ about Extensible Markup (32)
9. Intelligent User Interfaces (53)
10. Python Language Website (9)
Out of 10, 7 links are related to software.
Precision 7/10 70

32
Profile based reranking

All italicized highlighted links are related to
baseball by the content.
Improvement in Precision 20 to 60

33
Profile based reranking

Improvement in Precision 20 to 70

34
Use of relevance feedback where the user
reformulates the query and gets new result set
35
Relevance Feedback (Language)

after relevance feedback
Query extension markup xml language python
Fetched links
1. Java XML 112
2. XML News From Robin Cover
3. Custom perl, python, java
4. JSX 0.8.4
5. Search Results for xmldir.com
6. XML XML XML Distributed computing resources
7. XML Spy
8. Tamino - the leading XML Server! Download
FREE Trial
9. Comments for Extensible Markup Language (XML)
10. irt.org - Extensible Markup Language

36
Relevance Feedback (Mining)

after relevance feedback
Query is software of the data mining
Fetched links
Mining Data?
CodeWeb Data Mining Software Development
Experience
Knowledge Based Systems, Knowledge Management and
Data Mining
Data Mining Software , the key to business
intelligence
Siftware Tools for Data Mining and Knowledge
Discovery
AZMY Thinkware -- Data Analysis and Mining
Software Tools
Predictive Data Mining Software
KDnuggets Data Mining , Web Mining , and
Knowledge Discovery ... ...
Data Mining Software ...
Xanalys Intelligence Software Providing Data
Mining , Data Analysis

37

Qualitative Evaluation

Clustering
No other search engine available online to
compare the results for clustering.
Evaluation
Tested by doing some queries and visually seeing
the results.
Some of the results are shown here.

38
Clustering on query computer science along with
the original results for this query.
39
Clustering on query Sports

Clusters generated by the Scout

40
Clustering on query raging bull

Clusters generated by Scout

41

Qualitative Evaluation

Indexing
Evaluation
Tested by seeing the result of some queries .
They were also performed on other existing
systems such as www.vivisimo.com,
www.infonetware.com and the results compared
visually.
Some of the results are shown here.

42
Indexing on query computer science along with
original results for this query.
43
Indexing on query computer science along with
original results for this query.
44
Indexing on query computer science along with
original results for this query.
45
Indexing on query Sports Vivisimo Scout
Infonetware
46
Indexing on query raging bull Vivisimo
Scout Infonetware
47
Quantitative analysis

It is difficult to do the quantitative analysis
because of lack of any parameter to measure, but
following ways are being suggested.
Statistical analysis
Procedure
the system could be exposed to a large number of
users, and as an experiment, their actions could
be monitored over the time. In the end, data can
be analyzed to find out which method obtained the
maximum click-through.
Problem
It is not feasible for a short M.S. thesis
project to have enough time to popularize the
system among people, and leave it for enough time
to gather sufficient data.

48
Quantitative analysis (Continued)

Simulation Method
Procedure
Scout Interface could be coupled to the primary
search engine built on top of TREK dataset
(experimental data set meant for IR research),
Queried defined in the data set could be
performed, and retrieved documents could be
compared with the expected retrieval declared by
TREK.
Problem
No such search engine is available right now. We
are in the process of building our own primary
search engine for research purposes in line with
this present work. And in the future we would be
able to perform such tests after indexing the
TREK or similar experimental text data.

49
Conclusion

Presented the architecture of Scout
System allows user to effectively explore the
large number of results in various ways using
reranking, clustering and indexing
Allows the presentation of results to be
influenced by profile
Scout architecture allows us to plug in various
algorithms easily

50
Future Work