Title: Personalized Profile Based Search Interface With Ranked and Clustered Display
1- Personalized Profile Based Search Interface With
Ranked and Clustered Display
M.S. Thesis defense Sachin Kumar Advisor
Dr.Vipin Kumar
2Acknowledgements
- I would like to thank
- Prof. Vipin Kumar
- For his continuous guidance and invaluable
inspiration. - Eui-Hong Han
- For working closely, suggesting innovative ideas,
and providing implementation level guidance. - B. Uygar Oztekin
- For providing the command line version of Meta
Search Engine Mearf to suit to the specific needs
of Scout system. - Levent Ertoz
- providing the SNN Algorithm for clustering, and
brute force phrase generation algorithm for
Indexing. - Saurabh Singhal and Eric Eilertson
- in testing, and providing feedback and ideas to
improve the system.
3Objective
- Designing and implementing the search interface
with the capability of re-ranking the results
based on users profile - Developing the system having additional
capability of clustering, or indexing the
original results and be able to do relevance
feedback - Modular design to allow forthcoming developers to
easily plug and test their Algorithms.
4Problem Definition
- Huge Amount of Online Information
- In response to a query, thousands of documents
are matched, and a few hundreds are provided by
search engines in a ranked order - A typical user can only browse through top few
items - Top ranked documents may be irrelevant in case of
- Broad Topic Query
- Multi Model Query
- Imprecise Query etc
5Possible Solutions-reranking
- Customize the Ranking
- Profile built for each user
- Example query on language
- User 1 (Software Engineer) sees programming
language related pages at the top - User 2 (from Liberal Arts) sees natural language
related pages at the top - Drawbacks of profile based reranking
- Profile may saturate over the time
- User will have difficulty doing a search in the
field, which in contrast with his profile
6Possible Solutions-clustering
- Document Clustering
- Different themes in the different clusters
- of documents
- Number of clusters would be much less than the
number of documents. - Easy to browse.
- Potential problems of Clustering
- Label selection
- Snippet tolerance
- Only one path to reach the document
7Possible Solutions-indexing
- Indexing
- Main phrases or topics among documents are
chosen as index - All documents containing the index terms are put
under that index. - Index can be further divided into sub indices
- The same document can be present at different
leaf node - Drawback of Indexing
- Redundancy
- May be very computation intensive to generate the
phrases or topics, especially when we are working
with whole documents
8Possible Solutions-relevance Feedback
- Relevance Feedback
- Helpful when initial query is vague
- User specifies his/her likes and dislikes
- New query is suggested to the user and new set of
results are displayed - Drawback of Relevance feedback
- Requires extra effort on part of user to give his
feedback. - If document has some junk words, it might affect
the new set of results.
9Profiles- Past Research
- Various personalized information filtering agents
and systems - Content based Filtering systems
- Syskill Webert
- Learns by user explicitly telling about the good
links - Agent tells if the links on present page are
interesting - Amalthaea (by A.Moukas, and P.Maes 1998)
- Learns by the behavior of user
- filtering agents, and discovery agents. Filtering
agents help discovery agents - Social filtering or collaborative systems
- Lets browse (by Lieverman, dyke, Vivacqua 1999)
- Learns by behaviour of the users
- Users linked by infrared transmitter. Local
connaissance - Adaptive web site agents (Pazzani Billsus,
1999) - Limited to a single web site. Learns from past
and current users actions, Suggests the
documents as user browses
10Profiles- Past Research
- Continued
- Assisted Browsing Systems
- Letizia(Lieberman H, 1995)
- Assists only in browsing. Learns by browsing
pattern of user - Keeps quite till it finds enough information to
tell. Guides through its own window. - WebMate(Liren Chen, Katia Sycara, 1998)
- Learns by explicit relevance feedback
- Consists of stand-alone proxy and applet
controller as an interface while user browses - WebGlimpse (Udi Manber Mike, 1997)
- Learns while user browses.
- Creates neighborhood and adds recommendations to
the current pages - WebWatcher (Armstrong, Frietag, Joachims 1995)
- Learns by browsing pattern of user
- Search and browse guide, guides user by inserting
its markup
11Profiles- Past Research
- Continued
- Other systems
- ARIADNE (Twidale Michael B et al, 1995)
- GroupWeb (Greenberg, S. and Roseman, M., 1996)
- Jasper (Devis,weeks Revett, 1996)
- Pluribus (Schapira, 1999)
- SearchPad (Krishna Bharat, 2001)
- Select (Alton- Scheidl et al, 1999)
- WebHound (Lashkari Y. 1995)
12Clustering- Past Research
- This approach investigated in various papers and
incorporated in a few meta-search engines - Grouper (O. Zamir, O. Etzioni, 1998)
- Meta search engine based on STC Algorithm
- Manjara (Kannan R. and Vinay V., Yale)
- a meta search engine that uses SVD-based
clustering technique - Interactive Track Interface using Scatter/Gather
(Cutting Douglass R., Karger David R., 1996) - Query based browsing system based on
scatter-gather paradigm. - The Paraphrase interface (P.Anick,
S.Vaithyanathan,1997) - Based on SVD algorithm
13Indexing- Past Research
- Used by meta-search engines e.g.
- Vivisimo (www.vivisimo.com)
- uses a form of conceptual clustering
- MSEEC-Multi Search Engine with Multiple
Clustering - theme detection algorithm and LZW compression
method - Infogistics (www.infonetware.com)
- Statistical, linguistic and conceptual analysis
to break document collection into topics and sub
topics - Morphological and syntactic transformations to
unify phrases according to a language grammar.
14Introduction to Scout
- A meta search engine with an integrated Interface
having facility of profile based re-ranking,
clustering, indexing and relevance feedback. - Makes it easy to plug and test new algorithms
- Advanced users can control the various parameters
through advanced interface. - Scout Architecture is given on the next slide
15(No Transcript)
16Brief introduction to
- Scout powered by Mearf, an optimized meta-Search
Engine based on expert agreement and content
based reranking which has the ability to combine
multiple methods and quality measures - Reference mearf.cs.umn.edu
- (Submitted to CIKM 20001) B. U.Oztekin,
G.Karypis, V.Kumar - Supports various search engines Google, Excite,
Altavista, Directhit - Intelligent advertisement removal and duplicate
elimination module
17Profile Based Reranking
- Profile Storage
- Each user is given a unique user ID.
- For each user, in each search session, query and
all the documents visited are appended to the
past search session for that user. - Old profile phased out to keep it unsaturated,
and adapting to the changing interest of the
user. - Example of profile stored on the server
- 992729098 militari intellig 0 1 2
- 992728760 militari data mine 3 4 5 6
-
- 0 992729229 http//www.oss.net/Papers/hackers/In
fofare.html inform warfar librarian
frontlin imageri nation secur agenc data
mine internet joint nation militari intellig
command line librarian - 1 992729173 http//www.dnd.ca/somalia/vol3/v3c25
ce.htm militari plan system major
threat indiscrimin mine former barr armi mission
constitut failur militari intellig col labb
18Profile Based Reranking
- Profile Vectors
- Process acquires the profile information from the
server to create the profile vectors. - Profile vectors consists of Users profile vector
and Group profile vector. - Creation
- Pick the snippets from the profile file/files
whose corresponding query partially matches with
present query - Each such snippet is converted into a vector
based on vector space model. - Each vector thus obtained is compared with
present query and top 100 documents are selected - Centroid of these top documents represent the
profile vector.
19Profile Based Reranking
- Original rank vector
- It is obtained by taking the centroid of top (at
present 10) snippets vectors from the present
session. - Reranking Measure
- Reranking measure (doci)
- alpha cosine(doci, user profile Vector)
- betacosine(doci, group profile Vector)
- gamma cosine(doci, original rank Vector)
- Where
- Doci is the ith document in the present session
- Alpha, beta and gamma are constants with the
values chosen as 3,2,1 respectively.
20Interactive Profile Reranking
- Instead of stored profile, user can request
reranking on user specified key words through
interface. - Reranking Measure
- Reranking measure (doci)
- alpha cosine(doci, interactive Vector)
- beta cosine(doci, original rank Vector)
- Where
- Doci is the ith document in the present session
- alpha, and beta are constants with the values
chosen as 1, and 1 respectively after experiments.
21Clustering Algorithms
- Implemented following algorithms
- Kmeans
- Bisective Kmeans
- SNN algorithm
- Uses overlap between the nearest neighbors of
each pair of snippets as a measure of similarity - Built in noise removal mechanism
- Allows overlapping clusters
- Reference Ertöz, L., Steinbach, M., Kumar, V.
Finding Topics in Collections of Documents A
Shared Nearest Neighbor Approach, First SIAM
International Workshop on Data Mining,2001
22Clustering Algorithms (continued)
- For k-means algorithm, these variations were
tried out. - Nouns in cluster titles.
- Non noun words were filtered out from the titles
- Noun Phrases in cluster titles
- Each vector was augmented with noun phrases, so
that they could make place at cluster titles. - Above two variations used brill tagger for
generating nouns and noun phrases. - Modifying clusters by changing word weight
- Through interface, user could modify the weight
of title words to influence the clustering
algorithm. This gave somewhat control over the
algorithm.
23Indexing
- Index Terms Generation Algorithms
- Brute force sequential phrase generation
- Sequential phrase generation using word
clustering - All snippets containing the index term are
grouped under that index. - Documents can be present under more than one
index term
24Relevance Feedback Module
- Current algorithm used
- Rocchios query reformulation algorithm which is
described here in brief. - Q1 Q0 ß S Ri/n1 - ? S Si/n2,
-
- where
- Q1 the vector for the final query
- Q0 the vector for the initial query
- Ri the vector for the relevant document i
- Si the vector for the non- relevant
document I - n1 the number of relevant documents chosen
- n2 the number of non-relevant documents
chosen - ß,? weights to adjust the weights of relevant
and non-relevant document vectors.
25Qualitative Evaluation
26(No Transcript)
27(No Transcript)
28Query language, (Original results)
29Query language, (Original results)
- 1. Foreign Languages for Travelers
- 2. travlang Foreign Language for Travelers
- 3. The Human-Languages Page is now iLoveLanguages
- 4. World Language Resources
- 5. Search Language Language Resources Directory
- 6. Jennifer's language page
- 7. Language -Learning.net
- 8. Internet Activities for Foreign Language
Classes - 9. Python Language Website
- 10. HTML Hyper Text Markup Language
- Out of top 10 results only 3 are relevant.
- Precision 3/10 30
30Influence of user past profile and interactive
profile on the ranking of results.
31Profile based reranking
- Query on Language (reranked results)
- 1. Extensible markup language (XML) (54)
- 2. Extensible markup language (XML) (88)
- 3. Java Language Specification (62)
- 4. Foreign language for travelers (1)
- 5. Travlang (2)
- 6. Python Language Website (44)
- 7. HTML Hyper Text Markup Language (10)
- 8. FAQ about Extensible Markup (32)
- 9. Intelligent User Interfaces (53)
- 10. Python Language Website (9)
- Out of 10, 7 links are related to software.
- Precision 7/10 70
32Profile based reranking
- All italicized highlighted links are related to
baseball by the content. - Improvement in Precision 20 to 60
33Profile based reranking
- Improvement in Precision 20 to 70
34Use of relevance feedback where the user
reformulates the query and gets new result set
35Relevance Feedback (Language)
- after relevance feedback
- Query extension markup xml language python
- Fetched links
- 1. Java XML 112
- 2. XML News From Robin Cover
- 3. Custom perl, python, java
- 4. JSX 0.8.4
- 5. Search Results for xmldir.com
- 6. XML XML XML Distributed computing resources
- 7. XML Spy
- 8. Tamino - the leading XML Server! Download
FREE Trial - 9. Comments for Extensible Markup Language (XML)
- 10. irt.org - Extensible Markup Language
36Relevance Feedback (Mining)
- after relevance feedback
- Query is software of the data mining
- Fetched links
- Mining Data?
- CodeWeb Data Mining Software Development
Experience - Knowledge Based Systems, Knowledge Management and
Data Mining - Data Mining Software , the key to business
intelligence - Siftware Tools for Data Mining and Knowledge
Discovery - AZMY Thinkware -- Data Analysis and Mining
Software Tools - Predictive Data Mining Software
- KDnuggets Data Mining , Web Mining , and
Knowledge Discovery ... ... - Data Mining Software ...
- Xanalys Intelligence Software Providing Data
Mining , Data Analysis
37 Qualitative Evaluation
- Clustering
- No other search engine available online to
compare the results for clustering. - Evaluation
- Tested by doing some queries and visually seeing
the results. - Some of the results are shown here.
38Clustering on query computer science along with
the original results for this query.
39Clustering on query Sports
- Clusters generated by the Scout
40Clustering on query raging bull
- Clusters generated by Scout
41 Qualitative Evaluation
- Indexing
- Evaluation
- Tested by seeing the result of some queries .
They were also performed on other existing
systems such as www.vivisimo.com,
www.infonetware.com and the results compared
visually. - Some of the results are shown here.
42Indexing on query computer science along with
original results for this query.
43Indexing on query computer science along with
original results for this query.
44Indexing on query computer science along with
original results for this query.
45Indexing on query Sports Vivisimo Scout
Infonetware
46Indexing on query raging bull Vivisimo
Scout Infonetware
47Quantitative analysis
- It is difficult to do the quantitative analysis
because of lack of any parameter to measure, but
following ways are being suggested. - Statistical analysis
- Procedure
- the system could be exposed to a large number of
users, and as an experiment, their actions could
be monitored over the time. In the end, data can
be analyzed to find out which method obtained the
maximum click-through. - Problem
- It is not feasible for a short M.S. thesis
project to have enough time to popularize the
system among people, and leave it for enough time
to gather sufficient data.
48Quantitative analysis (Continued)
- Simulation Method
- Procedure
- Scout Interface could be coupled to the primary
search engine built on top of TREK dataset
(experimental data set meant for IR research), - Queried defined in the data set could be
performed, and retrieved documents could be
compared with the expected retrieval declared by
TREK. - Problem
- No such search engine is available right now. We
are in the process of building our own primary
search engine for research purposes in line with
this present work. And in the future we would be
able to perform such tests after indexing the
TREK or similar experimental text data.
49Conclusion
- Presented the architecture of Scout
- System allows user to effectively explore the
large number of results in various ways using
reranking, clustering and indexing - Allows the presentation of results to be
influenced by profile - Scout architecture allows us to plug in various
algorithms easily
50Future Work
- New and improved version of Scout now available
at http//scout.cs.umn.edu with - Reduced client server communication
- All code in C
- Reducing the file I/O and temporary file storage
- New methods for maintaining user and group
profile - Use of profile in clustering and indexing