Personalized Profile Based Search Interface With Ranked and Clustered Display - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Personalized Profile Based Search Interface With Ranked and Clustered Display

Description:

9. The Wire - Breaking News from the. 10. CNNSI.com from CNN and Sports Illustrated ... 5. Data Mining News (53) 6. SPSS Data Mining : SPSS delivers the ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 51
Provided by: sac6
Category:

less

Transcript and Presenter's Notes

Title: Personalized Profile Based Search Interface With Ranked and Clustered Display


1
  • Personalized Profile Based Search Interface With
    Ranked and Clustered Display

M.S. Thesis defense Sachin Kumar Advisor
Dr.Vipin Kumar
2
Acknowledgements
  • I would like to thank
  • Prof. Vipin Kumar
  • For his continuous guidance and invaluable
    inspiration.
  • Eui-Hong Han
  • For working closely, suggesting innovative ideas,
    and providing implementation level guidance.
  • B. Uygar Oztekin
  • For providing the command line version of Meta
    Search Engine Mearf to suit to the specific needs
    of Scout system.
  • Levent Ertoz
  • providing the SNN Algorithm for clustering, and
    brute force phrase generation algorithm for
    Indexing.
  • Saurabh Singhal and Eric Eilertson
  • in testing, and providing feedback and ideas to
    improve the system.

3
Objective
  • Designing and implementing the search interface
    with the capability of re-ranking the results
    based on users profile
  • Developing the system having additional
    capability of clustering, or indexing the
    original results and be able to do relevance
    feedback
  • Modular design to allow forthcoming developers to
    easily plug and test their Algorithms.

4
Problem Definition
  • Huge Amount of Online Information
  • In response to a query, thousands of documents
    are matched, and a few hundreds are provided by
    search engines in a ranked order
  • A typical user can only browse through top few
    items
  • Top ranked documents may be irrelevant in case of
  • Broad Topic Query
  • Multi Model Query
  • Imprecise Query etc

5
Possible Solutions-reranking
  • Customize the Ranking
  • Profile built for each user
  • Example query on language
  • User 1 (Software Engineer) sees programming
    language related pages at the top
  • User 2 (from Liberal Arts) sees natural language
    related pages at the top
  • Drawbacks of profile based reranking
  • Profile may saturate over the time
  • User will have difficulty doing a search in the
    field, which in contrast with his profile

6
Possible Solutions-clustering
  • Document Clustering
  • Different themes in the different clusters
  • of documents
  • Number of clusters would be much less than the
    number of documents.
  • Easy to browse.
  • Potential problems of Clustering
  • Label selection
  • Snippet tolerance
  • Only one path to reach the document

7
Possible Solutions-indexing
  • Indexing
  • Main phrases or topics among documents are
    chosen as index
  • All documents containing the index terms are put
    under that index.
  • Index can be further divided into sub indices
  • The same document can be present at different
    leaf node
  • Drawback of Indexing
  • Redundancy
  • May be very computation intensive to generate the
    phrases or topics, especially when we are working
    with whole documents

8
Possible Solutions-relevance Feedback
  • Relevance Feedback
  • Helpful when initial query is vague
  • User specifies his/her likes and dislikes
  • New query is suggested to the user and new set of
    results are displayed
  • Drawback of Relevance feedback
  • Requires extra effort on part of user to give his
    feedback.
  • If document has some junk words, it might affect
    the new set of results.

9
Profiles- Past Research
  • Various personalized information filtering agents
    and systems
  • Content based Filtering systems
  • Syskill Webert
  • Learns by user explicitly telling about the good
    links
  • Agent tells if the links on present page are
    interesting
  • Amalthaea (by A.Moukas, and P.Maes 1998)
  • Learns by the behavior of user
  • filtering agents, and discovery agents. Filtering
    agents help discovery agents
  • Social filtering or collaborative systems
  • Lets browse (by Lieverman, dyke, Vivacqua 1999)
  • Learns by behaviour of the users
  • Users linked by infrared transmitter. Local
    connaissance
  • Adaptive web site agents (Pazzani Billsus,
    1999)
  • Limited to a single web site. Learns from past
    and current users actions, Suggests the
    documents as user browses

10
Profiles- Past Research
  • Continued
  • Assisted Browsing Systems
  • Letizia(Lieberman H, 1995)
  • Assists only in browsing. Learns by browsing
    pattern of user
  • Keeps quite till it finds enough information to
    tell. Guides through its own window.
  • WebMate(Liren Chen, Katia Sycara, 1998)
  • Learns by explicit relevance feedback
  • Consists of stand-alone proxy and applet
    controller as an interface while user browses
  • WebGlimpse (Udi Manber Mike, 1997)
  • Learns while user browses.
  • Creates neighborhood and adds recommendations to
    the current pages
  • WebWatcher (Armstrong, Frietag, Joachims 1995)
  • Learns by browsing pattern of user
  • Search and browse guide, guides user by inserting
    its markup

11
Profiles- Past Research
  • Continued
  • Other systems
  • ARIADNE (Twidale Michael B et al, 1995)
  • GroupWeb (Greenberg, S. and Roseman, M., 1996)
  • Jasper (Devis,weeks Revett, 1996)
  • Pluribus (Schapira, 1999)
  • SearchPad (Krishna Bharat, 2001)
  • Select (Alton- Scheidl et al, 1999)
  • WebHound (Lashkari Y. 1995)

12
Clustering- Past Research
  • This approach investigated in various papers and
    incorporated in a few meta-search engines
  • Grouper (O. Zamir, O. Etzioni, 1998)
  • Meta search engine based on STC Algorithm
  • Manjara (Kannan R. and Vinay V., Yale)
  • a meta search engine that uses SVD-based
    clustering technique
  • Interactive Track Interface using Scatter/Gather
    (Cutting Douglass R., Karger David R., 1996)
  • Query based browsing system based on
    scatter-gather paradigm.
  • The Paraphrase interface (P.Anick,
    S.Vaithyanathan,1997)
  • Based on SVD algorithm

13
Indexing- Past Research
  • Used by meta-search engines e.g.
  • Vivisimo (www.vivisimo.com)
  • uses a form of conceptual clustering
  • MSEEC-Multi Search Engine with Multiple
    Clustering
  • theme detection algorithm and LZW compression
    method
  • Infogistics (www.infonetware.com)
  • Statistical, linguistic and conceptual analysis
    to break document collection into topics and sub
    topics
  • Morphological and syntactic transformations to
    unify phrases according to a language grammar.

14
Introduction to Scout
  • A meta search engine with an integrated Interface
    having facility of profile based re-ranking,
    clustering, indexing and relevance feedback.
  • Makes it easy to plug and test new algorithms
  • Advanced users can control the various parameters
    through advanced interface.
  • Scout Architecture is given on the next slide

15
(No Transcript)
16
Brief introduction to
  • Scout powered by Mearf, an optimized meta-Search
    Engine based on expert agreement and content
    based reranking which has the ability to combine
    multiple methods and quality measures
  • Reference mearf.cs.umn.edu
  • (Submitted to CIKM 20001) B. U.Oztekin,
    G.Karypis, V.Kumar
  • Supports various search engines Google, Excite,
    Altavista, Directhit
  • Intelligent advertisement removal and duplicate
    elimination module

17
Profile Based Reranking
  • Profile Storage
  • Each user is given a unique user ID.
  • For each user, in each search session, query and
    all the documents visited are appended to the
    past search session for that user.
  • Old profile phased out to keep it unsaturated,
    and adapting to the changing interest of the
    user.
  • Example of profile stored on the server
  • 992729098 militari intellig 0 1 2
  • 992728760 militari data mine 3 4 5 6
  • 0 992729229 http//www.oss.net/Papers/hackers/In
    fofare.html inform warfar librarian
    frontlin imageri nation secur agenc data
    mine internet joint nation militari intellig
    command line librarian
  • 1 992729173 http//www.dnd.ca/somalia/vol3/v3c25
    ce.htm militari plan system major
    threat indiscrimin mine former barr armi mission
    constitut failur militari intellig col labb

18
Profile Based Reranking
  • Profile Vectors
  • Process acquires the profile information from the
    server to create the profile vectors.
  • Profile vectors consists of Users profile vector
    and Group profile vector.
  • Creation
  • Pick the snippets from the profile file/files
    whose corresponding query partially matches with
    present query
  • Each such snippet is converted into a vector
    based on vector space model.
  • Each vector thus obtained is compared with
    present query and top 100 documents are selected
  • Centroid of these top documents represent the
    profile vector.

19
Profile Based Reranking
  • Original rank vector
  • It is obtained by taking the centroid of top (at
    present 10) snippets vectors from the present
    session.
  • Reranking Measure
  • Reranking measure (doci)
  • alpha cosine(doci, user profile Vector)
  • betacosine(doci, group profile Vector)
  • gamma cosine(doci, original rank Vector)
  • Where
  • Doci is the ith document in the present session
  • Alpha, beta and gamma are constants with the
    values chosen as 3,2,1 respectively.

20
Interactive Profile Reranking
  • Instead of stored profile, user can request
    reranking on user specified key words through
    interface.
  • Reranking Measure
  • Reranking measure (doci)
  • alpha cosine(doci, interactive Vector)
  • beta cosine(doci, original rank Vector)
  • Where
  • Doci is the ith document in the present session
  • alpha, and beta are constants with the values
    chosen as 1, and 1 respectively after experiments.

21
Clustering Algorithms
  • Implemented following algorithms
  • Kmeans
  • Bisective Kmeans
  • SNN algorithm
  • Uses overlap between the nearest neighbors of
    each pair of snippets as a measure of similarity
  • Built in noise removal mechanism
  • Allows overlapping clusters
  • Reference Ertöz, L., Steinbach, M., Kumar, V.
    Finding Topics in Collections of Documents A
    Shared Nearest Neighbor Approach, First SIAM
    International Workshop on Data Mining,2001

22
Clustering Algorithms (continued)
  • For k-means algorithm, these variations were
    tried out.
  • Nouns in cluster titles.
  • Non noun words were filtered out from the titles
  • Noun Phrases in cluster titles
  • Each vector was augmented with noun phrases, so
    that they could make place at cluster titles.
  • Above two variations used brill tagger for
    generating nouns and noun phrases.
  • Modifying clusters by changing word weight
  • Through interface, user could modify the weight
    of title words to influence the clustering
    algorithm. This gave somewhat control over the
    algorithm.

23
Indexing
  • Index Terms Generation Algorithms
  • Brute force sequential phrase generation
  • Sequential phrase generation using word
    clustering
  • All snippets containing the index term are
    grouped under that index.
  • Documents can be present under more than one
    index term

24
Relevance Feedback Module
  • Current algorithm used
  • Rocchios query reformulation algorithm which is
    described here in brief.
  • Q1 Q0 ß S Ri/n1 - ? S Si/n2,
  • where
  • Q1 the vector for the final query
  • Q0 the vector for the initial query
  • Ri the vector for the relevant document i
  • Si the vector for the non- relevant
    document I
  • n1 the number of relevant documents chosen
  • n2 the number of non-relevant documents
    chosen
  • ß,? weights to adjust the weights of relevant
    and non-relevant document vectors.

25
Qualitative Evaluation
26
(No Transcript)
27
(No Transcript)
28
Query language, (Original results)
29
Query language, (Original results)
  • 1. Foreign Languages for Travelers
  • 2. travlang Foreign Language for Travelers
  • 3. The Human-Languages Page is now iLoveLanguages
  • 4. World Language Resources
  • 5. Search Language Language Resources Directory
  • 6. Jennifer's language page
  • 7. Language -Learning.net
  • 8. Internet Activities for Foreign Language
    Classes
  • 9. Python Language Website
  • 10. HTML Hyper Text Markup Language
  • Out of top 10 results only 3 are relevant.
  • Precision 3/10 30

30
Influence of user past profile and interactive
profile on the ranking of results.
31
Profile based reranking
  • Query on Language (reranked results)
  • 1. Extensible markup language (XML) (54)
  • 2. Extensible markup language (XML) (88)
  • 3. Java Language Specification (62)
  • 4. Foreign language for travelers (1)
  • 5. Travlang (2)
  • 6. Python Language Website (44)
  • 7. HTML Hyper Text Markup Language (10)
  • 8. FAQ about Extensible Markup (32)
  • 9. Intelligent User Interfaces (53)
  • 10. Python Language Website (9)
  • Out of 10, 7 links are related to software.
  • Precision 7/10 70

32
Profile based reranking
  • All italicized highlighted links are related to
    baseball by the content.
  • Improvement in Precision 20 to 60

33
Profile based reranking
  • Improvement in Precision 20 to 70

34
Use of relevance feedback where the user
reformulates the query and gets new result set
35
Relevance Feedback (Language)
  • after relevance feedback
  • Query extension markup xml language python
  • Fetched links
  • 1. Java XML 112
  • 2. XML News From Robin Cover
  • 3. Custom perl, python, java
  • 4. JSX 0.8.4
  • 5. Search Results for xmldir.com
  • 6. XML XML XML Distributed computing resources
  • 7. XML Spy
  • 8. Tamino - the leading XML Server! Download
    FREE Trial
  • 9. Comments for Extensible Markup Language (XML)
  • 10. irt.org - Extensible Markup Language

36
Relevance Feedback (Mining)
  • after relevance feedback
  • Query is software of the data mining
  • Fetched links
  • Mining Data?
  • CodeWeb Data Mining Software Development
    Experience
  • Knowledge Based Systems, Knowledge Management and
    Data Mining
  • Data Mining Software , the key to business
    intelligence
  • Siftware Tools for Data Mining and Knowledge
    Discovery
  • AZMY Thinkware -- Data Analysis and Mining
    Software Tools
  • Predictive Data Mining Software
  • KDnuggets Data Mining , Web Mining , and
    Knowledge Discovery ... ...
  • Data Mining Software ...
  • Xanalys Intelligence Software Providing Data
    Mining , Data Analysis

37

Qualitative Evaluation
  • Clustering
  • No other search engine available online to
    compare the results for clustering.
  • Evaluation
  • Tested by doing some queries and visually seeing
    the results.
  • Some of the results are shown here.

38
Clustering on query computer science along with
the original results for this query.
39
Clustering on query Sports
  • Clusters generated by the Scout

40
Clustering on query raging bull
  • Clusters generated by Scout

41

Qualitative Evaluation
  • Indexing
  • Evaluation
  • Tested by seeing the result of some queries .
    They were also performed on other existing
    systems such as www.vivisimo.com,
    www.infonetware.com and the results compared
    visually.
  • Some of the results are shown here.

42
Indexing on query computer science along with
original results for this query.
43
Indexing on query computer science along with
original results for this query.
44
Indexing on query computer science along with
original results for this query.
45
Indexing on query Sports Vivisimo Scout
Infonetware
46
Indexing on query raging bull Vivisimo
Scout Infonetware
47
Quantitative analysis
  • It is difficult to do the quantitative analysis
    because of lack of any parameter to measure, but
    following ways are being suggested.
  • Statistical analysis
  • Procedure
  • the system could be exposed to a large number of
    users, and as an experiment, their actions could
    be monitored over the time. In the end, data can
    be analyzed to find out which method obtained the
    maximum click-through.
  • Problem
  • It is not feasible for a short M.S. thesis
    project to have enough time to popularize the
    system among people, and leave it for enough time
    to gather sufficient data.

48
Quantitative analysis (Continued)
  • Simulation Method
  • Procedure
  • Scout Interface could be coupled to the primary
    search engine built on top of TREK dataset
    (experimental data set meant for IR research),
  • Queried defined in the data set could be
    performed, and retrieved documents could be
    compared with the expected retrieval declared by
    TREK.
  • Problem
  • No such search engine is available right now. We
    are in the process of building our own primary
    search engine for research purposes in line with
    this present work. And in the future we would be
    able to perform such tests after indexing the
    TREK or similar experimental text data.

49
Conclusion
  • Presented the architecture of Scout
  • System allows user to effectively explore the
    large number of results in various ways using
    reranking, clustering and indexing
  • Allows the presentation of results to be
    influenced by profile
  • Scout architecture allows us to plug in various
    algorithms easily

50
Future Work
  • New and improved version of Scout now available
    at http//scout.cs.umn.edu with
  • Reduced client server communication
  • All code in C
  • Reducing the file I/O and temporary file storage
  • New methods for maintaining user and group
    profile
  • Use of profile in clustering and indexing
Write a Comment
User Comments (0)
About PowerShow.com