Title: Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails
1Memex A Browsing Assistant forCollaborative
Archiving andMining of Surf Trails
- Soumen ChakrabartiSandeep SrivastavaMallela
SubramanyamMitul Tiwari - Indian Institute of Technology Bombay
2Sources of Web information
- Sources already exploited
- Text on pages (keyword search)
- Link between pages (popularity rating)
- Topic taxonomies (query expansion)
- Sources not exploited enough yet
- Public surfing history
- Public bookmarks
- Collaboration is central to hypertext
- Lack of trust limits collaboration on Web
3Our goals
- Infrastructure to support spontaneous formation
of topic-based collaborative Web communities - Browsing assistant client
- Community server
- Mining algorithms for personal and community
level topic management and collaborative resource
discovery - Extensible API for plugging in additional
hypertext analysis tools
41 Create a Memex account (password sent by email)
5Function tabs
Memex client applet attaches to browser
Privacy choice
6Preparing to import initial bookmarks
7Bookmarks imported
8For Memex to suggest an initial topic
organization, select all bookmarks
9and send them to the clustering tab
10Switch to the clustering tab
URLs to be clustered appear here
11Submit the URLs to the server-side Memex
clustering demon
12Check later if the server has completed the
clustering task
13Two top-level clusters about software and music
14Expanding the software cluster to study it
in more detail
15User can freely reorganize URL placement
using cut-and-paste
16User can freely reorganize URL placement
using cut-and-paste
17User can freely reorganize URL placement
using cut-and-paste
18Moving an entire folder from the cluster tab
19to the folder tab together with example URLs
20to the folder tab together with example URLs
21Folder names can be edited as per taste
this also gives Memex additional clues about the
folders contents
22New folders can be created to hold clusters found
in the cluster tab
23New folders can be created to hold clusters found
in the cluster tab
24A topic hierarchy which is too detailed for the
user can be flattened
25A topic hierarchy which is too detailed for the
user can be flattened
26Groups of closely related URLs can be moved
back to folders in the folder tab
27Groups of closely related URLs can be moved
back to folders in the folder tab
28Memex helps the user derive a starting topic
hierarchy from unstructured bookmarks
29The user then continues browsing in multiple
sessions. Relevant pages found by other members
of the community and made public are
available for collaborative surfing
30If permission is granted, the Memex applet
monitors the trail that the surfer follows
and uploads it to the server for further analysis
and mining
31If permission is granted, the Memex applet
monitors the trail that the surfer follows
and uploads it to the server for further analysis
and mining
32Such surf trails together with page contents are
valuable inputs to the Memex server-side hypertext
mining and resource discovery demons
33? indicates that Memex is not sure about the
folder assignment. Users can easily correct
mistakes and this forms additional valuable
training data.
In the background, the Memex classifier finds the
most suitable folders to assign to each
history items. History is never deleted (disk is
cheap). When the user refreshes the view, surf
history from others and herself are found
categorized into the users familiar topic tree.
34Automatic collaborative classification also lets
users return to a topic-restricted surfing
context quickly, and replay the last few
surfing actions within that topic of interest.
35Personalized topic-based history management is
far superior to the one- dimensional history
list provided by popular browsers
36Users can switch topics with a single click, and
browsing is not limited by the linear back and
forward paradigm supported by browsers.
37Users can switch topics with a single click, and
browsing is not limited by the linear back and
forward paradigm supported by browsers.
38A flexible interactive search lets the user
locate any page ever visited from anywhere using
this account, combining content with popularity,
site selections and timeliness
39A flexible interactive search lets the user
locate any page ever visited from anywhere using
this account, combining content with popularity,
site selections and timeliness
40Close integration of the Memex client with
the browser is non-trivial to implement but adds
greatly to comfort and ease of use
41Memex system diagram
Browser
Memex server
Visit
Client JAR
Taxonomy synthesis
Resource discovery
Search
Attach
Recommendation
Folder
Download
Context
Classification
Mining demons
Running client applet
Event-handler servlets
Archive
Clustering
Relational metadata
Text index
Topic models
Memex client-server protocol and workload sharing
negotiations
42Document workflow
Page visit and bookmarking events logged
NODE table
Browser
Memex client
Push new version
Per-document version queue
Crawler
Pop and discard old version
Demon Registry
Search indexer
Classifier service
Clustering service
Garbage collector
43Autonomous topic organization
- Bookmarks often collected into topics
- Surfers use personal topic organization
- One-size-fits all taxonomy inadequate
- Many topics over-developed for most of us
- http//dmoz.org/Sports/Hockey/Underwater_Hockey/
- But deeper interests often underdeveloped
- Structure reorganization also desirable
- Best taxonomy depends on community behavior as
well as page content
44Autonomy and collaboration
- Personalization ? picking Yahoo nodes
- Complex relations between topics
- Need simplest common ground
- Coalesce similar topics where possible
- without sacrificing individual taste
User2
User1
User3
Yahoo
Cycling
Sports
Biz
Sports
Sports
Shops
Hiking
Cycling
Bikeshops
Bikeshops
Subsumption
Tree inversion
45Taxonomy synthesis example
Media
kpfa.org
bbc.co.uk
kron.com
Broadcasting
channel4.com
kcbs.com
Entertainment
foxmovies.com
miramax.com
Studios
lucasfilms.com
- Generating themes makes map simpler
- But distorts contents of original folders
- Joint optimization gives best themes
46Summary and project status
- Collaborative resource discovery and topic
management system - Testbed for hypertext mining research
- Signed Java2 client
- Netscape 4.5 available
- IE5 planned
- Server for Unix and Windows
- IBM UDB, Berkeley DB, servlets
- Non-trivial to install and manage
- Simple-to-use RPMs being planned
- http//www.cse.iitb.ernet.in/soumen