Title: A Platform for Personal Information Integration Xin Luna Dong, Alon Halevy lunadong, aloncs'washingt
1A Platform for Personal Information
IntegrationXin (Luna) Dong, Alon
Halevylunadong, alon_at_cs.washington.uga.eduUniv
ersity of Washington
- Paper Review
- by
- Delroy Cameron
- 2/23/2006
2A Platform for Personal Information Integration
Introduction
- Digital Information Explosion
- Search
- WWW Search (e.g. Google, Yahoo)
- Personal Search (Desktop data)
- SEMEX (SEMantic EXplorer)
- Goals of Personal Information Management (PIM)
- Uses Semantic Associations
- Browsing and Querying
- Automatic Creation/Detection of Association
3A Platform for Personal Information Integration
1. Browsing by Semantic Association
- SEMEX Technique
- Requires Logical View instead of hierarchical
- Objects and relations between those Objects
- e.g. Person, Book, AuthoredBy, AttachedTo
- Instantiation of Logical View
- Association Database or Personal Information
space - Current Problem
- Keyword Based Search
- Data stored in Directory Hierarchies on PCs
- Time to traverse trees to find relationships
4A Platform for Personal Information Integration
SEMEX Data Association Techniques
- Association by File type
- Simple Case
- Email Clients
- e.g. senders and recipients
- More Complex Case
- e.g. AuthorOf
- Associations by LaTex Types and PPT
- Association using External Source
- e.g. List of all Graduate Students in a
University - Association by Integration
- Multiple sources with simpler associations
- Spread Sheets, WWW
- Reconcile References
5A Platform for Personal Information Integration
2. Automatic Creation/Detection of Associations
- Data Integration using SEMEX
- Import data into the users personal information
space - From www
- From local files
- Formatting Data
- Scraping from Files or Web pages
- Form Associations
- External sources and users personal Domain
Model - Import Data
- Reconcile references
- Analyze data for pattern matching
- Derive new associations
6A Platform for Personal Information Integration
3. PIM Challenges
- Handling Long-lived/Evolving Data
- Data consistency, seamless updating
- Reference Reconciliation
- Schema Mapping
- Right Granularity for personal Data
- Keep models simple, users not technically savvy
- Develop user-oriented v. Database Design
- Let system fit user habitat
- Not fit user activities into Database
environment - Combining structured/structured data
- Seamless conversion oblivious to user
7A Platform for Personal Information Integration
4. SEMEX Architecture
- Domain Model
- Ontology of personal information, with objects
and their associations - Data Repository
- Association database or Personal Information
Space - Reference Reconciliation
- Ontology of personal information
- Associations and Instances
- Simple already stored
- Extracted rich objects e.g. power point
- External Sources
- Defined similar to views in a database
8(No Transcript)
9SEMEX Architecture
10SEMEX Interface
11A Platform for Personal Information Integration
4.1 Browsing and Querying
- SEMEX Keyword Search
- All documents mentioning the keyword
- Returns Heterogeneous data,
- From many different Classes
- SEMEX Selection Queries
- Specified Class, and specify given Attributes
- SEMEX Association Queries
- Conjunctive over triplets, pair of objects and
their relation - Returns links much like web browsing
- e.g. Search all Bernstein publications
12A Platform for Personal Information Integration
5. Reference Reconciliation
- Mesh External Data to users Domain Model
- e.g. Mike Carey, M. Carey refer to the same
person - Previous Techniques
- Reconciling tuple references in DB Table
- Assume References have same attributes
- Each attribute has a single value
- Challenges
- Heterogeneous Data, different set of attributes
- References have many attributes
- Each attribute may have multiple values
13A Platform for Personal Information Integration
Reference Reconciliation contd
- SEMEX Approach
- Important Definitions
- Reference
- One or several representations of an object
- Class
- May have several Keys
- A Key is a set of Attributes that uniquely
define an object in a class - e.g Person Class email, fname lname Keys
14A Platform for Personal Information Integration
Reference Reconciliation contd
- SEMEX Approach
- Pair wise Decisions
- Enrich references when they match with others
- Contain more information about the domain object
- Support more sophisticated decision matching
- An Enriched reference contains a set of values
- e.g. multiple spellings for last name
- References may be grouped
- e.g. Publications with multiple references to
each of its authors - Group so that all reference points to single
author
15A Platform for Personal Information Integration
Reference Reconciliation contd
- Algorithm
- Step1 Based on Shared Keys
- Merge input references on a key value
- e.g. Person email, fname, lname,
- Step2 - Based on String similarity
- Use edit distance to measure similarity
- An Independent heuristic, uses known format
- e.g. phone, email
- Step3 Applying Global Knowledge
- Time Series Comparison
- collects references judges to be similar, gets
the time stamp and merges if there is little or
no overlap - Step4 - Search Engine Analysis
- Feeds text into Google and compares top hits
16A Platform for Personal Information Integration
Objects from Multiple Classes
17Case Study
18A Platform for Personal Information Integration
19A Platform for Personal Information Integration
Reference Reconciliation contd
- Objects in Multiple Classes
- Reconcile each class in isolation
- Create a Dependency Graph
- A Node for each candidate pair of references
- Each node has similarity score (0 to 1)
- An edge between nodes mean we must reconsider
similarity if we re-compute - Using the Dependency Graph
- If c1 and c2 merged, pa1 and pa2 also merge
- pe1 , pe2, pe3, and pe4 may merge
- i1 and i2 merge
20A Platform for Personal Information Integration
Reference Reconciliation contd
- Evolving Objects
- Publication may change author, title, etc
- Blurs the line of when to model as single or
multiple reference - Distinguish granularity
- Coarse Grain compile from fine-grained based
on certain similarities
21A Platform for Personal Information Integration
Automatic Integration
- Integrate External Sources
- Approach
- Import instances and associations from external
source into personal information space - Mark imported data as Temporary or Permanent
- Pose query to find intersection of data
- Export this intersected data to the spreadsheet
- Previous Work
- Schema Mapping
22Browsing Associations with Semex
LUNA DONG
Different references to the same person
AuthorOfArticles
MentionedIn
SenderOfEmails
RecipientOfEmails
Reference reconciliation identifies all
references to the same real-world object
Coauthors
23Who are Working on Semex? Keyword Search
Returns Associated Instances
Search Semex
3 Conferences for publishing Semex papers
105 Images in Semex papers
2398 Messages 2 Presentations 65 Articles
15 Persons working on Semex (though they are not
named Semex )
24How do I Get to Know this Person? Semex
Provides Lineage Information
Susan Dumais
Latest Lineage
Shortest Lineage
User Do I know this paper of Susan Dumais?
Semex Yes, you once cited it.
The last time we mentioned Susan Dumais is in an
email
Earliest Lineage
I got to know Susan Dumais by citing her paper
25A Platform for Personal Information Integration