Title: Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi
1Efficiently Linking Text Documents with Relevant
Structured InformationPrasan Roywith Venkat
Chakaravarthy, Himanshu Gupta and Mukesh
MohaniaIBM India Research Lab, New Delhi
2Structured and Unstructured Information
- Information content in an enterprise can be
structured or unstructured - Structured Content Transaction data, payroll,
sales orders, invoice, customer profiles, etc. - Unstructured Content emails, reports, web-pages,
complaints, etc. - Historically, the structured and unstructured
data retrieval technologies have evolved
separately ? Artificial separation between these
two kinds of information - Enterprises are realizing the need to bridge this
separation, and are demanding integrated
retrieval, management and analysis of both the
structured and unstructured content
SQL Query
Keyword search query
RDBMS
Search Engine
SQL Result
Search Result
Structured Data Retrieval
Unstructured Data Retrieval
3EROCS Entity RecOgnition in Context of
Structured data
- Exploit partial information contained in a
document to automatically identify and link
relevant structured data - Main Idea
- View the structured data as a set of pre-defined
entities - Identify the entities from this set that best
match the document, and also find embeddings of
the identified entities in the document
4Entity Templates
- Specify the locations (within the relational
database) of - The candidate entities to be matched in the
document - For each entity, the relevant context information
to be exploited to perform the match - Specified by a domain expert
- Default All tables reachable from the identified
root node - Future work Automatic identification of relevant
templates
5Example
ltmoviegt ltnamegt Rebecca lt/namegt ltactorgt
ltnamegt Bruce, Nigel lt/namegt ltactedasgt Major
Giles Lacy lt/actedasgt lt/actorgt ltactorgt
ltnamegt Olivier, Laurence lt/namegt
ltactedasgt 'Maxim' de Winter lt/actedasgt
ltactorgt ltnamegt Collier, Constance lt/namegt
ltactedasgt Rebecca lt/actedasgt lt/actorgt
ltactorgt ltnamegt Cooper, Gladys lt/namegt
ltactedasgt Beatrice Lacy lt/actedasgt
lt/actorgt ltdirectorgtHitchcock, Alfred
lt/directorgt lt/moviegt
ltmoviegt ltnamegt Godfather, The lt/namegt
ltactorgt ltnamegtBrando, Marlonlt/namegt
ltactedasgtDon Vito Corleonelt/actedasgt lt/actorgt
ltactorgt ltnamegtCaan, Jameslt/namegt
ltactedasgt 'Sonny' Corleone lt/actedasgt lt/actorgt
ltactorgt ltnamegtDuvall, Robert (I)lt/namegt
ltactedasgtTom Hagenlt/actedasgt lt/actorgt
ltactorgt ltnamegtLettieri, Allt/namegt
ltactedasgtVirgil Sollozzolt/actedasgt lt/actorgt
ltactorgt ltnamegt Rendina, Victorlt/namegt
ltactedasgt Philip Tattaglialt/actedasgt lt/actorgt
ltdirectorgtCoppola, Francis F. lt/directorgt
lt/moviegt
EROCS
6Enables Effective Search
- Existing Search Engines
- Recall inherently limited by terms actually
present in the document - For instance, the term Godfather, though
relevant, does not appear in the document?
Existing search engines would not return this
document in response to a query Godfather - EROCS Value-Add
- Automatically retrieves relevant information from
a structured database and associates it with the
document as additional metadata - Search can exploit this metadata ? Improves
recall, precision - A search on Godfather would return the example
document - Show documents about movies with characters
Rebecca and Corleone would NOT return the
example document - Enables more complex XML Fragments/XPath queries
on the documents - The associated metadata can also be used to gauge
similarity between documents, enable/complement
sophisticated text analysis
7Enables OLAP on Structured Unstructured Data
- Current OLAP tools restricted to structured data
- EROCS can incorporate unstructured data in the
analysis
Unstructured Data
Find the store with the greatest upsurge in
complaints on high-value transactions
Structured Data
8Outline
- Introduction ?
- Framework
- Identifying Best-Matching Entities
- Context Cache
- Exploiting the Context Cache
- Experimental Study
- Conclusion
9Entity and Document Models
Pivot table
- Entity
- Each row e in the pivot table is identifiedas an
entity - Context of the entity e the set of
terms present in the row e as well as in rows
in the context tables having a path to e - Document
- A sequence of sentences, where each sentence is a
bag of terms - Actual implementation runs a parser on the
document and retains only the noun phrases - Could be enhanced by further disambiguating the
terms using NER (identifying customer names,
organization names, product names, etc.) - Segment
- A sequence of one or more consecutive sentences
in the document
10Entity-Document Matching
- The score of an entity e with respect to a
segment d is defined as - where
- T(e, d) the set of terms common between the
entity e and segment d - tf(t, d) the number of times the term t appears
in the segment d - w(t) the weight of the term t
- In our implementation, w(t) is defined as
-
11Identifying the Best Matching Annotation
- Input Document D, set of entities E
- An annotation for the document D is a pair (S, F)
where - S is a set of non-overlapping segments of D
- F S ? E maps each segment d ? S to an entity
F(d) ? E - Score of an annotation (S, F) defined as
- where ? 0 is a tunable parameter
E
Ensures that score(F(d), d) ? for each segment
d in the solution
D
12Pruning the Search Space
best match
- An annotation (S, F) is termed canonical iff
- S is a partition of D
- F maps each segment d ? S to its best matching
entity - Claim
- For any document D, there exists a canonical
annotation such that it is an optimal annotation
for D - ? We restrict the search space to canonical
annotations, without loss in generality
E
D
13Algorithm
- Let Di,j the segment in D containing sentence i
to sentence j, 0 i j D - Let ei,j be the best matching entity for Di,j,
with si,j score(ei,j, Di,j,) - Let (Sk, Fk) be the best annotation for D1,k,
with rk score(Sk, Fk) - Then
- rk max0 j k-1(rj sj1,k ?)
- Procedure BestAnnot(D)
- Input Document D
- Output best annotation
- For i 1 to D
- For j i to D
- Let ei,j argmaxe? E score(e, Di,j)
- Let si,j score(ei,j, Di,j)
- Let S0
- Let r0 0
- For k 1 to D
- Let j argmax0 j k-1(rj sj1,k ?)
- Let Sk Sj U Dj1,k
- Let rk rj sj1,k ?
- For each d? SD
- Let FD(d) argmaxe? E score(e, d)
- Return (SD, FD)
1
j
k
Optimal annotation
Optimal annotation
Maps to the best matching entity
14Issue Performance
- The algorithm involves an entity search for every
segment in the document - If naively done, likely to be a performance
bottleneck - Possible Solutions
- Cache the result of the entity search
- Document unlikely to have repeated segments ?
Not effective - Materialize and index the context ofeach entity
- High computation, maintenanceand storage
overheads - Remainder of the talk developsefficient, low
overhead techniques
- Procedure BestAnnot(D)
- Input Document D
- Output best annotation
- For i 1 to D
- For j i to D
- Let ei,j argmaxe? E score(e, Di,j)
- Let si,j score(ei,j, Di,j)
- Let S0
- Let r0 0
- For k 1 to D
- Let j argmax0 j k-1(rj sj1,k ?)
- Let Sk Sj U Dj1,k
- Let rk rj sj1,k ?
- For each d? SD
- Let FD(d) argmaxe? E score(e, d)
- Return (SD, FD)
15Context Cache
- Stores associations between entities and terms in
the document - Collection of pairs of the form (e, t) meaning
that the term t is contained in the context of
the entity e - Indexed both on entities and terms
- Database Access Primitives
- GetEntities(t)
- Retrieves the set of entities that contain term
tin their context - Inserts (e, t) in the cache for each e in the set
- GetTerms(e)
- Retrieves the set of terms in the context of the
entity e - Inserts (e, t) in the cache for each t in the set
Terms
1 1 1 1
1 1 1
1 1 1 1
1 1 1
1 1 1
Entities
16Eliminating Repeated Database Access
- Baseline approach the most straightforward use
of the cache - Eliminates repeated access to the database for
repeated invocations of GetEntities(t) and
GetTerms(e) for the same term t and entity e
respectively - AllTerms
- Populate the cache by invoking GetEntities(t) for
each term t in the document - Determine the best matching entity for each
segment in the document using the information in
the cache - Call BestAnnot(D) to compute the best annotation
- Does not scale well, both in terms of time and
space overheads - Reason
- Invokes GetEntities(t) for every term in the
document - Includes terms that are present in a very large
number of entities - Low weight ? make little difference
- Large size of result ? GetEntities
computationally expensive - Need to avoid calling GetEntities on such terms,
while ensuring that the best matching entity for
each segment is retrieved
17Cache-based Score Bounds
- Given a segment d
- Let TC(d) ? T(d) be the set of terms on which
GetEntities has been invoked so far - Then, for any e ? E
- Lower bound on score(e, d)
- Upper bound on score(e, d)
- where
Terms
1 1 1
1 1
1 1
1 1
1 1
Entities
TC(d)
T(d)-TC(d)
18Cache Completeness
- Let EC(d) ? E be the set of entities retrieved by
at least one GetEntities() so far - The context cache is termed complete wrt a
segment d iff the best matching entity for d is
guaranteed to be present in EC(d) - For any e ? EC(d)
- and, therefore
- ? The context cache is complete if there exists
an e ? EC(d) such that
Terms
1 1 1
1 1
1 1
1 1
1 1
Entities
EC(d)
e?EC(d)
TC(d)
T(d)-TC(d)
19Term Pruning Strategy
- Procedure BestMatchEntity(d)
- Input Segment d
- Output best matching entity
- Let EC(d)
- Let wC St ? T(d) tf(t, d).w(t)
- Repeat
- Let t argmaxt ? T(d)-EC(d) tf(t, d).w(t)
- Let wC wC - tf(t, d).w(t)
- Let EC(d) EC(d) U GetEntities(t)
- Let e argmaxe ? EC(d)score-(e, d)
- Until score-(e, d) gt wC
- Let E(d) EC(d), s 0
- Repeat
- Call GetTerms(e)
- Let s score(e, d)
- If s lt s then let s s, e e
- Let E(d) E(d) - e
- Let e argmaxe ? E(d)score-(e, d)
- Until score-(e, d) wC lt s
- Call GetEntities on t ? T(d) in decreasing order
of tf(t, d).w(t), stopping when the cache becomes
complete - AllSegments
- For each segment d in D
- Call BestMatchEntity(d) to compute the
best matching entity - Call BestAnnot(D) to compute the best annotation
- Does not scale well with document size
- Reason
- Computes the best matching entities for all
segments in the document - Need to effectively prune the set of segments
20Cache-based Best-Annotation Score Bounds
- Modify BestAnnot to compute the best annotation
using only the cache contents - In memory computation no DB access
- Let (S, F) be the best annotation for the
document D - Let (SC, FC) be the annotation returned by
BestAnnotC(D) - Then
- where
- Procedure BestAnnotC(D)
- Input Document D
- Output best annotation
- For i 1 to D
- For j i to D
- Let ei,j argmaxe? E score(e, Di,j)
- Let si,j score(ei,j, Di,j)
- Let S0
- Let r0 0
- For k 1 to D
- Let j argmax0 j k-1(rj sj1,k ?)
- Let Sk Sj U Dj1,k
- Let rk rj sj1,k ?
- For each d? SD
- Let FD(d) argmaxe? E score(e, d)
- Return (SD, FD)
21Iterative Cache Refinement The EROCS Algorithm
- Iteratively refine the cache contents so that
the slack between the lower/upper bounds of
successive (SC, FC) decreases with every
iteration - On termination, score-(SC, FC) score (SC, FC)
? (SC, FC) is the best annotation (S, F) - Incremental version of BestAnnotC used to compute
successive annotations efficiently
- Procedure BestAnnotEROCS(D)
- Input Document D
- Output best annotation
- Initialize the context cache as empty
- Let (SC, FC) BestAnnotC(D)
- While score-(SC, FC) lt score(SC, FC)
- Call UpdateCache(SC, FC)
- Let (SC, FC) BestAnnotC(D)
- Return (SC, FC)
Cache Refinement Policy
22Cache Refinement
- Define the slack of annotation (SC, FC) as
- Let (S1C, F1C) and (S2C, F2C) be the annotations
computed by BestAnnotC(D) on two consecutive
invocations - Aim Choose the intervening cache update such
that the decrease in slackslack(S1C, F1C)
slack(S2C, F2C) is maximized - Observations
- If only GetEntities allowed, then the optimal
policy is to invoke GetEntities on the term t ?
T(D)-TC(D) such that tf(t, D).w(t) is maximum - If only GetTerms allowed (and the context cache
is complete with respect to each segment in S1C),
a good heuristic is to invoke GetTerms on the
best matching entity for the segment in S1C with
the maximum slack - EROCS follows a hybrid policy
23Cache Refinement Policy
- Procedure UpdateCache(SC, FC)
- Input current best annotation
- If the current cache is complete wrt each d ? SC
- Let d argmaxd ? SC( score(FC(d),d)
score-(FC(d),d) ) - Call GetTerms(FC(d))
- Else
- Let t argmaxt ? T(D)-TC(D)tf(t, D).w(t)
- Call GetEntities(t)
- Favors calling GetEntities initially and GetTerms
later - Justified since initially, terms cause greater
decrease in the slack - Does not intermix GetEntities and GetTerms for an
annotation - Avoids redundant work
- Works well in practice
24Experimental Setup
- Structured Dataset
- Subset of the IMDB dataset
- Entities Movies, Context Actors, Directors,
Producers, Writers, Editors - 401660 movies, 2GB across 8 tables
- Document Dataset
- Movie reviews/storylines downloaded from
http//www.filmsite.org - Noun phrases identified as relevant terms
- Removed the name of the movie from the text
- Decomposed the reviews into 8-sentence segments
- Classified each segment as good/bad based on
average term weight - Given parameters K and a, generated a random
document by - Picking a random sequence of K distinct movies
- For each movie in the sequence, including a good
segment with probability a and a bad segment with
probability 1- a - Final repository included 50 docs for each K 1,
2, , 10 and a 0.0, 0.1,, 1.0 - Accuracy Metric
- Harmonic mean of the avg precision and recall
over the sentences in the document - Parameter Setting ? 4 (results robust with
respect to ?)
25Experimental Results Efficacy of Fine-Grained
Entity Matching
a 0.8
K 10
Top-K picks the top K entities that best match
the entire document (considered as a single
segment)
26Experimental Results Efficacy of EROCS
a 0.8
27Experimental Results Efficacy of EROCS (contd)
K 10
28Related Work
- Semantic Integration AI Magazine, Special Issue
on Semantic Integration (2005) - Identifying common concepts across heterogeneous
data sources - Mainstay heterogeneous structured databases
(DB), within and across text documents (IR) - Keyword-based search in relational databases
DBXplorer, BANKS (ICDE02), Discover (VLDB03) - Few keywords, all presumed relevant to a single
entity - Named-entity recognition Mansuri and Sarawagi,
Chandel et al. (ICDE06), Agichtein and Ganti
(SIGKDD04) - Entities recognized only if explicitly mentioned
in the document - Please refer to the paper for a more complete
discussion
29Summary
- EROCS A system for inter-linking information
across structured databases and documents - An effective iterative improvement algorithm that
tries to keep the information retrieved from the
database small - The linkages discovered can be used to enrich
several techniques and applications in the
database-centric and IR-centric domains
30(No Transcript)