Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi

Description:

IBM Research. Efficiently Linking Text Documents with Relevant Structured ... The heroine ungraciously and ineptly disturbs the respectability and silence of ... – PowerPoint PPT presentation

Number of Views:191

Avg rating:3.0/5.0

Slides: 31

Provided by: prasa4

Category:

more less

Transcript and Presenter's Notes

Title: Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi

1
Efficiently Linking Text Documents with Relevant
Structured InformationPrasan Roywith Venkat
Chakaravarthy, Himanshu Gupta and Mukesh
MohaniaIBM India Research Lab, New Delhi
2
Structured and Unstructured Information

Information content in an enterprise can be
structured or unstructured
Structured Content Transaction data, payroll,
sales orders, invoice, customer profiles, etc.
Unstructured Content emails, reports, web-pages,
complaints, etc.
Historically, the structured and unstructured
data retrieval technologies have evolved
separately ? Artificial separation between these
two kinds of information
Enterprises are realizing the need to bridge this
separation, and are demanding integrated
retrieval, management and analysis of both the
structured and unstructured content

SQL Query
Keyword search query
RDBMS
Search Engine
SQL Result
Search Result
Structured Data Retrieval
Unstructured Data Retrieval
3
EROCS Entity RecOgnition in Context of
Structured data

Exploit partial information contained in a
document to automatically identify and link
relevant structured data
Main Idea
View the structured data as a set of pre-defined
entities
Identify the entities from this set that best
match the document, and also find embeddings of
the identified entities in the document

4
Entity Templates

Specify the locations (within the relational
database) of
The candidate entities to be matched in the
document
For each entity, the relevant context information
to be exploited to perform the match
Specified by a domain expert
Default All tables reachable from the identified
root node
Future work Automatic identification of relevant
templates

5
Example
ltmoviegt ltnamegt Rebecca lt/namegt ltactorgt
ltnamegt Bruce, Nigel lt/namegt ltactedasgt Major
Giles Lacy lt/actedasgt lt/actorgt ltactorgt
ltnamegt Olivier, Laurence lt/namegt
ltactedasgt 'Maxim' de Winter lt/actedasgt
ltactorgt ltnamegt Collier, Constance lt/namegt
ltactedasgt Rebecca lt/actedasgt lt/actorgt
ltactorgt ltnamegt Cooper, Gladys lt/namegt
ltactedasgt Beatrice Lacy lt/actedasgt
lt/actorgt ltdirectorgtHitchcock, Alfred
lt/directorgt lt/moviegt
ltmoviegt ltnamegt Godfather, The lt/namegt
ltactorgt ltnamegtBrando, Marlonlt/namegt
ltactedasgtDon Vito Corleonelt/actedasgt lt/actorgt
ltactorgt ltnamegtCaan, Jameslt/namegt
ltactedasgt 'Sonny' Corleone lt/actedasgt lt/actorgt
ltactorgt ltnamegtDuvall, Robert (I)lt/namegt
ltactedasgtTom Hagenlt/actedasgt lt/actorgt
ltactorgt ltnamegtLettieri, Allt/namegt
ltactedasgtVirgil Sollozzolt/actedasgt lt/actorgt
ltactorgt ltnamegt Rendina, Victorlt/namegt
ltactedasgt Philip Tattaglialt/actedasgt lt/actorgt
ltdirectorgtCoppola, Francis F. lt/directorgt
lt/moviegt
EROCS
6
Enables Effective Search

Existing Search Engines
Recall inherently limited by terms actually
present in the document
For instance, the term Godfather, though
relevant, does not appear in the document?
Existing search engines would not return this
document in response to a query Godfather
EROCS Value-Add
Automatically retrieves relevant information from
a structured database and associates it with the
document as additional metadata
Search can exploit this metadata ? Improves
recall, precision
A search on Godfather would return the example
document
Show documents about movies with characters
Rebecca and Corleone would NOT return the
example document
Enables more complex XML Fragments/XPath queries
on the documents
The associated metadata can also be used to gauge
similarity between documents, enable/complement
sophisticated text analysis

7
Enables OLAP on Structured Unstructured Data

Current OLAP tools restricted to structured data
EROCS can incorporate unstructured data in the
analysis

Unstructured Data
Find the store with the greatest upsurge in
complaints on high-value transactions
Structured Data
8
Outline

Introduction ?
Framework
Identifying Best-Matching Entities
Context Cache
Exploiting the Context Cache
Experimental Study
Conclusion

9
Entity and Document Models
Pivot table

Entity
Each row e in the pivot table is identifiedas an
entity
Context of the entity e the set of
terms present in the row e as well as in rows
in the context tables having a path to e
Document
A sequence of sentences, where each sentence is a
bag of terms
Actual implementation runs a parser on the
document and retains only the noun phrases
Could be enhanced by further disambiguating the
terms using NER (identifying customer names,
organization names, product names, etc.)
Segment
A sequence of one or more consecutive sentences
in the document

10
Entity-Document Matching

The score of an entity e with respect to a
segment d is defined as
where
T(e, d) the set of terms common between the
entity e and segment d
tf(t, d) the number of times the term t appears
in the segment d
w(t) the weight of the term t
In our implementation, w(t) is defined as

11
Identifying the Best Matching Annotation

Input Document D, set of entities E
An annotation for the document D is a pair (S, F)
where
S is a set of non-overlapping segments of D
F S ? E maps each segment d ? S to an entity
F(d) ? E
Score of an annotation (S, F) defined as
where ? 0 is a tunable parameter

E
Ensures that score(F(d), d) ? for each segment
d in the solution
D
12
Pruning the Search Space

best match

An annotation (S, F) is termed canonical iff
S is a partition of D
F maps each segment d ? S to its best matching
entity
Claim
For any document D, there exists a canonical
annotation such that it is an optimal annotation
for D
? We restrict the search space to canonical
annotations, without loss in generality

E
D
13
Algorithm

Let Di,j the segment in D containing sentence i
to sentence j, 0 i j D
Let ei,j be the best matching entity for Di,j,
with si,j score(ei,j, Di,j,)
Let (Sk, Fk) be the best annotation for D1,k,
with rk score(Sk, Fk)
Then
rk max0 j k-1(rj sj1,k ?)

Procedure BestAnnot(D)
Input Document D
Output best annotation
For i 1 to D
For j i to D
Let ei,j argmaxe? E score(e, Di,j)
Let si,j score(ei,j, Di,j)
Let S0
Let r0 0
For k 1 to D
Let j argmax0 j k-1(rj sj1,k ?)
Let Sk Sj U Dj1,k
Let rk rj sj1,k ?
For each d? SD
Let FD(d) argmaxe? E score(e, d)
Return (SD, FD)

1

j

k
Optimal annotation
Optimal annotation
Maps to the best matching entity
14
Issue Performance

The algorithm involves an entity search for every
segment in the document
If naively done, likely to be a performance
bottleneck
Possible Solutions
Cache the result of the entity search
Document unlikely to have repeated segments ?
Not effective
Materialize and index the context ofeach entity
High computation, maintenanceand storage
overheads
Remainder of the talk developsefficient, low
overhead techniques

Procedure BestAnnot(D)
Input Document D
Output best annotation
For i 1 to D
For j i to D
Let ei,j argmaxe? E score(e, Di,j)
Let si,j score(ei,j, Di,j)
Let S0
Let r0 0
For k 1 to D
Let j argmax0 j k-1(rj sj1,k ?)
Let Sk Sj U Dj1,k
Let rk rj sj1,k ?
For each d? SD
Let FD(d) argmaxe? E score(e, d)
Return (SD, FD)

15
Context Cache

Stores associations between entities and terms in
the document
Collection of pairs of the form (e, t) meaning
that the term t is contained in the context of
the entity e
Indexed both on entities and terms
Database Access Primitives
GetEntities(t)
Retrieves the set of entities that contain term
tin their context
Inserts (e, t) in the cache for each e in the set
GetTerms(e)
Retrieves the set of terms in the context of the
entity e
Inserts (e, t) in the cache for each t in the set

Terms
1 1 1 1
1 1 1
1 1 1 1
1 1 1
1 1 1

Entities
16
Eliminating Repeated Database Access

Baseline approach the most straightforward use
of the cache
Eliminates repeated access to the database for
repeated invocations of GetEntities(t) and
GetTerms(e) for the same term t and entity e
respectively
AllTerms
Populate the cache by invoking GetEntities(t) for
each term t in the document
Determine the best matching entity for each
segment in the document using the information in
the cache
Call BestAnnot(D) to compute the best annotation
Does not scale well, both in terms of time and
space overheads
Reason
Invokes GetEntities(t) for every term in the
document
Includes terms that are present in a very large
number of entities
Low weight ? make little difference
Large size of result ? GetEntities
computationally expensive
Need to avoid calling GetEntities on such terms,
while ensuring that the best matching entity for
each segment is retrieved

17
Cache-based Score Bounds

Given a segment d
Let TC(d) ? T(d) be the set of terms on which
GetEntities has been invoked so far
Then, for any e ? E
Lower bound on score(e, d)
Upper bound on score(e, d)
where

Terms
1 1 1
1 1
1 1
1 1
1 1

Entities
TC(d)
T(d)-TC(d)
18
Cache Completeness

Let EC(d) ? E be the set of entities retrieved by
at least one GetEntities() so far
The context cache is termed complete wrt a
segment d iff the best matching entity for d is
guaranteed to be present in EC(d)
For any e ? EC(d)
and, therefore
? The context cache is complete if there exists
an e ? EC(d) such that

Terms
1 1 1
1 1
1 1
1 1
1 1

Entities
EC(d)
e?EC(d)
TC(d)
T(d)-TC(d)
19
Term Pruning Strategy

Procedure BestMatchEntity(d)
Input Segment d
Output best matching entity
Let EC(d)
Let wC St ? T(d) tf(t, d).w(t)
Repeat
Let t argmaxt ? T(d)-EC(d) tf(t, d).w(t)
Let wC wC - tf(t, d).w(t)
Let EC(d) EC(d) U GetEntities(t)
Let e argmaxe ? EC(d)score-(e, d)
Until score-(e, d) gt wC
Let E(d) EC(d), s 0
Repeat
Call GetTerms(e)
Let s score(e, d)
If s lt s then let s s, e e
Let E(d) E(d) - e
Let e argmaxe ? E(d)score-(e, d)
Until score-(e, d) wC lt s

Call GetEntities on t ? T(d) in decreasing order
of tf(t, d).w(t), stopping when the cache becomes
complete
AllSegments
For each segment d in D
Call BestMatchEntity(d) to compute the
best matching entity
Call BestAnnot(D) to compute the best annotation
Does not scale well with document size
Reason
Computes the best matching entities for all
segments in the document
Need to effectively prune the set of segments

20
Cache-based Best-Annotation Score Bounds

Modify BestAnnot to compute the best annotation
using only the cache contents
In memory computation no DB access
Let (S, F) be the best annotation for the
document D
Let (SC, FC) be the annotation returned by
BestAnnotC(D)
Then
where

Procedure BestAnnotC(D)
Input Document D
Output best annotation
For i 1 to D
For j i to D
Let ei,j argmaxe? E score(e, Di,j)
Let si,j score(ei,j, Di,j)
Let S0
Let r0 0
For k 1 to D
Let j argmax0 j k-1(rj sj1,k ?)
Let Sk Sj U Dj1,k
Let rk rj sj1,k ?
For each d? SD
Let FD(d) argmaxe? E score(e, d)
Return (SD, FD)

21
Iterative Cache Refinement The EROCS Algorithm

Iteratively refine the cache contents so that
the slack between the lower/upper bounds of
successive (SC, FC) decreases with every
iteration
On termination, score-(SC, FC) score (SC, FC)
? (SC, FC) is the best annotation (S, F)
Incremental version of BestAnnotC used to compute
successive annotations efficiently

Procedure BestAnnotEROCS(D)
Input Document D
Output best annotation
Initialize the context cache as empty
Let (SC, FC) BestAnnotC(D)
While score-(SC, FC) lt score(SC, FC)
Call UpdateCache(SC, FC)
Let (SC, FC) BestAnnotC(D)
Return (SC, FC)

Cache Refinement Policy
22
Cache Refinement

Define the slack of annotation (SC, FC) as
Let (S1C, F1C) and (S2C, F2C) be the annotations
computed by BestAnnotC(D) on two consecutive
invocations
Aim Choose the intervening cache update such
that the decrease in slackslack(S1C, F1C)
slack(S2C, F2C) is maximized
Observations
If only GetEntities allowed, then the optimal
policy is to invoke GetEntities on the term t ?
T(D)-TC(D) such that tf(t, D).w(t) is maximum
If only GetTerms allowed (and the context cache
is complete with respect to each segment in S1C),
a good heuristic is to invoke GetTerms on the
best matching entity for the segment in S1C with
the maximum slack
EROCS follows a hybrid policy

23
Cache Refinement Policy

Procedure UpdateCache(SC, FC)
Input current best annotation
If the current cache is complete wrt each d ? SC
Let d argmaxd ? SC( score(FC(d),d)
score-(FC(d),d) )
Call GetTerms(FC(d))
Else
Let t argmaxt ? T(D)-TC(D)tf(t, D).w(t)
Call GetEntities(t)

Favors calling GetEntities initially and GetTerms
later
Justified since initially, terms cause greater
decrease in the slack
Does not intermix GetEntities and GetTerms for an
annotation
Avoids redundant work
Works well in practice

24
Experimental Setup

Structured Dataset
Subset of the IMDB dataset
Entities Movies, Context Actors, Directors,
Producers, Writers, Editors
401660 movies, 2GB across 8 tables
Document Dataset
Movie reviews/storylines downloaded from
http//www.filmsite.org
Noun phrases identified as relevant terms
Removed the name of the movie from the text
Decomposed the reviews into 8-sentence segments
Classified each segment as good/bad based on
average term weight
Given parameters K and a, generated a random
document by
Picking a random sequence of K distinct movies
For each movie in the sequence, including a good
segment with probability a and a bad segment with
probability 1- a
Final repository included 50 docs for each K 1,
2, , 10 and a 0.0, 0.1,, 1.0
Accuracy Metric
Harmonic mean of the avg precision and recall
over the sentences in the document
Parameter Setting ? 4 (results robust with
respect to ?)

25
Experimental Results Efficacy of Fine-Grained
Entity Matching
a 0.8
K 10
Top-K picks the top K entities that best match
the entire document (considered as a single
segment)
26
Experimental Results Efficacy of EROCS
a 0.8
27
Experimental Results Efficacy of EROCS (contd)
K 10
28
Related Work

Semantic Integration AI Magazine, Special Issue
on Semantic Integration (2005)
Identifying common concepts across heterogeneous
data sources
Mainstay heterogeneous structured databases
(DB), within and across text documents (IR)
Keyword-based search in relational databases
DBXplorer, BANKS (ICDE02), Discover (VLDB03)
Few keywords, all presumed relevant to a single
entity
Named-entity recognition Mansuri and Sarawagi,
Chandel et al. (ICDE06), Agichtein and Ganti
(SIGKDD04)
Entities recognized only if explicitly mentioned
in the document
Please refer to the paper for a more complete
discussion

29
Summary

EROCS A system for inter-linking information
across structured databases and documents
An effective iterative improvement algorithm that
tries to keep the information retrieved from the
database small
The linkages discovered can be used to enrich
several techniques and applications in the
database-centric and IR-centric domains

30
(No Transcript)

Write a Comment

User Comments (0)