Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi

Description:

IBM Research. Efficiently Linking Text Documents with Relevant Structured ... The heroine ungraciously and ineptly disturbs the respectability and silence of ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi


1
Efficiently Linking Text Documents with Relevant
Structured InformationPrasan Roywith Venkat
Chakaravarthy, Himanshu Gupta and Mukesh
MohaniaIBM India Research Lab, New Delhi
2
Structured and Unstructured Information
  • Information content in an enterprise can be
    structured or unstructured
  • Structured Content Transaction data, payroll,
    sales orders, invoice, customer profiles, etc.
  • Unstructured Content emails, reports, web-pages,
    complaints, etc.
  • Historically, the structured and unstructured
    data retrieval technologies have evolved
    separately ? Artificial separation between these
    two kinds of information
  • Enterprises are realizing the need to bridge this
    separation, and are demanding integrated
    retrieval, management and analysis of both the
    structured and unstructured content

SQL Query
Keyword search query
RDBMS
Search Engine
SQL Result
Search Result
Structured Data Retrieval
Unstructured Data Retrieval
3
EROCS Entity RecOgnition in Context of
Structured data
  • Exploit partial information contained in a
    document to automatically identify and link
    relevant structured data
  • Main Idea
  • View the structured data as a set of pre-defined
    entities
  • Identify the entities from this set that best
    match the document, and also find embeddings of
    the identified entities in the document

4
Entity Templates
  • Specify the locations (within the relational
    database) of
  • The candidate entities to be matched in the
    document
  • For each entity, the relevant context information
    to be exploited to perform the match
  • Specified by a domain expert
  • Default All tables reachable from the identified
    root node
  • Future work Automatic identification of relevant
    templates

5
Example
ltmoviegt ltnamegt Rebecca lt/namegt ltactorgt
ltnamegt Bruce, Nigel lt/namegt ltactedasgt Major
Giles Lacy lt/actedasgt lt/actorgt ltactorgt
ltnamegt Olivier, Laurence lt/namegt
ltactedasgt 'Maxim' de Winter lt/actedasgt
ltactorgt ltnamegt Collier, Constance lt/namegt
ltactedasgt Rebecca lt/actedasgt lt/actorgt
ltactorgt ltnamegt Cooper, Gladys lt/namegt
ltactedasgt Beatrice Lacy lt/actedasgt
lt/actorgt ltdirectorgtHitchcock, Alfred
lt/directorgt lt/moviegt
ltmoviegt ltnamegt Godfather, The lt/namegt
ltactorgt ltnamegtBrando, Marlonlt/namegt
ltactedasgtDon Vito Corleonelt/actedasgt lt/actorgt
ltactorgt ltnamegtCaan, Jameslt/namegt
ltactedasgt 'Sonny' Corleone lt/actedasgt lt/actorgt
ltactorgt ltnamegtDuvall, Robert (I)lt/namegt
ltactedasgtTom Hagenlt/actedasgt lt/actorgt
ltactorgt ltnamegtLettieri, Allt/namegt
ltactedasgtVirgil Sollozzolt/actedasgt lt/actorgt
ltactorgt ltnamegt Rendina, Victorlt/namegt
ltactedasgt Philip Tattaglialt/actedasgt lt/actorgt
ltdirectorgtCoppola, Francis F. lt/directorgt
lt/moviegt
EROCS
6
Enables Effective Search
  • Existing Search Engines
  • Recall inherently limited by terms actually
    present in the document
  • For instance, the term Godfather, though
    relevant, does not appear in the document?
    Existing search engines would not return this
    document in response to a query Godfather
  • EROCS Value-Add
  • Automatically retrieves relevant information from
    a structured database and associates it with the
    document as additional metadata
  • Search can exploit this metadata ? Improves
    recall, precision
  • A search on Godfather would return the example
    document
  • Show documents about movies with characters
    Rebecca and Corleone would NOT return the
    example document
  • Enables more complex XML Fragments/XPath queries
    on the documents
  • The associated metadata can also be used to gauge
    similarity between documents, enable/complement
    sophisticated text analysis

7
Enables OLAP on Structured Unstructured Data
  • Current OLAP tools restricted to structured data
  • EROCS can incorporate unstructured data in the
    analysis

Unstructured Data
Find the store with the greatest upsurge in
complaints on high-value transactions
Structured Data
8
Outline
  • Introduction ?
  • Framework
  • Identifying Best-Matching Entities
  • Context Cache
  • Exploiting the Context Cache
  • Experimental Study
  • Conclusion

9
Entity and Document Models
Pivot table
  • Entity
  • Each row e in the pivot table is identifiedas an
    entity
  • Context of the entity e the set of
    terms present in the row e as well as in rows
    in the context tables having a path to e
  • Document
  • A sequence of sentences, where each sentence is a
    bag of terms
  • Actual implementation runs a parser on the
    document and retains only the noun phrases
  • Could be enhanced by further disambiguating the
    terms using NER (identifying customer names,
    organization names, product names, etc.)
  • Segment
  • A sequence of one or more consecutive sentences
    in the document

10
Entity-Document Matching
  • The score of an entity e with respect to a
    segment d is defined as
  • where
  • T(e, d) the set of terms common between the
    entity e and segment d
  • tf(t, d) the number of times the term t appears
    in the segment d
  • w(t) the weight of the term t
  • In our implementation, w(t) is defined as

11
Identifying the Best Matching Annotation












  • Input Document D, set of entities E
  • An annotation for the document D is a pair (S, F)
    where
  • S is a set of non-overlapping segments of D
  • F S ? E maps each segment d ? S to an entity
    F(d) ? E
  • Score of an annotation (S, F) defined as
  • where ? 0 is a tunable parameter

E
Ensures that score(F(d), d) ? for each segment
d in the solution
D
12
Pruning the Search Space













best match
  • An annotation (S, F) is termed canonical iff
  • S is a partition of D
  • F maps each segment d ? S to its best matching
    entity
  • Claim
  • For any document D, there exists a canonical
    annotation such that it is an optimal annotation
    for D
  • ? We restrict the search space to canonical
    annotations, without loss in generality

E
D
13
Algorithm
  • Let Di,j the segment in D containing sentence i
    to sentence j, 0 i j D
  • Let ei,j be the best matching entity for Di,j,
    with si,j score(ei,j, Di,j,)
  • Let (Sk, Fk) be the best annotation for D1,k,
    with rk score(Sk, Fk)
  • Then
  • rk max0 j k-1(rj sj1,k ?)
  • Procedure BestAnnot(D)
  • Input Document D
  • Output best annotation
  • For i 1 to D
  • For j i to D
  • Let ei,j argmaxe? E score(e, Di,j)
  • Let si,j score(ei,j, Di,j)
  • Let S0
  • Let r0 0
  • For k 1 to D
  • Let j argmax0 j k-1(rj sj1,k ?)
  • Let Sk Sj U Dj1,k
  • Let rk rj sj1,k ?
  • For each d? SD
  • Let FD(d) argmaxe? E score(e, d)
  • Return (SD, FD)

1



j

k
Optimal annotation
Optimal annotation
Maps to the best matching entity
14
Issue Performance
  • The algorithm involves an entity search for every
    segment in the document
  • If naively done, likely to be a performance
    bottleneck
  • Possible Solutions
  • Cache the result of the entity search
  • Document unlikely to have repeated segments ?
    Not effective
  • Materialize and index the context ofeach entity
  • High computation, maintenanceand storage
    overheads
  • Remainder of the talk developsefficient, low
    overhead techniques
  • Procedure BestAnnot(D)
  • Input Document D
  • Output best annotation
  • For i 1 to D
  • For j i to D
  • Let ei,j argmaxe? E score(e, Di,j)
  • Let si,j score(ei,j, Di,j)
  • Let S0
  • Let r0 0
  • For k 1 to D
  • Let j argmax0 j k-1(rj sj1,k ?)
  • Let Sk Sj U Dj1,k
  • Let rk rj sj1,k ?
  • For each d? SD
  • Let FD(d) argmaxe? E score(e, d)
  • Return (SD, FD)

15
Context Cache
  • Stores associations between entities and terms in
    the document
  • Collection of pairs of the form (e, t) meaning
    that the term t is contained in the context of
    the entity e
  • Indexed both on entities and terms
  • Database Access Primitives
  • GetEntities(t)
  • Retrieves the set of entities that contain term
    tin their context
  • Inserts (e, t) in the cache for each e in the set
  • GetTerms(e)
  • Retrieves the set of terms in the context of the
    entity e
  • Inserts (e, t) in the cache for each t in the set

Terms
1 1 1 1
1 1 1
1 1 1 1
1 1 1
1 1 1

Entities
16
Eliminating Repeated Database Access
  • Baseline approach the most straightforward use
    of the cache
  • Eliminates repeated access to the database for
    repeated invocations of GetEntities(t) and
    GetTerms(e) for the same term t and entity e
    respectively
  • AllTerms
  • Populate the cache by invoking GetEntities(t) for
    each term t in the document
  • Determine the best matching entity for each
    segment in the document using the information in
    the cache
  • Call BestAnnot(D) to compute the best annotation
  • Does not scale well, both in terms of time and
    space overheads
  • Reason
  • Invokes GetEntities(t) for every term in the
    document
  • Includes terms that are present in a very large
    number of entities
  • Low weight ? make little difference
  • Large size of result ? GetEntities
    computationally expensive
  • Need to avoid calling GetEntities on such terms,
    while ensuring that the best matching entity for
    each segment is retrieved

17
Cache-based Score Bounds
  • Given a segment d
  • Let TC(d) ? T(d) be the set of terms on which
    GetEntities has been invoked so far
  • Then, for any e ? E
  • Lower bound on score(e, d)
  • Upper bound on score(e, d)
  • where

Terms
1 1 1
1 1
1 1
1 1
1 1

Entities
TC(d)
T(d)-TC(d)
18
Cache Completeness
  • Let EC(d) ? E be the set of entities retrieved by
    at least one GetEntities() so far
  • The context cache is termed complete wrt a
    segment d iff the best matching entity for d is
    guaranteed to be present in EC(d)
  • For any e ? EC(d)
  • and, therefore
  • ? The context cache is complete if there exists
    an e ? EC(d) such that

Terms
1 1 1
1 1
1 1
1 1
1 1


Entities
EC(d)
e?EC(d)
TC(d)
T(d)-TC(d)
19
Term Pruning Strategy
  • Procedure BestMatchEntity(d)
  • Input Segment d
  • Output best matching entity
  • Let EC(d)
  • Let wC St ? T(d) tf(t, d).w(t)
  • Repeat
  • Let t argmaxt ? T(d)-EC(d) tf(t, d).w(t)
  • Let wC wC - tf(t, d).w(t)
  • Let EC(d) EC(d) U GetEntities(t)
  • Let e argmaxe ? EC(d)score-(e, d)
  • Until score-(e, d) gt wC
  • Let E(d) EC(d), s 0
  • Repeat
  • Call GetTerms(e)
  • Let s score(e, d)
  • If s lt s then let s s, e e
  • Let E(d) E(d) - e
  • Let e argmaxe ? E(d)score-(e, d)
  • Until score-(e, d) wC lt s
  • Call GetEntities on t ? T(d) in decreasing order
    of tf(t, d).w(t), stopping when the cache becomes
    complete
  • AllSegments
  • For each segment d in D
  • Call BestMatchEntity(d) to compute the
    best matching entity
  • Call BestAnnot(D) to compute the best annotation
  • Does not scale well with document size
  • Reason
  • Computes the best matching entities for all
    segments in the document
  • Need to effectively prune the set of segments

20
Cache-based Best-Annotation Score Bounds
  • Modify BestAnnot to compute the best annotation
    using only the cache contents
  • In memory computation no DB access
  • Let (S, F) be the best annotation for the
    document D
  • Let (SC, FC) be the annotation returned by
    BestAnnotC(D)
  • Then
  • where
  • Procedure BestAnnotC(D)
  • Input Document D
  • Output best annotation
  • For i 1 to D
  • For j i to D
  • Let ei,j argmaxe? E score(e, Di,j)
  • Let si,j score(ei,j, Di,j)
  • Let S0
  • Let r0 0
  • For k 1 to D
  • Let j argmax0 j k-1(rj sj1,k ?)
  • Let Sk Sj U Dj1,k
  • Let rk rj sj1,k ?
  • For each d? SD
  • Let FD(d) argmaxe? E score(e, d)
  • Return (SD, FD)

21
Iterative Cache Refinement The EROCS Algorithm
  • Iteratively refine the cache contents so that
    the slack between the lower/upper bounds of
    successive (SC, FC) decreases with every
    iteration
  • On termination, score-(SC, FC) score (SC, FC)
    ? (SC, FC) is the best annotation (S, F)
  • Incremental version of BestAnnotC used to compute
    successive annotations efficiently
  • Procedure BestAnnotEROCS(D)
  • Input Document D
  • Output best annotation
  • Initialize the context cache as empty
  • Let (SC, FC) BestAnnotC(D)
  • While score-(SC, FC) lt score(SC, FC)
  • Call UpdateCache(SC, FC)
  • Let (SC, FC) BestAnnotC(D)
  • Return (SC, FC)

Cache Refinement Policy
22
Cache Refinement
  • Define the slack of annotation (SC, FC) as
  • Let (S1C, F1C) and (S2C, F2C) be the annotations
    computed by BestAnnotC(D) on two consecutive
    invocations
  • Aim Choose the intervening cache update such
    that the decrease in slackslack(S1C, F1C)
    slack(S2C, F2C) is maximized
  • Observations
  • If only GetEntities allowed, then the optimal
    policy is to invoke GetEntities on the term t ?
    T(D)-TC(D) such that tf(t, D).w(t) is maximum
  • If only GetTerms allowed (and the context cache
    is complete with respect to each segment in S1C),
    a good heuristic is to invoke GetTerms on the
    best matching entity for the segment in S1C with
    the maximum slack
  • EROCS follows a hybrid policy

23
Cache Refinement Policy
  • Procedure UpdateCache(SC, FC)
  • Input current best annotation
  • If the current cache is complete wrt each d ? SC
  • Let d argmaxd ? SC( score(FC(d),d)
    score-(FC(d),d) )
  • Call GetTerms(FC(d))
  • Else
  • Let t argmaxt ? T(D)-TC(D)tf(t, D).w(t)
  • Call GetEntities(t)
  • Favors calling GetEntities initially and GetTerms
    later
  • Justified since initially, terms cause greater
    decrease in the slack
  • Does not intermix GetEntities and GetTerms for an
    annotation
  • Avoids redundant work
  • Works well in practice

24
Experimental Setup
  • Structured Dataset
  • Subset of the IMDB dataset
  • Entities Movies, Context Actors, Directors,
    Producers, Writers, Editors
  • 401660 movies, 2GB across 8 tables
  • Document Dataset
  • Movie reviews/storylines downloaded from
    http//www.filmsite.org
  • Noun phrases identified as relevant terms
  • Removed the name of the movie from the text
  • Decomposed the reviews into 8-sentence segments
  • Classified each segment as good/bad based on
    average term weight
  • Given parameters K and a, generated a random
    document by
  • Picking a random sequence of K distinct movies
  • For each movie in the sequence, including a good
    segment with probability a and a bad segment with
    probability 1- a
  • Final repository included 50 docs for each K 1,
    2, , 10 and a 0.0, 0.1,, 1.0
  • Accuracy Metric
  • Harmonic mean of the avg precision and recall
    over the sentences in the document
  • Parameter Setting ? 4 (results robust with
    respect to ?)

25
Experimental Results Efficacy of Fine-Grained
Entity Matching
a 0.8
K 10
Top-K picks the top K entities that best match
the entire document (considered as a single
segment)
26
Experimental Results Efficacy of EROCS
a 0.8
27
Experimental Results Efficacy of EROCS (contd)
K 10
28
Related Work
  • Semantic Integration AI Magazine, Special Issue
    on Semantic Integration (2005)
  • Identifying common concepts across heterogeneous
    data sources
  • Mainstay heterogeneous structured databases
    (DB), within and across text documents (IR)
  • Keyword-based search in relational databases
    DBXplorer, BANKS (ICDE02), Discover (VLDB03)
  • Few keywords, all presumed relevant to a single
    entity
  • Named-entity recognition Mansuri and Sarawagi,
    Chandel et al. (ICDE06), Agichtein and Ganti
    (SIGKDD04)
  • Entities recognized only if explicitly mentioned
    in the document
  • Please refer to the paper for a more complete
    discussion

29
Summary
  • EROCS A system for inter-linking information
    across structured databases and documents
  • An effective iterative improvement algorithm that
    tries to keep the information retrieved from the
    database small
  • The linkages discovered can be used to enrich
    several techniques and applications in the
    database-centric and IR-centric domains

30
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com