Information Extraction with Tree Automata Induction - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Information Extraction with Tree Automata Induction

Description:

Information Extraction with Tree Automata Induction. Raymond Kosala1, Jan Van den Bussche2, ... Extract certain fields of interest from a text. Learner is ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 25
Provided by: raym189
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction with Tree Automata Induction


1
Information Extraction with Tree Automata
Induction
  • Raymond Kosala1, Jan Van den Bussche2,
  • Maurice Bruynooghe1, Hendrik Blockeel1
  • 1 Katholieke Universiteit Leuven, Belgium
  • 2 University of Limburg, Belgium

2
Outline
  • Introduction
  • information extraction (IE)
  • grammatical inference
  • Approach
  • k-testable and g-testable algorithms
  • Preliminary result
  • Further work

3
IE from unstructured documents
  • Extract certain fields of interest from a text
  • Learner is trained with (positive) examples
  • Each learner focuses on a single field marked
    with x

Company Name
Job Title
Requirement 1
Requirement 2
4
Grammatical inference
  • ? finite alphabet
  • Regular language L ? ?
  • Given set of examples (pos. or neg.)
  • Task infer a DFA compatible with examples
  • Quality criterion
  • Exact learning in the limit
  • PAC
  • etc.
  • Large body of work

5
IE with grammatical inference
  • Mark field x as special token
  • Infer DFA for the language L
  • S over (? ? x) the field to be extracted is
    marked by x
  • Only positive examples

6
IE from structured documents
  • Previous works learn string language
  • XML or HTML data tree structured
  • Natural extension is to learn a tree language
  • x has a b-brother
  • Extraction of a field can depend on structural
    context

7
Learning process

8
Testing process

9
Page examples
10
Why do we need the context?
-- tr -- td
-- td -- lastupdate
(CDATA) -- td
-- b --
12/4/98 -- td -- tr
-- td -- tr -- td
-- td --
organization (CDATA) -- td
-- b -- ABC
-- td -- tr -- td
  • Not enough to differentiate the fields of
    interest that depend on the structural context.
  • Can be chosen automatically.

11
Tree automata
  • Ranked alphabet ? finite set of function symbols
    with arities.
  • E.g. ? a(2), b(2), c(0)
  • Tree ? ground term over ?
  • a(c,a(b(c,c),b(c,c))) tree with depth 3
  • Tree automaton M (?, Q, ?, F).
  • ? is a set of transitions of the form v(q1, ,
    qn) ? q
  • Where v ? ? , n is the arity of v , qi and q ? Q

12
Example
  • Given an automaton M with the following
    transitions
  • ?1 c ? q0
  • ?2 a(q0 , q0) ? q0
  • ?3 b(q0 , q0) ? accept
  • M accepts a tree t ? t has b-node as the root

13
Unranked trees
  • XML/HTML
  • bib
  • paper report book
    paper
  • The number of children is not fixed by the label
  • Two approaches
  • 1. Generalize notion of tree automaton to
    unranked trees. L transition rules
  • v(e) ? q , where e is regular expression over Q
  • 2. Encode as ranked tree

14
Encoding of unranked trees
  • There are well-known methods of encoding to
    binary trees, we use
  • encode(T) encodef(T)
  • v if F1 F2 ?
  • vright(encodef(F2)) if F1 ?, F2 ?
    ?
  • encodef(v(F1), F2) vleft(encodef(F1))
    if F1 ? ?, F2 ?
  • v(encodef(F1), encodef(F2))
    otherwise
  • Where T v(F), v ? ? F ?
  • F T, F
  • Example
  • becomes

15
k-testable tree languages
  • Languages in which membership can be checked by
    just looking at subtrees of length k-1 that
    appear in the tree.
  • k-roots
  • k-forks
  • k-subtrees

16
k-testable tree languages (cont.)
  • An example t
  • html
  • head body
  • title h1 table
  • f2(t)
  • html head body
  • head body title h1 table

r2(t) html head
body s2(t) body head
table title h1 h1 table title
17
k-testable algorithm Rico-Juan, et al.
  • Given a set of positive examples T, a positive
    integer k
  • Q, FS, ? Ø
  • For each t ?T,
  • Let R rk-1(t) F fk(t) S sk-1(t)
  • Q Q ? R ? rk-1( F ) ? S
  • FS FS ? R
  • ? v(t1, ,tm) ? S ? ? ? ?m(v(t1, ,tm))
    v(t1, ,tm)
  • ? v(t1, ,tm) ? F ? ? ? ?m(v(t1, ,tm))
    rk-1(v(t1, ,tm))

18
g-testable algorithm
  • Idea generalize state transitions from forks
    that are not important for the extraction.
  • Important forks are those that contain x and
    (possibly) the distinguishing context.

t
gen(t,1)
gen(t,2)
19
g-testable algorithm
  • Given a set of positive examples T, positive
    integer k and l
  • FS, ?, Ss, CF, OF, OF Ø
  • For each t ?T,
  • Let R rk-1(t) F fk(t) S sk-1(t)
  • Ss Ss ? S
  • FS FS ? R
  • ? v(t1, ,tm) ? S ? ? ? ?m(v(t1, ,tm))
    v(t1, ,tm)
  • CF CF ? f f ? F, f contains x
  • OF OF ? f f ? F, f does not contain x
  • For each of ?OF,
  • of gen(of,l)
  • If of covers one of CF then
  • OF OF ? of else OF OF ? of
  • Let F OF ? CF
  • Q Q ? F ? rk-1( F ) ? Ss
  • ? v(t1, ,tm) ? F ? ? ? ?m(v(t1, ,tm))
    rk-1(v(t1, ,tm))

20
g-testable algorithm example
  • A set of examples T (with k 2 and l 1)
  • html
    html
  • head body head
    body
  • title h1 x table title
    h1 x
  • F f2(T)
  • html head body
    body
  • head body title h1 x h1 x
    table
  • FS html
  • OF html(head, body), head(title)
  • CF body(h1, x), body(h1, x, table)
  • OF html(, ), head()
  • R r1(T) html
  • Ss s1(T) table , title , h1, x

21
g-testable algorithm example (cont.)
  • F html(, ), head(), body(h1, x), body(h1,
    x, table)
  • Transitions from the trees in the subtrees s1(T)
  • ? (table) table
  • ? (title) title
  • ? (h1) h1
  • ? (x) x
  • Transitions from the trees in the generalized
    forks F
  • ? (html(, )) html
  • ? (head()) head
  • ? (body(h1, x)) body
  • ? (body(h1, x, table)) body
  • Q html head body table title h1 x

22
Experiment
  • Two benchmark datasets Internet Address Finder
    (IAF) and Quote Server (QS).
  • Comparison with HMM, Stalker, and BWI.
  • The highlights of our method
  • More expressive.
  • Doesnt require
  • manual specifications of windows length of the
    prefix and suffix of the target field (HMM and
    BWI)
  • special tokens of the delimiters such as gt
    (Stalker and BWI)
  • embedded catalog tree (Stalker)
  • The limitations
  • The field that can be extracted limited to whole
    node
  • Slower when extracting

23
Experiment results
  • The results in are
  • Dataset IAF-altname IAF-org
    QS-date QS-vol
    Shakespeare
  • Prec Rec F1 Prec Rec
    F1 Prec Rec F1 Prec Rec F1
    Prec Rec F1
  • --------------------------------------------------
    --------------------------------------------------
    ----------------------
  • HMM 1.7 90 3.4 16.8 89.7
    28.4 36.3 100 53.3 18.4 96.2 30.9
  • Stalker 100 - - 48.0 -
    - 0 - - 0
    - -
  • BWI 90.9 43.5 58.8 77.5 45.9 57.7
    100 100 100 100 61.9 76.5
  • k-testable 100 73.9 85 100 57.9 73.3
    100 60.5 75.4 100 73.6 84.8
    56.2 90 69.2
  • g-testable 100 73.9 85 100 82.6 90.5
    100 60.5 75.4 100 73.6 84.8
    69.2 90 78.2
  • Parameters 4 and (5,2) 2 and (3,2)
    2 and (3,2) 5 and (6,5)
    3 and (4,2)
  • F1 is the harmonic mean of recall and precision
  • The results of HMM, Stalker and BWI are adopted
    from Freitag Kushmerick

24
Further work
  • More generalization while using bigger context is
    achieved, but sometimes the binarisation makes
    the context far from the field of interest ? the
    generalization cannot go very far
  • Work on the algorithm that can work directly with
    unranked trees.
Write a Comment
User Comments (0)
About PowerShow.com