Annotation Free Information Extraction - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Annotation Free Information Extraction

Description:

Display of multiple records often forms a repeated pattern ... Figuration. Goal and Challenge. Previous IE Techniques rely on heuristic by human. ex. ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 77
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Annotation Free Information Extraction


1
Annotation Free Information Extraction
  • Chia-Hui Chang
  • Department of Computer Science Information
    Engineering
  • National Central University
  • chia_at_csie.ncu.edu.tw
  • 10/4/2002

2
IEPAD Information Extraction based on Pattern
Discovery
  • C.H. Chang.
  • National Central University
  • WWW10

3
Semi-structured Information Extraction
  • Information Extraction (IE)
  • Input Html pages
  • Output A set of records

4
Pattern Discovery based IE
  • Motivation
  • Display of multiple records often forms a
    repeated pattern
  • The occurrences of the pattern are spaced
    regularly and adjacently
  • Now the problem becomes ...
  • Find regular and adjacent repeats in a string

5
IEPAD Architecture
6
The Pattern Generator
  • Translator
  • PAT tree construction
  • Pattern validator
  • Rule Composer

7
1. Web Page Translation
  • Encoding of HTML source
  • Rule 1 Each tag is encoded as a token
  • Rule 2 Any text between two tags are translated
    to a special token called TEXT (denoted by a
    underscore)
  • HTML Example
  • ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
  • ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
  • Encoded token string
  • T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
  • T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)

8
Various Encoding Schemes
9
2. PAT Tree Construction
  • PAT tree binary suffix tree
  • A Patricia tree constructed over all possible
    suffix strings of a text
  • Example
  • T(ltBgt) 000
  • T(lt/Bgt) 001
  • T(ltIgt) 010
  • T(lt/Igt) 011
  • T(ltBRgt) 100
  • T(_) 110

T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
10
The Constructed PAT Tree
11
Definition of Maximal Repeats
  • Let a occurs in S in position p1, p2, p3, , pk
  • a is left maximal if there exists at least one
    (i, j) pair such that Spi-1?Spj-1
  • a is right maximal if there exists at least one
    (i, j) pair such that Spia?Spja
  • a is a maximal repeat if it it both left maximal
    and right maximal

12
Finding Maximal Repeats
  • Definition
  • Lets call character Spi-1 the left character
    of suffix pi
  • A node ? is left diverse if at least two leaves
    in the ?s subtree have different left characters
  • Lemma
  • The path labels of an internal node ? in a PAT
    tree is a maximal repeat if and only if ? is left
    diverse

13
3. Pattern Validator
  • Suppose a maximal repeat ? are ordered by its
    position such that suffix p1 lt p2 lt p3 lt pk,
    where pi denotes the position of each suffix in
    the encoded token sequence.
  • Characteristics of a Pattern
  • Regularity Variance coefficient
  • Adjacency Density

14
Pattern Validator (Cont.)
  • Basic Screening
  • For each maximal repeat a, compute V(a) and D(a)
  • a) check if the patterns variance V(a) lt 0.5
  • b) check if the patterns density 0.25 lt D(a) lt
    1.5

15
4. Rule Composer
  • Occurrence partition
  • Flexible variance threshold control
  • Multiple string alignment
  • Increase density of a pattern

16
Occurrence Partition
  • Problem
  • Some patterns are divided into several blocks
  • Ex Lycos, Excite with large regularity
  • Solution
  • Clustering of the occurrences of such a pattern

Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
17
Multiple String Alignment
  • Problem
  • Patterns with density less than 1 can extract
    only part of the information
  • Solution
  • Align k-1 substrings among the k occurrences
  • A natural generalization of alignment for two
    strings which can be solved in O(nm) by dynamic
    programming where n and m are string lengths.

18
Multiple String Alignment (Cont.)
  • Suppose adc is the discovered pattern for token
    string adcwbdadcxbadcxbdadcb
  • If we have the following multiple alignment for
    strings adcwbd'', adcxb'' and adcxbd''
  • a d c w b d
  • a d c x b -
  • a d c x b d
  • The extraction pattern can be generalized as
    adcwxbd-

19
Pattern Viewer
  • Java-application based GUI
  • Web based GUI
  • http//www.csie.ncu.edu.tw/chia/WebIEPAD/

20
The Extractor
  • Matching the pattern against the encoding token
    string
  • Knuth-Morris-Pratts algorithm
  • Boyer-Moores algorithm
  • Alternatives in a rule
  • matching the longest pattern
  • What are extracted?
  • The whole record

21
Experiment Setup
  • Fourteen sources search engines
  • Performance measures
  • Number of patterns
  • Retrieval rate and Accuracy rate
  • Parameters
  • Encoding scheme
  • Thresholds control

22
Translation
  • Average page length is 22.7KB

23
Accuracy and Retrieval Rate
24
Problems
  • Guarantee high retrieval rate instead of accuracy
    rate
  • Generalized rule can extract more than the
    desired data
  • Only applicable when there are several records in
    a Web page, currently

25
ROADRUNNER Towards Automatic Data Extraction
from Large Web Sites
  • Valter Crescenzi , Giansalvatore , Paolo Merialdo
  • VLDB2001

26
Observations
  • 1. Wrapper generator works by using additional
    information. (labeled samples)
  • 2. Wrapper induction system has some a priori
    knowledge about the page organization.
  • 3. Finally, systems generate wrapper by examining
    one HTML page at a time.

27
ROADRUNNER new perspective
  • 1. Dont rely on any interaction with the user.
    (Completely automatic)
  • 2. No a priori knowledge
  • HTML schema will be inferred along with wrapper.
  • Can handle any nested structures.
  • 3. Works with two HTML pages at a time. (based on
    the study of similarities and dissimilarities
    between the pages)

28
(No Transcript)
29
Theoretical Background
  • Site generation Encoding of database content
  • Data extraction Decoding
  • The problem is based on a close correspondence
    between nested type and union-free regular
    expressios.

30
Delimiter
  • PCDATA map to string
  • map to lists (nested) , being iterator
  • ? map to nullable fields, optional patterns.
  • Find schema and data extraction Find minimal
    UFRE.

31
Matching Technique
  • It is based on a matching technique called ACME.
    (Align, Collapse under Mismatch, and Extract)
  • HTML ? XHTML ? tokens
  • Matching algorithm works on two objects
  • A list of tokens, call the sample
  • A wrapper (one UFRE)
  • This is done by solving mismatches between the
    wrapper and the sample.

32
(No Transcript)
33
Mismatches
  • 1. String mismatches
  • May be due only to different values of a database
    field.
  • These mismatches are use to discover fields.
    (PCDATA)
  • Ex John Smith and Paul Jones at token 4
  • 2. Tag mismatches
  • Optional patterns
  • Iterative patterns

34
Discovering Optionals
  • Strategy Looking for repeated patterns as a
    first step, and then, if this attempt fails, in
    trying to identify optional pattern.
  • Two steps
  • 1. Optional Pattern Location by Cross-Search
  • Mismatch at token 6 - ltULgt and ltIMG/gt
  • Assume optional pattern is located on wrapper or
    sample.
  • 2. Wrapper Generalization
  • ( ltIMG src/gt ) ?

35
Discovering Iterators
  • 1. Square Location by Terminal Tag Search
  • Both the wrapper and sample contain at least one
    occurrence of the square.
  • Terminal Tag position before the mismatch
  • In this example is lt/LIgt
  • Test which is the square initial tag ?
  • lt/UIgt lt/LIgt v.s. ltLIgt lt/LIgt
  • Finally, we can infer that the sample contains
    one candidate occurrence of the square at token
    20-25.

36
Discovering Iterators (cont)
  • 2. Square Matching
  • Try to match the candidate square occurrence
    (tokens 20-25).
  • Backwards matching token 25 and 19, then moves
    to 24 and 18 and so on.
  • 3. Wrapper Generalization
  • If we denote the newly found square by s, we
    replace the repeated pattern by (s)

37
More Complex Example
  • First mismatch at token 15 (external mismatch)
  • Find iterators
  • Terminal tag lt/LIgt
  • Candidate square is found ltLIgt lt/LIgt at token
    15-28
  • Backward match second mismatch at token 23 and
    9 (internal mismatch) ? solve the mismatch by
    recursive

38
Recursively solve mismatch
  • Internal mismatch at token 23 and 9
  • Solve it by the same way at external mismatch.
  • But dont work by comparing one wrapper and one
    sample, rather two different portions of the same
    objects.
  • Terminal tag ltBgt
  • Candidate square is lt/BgtltBgt token 23-18
  • Backward match mismatch at token 20 and 26
  • Find token 20-22 is optional pattern.

39
(No Transcript)
40
Matching as an AND-OR tree
  • Finding one solution to match(w,s) corresponds to
    finding one visit for the AND-OR tree.
  • (i) match(w,s) all external mismatches
    encountered during the parsing (AND node)
  • (ii) solve mismatch by either introducing one
    field, or one iterator, or one optional (OR)
  • (iii) The search may either on wrapper or sample
    (OR)
  • (iv) iterators and optionals are various
    candidates (OR)
  • (v) Discover iterators may be need to recursively
    solve several internal mismatches. (AND)

41
AND-OR tree
42
Experimental Results
43
Experimental Results (cont)
44
Extracting Structured Data from Web Page
  • Arvind Arasu, Hector Garcia-Molina
  • ACM SIGMOD 2003

45
Cue
  • Keywords schema, template
  • Web pages belonging to the same site are
    generated by encoding data of the same schema
    with a common template
  • gt a common template by plugging-in value

46
Figuration
47
Goal and Challenge
  • Previous IE Techniques rely on heuristic by
    human. ex. wrapper
  • Goal to deduce the template without human
  • Time consuming and error-prone
  • Optional attributes are ignored
  • Challenge
  • No obvious way of differentiating what text is
  • template or data
  • The schema of data in pages isnt flat but more
  • complex and semi-structured of attributes

48
Model, Problem Formulation
  • Structured Data
  • Model of Page Creation
  • Optionals and Disjunctions
  • Problem Statement
  • Miscellaneous Terminology, Definition

49
Structured Data
  • Token A token is some basic unit of text
  • Structured Data any set of data values
    conforming to a common schema or type
  • Define Type
  • 1. Basic Type (ß) string of tokens
  • e.g. lthtmlgt, text
  • 2. Ordered List Type tuple constructor order n
  • e.g. ltT1, T2, , Tngt, T1, T2, , Tn type
  • 3. Define Type set constructor
  • e.g. T , T type

50
Define term value and example
  • Define instance
  • 1. an instance of basic type, ß, token
  • 2. an instance of type ltT1, T2, , Tngt is
  • tuple of the form lti1, i2, , ingt, attributes
  • i1, i2, , in are instances of typesT1,
    T2, , Tn
  • 3. an instance of type T, is any set of
    elements
  • e1, e2, , em, such ei is an instance of
    type T
  • Instance ? Value String ? token
  • Example
  • Schema S1
  • Value

51
Model of Page Creation
  • Definition A template T for a schema S (as shown
    TS), is defined as a function that maps each type
    constructor t of S into an ordered set of strings
    T(t ), such that,
  • tis the tuple constructor of order n, T(t) is an
    order set of n1 string
  • tis the set constructor of order n, T(t) is
    string St

?(T, x) values x that are instances of
sub-schema of S
52
Encoding of a value x? S
  • 1. if x ?ß, then ? (T,x)?x
  • 2. if x ? ltx1, x2, , xngttt
  • ? (T,x) ? C1 ? (T, x1) C2 ? (T, xn) Cn1
  • 3. if x ? e1, e2, , emts , ts ? S
  • ? (T,x) ? ? (T, e1) S ? (T, e2) .S ? (T, em)

53
Example of Schema S1
54
Optionals and Disjunctions
  • Optional
  • If T is type, optional type (T)?Tt
  • t 0 or 1
  • Disjunction
  • If T1 and T2 is type, disjunction type
  • (T1 T2) ltT1t1, T2t2 gtt
  • t1t2 1

55
Problem Statement
  • Extract Problem n pages, pi ?(T, xi)
  • (1 i n), created from some unknown
    deduction template T and values x1,. . .,x1
    from the set of pages alone

56
Example of correct solution of EXTRACT (cont.)
57
Example of correct solution of EXTRACT (cont.)
58
Miscellaneous Terminology, Definition
  • An occurrence of a token in template is called a
    template-token
  • An occurrence of a token in value is called a
    value-token
  • An occurrence of a token in page is called a
    page-token
  • 2 page-token in Pe have the same role iff they
    have been generated by the same template-token

59
Overview Approach - EXALG
(ECGM)
60
EXALG - ECGM FINDEQ (step2)
  • The module used to compute equivalence
    classese, set of tokens having the same
    frequency of occurrence in every pages Pe
  • Ex. ee1 lthtmlgt, ltbodygt, Book, Reviews, ltolgt,
  • lt/olgt, lt/bodygt, lt/htmlgt
  • Ex. ee3 ltligt, Reviewer, Rating, Text, lt/ligt
  • EXALG retain only EQ Classes that are Large and
    Frequently occurring EQ Classes (LFEQ)

61
EXALG - ECGM HANDINV (step3)
  • The module used to detect and remove invalid
    LFEQs those that are not formed by tokens
    associated with a type constructor

62
DIFFFORM (step1) and DIFFEQ (step4)
  • The module used to add more tokens to LFEQ by
    differentiating roles
  • Ex. Name has multiple role, one occurs in Book
    Name and the other occurs in Reviewer Name
  • Differentiate the multiple roles
  • The multiple tokens occur in different path from
    root in the HTML parse tree (DIFFFORM)
  • The multiple tokens occur in different Position
    with respect to LFEQ ee1(DIFFEQ)
  • dtoken ex. Name5 and Name14
  • regard NameA and NameB as different tokens

63
Review ECGM
64
Example After ECGM Process
  • ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
    ltbgt, Reviews, lt/bgt, ltolgt, lt/olgt, lt/bodygt, lt/htmlgt
  • 8 ?13
  • ee3 ltligt, ltbgt, Reviewer, Name, lt/bgt, ltbgt,
    Rating, lt/bgt, ltbgt, Text, lt/bgt, lt/ligt
  • 5 ?12
  • Position empty and non-empty

65
Construct Schema from ECGM
  • Construct Schema S fromee1
  • The 1st of non-empty position is Basic Type ß
  • The 2nd of non-empty position is ee3 , are
    generated by set type constructorte3
  • ? T(te1) ltC11, C12,C13gt, S ltß, S te2
    gtte1
  • ? T(te2) S lt C31, C32,C33,C34 gt
  • ? T(te3) lt C31, C32,C33,C34 gt, ltß,ß,ß,gtte3
  • ? S lt ß, ltß,ß,ß,gtte3 te2 gtte1

66
Equivalence Classes (Cont.)
  • Pages P p1, , pn , pi ?(TS, xi)
  • TS t1, , tk type constructor
  • Definition All tokens of equivalence class have
    the same occurrence vector
  • ex. ee1 lt1,1,1,1gt ee3 lt1,2,1,0gt
  • Observation1 Tokens associated with the same
    type constructor tj in T that have unique-roles
    occur in the same equivalence class. (used to
    decide EQ valid or not)
  • Support of token (page contain)
  • Size of EQ class (token of EQ)

67
Equivalence Classes (Cont.)
  • Observation2 for real pages, an equivalence
    class of large size and support is usually valid
  • Properties of EQ class ltt1, , tmgt
  • Ordered
  • Nested the span of all occurrences of ei is
    within for some fixed Position_p or doesnt
    overlap
  • Observation3 A valid equivalence class is
    ordered and a pair of two valid equivalence
    classes is nested

68
Handling Invalid Equivalence classes
  • Detect the existence of invalid LFEQs using
    violation of ordered and nesting
  • Yes, discard some of LFEQs and break other into
    smaller LFEQs

Differentiating roles of tokens
  • By Path different roles of tokens are in
    different path of HTML parse tree
  • By Position different roles of tokens locates
    at different Position (non-empty)

69
Equivalence Class Generation Module
  • OUTPUT set of LFEQs of dtokens and page
    represented as string of dtokens
  • FINDEQ 2 parameters used to consider
  • LFEQs (SIZETHRES, SUPTHRES)
  • On running example
  • SIZETHRES SUPTHRES 3
  • the iteration 2, find out ee1 and ee3

70
Building Template and Extracting Values
  • Input to this module is e1 ,e2 , ,em
  • The ANALYSIS consist of 2 modules CONSTTEMP and
    EXVAL
  • CONSTTEMP ,ei d1, d2, , dl
  • Start the basic e1 lthtmlgt, ltbodygt, ,lt/bodygt,
    lt/htmlgt
  • recursively constructs a template Tei ,
    corresponding toei , and template Tei, p,
    corresponding to each non-empty position p ofei
  • Checks if the set of strings, PosString(ei ,p),
    corresponding has some recognizable pattern

71
Example
  • In running example, PosString(ee1 ,6) is a
    string dtokens for every occurrence of ee1,
    which matches Pattern 5 of table PosString(ee1
    ,10) is always a string of 0 or more occurrences
    of ee3, which matches Pattern 1
  • ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
    ltbgt,
  • Reviews, lt/bgt, ltolgt, lt/olgt,
    lt/bodygt, lt/htmlgt

72
Assumption
  • The 4 assumptions
  • (A1) A large number of tokens occurring in
  • template have unique roles
  • (A2) The EQ class derived from a type constructor
  • is recognized as an LFEQ
  • (A3) Irregularity in encoded data that leads to
  • invalid EQ class
  • (A4) The separators are around data values. In
  • this model, strings associated with type
  • construction are non-empty position

73
Evaluation
  • Leaf attribute Am in schema Sm
  • Correct the set of Am in the page is equal to
    the set of extracted value Ae in the page
  • Partially Correct the set of Am in the page is
    not equal to the set of extracted value Ae in the
    page, but as part of value of Ae
  • Incorrect not correct and Partially correct

74
Result
  • 18 or 40 of input collections our System
    correctly extracted all the attribute
  • Around 80 of the attributes were extracted
    correctly
  • Normalized average
  • Input size lt10
  • Parameter 3

75
Conclusion
  • EXALG use 2 novel concept equivalence classes
    and differentiate roles, to discovery the
    template
  • Impact of the failed assumption is limit to a few
    attributes
  • Future work
  • Develop techniques for crawling, indexing, and
    providing querying support for the structured
    pages in the web
  • Develop techniques for automatically annotating
    the extracted data, possibly using the words that
    appear in the template

76
References
  • C.H. Chang. and S.C. Lui. IEPAD Information
    Extraction based on Pattern Discovery, WWW2001,
    pp. 681-688.
  • Valter Crescenzi, Giansalvatore Mecca, Paolo
    Merialdo. RoadRunner Towards Automatic Data
    Extraction from Large Web Sites. VLDB2001,
    109-118
  • Arvind Arasu, Hector Garcia-Molina. Extracting
    Structured Data from Web Pages. SIGMOD2003,
    337-348.
Write a Comment
User Comments (0)
About PowerShow.com