Annotation Free Information Extraction

About This Presentation

Title:

Annotation Free Information Extraction

Description:

Display of multiple records often forms a repeated pattern ... Figuration. Goal and Challenge. Previous IE Techniques rely on heuristic by human. ex. ... – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 77

Provided by: chiahu

Category:

more less

Transcript and Presenter's Notes

Title: Annotation Free Information Extraction

1
Annotation Free Information Extraction

Chia-Hui Chang
Department of Computer Science Information
Engineering
National Central University
chia_at_csie.ncu.edu.tw
10/4/2002

2
IEPAD Information Extraction based on Pattern
Discovery

C.H. Chang.
National Central University
WWW10

3
Semi-structured Information Extraction

Information Extraction (IE)
Input Html pages
Output A set of records

4
Pattern Discovery based IE

Motivation
Display of multiple records often forms a
repeated pattern
The occurrences of the pattern are spaced
regularly and adjacently
Now the problem becomes ...
Find regular and adjacent repeats in a string

5
IEPAD Architecture
6
The Pattern Generator

Translator
PAT tree construction
Pattern validator
Rule Composer

7
1. Web Page Translation

Encoding of HTML source
Rule 1 Each tag is encoded as a token
Rule 2 Any text between two tags are translated
to a special token called TEXT (denoted by a
underscore)
HTML Example
ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
Encoded token string
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)

8
Various Encoding Schemes
9
2. PAT Tree Construction

PAT tree binary suffix tree
A Patricia tree constructed over all possible
suffix strings of a text
Example
T(ltBgt) 000
T(lt/Bgt) 001
T(ltIgt) 010
T(lt/Igt) 011
T(ltBRgt) 100
T(_) 110

T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
10
The Constructed PAT Tree
11
Definition of Maximal Repeats

Let a occurs in S in position p1, p2, p3, , pk
a is left maximal if there exists at least one
(i, j) pair such that Spi-1?Spj-1
a is right maximal if there exists at least one
(i, j) pair such that Spia?Spja
a is a maximal repeat if it it both left maximal
and right maximal

12
Finding Maximal Repeats

Definition
Lets call character Spi-1 the left character
of suffix pi
A node ? is left diverse if at least two leaves
in the ?s subtree have different left characters
Lemma
The path labels of an internal node ? in a PAT
tree is a maximal repeat if and only if ? is left
diverse

13
3. Pattern Validator

Suppose a maximal repeat ? are ordered by its
position such that suffix p1 lt p2 lt p3 lt pk,
where pi denotes the position of each suffix in
the encoded token sequence.
Characteristics of a Pattern
Regularity Variance coefficient
Adjacency Density

14
Pattern Validator (Cont.)

Basic Screening
For each maximal repeat a, compute V(a) and D(a)
a) check if the patterns variance V(a) lt 0.5
b) check if the patterns density 0.25 lt D(a) lt
1.5

15
4. Rule Composer

Occurrence partition
Flexible variance threshold control
Multiple string alignment
Increase density of a pattern

16
Occurrence Partition

Problem
Some patterns are divided into several blocks
Ex Lycos, Excite with large regularity
Solution
Clustering of the occurrences of such a pattern

Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
17
Multiple String Alignment

Problem
Patterns with density less than 1 can extract
only part of the information
Solution
Align k-1 substrings among the k occurrences
A natural generalization of alignment for two
strings which can be solved in O(nm) by dynamic
programming where n and m are string lengths.

18
Multiple String Alignment (Cont.)

Suppose adc is the discovered pattern for token
string adcwbdadcxbadcxbdadcb
If we have the following multiple alignment for
strings adcwbd'', adcxb'' and adcxbd''
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as
adcwxbd-

19
Pattern Viewer

Java-application based GUI
Web based GUI
http//www.csie.ncu.edu.tw/chia/WebIEPAD/

20
The Extractor

Matching the pattern against the encoding token
string
Knuth-Morris-Pratts algorithm
Boyer-Moores algorithm
Alternatives in a rule
matching the longest pattern
What are extracted?
The whole record

21
Experiment Setup

Fourteen sources search engines
Performance measures
Number of patterns
Retrieval rate and Accuracy rate
Parameters
Encoding scheme
Thresholds control

22
Translation

Average page length is 22.7KB

23
Accuracy and Retrieval Rate
24
Problems

Guarantee high retrieval rate instead of accuracy
rate
Generalized rule can extract more than the
desired data
Only applicable when there are several records in
a Web page, currently

25
ROADRUNNER Towards Automatic Data Extraction
from Large Web Sites

Valter Crescenzi , Giansalvatore , Paolo Merialdo
VLDB2001

26
Observations

1. Wrapper generator works by using additional
information. (labeled samples)
2. Wrapper induction system has some a priori
knowledge about the page organization.
3. Finally, systems generate wrapper by examining
one HTML page at a time.

27
ROADRUNNER new perspective

1. Dont rely on any interaction with the user.
(Completely automatic)
2. No a priori knowledge
HTML schema will be inferred along with wrapper.
Can handle any nested structures.
3. Works with two HTML pages at a time. (based on
the study of similarities and dissimilarities
between the pages)

28
(No Transcript)
29
Theoretical Background

Site generation Encoding of database content
Data extraction Decoding
The problem is based on a close correspondence
between nested type and union-free regular
expressios.

30
Delimiter

PCDATA map to string
map to lists (nested) , being iterator
? map to nullable fields, optional patterns.
Find schema and data extraction Find minimal
UFRE.

31
Matching Technique

It is based on a matching technique called ACME.
(Align, Collapse under Mismatch, and Extract)
HTML ? XHTML ? tokens
Matching algorithm works on two objects
A list of tokens, call the sample
A wrapper (one UFRE)
This is done by solving mismatches between the
wrapper and the sample.

32
(No Transcript)
33
Mismatches

1. String mismatches
May be due only to different values of a database
field.
These mismatches are use to discover fields.
(PCDATA)
Ex John Smith and Paul Jones at token 4
2. Tag mismatches
Optional patterns
Iterative patterns

34
Discovering Optionals

Strategy Looking for repeated patterns as a
first step, and then, if this attempt fails, in
trying to identify optional pattern.
Two steps
1. Optional Pattern Location by Cross-Search
Mismatch at token 6 - ltULgt and ltIMG/gt
Assume optional pattern is located on wrapper or
sample.
2. Wrapper Generalization
( ltIMG src/gt ) ?

35
Discovering Iterators

1. Square Location by Terminal Tag Search
Both the wrapper and sample contain at least one
occurrence of the square.
Terminal Tag position before the mismatch
In this example is lt/LIgt
Test which is the square initial tag ?
lt/UIgt lt/LIgt v.s. ltLIgt lt/LIgt
Finally, we can infer that the sample contains
one candidate occurrence of the square at token
20-25.

36
Discovering Iterators (cont)

2. Square Matching
Try to match the candidate square occurrence
(tokens 20-25).
Backwards matching token 25 and 19, then moves
to 24 and 18 and so on.
3. Wrapper Generalization
If we denote the newly found square by s, we
replace the repeated pattern by (s)

37
More Complex Example

First mismatch at token 15 (external mismatch)
Find iterators
Terminal tag lt/LIgt
Candidate square is found ltLIgt lt/LIgt at token
15-28
Backward match second mismatch at token 23 and
9 (internal mismatch) ? solve the mismatch by
recursive

38
Recursively solve mismatch

Internal mismatch at token 23 and 9
Solve it by the same way at external mismatch.
But dont work by comparing one wrapper and one
sample, rather two different portions of the same
objects.
Terminal tag ltBgt
Candidate square is lt/BgtltBgt token 23-18
Backward match mismatch at token 20 and 26
Find token 20-22 is optional pattern.

39
(No Transcript)
40
Matching as an AND-OR tree

Finding one solution to match(w,s) corresponds to
finding one visit for the AND-OR tree.
(i) match(w,s) all external mismatches
encountered during the parsing (AND node)
(ii) solve mismatch by either introducing one
field, or one iterator, or one optional (OR)
(iii) The search may either on wrapper or sample
(OR)
(iv) iterators and optionals are various
candidates (OR)
(v) Discover iterators may be need to recursively
solve several internal mismatches. (AND)

41
AND-OR tree
42
Experimental Results
43
Experimental Results (cont)
44
Extracting Structured Data from Web Page

Arvind Arasu, Hector Garcia-Molina
ACM SIGMOD 2003

45
Cue

Keywords schema, template
Web pages belonging to the same site are
generated by encoding data of the same schema
with a common template
gt a common template by plugging-in value

46
Figuration
47
Goal and Challenge

Previous IE Techniques rely on heuristic by
human. ex. wrapper
Goal to deduce the template without human
Time consuming and error-prone
Optional attributes are ignored
Challenge
No obvious way of differentiating what text is
template or data
The schema of data in pages isnt flat but more
complex and semi-structured of attributes

48
Model, Problem Formulation

Structured Data
Model of Page Creation
Optionals and Disjunctions
Problem Statement
Miscellaneous Terminology, Definition

49
Structured Data

Token A token is some basic unit of text
Structured Data any set of data values
conforming to a common schema or type
Define Type
1. Basic Type (ß) string of tokens
e.g. lthtmlgt, text
2. Ordered List Type tuple constructor order n
e.g. ltT1, T2, , Tngt, T1, T2, , Tn type
3. Define Type set constructor
e.g. T , T type

50
Define term value and example

Define instance
1. an instance of basic type, ß, token
2. an instance of type ltT1, T2, , Tngt is
tuple of the form lti1, i2, , ingt, attributes
i1, i2, , in are instances of typesT1,
T2, , Tn
3. an instance of type T, is any set of
elements
e1, e2, , em, such ei is an instance of
type T
Instance ? Value String ? token
Example
Schema S1
Value

51
Model of Page Creation

Definition A template T for a schema S (as shown
TS), is defined as a function that maps each type
constructor t of S into an ordered set of strings
T(t ), such that,
tis the tuple constructor of order n, T(t) is an
order set of n1 string
tis the set constructor of order n, T(t) is
string St

?(T, x) values x that are instances of
sub-schema of S
52
Encoding of a value x? S

1. if x ?ß, then ? (T,x)?x
2. if x ? ltx1, x2, , xngttt
? (T,x) ? C1 ? (T, x1) C2 ? (T, xn) Cn1
3. if x ? e1, e2, , emts , ts ? S
? (T,x) ? ? (T, e1) S ? (T, e2) .S ? (T, em)

53
Example of Schema S1
54
Optionals and Disjunctions

Optional
If T is type, optional type (T)?Tt
t 0 or 1
Disjunction
If T1 and T2 is type, disjunction type
(T1 T2) ltT1t1, T2t2 gtt
t1t2 1

55
Problem Statement

Extract Problem n pages, pi ?(T, xi)
(1 i n), created from some unknown
deduction template T and values x1,. . .,x1
from the set of pages alone

56
Example of correct solution of EXTRACT (cont.)
57
Example of correct solution of EXTRACT (cont.)
58
Miscellaneous Terminology, Definition

An occurrence of a token in template is called a
template-token
An occurrence of a token in value is called a
value-token
An occurrence of a token in page is called a
page-token
2 page-token in Pe have the same role iff they
have been generated by the same template-token

59
Overview Approach - EXALG
(ECGM)
60
EXALG - ECGM FINDEQ (step2)

The module used to compute equivalence
classese, set of tokens having the same
frequency of occurrence in every pages Pe
Ex. ee1 lthtmlgt, ltbodygt, Book, Reviews, ltolgt,
lt/olgt, lt/bodygt, lt/htmlgt
Ex. ee3 ltligt, Reviewer, Rating, Text, lt/ligt
EXALG retain only EQ Classes that are Large and
Frequently occurring EQ Classes (LFEQ)

61
EXALG - ECGM HANDINV (step3)

The module used to detect and remove invalid
LFEQs those that are not formed by tokens
associated with a type constructor

62
DIFFFORM (step1) and DIFFEQ (step4)

The module used to add more tokens to LFEQ by
differentiating roles
Ex. Name has multiple role, one occurs in Book
Name and the other occurs in Reviewer Name
Differentiate the multiple roles
The multiple tokens occur in different path from
root in the HTML parse tree (DIFFFORM)
The multiple tokens occur in different Position
with respect to LFEQ ee1(DIFFEQ)
dtoken ex. Name5 and Name14
regard NameA and NameB as different tokens

63
Review ECGM
64
Example After ECGM Process

ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
ltbgt, Reviews, lt/bgt, ltolgt, lt/olgt, lt/bodygt, lt/htmlgt
8 ?13
ee3 ltligt, ltbgt, Reviewer, Name, lt/bgt, ltbgt,
Rating, lt/bgt, ltbgt, Text, lt/bgt, lt/ligt
5 ?12
Position empty and non-empty

65
Construct Schema from ECGM

Construct Schema S fromee1
The 1st of non-empty position is Basic Type ß
The 2nd of non-empty position is ee3 , are
generated by set type constructorte3
? T(te1) ltC11, C12,C13gt, S ltß, S te2
gtte1
? T(te2) S lt C31, C32,C33,C34 gt
? T(te3) lt C31, C32,C33,C34 gt, ltß,ß,ß,gtte3
? S lt ß, ltß,ß,ß,gtte3 te2 gtte1

66
Equivalence Classes (Cont.)

Pages P p1, , pn , pi ?(TS, xi)
TS t1, , tk type constructor
Definition All tokens of equivalence class have
the same occurrence vector
ex. ee1 lt1,1,1,1gt ee3 lt1,2,1,0gt
Observation1 Tokens associated with the same
type constructor tj in T that have unique-roles
occur in the same equivalence class. (used to
decide EQ valid or not)
Support of token (page contain)
Size of EQ class (token of EQ)

67
Equivalence Classes (Cont.)

Observation2 for real pages, an equivalence
class of large size and support is usually valid
Properties of EQ class ltt1, , tmgt
Ordered
Nested the span of all occurrences of ei is
within for some fixed Position_p or doesnt
overlap
Observation3 A valid equivalence class is
ordered and a pair of two valid equivalence
classes is nested

68
Handling Invalid Equivalence classes

Detect the existence of invalid LFEQs using
violation of ordered and nesting
Yes, discard some of LFEQs and break other into
smaller LFEQs

Differentiating roles of tokens

By Path different roles of tokens are in
different path of HTML parse tree
By Position different roles of tokens locates
at different Position (non-empty)

69
Equivalence Class Generation Module

OUTPUT set of LFEQs of dtokens and page
represented as string of dtokens
FINDEQ 2 parameters used to consider
LFEQs (SIZETHRES, SUPTHRES)
On running example
SIZETHRES SUPTHRES 3
the iteration 2, find out ee1 and ee3

70
Building Template and Extracting Values

Input to this module is e1 ,e2 , ,em
The ANALYSIS consist of 2 modules CONSTTEMP and
EXVAL
CONSTTEMP ,ei d1, d2, , dl
Start the basic e1 lthtmlgt, ltbodygt, ,lt/bodygt,
lt/htmlgt
recursively constructs a template Tei ,
corresponding toei , and template Tei, p,
corresponding to each non-empty position p ofei
Checks if the set of strings, PosString(ei ,p),
corresponding has some recognizable pattern

71
Example

In running example, PosString(ee1 ,6) is a
string dtokens for every occurrence of ee1,
which matches Pattern 5 of table PosString(ee1
,10) is always a string of 0 or more occurrences
of ee3, which matches Pattern 1
ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
ltbgt,
Reviews, lt/bgt, ltolgt, lt/olgt,
lt/bodygt, lt/htmlgt

72
Assumption

The 4 assumptions
(A1) A large number of tokens occurring in
template have unique roles
(A2) The EQ class derived from a type constructor
is recognized as an LFEQ
(A3) Irregularity in encoded data that leads to
invalid EQ class
(A4) The separators are around data values. In
this model, strings associated with type
construction are non-empty position

73
Evaluation

Leaf attribute Am in schema Sm
Correct the set of Am in the page is equal to
the set of extracted value Ae in the page
Partially Correct the set of Am in the page is
not equal to the set of extracted value Ae in the
page, but as part of value of Ae
Incorrect not correct and Partially correct

74
Result

18 or 40 of input collections our System
correctly extracted all the attribute
Around 80 of the attributes were extracted
correctly
Normalized average
Input size lt10
Parameter 3

75
Conclusion

EXALG use 2 novel concept equivalence classes
and differentiate roles, to discovery the
template
Impact of the failed assumption is limit to a few
attributes
Future work
Develop techniques for crawling, indexing, and
providing querying support for the structured
pages in the web
Develop techniques for automatically annotating
the extracted data, possibly using the words that
appear in the template

76
References

C.H. Chang. and S.C. Lui. IEPAD Information
Extraction based on Pattern Discovery, WWW2001,
pp. 681-688.
Valter Crescenzi, Giansalvatore Mecca, Paolo
Merialdo. RoadRunner Towards Automatic Data
Extraction from Large Web Sites. VLDB2001,
109-118
Arvind Arasu, Hector Garcia-Molina. Extracting
Structured Data from Web Pages. SIGMOD2003,
337-348.