Structured Data Extraction From Web Based on Partial Tree Alignment - PowerPoint PPT Presentation

Loading...

PPT – Structured Data Extraction From Web Based on Partial Tree Alignment PowerPoint presentation | free to view - id: d0bd6-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Structured Data Extraction From Web Based on Partial Tree Alignment

Description:

DOM Tree Builder. Data Region Identifier. Data Records ... Building the Dom Trees Based on. Visual Information. Mining Data Regions. Identifying Data Records ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 25
Provided by: NEC64
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Structured Data Extraction From Web Based on Partial Tree Alignment


1
Structured Data Extraction From Web Based on
Partial Tree Alignment
  • by
  • Yanhong zhai and Bing Liu

2
Introduction
  • A large amount of information on the Web is
    contained in regularly structured data objects
  • Which are data records retrieved from databases.
  • Such Web data records are important because
  • They often present the essential information of
    their host pages, e.g., lists of products and
    services.
  • Applications integrated and value-added
    services,
  • e.g., Comparative shopping, meta-search query,
    etc.

3
Example (a)-
4
Years Persons () Persons () Persons ()
40-49 51,000 0.1 80,000 0.2 131,000 0.3
50-59 45,000 0.1 102,000 0.3 147,000 0.4
60-69 59,000 0.3 178,000 0.9 235,000 1.2
70-79 134,000 0.8 471,000 3.0 605,000 3.8
gt80 648,000 7.0 1,532,00 16 905,00 23
5
Existing Methods
  • Wrapper Programming languages
  • This approach provides some languages to
    facilitate the construction of data extraction
    programs.
  • Wrapper Induction
  • This approach use machine learning techniques
  • to learn data extraction rules from
    set
  • of manually labeled examples.
  • Automatic Extraction
  • This approach is based on the idea of automatic
    pattern discovery.

6
Proposed Method
  • DEPTA (Data extraction based on partial tree
    alignment
  • This method consists of two steps
  • 1)Identifying individual records in a
    page.
  • 2)Aligning and extracting data items
    from
  • the Identified records.

7
Architecture of DEPTA System
Input a web page
DOM Tree Builder
Data Region Identifier
Data Records Identifier
Output Data Tables
Data Items Extractor


8
DATA RECORD IDENTIFICATION
  • MDR Mining Data Records
  • Given a single page with multiple data records,
    MDR extracts data records ,but not data
    items(step1).
  • MDR is based on
  • two observations about data records in a Web
    page and
  • a tree matching algorithm
  • Consider both
  • Contiguous
  • non contiguous records

9
Two Observations
  • A group of data records are presented
  • In a contiguous region (a data region) of a page
    and
  • are formatted using similar HTML tags
  • A set of similar data records are formed by some
    child sub trees of the same parent node.

10
DOM tree of the previous page

TABLE
TBODY
TR
TR
TR
TR
TR
TR
TD
TD
TD
Data record2
Data record1
TD
TD
TD
TD
TD
11
The approach
  • Given a page ,
  • Building the Dom Trees Based on
  • Visual Information
  • Mining Data Regions
  • Identifying Data Records
  • Rendering (or Visual) information is very
  • useful in the whole process.

12
Building Dom Trees Based on Visual Information
  • 1.lttablegt
  • 2.lttrgt
  • 3.lttdgtdata1lt/tdgt
  • 4.lttdgtdata2lt/tdgt
  • 5.lttrgt
  • 6.lttdgtdata3lt/tdgt
  • 7.lttdgtdata4lt/tdgt
  • 8.lt/trgt
  • 9.lt/tablegt

Left right top bottom
table
100 300 200 400
100 300 200 300
100 300 200 400
200 300 200 300
tr
tr
100 300 300 400
100 200 300 400
tr
tr
tr
tr
200 300 200 400
13
Enhanced Simple Tree Matching
T1
T2
p
p
T2
T1
p
p
a
a
a
a
a
a
a
b
a
b
ltdata1gt
ltdata2gt
ltdata3gt
ltdata2gt
ltdata3gt
ltdata4gt
c
c
g
data1 data2 data3
data2 data3 data4
data1 data2 data3
data2 data3 data4
c
ltdata1gt
ltdata2gt
ltdata1gt
Wrong alignment
Correct alignment
(b)
(a)
Alignment using tags only can produce wrong
alignments
Two trees with more than one possible matches
14
Mining Data Regions
  • Find every data region with similar data records.
  • Definition A generalized node (or a node
    combination)
  • of length r consists of r (r1)nodes in the HTML
    tag tree
  • with the following two properties
  • 1. the nodes all have the same parent and
  • 2. the nodes are adjacent.
  • Definition A data region is a collection of two
    or more
  • generalized nodes with the following properties
  • 1.The generalized nodes all have the same
    parent.
  • 2.The generalized nodes are all adjacent.
  • 3.Adjacent generalized nodes are similar.

15
Determining Data Regions
  • To find each data region , the algorithm needs to
    find the following .
  • 1. Where does the first generalized node of the
    data region start?
  • Try to start from each child node under a parent
  • 2. How many tag nodes or components does a
    generalized node have?
  • We try one node, two node,., K node combinations

16
An illustration of generalized nodes and data
regions
Shades nodes are generalized nodes
data regions
1
2
3
4
5
6
7
8
9
10
Region 1
Region 2
11
12
13
14
15
16
17
19
18
Region 3
17
Identifying Data Records
  • A generalized node may not
  • be a data record.
  • Extra mechanisms are
  • needed to identify true
  • atomic objects
  • Some highlights
  • contiguous
  • non-contiguous data records

Name1 Description of object 1
Name2 Description of object2
Name3 Description of object3
Name4 Description of object4
Name1
Name2
Description Of object 1
Description Of object2
Name3
Name4
Description Of object 3
Description Of object4
18
DEPTA Extract Data from Data Records
  • Once a list of data records are identified, we
    can align and extract items in them
  • Multiple tree alignment
  • We need multiple alignment as we have multiple
    data records
  • Most multiple alignment methods work like
    hierarchical clustering , and require n2 pair
    wise matching.
  • Too expensive
  • Optimal alignment/ matching is exponential
  • A partial tree matching algorithm is proposed in
    Depta to perform multiple tree alignment

19
The partial Tree Alignment Approach
  • Choose a seed tree A seed tree , denoted by Ts,
    is picked with the maximum number of data items.
  • Tree matching
  • For each unmatched tree Ti (i?s),
  • Match Ts and Tr
  • Each pair of matched nodes are linked (aligned)
  • For each unmatched node nj in Ti do
  • Expand Ts by inserting n into Ts if a position
    for insertion can be uniquely determined in Ts.
  • The expanded seed tree Ts is then used in
    subsequent matching.

20
Illustration of partial tree alignment
TS
Ti
p
p
a
b
b
e
c
d
e
New part of Ts
Insertion is possible
p
a
b
c
d
e
Ts
Ti
p
p
Insertion is not possible
a
a
b
e
x
e
21
A complete example
Ts

T1
T2
T3
p
p
p
..
X
b
d
b
b
c
c
n
k
k
g
d
h
Ts
p
No node inserted
X
b
d

Ts
New
p
C, h and k inserted
T2 is matched again
X
b
d
c
k
h
T2
p
b
c
n
k
g
p

X
b
d
c
n
k
h
g
22
Output data table




X
b
n
c
d
h
K
g
.
T1
1
1
1
.
1
1
1
1
1
T2
1
1
1
1
1
T3
  • The final tree may also be used to match and
    extract data from other
  • similar pages

23
Conclusion
  • Existing techniques either inaccurate or make
    several assumptions.
  • Our method does not make these assumptions
  • Our technique consists of two steps
  • Identifying data records
  • Aligning corresponding data items from multiple
    data records.
  • Step1 is based on visual cues
  • Step2 is based on partial tree aligment

24
References
  • 1. Arasu, A. and Garcia-Molina, H. Extracting
    Structured Data
  • from Web Pages. SIGMOD-03, 2003.
  • 2. Baeza-Yates, R. Algorithms for string
    matching A survey.
  • ACM SIGIR Forum, 23(3-4)34-58, 1989.
  • 3. Barton, G., Sternberg, M. A strategy for the
    rapid multiple
  • alignment of protein sequences confidence levels
    from
  • tertiary structure comparisons. J. Mol. Biol.
    1987, 327-337.
  • 4. Bar-Yossef, Z. and Rajagopalan, S. Template
    Detection via
  • Data Mining and its Applications, WWW 2002, 2002.
  • 5. Buttler, D., Liu, L., Pu, C. A fully
    automated extraction
  • system for the World Wide Web. IEEE ICDCS-21,
    2001.
  • 6. Carrillo, H., Lipman, D. The multiple
    sequence alignment
  • problem in biology. SIAM J. Applied Math.,
    198848(5).
  • 7. Chakrabarti, S. Mining the Web Discovering
    Knowledge
  • from Hypertext Data. Morgan Kaufmann Publishers,
    2002.
  • 8. Chang, C. and Lui, S-L. IEPAD Information
    extraction
  • based on pattern discovery. WWW-10, 2001.
  • 9. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H.
    Mining tables from
  • large scale html texts. COLING-00, 2000.
About PowerShow.com