Extracting Structured Data from Web Pages - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Extracting Structured Data from Web Pages

Description:

... compared with the RoadRunner, however RoadRunner makes simplifying assumptions. The first 6 web pages are obtained from RoadRunner site. ... – PowerPoint PPT presentation

Number of Views:621
Avg rating:3.0/5.0
Slides: 38
Provided by: arsun
Category:

less

Transcript and Presenter's Notes

Title: Extracting Structured Data from Web Pages


1
Extracting Structured Data from Web Pages
  • By Arsun ARTEL, Özgün ÖZISIKYILMAZ
  • 05.11.2003
  • Instructor Prof. Taflan Gündem

2
Presentation Outline
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

3
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

4
Motivation
  • There are many web sites that contain a large
    collection of structured pages.
  • Extracting structured data from the web pages is
    useful, since it enables us to pose complex
    queries over the data.
  • This paper focuses on the problem of
    automatically extracting structured data from a
    collection of pages.

5
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

6
Example Pages
  • In the real world there are many examples for
    structured web pages.
  • amazon web site, e-bay web site etc.
  • Two examples from www.amazon.com
  • My System
  • An Eternal Golden Braid

7
Example Pages (My System 21st Century Edition)
8
Example Pages (An Eternal Golden Braid)
9
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

10
Underlying Problems
  • Complex Schema The schema of the information
    encoded in the web pages could be very complex
    with arbitrary levels nesting. For instance, each
    book page can contain a set of authors, with each
    author having a set of addresses and so on.
  • Template vs. Data Syntactically, there is
    nothing that distinguishes the text that is part
    of the template and the text that is part of the
    data.

11
How is a page created with template?
12
Basic Type, Tuples and Sets
  • Basic Type b, Basic unit of text
  • Tuple Ordered List of types, ltT1,T2,,Tngt
  • Set T1

lt C Programming Language, lt Brian, Kernighan gt,
lt Dennis, Ritchie gt, 30.00 gt
13
Schema and Instance
  • lt C Programming Language, lt Brian, Kernighan gt,
    lt Dennis, Ritchie gt, 30.00 gt

14
Template Definition
  • Own example
  • Schema S ltb, b, bgt
  • Template TS ltA B E C Dgt
  • A Title, B Presented by, C
    Cost, D , E and
  • Instance of TS
  • Title Extracting Structured Data Presented by
    Arsun and Özgün Cost 1hr

15
Template
16
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

17
General Description of EXALG
18
Multiple Pages
19
Correct Solution for those pages
20
Some Terminology (1)
  • The occurrence-vector of a token t, is defined as
    the vector ltf1,f2,fngt where fi is the number of
    occurrences of t in ith page
  • An equivalence class is a maximal set of tokens
    having the same occurrence-vector.
  • A token is said to have unique role, if all the
    occurrences of the token in the pages, is
    generated by a single template-token.

21
Some Terminology (2)
22
Some Terminology (3)
  • For real pages, an equivalence class of large
    size and support is usually valid, where support
    of a token is defined as the number of pages in
    which the token occurs.
  • Example for invalid equivalence class
  • Data, Mining, Jeff, 2, Jane, 6 has occurrence
    vector lt0, 1, 0, 0gt

23
Some Terminology (4)
  • The equivalence classes with large size and
    support are called LFEQs (for Large and Frequent
    EQuivalence class). LFEQs are rarely formed by
    chance.
  • Threshold for size and support is set by the user
    (SizeThres, SupThres).

24
Some Terminology(5)
  • Valid equivalence class properties Ordering and
    Nesting
  • Back to own example
  • Template TS ltA B E C Dgt
  • A Title, B Presented by, C
    Cost, D , E and
  • Ordered A gt B gt C gt D
  • Nesting B gt E gt C

25
Important Observations
  • In practice, two page-tokens with different
    occurrence-paths have different roles
    html-parser
  • Two page-tokens having same occurrence paths, but
    with different neighbours also have different
    roles

26
Explanation of observations
27
Modules and their operations
28
Constructing Template (1)
  • The extraction algorithm determines the positions
    between consecutive tokens of an equivalence
    class that are non-empty.
  • A position between two consecutive tokens is
    empty if the two tokens always occur
    contiguously, and non-empty, otherwise.

29
Constructing Template (2)
  • The tokens connected by empty positions belong to
    the template.
  • In the non-empty positions, there are either
    basic types (strings extracted from database), or
    a more complex type
  • This unknown type can be determined by inspecting
    input pages

30
Constructing Template(3)
31
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

32
Experimental Results (1)
  • Basically this project is compared with the
    RoadRunner, however RoadRunner makes simplifying
    assumptions.
  • The first 6 web pages are obtained from
    RoadRunner site.
  • The last three web pages have more complex
    structure.

33
Experimental Results(2)
34
What is next?
  • Motivation
  • Example Pages
  • Model Problem Formulation
  • Approach in Detail
  • Experimental Results
  • Conclusion

35
Concluding Remarks
  • EXALG first discovers the unknown template that
    generated the pages and uses the discovered
    template to extract the data from the input
    pages.
  • Besides getting very good results, EXALG does not
    completely fail to extract any data even when
    some of the assumptions made by EXALG are not met
    by the input collection.
  • No human intervention automatically getting
    template and data

36
Future Work
  • Automatically locate collections of pages that
    are structured
  • Check, whether it is feasible to generate some
    large database from these pages

37
Questions Answers
Write a Comment
User Comments (0)
About PowerShow.com