A Survey of WEB Information Extraction Systems - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

A Survey of WEB Information Extraction Systems

Description:

IE from free text using extraction patterns that are mainly based on syntactic ... Roadrunner and EXALG do the analysis from multiple pages. Comparison (Cont. ... – PowerPoint PPT presentation

Number of Views:495
Avg rating:3.0/5.0
Slides: 59
Provided by: Sure63
Category:

less

Transcript and Presenter's Notes

Title: A Survey of WEB Information Extraction Systems


1
A Survey of WEB Information Extraction Systems
  • Chia-Hui Chang
  • National Central University
  • Sep. 22, 2005

2
Introduction
  • Abundant information on the Web
  • Static Web pages
  • Searchable databases Deep Web
  • Information Integration
  • Information for life
  • e.g. shopping agents, travel agents
  • Data for research purpose
  • e.g. bioinformatics, auction economy

3
Introduction (Cont.)
  • Information Extraction (IE)
  • is to identify relevant information from
    documents, pulling information from a variety of
    sources and aggregates it into a homogeneous form
  • An IE task is defined by its input and output

4
An IE Task
5
Web Data Extraction
Data Record
Data Record
6
IE Systems
  • Wrappers
  • Programs that perform the task of IE are referred
    to as extractors or wrappers.
  • Wrapper Induction
  • IE systems are software tools that are designed
    to generate wrappers.

7
Various IE Survey
  • Muslea
  • Hsu and Dung
  • Chang
  • Kushmerick
  • Laender
  • Sarawagi
  • Kuhlins and Tredwell

8
Related Work Time
  • MUC Approaches
  • AutoSolg Riloff, 1993, LIEP Huffman, 1996,
    PALKA Kim, 1995, HASTEN Krupka, 1995, and
    CRYSTAL Soderland, 1995
  • Post-MUC Approaches
  • WHISK Soderland, 1999, RAPIER califf, 1998,
    SRV Freitag, 1998, WIEN Kushmerick, 1997,
    SoftMealy Hsu, 1998 and STALKER Muslea, 1999

9
Related Work Automation Degree
  • Hsu and Dung 1998
  • hand-crafted wrappers using general programming
    languages
  • specially designed programming languages or tools
  • heuristic-based wrappers, and
  • WI approaches

10
Related Work Automation Degree
  • Chang and Kuo 2003
  • systems that need programmers,
  • systems that need annotation examples,
  • annotation-free systems and
  • semi-supervised systems

11
Related Work Input and Extraction Rules
  • Muslea 1999
  • IE from free text using extraction patterns that
    are mainly based on syntactic/semantic
    constraints.
  • The second class is Wrapper induction systems
    which rely on the use of delimiter-based rules.
  • The third class also processes IE from online
    documents however the patterns of these tools
    are based on both delimiters and
    syntactic/semantic constraints.

12
Related Work Extraction Rules
  • Kushmerick 2003
  • Finite-state tools (regular expressions)
  • Relational learning tools (logic rules)

13
Related Work Techniques
  • Laender 2002
  • languages for wrapper development
  • HTML-aware tools
  • NLP-based tools
  • Wrapper induction tools (e.g., WIEN, SoftMealy
    and STALKER),
  • Modeling-based tools
  • Ontology-based tools
  • New Criteria
  • degree of automation, support for complex
    objects, page contents, availability of a GUI,
    XML output, support for non-HTML sources,
    resilience and adaptiveness.

14
Related Work Output Targets
  • Sarawagi VLDB 2002
  • Record-level
  • Page-level
  • Site-level

15
Related Work Usability
  • Kuhlins and Tredwell 2002
  • Commercial
  • Noncommercial

16
Three Dimensions
  • Task Domain
  • Input (Unstructured, semi-structured)
  • Output Targets (record-level, page-level,
    site-level)
  • Automation Degree
  • Programmer-involved, learning-based or
    annotation-free approaches
  • Techniques
  • Regular expression rules vs Prolog-like logic
    rules
  • Deterministic finite-state transducer vs
    probabilistic hidden Markov models

17
Task Domain Input
18
Task Domain Output
  • Missing Attributes
  • Multi-valued Attributes
  • Multiple Permutations
  • Nested Data Objects
  • Various Templates for an attribute
  • Common Templates for various attributes
  • Untokenized Attributes

19
Classification by Automation Degree
  • Manually
  • TSIMMIS, Minerva, WebOQL, W4F, XWrap
  • Supervised
  • WIEN, Stalker, Softmealy
  • Semi-supervised
  • IEPAD, OLERA
  • Unsupervised
  • DeLa, RoadRunner, EXALG

20
Automation Degree
  • Page-fetching Support
  • Annotation Requirement
  • Output Support
  • API Support

21
Technologies
  • Scan passes
  • Extraction rule types
  • Learning algorithms
  • Tokenization schemes
  • Feature used

22
A Survey of Contemporary IE Systems
  • Manually-constructed IE tools
  • Programmer-aided
  • Supervised IE systems
  • Labeled based
  • Semi-supervised IE systems
  • Unsupervised IE systems
  • Annotation-free

23
(No Transcript)
24
Manually-constructed IE Systems
  • TSIMMIS Hammer, et al, 1997
  • Minerva Crescenzi, 1998
  • WebOQL Arocena and Mendelzon, 1998
  • W4F Saiiuguet and Azavant, 2001
  • XWrap Liu, et al. 2000

25
A Running Example
26
TSIMMIS
  • Each command is of the form
  • variables, source, pattern where
  • source specifies the input text to be considered
  • pattern specifies how to find the text of
    interest within the source, and
  • variables are a list of variables that hold the
    extracted results.
  • Note
  • means save in the variable
  • means discard

27
Minerva
  • The grammar used by Minerva is defined in an EBNF
    style

28
WebOQL
  • Select Z!.Text
  • From x in browse (pe2.html), y in x, Z in y
  • Where x.Tag ol and Z.TextReviewer Name

29
W4F
  • Wysiwyg support
  • Java toolkit
  • Extraction rule
  • HTML parse tree (DOM object)
  • e.g. html.body.ol0.li.pcdata0.txt
  • Regular expression to address finer pieces of
    information

30
Supervised IE systems
  • SRV Freitag, 1998
  • Rapier Califf and Mooney, 1998
  • WIEN Kushmerick, 1997
  • WHISK Soderland, 1999
  • NoDoSE Adelberg, 1998
  • Softmealy Hsu and Dung, 1998
  • Stalker Muslea, 1999
  • DEByE Laender, 2002b

31
SRV
  • Single-slot information extraction
  • Top-down (general to specific) relational
    learning algorithm
  • Positive examples
  • Negative examples
  • Learning algorithm work like FOIL
  • Token-oriented features
  • Logic rule

Rating extraction rule- Length(1),
Every(numeric true), Every(in_list true).
32
Rapier
  • Field-level (Single-slot) data extraction
  • Bottom-up (specific to general)
  • The extraction rules consist of 3 parts
  • Pre-filler
  • Slot-filler
  • Post-filler

Book Title extraction rule- Pre-filler slot-fille
r post-filler word Book Length2 wordltbgt word
Name Tag nn, nns word lt/bgt
33
WIEN
  • LR Wrapper
  • (Reviewer name lt/bgt, ltbgt, Rating lt/bgt,
    ltbgt, Text lt/bgt, lt/ligt)
  • HLRT Wrapper (Head LR Tail)
  • OCLR Wrapper (Open-Close LR)
  • HOCLRT Wrapper
  • N-LR Wrapper (Nested LR)
  • N-HLRT Wrapper (Nested HLRT)

34
WHISK
  • Top-down (general to specific) learning
  • Example
  • To generate 3-slot book reviews, it start with
    empty rule ()()()
  • Each parenthesis indicates a phrase to be
    extracted
  • The phrase in the first set of parenthesis is
    bound to variable 1, and 2nd to 2, etc.
  • The extraction logic is similar to the LR wrapper
    for WIEN.

Pattern Reviewer Name lt/bgt (Person) ltbgt
(Digit) ltbgtTextlt/bgt() lt/ligt Output
BookReview Name 1 Rating 2 Comment 3
35
NoDoSE
  • Assume the order of attributes within a record to
    be fixed
  • The user interacts with the system to decompose
    the input.
  • For the running example
  • a book title (an attribute of type string) and
  • a list of Reviewer
  • RName (string), Rate (integer), and Text
    (string).

36
Softmealy
  • Finite transducer
  • Contextual rules

slt,RgtL HTML(ltbgt) C1Alph(Rating)
HTML(lt/bgt) slt,RgtR Spc(-) Num(-) sltR,gtL
Num(-) sltR,gtR NL(-) HTML(ltbgt)
37
Stalker
  • Embedded Category Tree
  • Multipass Softmealy

38
DEByE
  • Bottom-up extraction strategy
  • Comparison
  • DEByE the user marks only atomic (attribute)
    values to assemble nested tables
  • NoDoSE the user decomposes the whole document in
    a top-down fashion

39
Semi-supervised Approaches
  • IEPAD Chang and Lui, 2001
  • OLERA Chang and Kuo, 2003
  • Thresher Hogue, 2005

40
IEPAD
  • Encoding of the input page
  • Multiple-record pages
  • Pattern Mining by PAT Tree
  • Multiple string alignment
  • For the running example
  • ltligtltbgtTlt/bgtTltbgtTlt/bgtTltbgtTlt/bgtTlt/ligt

41
OLERA
  • Online extraction rule analysis
  • Enclosing
  • Drill-down / Roll-up
  • Attribute Assignment

42
Thresher
  • Work similar to OLERA
  • Apply tree alignment instead of string alignment

43
Unsupervised Approaches
  • Roadrunner Crescenzi, 2001
  • DeLa Wang, 2002 2003
  • EXALG Arasu and Garcia-Molina, 2003
  • DEPTA Zhai, et al., 2005

44
Roadrunner
  • Input multiple pages with the same template
  • Match two input pages at one time

Sample page 01 lthtmlgtltbodygt 02 ltbgt 03
Book Name 04 lt/bgt 05 Data mining 06
ltbgt 07 Reviews 08 lt/bgt 09
ltOLgt 10 ltLIgt 11 ltbgt Reviewer Name
lt/bgt 12 Jeff 13 ltbgt Rating
lt/bgt 14 2 15 ltbgtText lt/bgt 16
17 lt/LIgt 18 ltLIgt 19 ltbgt
Reviewer Name lt/bgt 20 Jane 21 ltbgt
Rating lt/bgt 22 6 23 ltbgtText
lt/bgt 24 25 lt/LIgt 26
lt/OLgt 27lt/bodygtlt/htmlgt
Wrapper (initially) 01 lthtmlgtltbodygt 02
ltbgt 03 Book Name 04 lt/bgt 05
Databases 06 ltbgt 07 Reviews 08
lt/bgt 09 ltOLgt 10 ltLIgt 11 ltbgt
Reviewer Name lt/bgt 12 John 13 ltbgt
Rating lt/bgt 14 7 15 ltbgtText
lt/bgt 16 17 lt/LIgt 10
lt/OLgt 11lt/bodygtlt/htmlgt
parsing
String mismatch
String mismatch
String mismatch
String mismatch
tag mismatch
45
DeLa
  • Similar to IEPAD
  • Works for one input page
  • Handle nested data structure
  • Example
  • ltPgtltAgtTlt/AgtltAgtTlt/Agt Tlt/PgtltPgtltAgtTlt/AgtTlt/Pgt
  • ltPgtltAgtTlt/AgtTlt/PgtltPgtltAgtTlt/AgtTlt/Pgt
  • (ltPgt(ltAgtTlt/Agt)TltPgt)

46
EXALG
  • Input multiple pages with the same template
  • Techniques
  • Differentiating token roles
  • Equivalence class (EC) form a template
  • Tokens with the same occurrence vector

47
DEPTA
  • Identify data region
  • Allow mismatch between data records
  • Identify data record
  • Data records may not be continuous
  • Identify data items
  • By partial tree alignment

48
Comparison
  • How do we differentiate template token from data
    token?
  • DeLa and DEPTA assume HTML tags are template
    while others are data tokens
  • IEPAD and OLERA leaves the problems to users
  • How to apply the information from multiple pages?
  • DeLa and DEPTA conduct the mining from single
    page
  • Roadrunner and EXALG do the analysis from
    multiple pages

49
Comparison (Cont.)
  • Techniques improvement
  • From string alignment (IEPAD, RoadRunner) to tree
    alignment (DEPTA, Thresher)
  • From full alignment (IEPAD) to partial alignment
    (DEPTA)

50
Task domain comparison
  • Page type
  • structured, semi-structured or free-text Web
    pages
  • Non-HTML support
  • Extraction level
  • Field level, record-level, page-level

51
Task domain comparison (Cont.)
  • Extraction target variation
  • Missing attributes, multiple-value attributes,
    multi-order attribute permutation
  • Template variation
  • Untokernized Attributes

52
(No Transcript)
53
Technique-based comparison
  • Scan pass
  • Single pass vs mutiple pass
  • Extraction rule type
  • Regular expression vs. logic rules
  • Feature used
  • DOM tree information, POS tags, etc.
  • Learning algorithm
  • Machine learning vs pattern mining
  • Tokernization schemes

54
(No Transcript)
55
(No Transcript)
56
Conclusion
  • Criteria for evaluating IE systems from the task
    domain
  • Comparison of IE systems from various automation
    degree
  • The use of various techniques in IE systems

57
Future Work
  • Page Fetching
  • XWrap, W4F, WNDL
  • Schema Mapping
  • Full information
  • Partial information
  • Query Interface Integration

58
References
  • C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan,
    A survey of Web Information Extraction Systems.
Write a Comment
User Comments (0)
About PowerShow.com