Title: Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax
1Understanding Web Query Interfaces Best-Efforts
Parsing with Hidden Syntax
- Zhen Zhang, Bin He and Kevin C. Chang
2MetaQuerier Goals Exploring and integrating the
deep Web
FIND sources
QUERY sources
- Integrator
- source selection
- schema integration
- query mediation
- Explorer
- source discovery
- source modeling
- source indexing
Cars.com
Amazon.com
411localte.com
Apartments.com
The Deep Web Databases on the Web
3Problem Source capability extraction Or, query
interface understanding.
Book sources
Music sources
4Form understanding What are the essential tasks?
- Output all the conditions, for each
- Grouping elements (into query conditions)
- Tagging elements with their semantic roles
attribute
operator
value
5Demo summary
Query form
Understanding form structure
Multiple interpretations
6Certainly not a trivial task - Recall the
butterfly ballot in U.S. Election 2000.
Even just grouping can be hard!
7Baseline approach? The problem seems to be rather
heuristic in nature
- There seem to be no clear criteria, but only
fuzzy heuristics - Grouping is hard it is often n-ary
- Heuristic Group two elements if they are close
- But
- Tagging is hard no semantic labeling in HTML
forms - Heuristic Tag the closest text as the
attribute - But
- We need many such heuristics!
- Goal A principled mechanism to encode and use
the various heuristics systematically?
8Our observation concerted structures of QI
- Condition pattern as building blocks
- Convergence condition patterns
9Our insight Cope with form complexity by their
composition patterns.
- Lego-like building blocks
- Pattern of elements composed into conditions
- Pattern of conditions composed into a form
- So, how to realize our divide-and-conquer idea?
Any computation paradigm?
Source
Q-Form
Lego Building Blocks
?
Semantic Structure
10Our Hypothesis Existence of Hidden-Syntax
- Query-form creation is guided by hidden syntax
Semantic Structure (Query Conditions)
Presentation (Query Interface)
Attr title Operator title words,. Value
string
Parsing is thus a principled mechanism for the
inverse
11This language paradigm enables principled
solution to a seemingly heuristic problem
- Essential notions Grammar and Parser
- Grammar Pattern specification
- Declarative
- No need to hard-code heuristics
- Collective
- Capture both micro and macro patterns
- Parser Pattern recognition
- Global
- Coherently interpret an entire query form
- Systematic
- Systematically assembles the building blocks
12However, the hidden-syntax hypothesis itself
entails challenges in its realization
- Hidden syntax is only hypothetical
- We must derive a grammar in its place
- What should be captured in a derived grammar?
- 2P-Grammar Production Preference
- productions for patterns preferences for their
precedence - Derived grammar is secondary to any input
- Inherently incomplete and ambiguous
- What should be the machinery of a soft parser?
- Best-effort Parser
- multiple, maximal-partial parse trees
13Our Paradigm Best-Effort Visual Language
Parsing Framework
Input HTML query form
2P Grammar
Preferences
Productions
BE-Parser
Ambiguity Resolution Error Handling
X
Output semantic structure
14Grammar Layout based
Traditional grammar (Sequential based 1-D)
Our grammar (Layout based 2-D)
Presentation
3 5
TextCond - left(TextAttr, TextVal) Ú
above(TextAttr, TextVal) Ù above(TextVal,
TextOp)
E - E E, or E - sequential(E, , E)
Grammar
15Parser Logic programming style
- Traditional parsing
- Scan input sequentially
- Our parsing
- Nonlinear input
- Arbitrary constraints
Parse trees
. . .
16Thats not all complications of hypothetical
syntax
- Hidden syntax is only hypothetical !
Ambiguous
Incomplete
Grammar
Parser
Multiple parse trees
Partial parse trees
17Ambiguity
TextCond Below(Attr,Selection)
- Grammar
- Preferences to capture the conventional
precedence - eg. RButton TextCond
- Parser
- Just-in-time pruning by preference
- Multiple trees possible
RButton Left(radio,text))
18Incompleteness
- Grammar
- Cannot capture all patterns
- Parser
- Cannot interpret entire query interfaces
- Interpret as much as possible
- Greedily choose the maximum parse trees
- Reasoning they look at big picture and consider
more context
19Error Handling Best-effort parser can output
multiple and partial parse trees
- Union all the conditions interpreted by all the
parse trees. - Report both conflicts and missing errors
Parsing
Union
20Experiment How a global grammar will do?
- Global grammar
- Derived from Basic captures 21 patterns
- 82 productions, 39 non-terminals, 16 terminals
- Datasets
- Basic 3 domains (Airfare, Autos, Books) 150
sources - NewSource same domains, 30 sources
- NewDomain 6 new domains (Music, ), 42 sources
- Random 30 sources (from invisible-web.net)
- Correctness judgment
- Number of correctly identified (grouping and
tagging) conditions
21Conclusion Syntactic Parsing for Interface
Understanding
- Query interface understanding by syntactic
parsing with hidden grammars - Insight
- Exploit how semantics connects to presentation,
in a syntactic way - Future work
- Constructing grammar automatically
- Developing more sophisticated preference
framework - Extending the framework to other applications
22Thank you !
- For more information
- Online demo at MetaQuerier project Web site
- http//metaquerier.cs.uiuc.edu
- Invite you to our MetaQuerier demo in the
afternoon