Title: A System for Understanding Imaged Infographics and Its Applications
1A System for Understanding Imaged Infographics
and Its Applications
Weihua Huang, Chew Lim Tan School of
Computing National University of Singapore
2Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
3Introduction
- Information graphics (infographics) are
frequently used in various kinds of documents. - Recognition and interpretation of infographics is
important for automatic document processing and
information retrieval. - What are the elements/components in
- an infographic? Recognition task
- What does an infographic
- try to tell? Interpretation task
- This paper focus on one type of infographics
scientific charts
4Introduction
- Imaged infographics are harder to recognize and
interpret
Because everything is in pixels!
5Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
6Scientific Charts
7Scientific Charts
- Semantic information
- Recognition and interpretation is the reverse
process
8Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
9Chart Recognition
- Preprocessing
- Text/graphics separation connected component
analysis - Edge detection Canny edge detector
10Chart Recognition
- Graphical symbol construction
- Vectorization
- Detection of coordinate lines
- Geometric constraint between candidate lines
- Coverage of other lines in the candidate plot
area - Attachment of text blocks
Edge Map
DSCC
Straight segments
Ellipse fitting
Circular arcs, Elliptic arcs
11Chart Recognition
- Graphical symbol construction (cont.)
- Construction of data components
- Bottom up process with the vectorized edges and
intersections - Model based parsing rules using the domain
knowledge - Example
BarChart x-axis, y-axis, BarSet, where
BarSet Bar, where number of elements 2
and Bar l1, l2, l3 l1 - l3, l2 - l3, l3
x-axis, CE(l1, l3), CE(l2, l3), EL(l1, x-axis),
EL(l2, x-axis) Constraints a b line a is
parallel to line b. a - b line a is
perpendicular to b. CE(a, b) shape a and
b share one common endpoint. EL(a, b)
one end point of shape a lies on shape b.
12Chart Recognition
- Text grouping
- Yuans method to group connected components
- Text recognition
- Omnipage Scansoft Capture SDK 12.0
- Errors are manually corrected.
13Chart Recognition
Green bars bar1 (281,249), (345,248),
(346,301), (281,302) Bar2 (430,109), (494,108),
(499,298), (435,299) Bar3 (581,134), (645,132),
(648,296), (585,298)
Red axis X (239,304) to (994,290) Y (239,304)
to (236,100)
Type bar chart
14Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
15Chart Interpretation
- Associating text with graphics
- Assign syntactic role to each text block
- Label graphical symbols using the text blocks
- 11 roles of text in the scientific charts
identified - The problem is modeled as classification of text
blocks
16Chart Interpretation
- Associating text with graphics (cont.)
- To train the classifier and classify a new text
block, 4 features are defined - Distance to the nearest graphical symbol
- Type of the nearest graphical symbol
- Relative position of the text block and the
graphical symbol - Type of the text string itself
- Centricity of a text block
- Learning algorithm C4.5 is used for building
decision tree.
17Chart Interpretation
- Obtaining the tabular data
- Assign label to each data entry if its label is
not directly presented.
D1 Distance to nearest label on the left. D2
Distance to nearest label on the right If (D1 lt
D2) label L1 Else if (D1 gt D2) label L2 Else
label L1 L2
18Chart Interpretation
- Obtaining the tabular data (cont.)
- Calculate value for each data entry if its value
is not directly presented.
H1 Data height H2 Unit height Value per unit
height 30 Data value H1 30 / H2
19Chart Interpretation
- Generating chart description
- XML format description
- Keeping data in the tabular form
- Good for querying on data value or label
- Natural language description
- Fact based sentences generated from templates
- Good for factoid question
20Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
21Applications
- Enriching OCR output
- Traditional OCR output Text Figures
- The information in figures are not extracted
- The proposed system helps to extract more
information - The tabular data obtained can be used to
reproduce the document in machine readable form.
(Electronic) (Image format)
22Applications
- Enriching OCR output (cont.)
- Approach
- Question where to insert the infographics?
- Clue Look for the figure number in the text.
23Applications
- Assisting QA systems
- Question type 1 factoid question
- Example How many fatalities were there in the
year 1984? - Solution Add the NL description of the
infographics into the original text - Question parsing and answer extraction Cui et
als method based on soft pattern matching
24Applications
- Assisting QA systems (cont.)
- Question type 2 query-like question
- Example What is the maximum number of
fatalities among all years? - Solution Translate the question into one of the
pre-defined queries - Question translation Semantic parser proposed by
Mooney et al
25Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
26Experiment Results
- Chart recognition and classification using 200
scientific chart image collected
27Experiment Results
- Text block classification using 200 scientific
chart images collected
28Experiment Results
- Question answering using 10 scanned document
pages from the UW database I
29Outline
- Introduction
- Syntactic and semantic information in scientific
charts - Chart recognition
- Chart interpretation
- Applications
- Experiment results
- Conclusion
30Conclusion
- A system for recognizing and interpreting imaged
infographics is introduced. - Current focus is on scientific charts, a commonly
used type of infographics - The system can be generalized to handle more
variety of infographics - The system can be enhanced to handle more complex
layout and special effects etc.
31Thank you!