ITEC810 Final Report Inferring Document Structure - PowerPoint PPT Presentation

About This Presentation
Title:

ITEC810 Final Report Inferring Document Structure

Description:

ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen Outlines Part A Introduction Related work Part B Material ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 19
Provided by: edua2182
Category:

less

Transcript and Presenter's Notes

Title: ITEC810 Final Report Inferring Document Structure


1
ITEC810 Final ReportInferring Document Structure
  • Wieyen Lin/41348133
  • Supervised by
  • Jette Viethen

2
Outlines
  • Part A
  • Introduction
  • Related work
  • Part B
  • Material
  • Methodology
  • Part C
  • Implementation
  • Conclusion

3
Part A Introduction
4
Introduction
5
Introduction (contd)
  • Research Objective
  • Analyze a document image and detect its logical
    structure with annotated labels
  • Project Scope
  • Focus on Academic articles
  • Source Corpus Association for Computational
    Linguistics (ACL) Anthology Corpus

6
Related Work
  • Physical Layout Analysis
  • Top-down methods
  • Bottom-up methods
  • Logical Structure Analysis
  • Syntactic methods
  • Rule-based methods

7
Part B Methodology
8
MaterialXML Source by Text
An example of Input file of the project
9
Methodology
10
Methodology (contd)
11
Methodology (contd)
Algorithm for aggregating blocks In Phase II
Check dominant font size
Read-in 3 lines at a time
A1A2A3
AAB
ABB
A1BA2
ABC
Checkspacing
A
B
C
A1
B
A2
A
BB
AA
B
A, B, C lines of texts with different
dominant font sizes A1, A2 lines of texts
with the same dominant font size s1
spacing between A1 and A2 s2 spacing between A2
and A3
s1s2
s1gts2
s1gts2
AAA
A2A3
A1A2
A1
A3
A
belongs to the same block
12
Part C Outcomes
13
Current Outcome
Original PDF document
14
Current Outcome (contd)
Logical structure outcome in HTML
15
ImplementationClass Diagram
16
ImplementationUser Interfaces
17
ConclusionInformation Evaluation
Error Type Error Found Accuracy of Detection
Incorrect title or missing title 1 97.5 (39/40)
Incorrect Abstract heading or Missing Abstract heading 4 90.0 (36/40)
Incorrect Abstract or Missing Abstract 4 90.0 (36/40)
Incorrect Affiliation(s) or Missing Affiliation(s) 11 72.5 (29/40)
Missing gt50 of Page number(s) or Erroneous Page number(s) found 15 62.5 (25/40)
Missing gt50 Section heading(s) or Erroneous Section heading(s) found 11 72.5 (29/40)
Summary of detection results out of 40 randomly
selected documents
18
ConclusionFuture Work
  • Improving Algorithms
  • Aggregation of Homogenous blocks
  • Detection of Abstract Heading, Section Heading,
    and Paragraph
  • Removing Noise
  • Incomplete table contents
  • Incomplete mathematic formula
Write a Comment
User Comments (0)
About PowerShow.com