Document Image Content Inventories - PowerPoint PPT Presentation

About This Presentation
Title:

Document Image Content Inventories

Description:

Document Image Content Inventories – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 34
Provided by: cseLe
Category:

less

Transcript and Presenter's Notes

Title: Document Image Content Inventories


1
Document Image Content Inventories
  • Henry S. Baird
  • Michael A. Moll
  • Chang An
  • Matthew R. Casey

DRR XIVFebruary 1, 2007
2
Document Image Content
  • Given an image of a document
  • Find regions containing handwriting,
    machine-print text, graphics, line-art, logos,
    photographs, noise, etc
  • Challenges Select images to cover vast variety
  • Bitonal, Greyscale and Color
  • English, Arabic, Chinese
  • Simple to Complex Layouts
  • Historical and Modern Documents
  • Low Quality Distorted Images to Higher Quality
    Clear Images
  • Machine Print, Handwriting, Photographs and Blank
    Content
  • Combinations of all of these types

3
Test Document Examples
4
More Document Examples
5
More Document Examples
6
Classification Algorithms
  • Consider brute-force 5-Nearest Neighbors as gold
    standard
  • Hashed k-D Trees used to approximate 5-NN
  • Very fast, large speedup
  • Small loss in accuracy
  • Classifiers discussed in detail in previously
    published work (DRR XIII 06)
  • A sample is a single pixel in a document image
  • Not classifying entire regions
  • Classifiers run to completion in CPU hours
  • Just fast enough to allow us to push accuracy,
    etc

7
Classification Example
8
Classification Example
9
Choosing the Feature Set
  • Each test point is a single pixel in image
  • We do not classify regions
  • Therefore we do not restrict region shape
  • We allow complex non rectangular shapes
  • Represented by scalar features
  • Extracted from Luminosity channel of HSL image
  • Extracted over small window surrounding the pixel
  • Windows from /- 11 pixels /- 20 pixels
  • Trial and error exploration of over 60 features
  • We now use set of 26 features used for testing

10
Features Used
  • Region Luminosity
  • 1x1 region
  • Line Luminosity Average
  • Horizontal and Vertical
  • Line length of 25 pixels
  • Line Average Difference
  • Line length of 25 pixels
  • Line Luminosity Average
  • Difference
  • Diagonals only
  • Line length of 25 pixels
  • Line Luminosity Max Difference
  • Four directions
  • Line length of 41 pixels
  • Revised Distance to Max-Difference Pair
  • Eight directions
  • Line length of 41 pixels
  • Revised Distance to Max-Difference Pixel
  • Eight directions
  • Line length of 41 pixels

11
Experimental Design
  • Training set of 31 images and Testing set of 86
    images
  • Each image in test set, has at least one similar
    image in training set (from same source)
  • We are not testing strong generalizing ability of
    classifier to different images
  • Testing set consists of 178793163 samples

12
Speed-Ups by Decimation
  • Trials and intuition show that large speed-ups in
    classification (regardless of method) can be
    obtained by randomly throwing away training data
  • Since we are classifying pixels, we expect a high
    redundancy in training data
  • Partially due to isogeny the tendency for data
    in the same image to have been generated by the
    same source and process
  • Following slide shows example
  • Classifier trained on one image and tested on
    second image from same source

13
Speed-Ups by Decimation
Factor 1 10 100 500 1000
Speed-Up 0 7.9 57.9 212.5 354.2
Accuracy 80.4 72.9 76.2 70.0 66.6
14
Example Analysis
  • Consider this image and its output
  • Per-pixel Accuracy 62

15
Per-pixel Classification versus Inventory
  • Per-pixel Classification Confusion Matrix
  • Per-page Inventory Fraction of Content

BL HW MP PH Type1
BL 0.0661 0.0194 0.00217 0.00183 0.0234
HW 0 0 0 0 0
MP 0.0863 0.0603 0.417 0.0464 0.193
PH 0.0119 0.00848 0.0734 0.207 0.0938
Type2 0.0982 0.0882 .0756 .04823 .3706
Content True Classifier Accuracy
BL 6.817 24.18 20.96
HW 0 11.3 0
MP 46.75 42.14 75.85
PH 23.06 22.38 70.91
16
Per-Pixel Accuracy
  • Fraction of all pixels in a document that are
    correctly classified
  • Class label matches class specified by ground
    truth
  • Objective and Quantitative
  • However, arbitrary due to methods of
    ground-truthing and inconsistencies
  • Per-pixel accuracy score prone to be worse than
    image may subjectively appear
  • For our test set, across all images, average
    per-pixel accuracy score was 62.4

17
Confusion Matrix
BL HW MP PH Type 1
BL .159 .0279 .0318 .00539 .0651
HW .0283 .0231 .0135 .00128 .0431
MP .0456 .0291 .353 .0390 .114
PH .0228 .00739 .0465 .167 .0767
Type 2 .0967 .0644 .0918 .0457 .386
18
Per-Page Inventory
  • For each content class we measure the fraction of
    each page that is measured as that class
  • Allows a user to query a data base of page images
    in a variety of useful and natural ways
  • For example, answer a query like
  • Find all pages containing ? 70 Photograph
  • and ? 10 Machine Print
  • This is an information retrieval problem for
    which precision and recall are natural evaluation
    metrics
  • Most images in test class are of mixed content
    and do not contain a majority of any one class

19
Per-Page Inventory
  • We tried queries on our complete test set, e.g.
  • Find all images that contain
  • at least the fraction 20 of MP pixels

20
Precision and Recall Curves
21
Machine Print PR Curves
22
Per-Page Inventory
  • If we assume all thresholds are equally likely,
    we can estimate expected precision and recall

Recall Precision
BL 96.7 55.6
HW 45.1 80.1
MP 80.9 77.2
PH 76.0 78.8
Vs. 62.4 per-pixel accuracy
23
Discussion of Results
24
Discussion of Results
25
Discussion of Results
26
Discussion of Results
27
Discussion of Results
28
Discussion of Results
29
Discussion of Results
30
Conclusion
  • Modest per-pixel classification accuracies (of
    e.g. 60-70) support useful recall and precision
    rates (of e.g. 80-90) for retrieval queries of
    document collections seeking pages containing a
    given minimum fraction of a certain content type
  • On per page basis inventories tend to be more
    accurate than per-pixel classification

31
Future Work
  • Analysis of relationship between per-pixel
    accuracy scores and per-page inventory queries
  • Under what assumptions can we relate the two?
  • Which is more useful/descriptive
  • Building classifiers on classified images
    (iterated classification)
  • Content class masks
  • Automated feature selection
  • Massive tests with GRID computing

32
Iterated Classification
33
Content Class Masks
HW MP PH
34
Thank You!
  • Henry S. Baird
  • Michael A. Moll
  • Chang An
  • Matthew R. Casey

35
Raw Pixel Counts
BL HW MP PH Total 1
BL 28385810 4992050 5678089 963871 40019820
HW 5054560 4128702 2422925 228225 11834412
MP 8150037 5196000 63137187 6968964 83452188
PH 4080809 1321661 8310500 29773773 43486743
Total 2 45671216 15638413 79548701 37934833 178793163
36
Analyzing the Confusion Matrix
  • Classifier is best at classifying PH and MP
  • Some trouble with BL and lots of trouble with HW
  • 43 of HW misclassified at BL
  • However, similar amount of content is classified
    as each class as was labeled in ground truth
  • Suggests problems with zoning, not necessarily
    with classification

37
Photograph PR Curves
38
Discussion of Results
  • Locates HW in detail, not just as rectangular
    blocks like zoning
  • Some difficulty in separating handwriting from
    background (lines on legal pad) that were
    included in zoning

39
Discussion of Results
  • Good segmentation of machine print, some
    confusions with handwriting
  • In photograph of football player identifies
    jersey letters as MP

40
Discussion of Results
  • Discriminates text from photos very well
  • Does very good job of discovering actual text
    layout
  • Trouble distinguishing HW from MP

41
Discussion of Results
Both show remarkable segmentation of MP from PH,
regardless of background, etc
42
Discussion of Results
  • Complex magazine layouts,
  • Non rectilinear text layouts,
  • Text overlapping photographs, etc

43
Discussion of Results
  • On left, a particularly complex layout that
    reveals classifier making many correct small
    distinctions between photograph and machine print
  • On right, interesting case where blank background
    indistinguishably blends into photograph at
    bottom

44
Discussion of Results
  • Image of left shows methodological problem of
    zoning large, dim areas of photographs that are
    statistically indistinguishable from blank
  • Image on right shows excellent subjective
    segmentation for MP and PH but confusion with HW
    and BL
Write a Comment
User Comments (0)
About PowerShow.com