Title: From Anthrax to ZIP Codes The Handwriting is on the Wall
1From Anthrax to ZIP Codes-The Handwriting
is on the Wall
- Venu Govindaraju
- Dept. of Computer Science Engineering
- University at Buffalo
- Venu_at_cedar.buffalo.edu
2Outline
- Success in Postal Application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon density
- Lexicon Reduction and Combination
- Other Applications
3USPS HWAI Background
- Postal Sponsorship Started 1984
- 370 Academic Articles Published
- Millions of Letters Examined
- Many Experimental Systems Built and Tested
- Migrated from Hardware to Software System
- Only Postal Research Continuously Funded
4Pattern Recognition Tasks
- Items to be Recognized, Read, and Evaluated
(Machine printed and Script) - Delivery address, senders address, endorsements
- Linear Codes, Mail Class
- Indicia (2D-Codes, Meter Marks)
5Deployed..
- USA
- 250 PDC sites
- 27 Remote Encoding Centers
- 25 Billion Images Processed Annually
- 89 Automated Bar-coding
- UK
- 67 Processing Centers
- 27 Million Pieces Per Day,
- 9.7 Million Pieces Per Hour Peak
- Australia
6Scope - Others
- Royal Mail
- 67 Processing Centers
- 27 Million Pieces Per Day
- 9.7 Million Pieces Per Hour Peak
- Australia Post
- Similar to Royal Mail
7(No Transcript)
8RCR Overview
9The Right Technology
- Technological Nexus
- Sophisticated Algorithms
- High Speed Processors
- Large Disk Capacities
- High Speed Memories
10At the Right Price
- Processing Type Cost/1000 Pieces
- Manual 47.78
-
- Mechanized 27.46
-
- Automated 5.30
1180 encode rate and counting!
12Impact
- Applications of CEDAR research helping to
automate tasks at IRS and USPS - 1st year that USPS used CEDAR-developed software
to read handwritten addresses on envelopes, saved
100 million - 1997-1999 USPS deployment of CEDAR-developed
RCRs, USPS saved 12 million work hours and over
340 million - 500 scientific publications and 10 patents
13Outline
- Success in Postal Application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon density
- Lexicon Reduction and Combination
- Other Applications
14(No Transcript)
15Handwritten Address Interpretation (HWAI)
Chaincode Generation
Pre-scan with Digit Recognizer
Line Segmentation
Word Separation
Parsing a) shape b) syntax
Digit String Recognition
Address Block Image
Input
Yes
Phrase Recognition
Encoding Strategy
Database Queries
Output
Finalized?
14221 3851 11
No
Adaptive Image Enhancement
5, 9, or 11 digit encode OR reject
Pass 1 or Pass 2
Pass 1
Pass 2
Output
16Context Provided by Postal Directories
-
- Create street name lexicon
- DPF yields 8 street names
- ZIP4 yields 31 street names (on average about 5
times more) - HAWLEY RD 1034NEWGATE RD 1533BEE
MOUNTAIN RD 1615DORMAN RD 1642BOWERS
HILL RD 1757FREEMAN RD 1781PUNKUP RD 1784
PARK RD 6124
17CEDAR
Delivery Point File
- One record per delivery point in USA
- Provided weekly by USPS, San Mateo
- Raw DPF
- 138 million records
- 15 GB (114 bytes per record)
- 41,889 ZIP Code files
- Fields of interest to HWAI
- ZIP Code, record type (eg., street, firm, PO Box
..), street name, primary number, secondary
number, add-on
18CEDAR
Relevant Statistics
- ZIP Code
- 30 of ZIP Codes contain a single street name
- 5 of ZIP Codes contain a single primary number
- 2 of ZIP Codes contain a single add-on
-
- Maximum number of records returned is 3,071
-
- Maximum number of records returned is 3,070
19Outline
- Success in Postal Application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon density
- Lexicon Reduction and Combination
- Other Applications
20Handwriting Recognition
Bryant 2.3 Boston 1.8 Bidwell 2.6 James
4.7 Buffalo 8.9
Word Recognition Engine
Signal
BostonBuffaloWilliamsvilleBidwellJamesByrant
....
Context Lexicon
Ranked lexicon with distance scores
21WMR
Distance between lexicon entry word first
character w and the image between - segments 1
and 4 is 5.0 - segments 1 and 3 is 7.2 - segments
1 and 2 is 7.6
Find the best way of accounting for characters
w, o, r, d buy consuming all segments 1
to 8 in the process
22CMR
- Image from 1 to 3 is a in with 0.5 confidence
- Image from segment 1 to 4 is a w with 0.7
confidence - Image from segment 1 to 5 is a w with 0.6
confidence and an m with 0.3 confidence
w.6, m.3
w.7
d.8
o.5
u.5, v.2
i.8, l.8
i.7
r.4
u.3
m.2
m.1
Find the best path in graph from segment 1 to 8 w
o r d
23Outline
- Success in postal application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon density
- Lexicon Reduction and Combination
- Other Applications
24Multiple Choice Paradigm
- Amherst b) Buffalo c) Boston
- d) None of the above
25Grapheme Models
26Stochastic Models and Continuous Attributes
27Results
28Interactive Models McClelland and Rumelhart,
Psychological Review, 1981
ABLE
TRIP
TRAP
Words
A
T
N
Letters
Features
29Cognitive Handwritten Word Recognition
Lexicon 1 Lexicon 2 Lexicon 3
West Central StreetWest Main StreetSunset
Avenue
West Central StreetEast Central StreetSunset
Avenue
West Central StreetWest Central AvenueSunset
Avenue
Interactive Model
features
T-crossings, loops, ascenders, descenders, length
image
30Adaptive Character Recognition Park and
Govindaraju, IEEE CVPR 2000
- Adaptive selection of features
- Adaptive number of features
- Adaptive resolutions
- Adaptive sequencing of features
- Adaptive termination conditions
31Features
4 gradient features
5 moment features
Vector code book
32Feature Space
- V x Nc x Ixy
- 29 x 10 x 85 (quad tree, 4 levels)
- Recognition rate and feature V
- GSC V 2512
- Tradeoffs space vs accuracy
- Hierarchical space with additional resolution and
features as needed
33Active Recognition Using Quad Trees
34Experimental Results
35(No Transcript)
36Results
10 class digit recognition 25656 training and
12242 test (Postal NIST)
37Outline
- Success in Postal Application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon Reduction and Combination
- Lexicon Density and Prediction of Performance
- Other Applications
38Combination and Dynamic Selection Govindaraju
and Ianakiev, MCS 2000
image
WR 1
WR 3
1
Top 50
Lexicon
WR 2
Top 5
- Optimization problem
- Combinatorial explosion in
- arrangement of recognizers
- lexicon reduction levels
39Lexicon Density Govindaraju, Slavik, and Xue,
IEEE PAMI 2002
Lexicon 1 Lexicon 2 Me MeHe MemoSo MemoryTo
MemoirsIn Mellon
40Classifier Performance Prediction Xue and
Govindaraju, IEEE PAMI 2002
q probability that recognizer make a unit
distance errors D average distance between any
two words in the lexicons n lexicon size p
performance a, k, model parameters ln (-ln p)
(ln q) D a ln ln n ln k
41Outline
- Success in Postal Application
- Role of Handwritten Word Recognition
- Word Recognition
- Lexicon Driven Word Recognition
- Lexicon Free Word Recognition
- New Models
- Interactive Cognitive Models
- New Research Areas
- Lexicon density
- Lexicon Reduction and Combination
- Other Applications
42Bank Check Recognition
43PCR Trend Analysis
44NYS EMS PCR Form
- NYS PCR Example
- Thousands are filed a day.
- Passed from EMS to Hospital.
- PCR Purpose
- Medical care/diagnosis
- Legal Documentation
- Quality Assurance
- EMS Abbreviations
- COPD Chronic Obstructive Pulmonary Disease
- CHF Congestive Heart Failure
- D/S Dextrose in Saline
- PID Pelvic Inflammatory Disease
- GSW Gunshot Wound
- NKA No known allergies
- KVO Keep vein open
- NaCL Sodium Chloride
45Medical Text Recognition and Data Mining
46Reading Census Forms
Lexicon Anomalies Space sales man and
salesman Morphology acct manager and
account management Abbreviation Plural
school and schools Typographical managar
and manager
47Binarization
48Historic Manuscripts
49Mapping Snippets with Transcribed Text
50Summary
- Handwriting recognition technology
- Pattern recognition task
- Lexicon holds domain specific knowledge
- Adaptive methods
- Classifier combination methods
- Many applications