Data Mining, Information Theory and Image Interpretation - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining, Information Theory and Image Interpretation

Description:

Uncertainty in ZIP Code when City, State or a digit is known ... Information from S is more effective for determining a ZIP Code ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 35
Provided by: wenjan
Category:

less

Transcript and Presenter's Notes

Title: Data Mining, Information Theory and Image Interpretation


1
Data Mining, Information Theory and Image
Interpretation
  • Sargur N. Srihari
  • Center of Excellence for Document Analysis and
    Recognition
  • and
  • Department of Computer Science and Engineering
  • State University of New York at Buffalo
  • Buffalo, NY 14260
  • USA

2
Data Mining
  • Search for Valuable Information in Large Volumes
    of Data
  • Knowledge Discovery in Databases (KDD)
  • Discovery of Hidden Knowledge, Unexpected
    Patterns and new rules from Large Databases

3
Information Theory
  • Definitions of Information
  • Communication Theory
  • Entropy (Shannon-Weaver)
  • Stochastic Uncertainty
  • Bits
  • Information Science
  • Data of Value in Decision Making

4
Image Interpretation
  • Use of knowledge in assigning meaning to an image
  • Pattern Recognition using Knowledge
  • Processing Atoms (Physical) as Bits (Information)

5
Address Interpretation Model
6
Typical American AddressAddress Directory Size
139 million records
7
Assignment Strategies
Typical street address
Database query
Address encoding
Results
Word Recognizer selects (after lexicon expansion)
Delivery point 142213557
8
Australian Address
Delivery Point ID 66568882 Postal Directory
Size 9.4 million records
9
Canadian Address
Postal code H1X 3B3 Postal Directory 12.7
million records
10
United Kingdom Address
Postcode TN23 1EU (unique postcode) Delivery
Point Suffix 1A (default) Address Directory
Size 26 million records
11
Motivation for Information Theoretic Study
  • Understand information interaction in postal
    address fields to overcome uncertainty in fields
  • Compare the efficiency of assignment strategies
  • Rank processing priority for determining a
    component value
  • Select most effective component to help recover
    an ambiguous component

12
Address Fields in a US Postal Address
  • Address fields

Sargur N. Srihari
f6 street name
f7 secondary designator abbr.
f5 primary number
f8 secondary number
f2 state abbr.
f3 5-digit ZIP Code
f4 4-digit ZIP4 add-on
f1 city name
  • Delivery point 142282583

13
Probability Distributionof Street Name Lexicon
Size f6
14
Number of Address Recordsfor Different Countries
15
Definitions
  • A component c is an address field fi, a portion
    of fi (e.g., a digit), or a combination of
    components.
  • 1. Entropy H (x) information provided by
    component x (assuming uniform distribution)
  • H (x) log2 x bits
  • 2. Conditional Entropy Hx(y) uncertainty of
    component y when component x is known
  • where xi is a value of component x yj is a
    value of component y
  • pij is the joint probability of p(xi , yj)
  • 3. Redundancy of component x to y
  • Rx(y) (H (x) H (y) - H (x, y)) / H (y)
  • 0
  • Higher value of Rx(y) indicates that more
    information in y is shared by x.

16
Example of Information Measure
Value sets
pa10 1/5, pae 2/5, etc.
Address records
Information measure
17
Measure of Information from National City State
File, D1 (July 1997)
  • Measure
  • H(x) x any combination of f1, f2, and f3i
  • Hx(f3) x any combination of f1, f2, and f3i

18
Measure of Information from Delivery Point Files,
D2 (July 1997)
  • Measure
  • H(x) x any combination of f3, f4 , f5 , f6 , f7
    , f8, and f9
  • Hx(f4) x f3 with any combination of f3 f9

19
Measure of Information from D
Uncertainty in ZIP Code when City, State or a
digit is known
Uncertainty in component
  • To determine f3 (5-digit ZIP) from f1, f2 and
    f3i
  • - City name reduces uncertainty the most

20
Propagation of Uncertainty for Assignment
Strategies
21
Ranking Processing Priority for Confirming ZIP
Code
f1 City name f2 State abbreviation f3 ZIP Code
Processing flow city, 5th, 4th, 3rd, state
22
Modeling Processing Cost
  • For component y
  • Location rate l(y) 0
  • Recognition rate r(y) 0
  • Processing speed s(y) in msec
  • Existence rate e(y) 0
  • Patron rate p(y) 0
  • Lexicon size of y, given x yx 2(H (x,y)
    -H (x))
  • Cost of processing component y given component x

23
Example Cost Table
24
Ranking Processing Priority for Confirming ZIP
CodeBased on Cost
Processing flow based on cost 2nd, city, 5th,
4th, 3rd, 1st Processing flow based on Hx(y)
city, 5th, 4th, 3rd, state
25
Recovery of 1st ZIP-Code Digit, f31, from State
Abbr. (f2) and Other ZIP-Code Digits (f32-f35)
  • Usage If recognition of a component (e.g., f31)
    fails, this component has higher probability of
    recovery by knowing another component with
    largest redundancy (f2).
  • There are 62 state abbrs. In 60 of them, 1st ZIP
    digit is unique.
  • For NY and TX, there are two valid 1st ZIP-Code
    digits.

26
Measure of Information from Mail Stream, S
  • Eighteen sets, each from a mail processing site,
    of mail pieces
  • We measure
  • Information provided by H(f2), H(f3i)
  • Uncertainty of f3 by Hf2(f3), Hf3i(f3)
  • Each set is measured separately
  • The results are shown on the average of these sets

27
Comparison of ZIP-Code Uncertainty from D and S
28
Comparison of Results from D and S
  • ZIP-Code uncertainty
  • from S
  • Information from S is more effective for
    determining a ZIP Code
  • The most effective processing flow of using f3i
    and f2 to determine f3 is (consistent between S
    and D)
  • f2 - f35 - f34 - f33 - f32 - f31

29
UK Address InterpretationField Recognition
Database Query
  • Fields of interest
  • Locality
  • Post town
  • County
  • Outward postcode
  • Target
  • Outward postcode
  • Control flow
  • Based on data mining

30
UK Address InterpretationLast Line Parsing
Resolution
31
Discussion(Reliability of information)
  • For selecting effective processing flow in
    address interpretation, the prediction is
    accurate when the information can be the most
    representative in the current processing
    situation
  • Use of unreliable information for determining a
    candidate value may cause error.
  • Unreliable information used to choose an
    effective processing flow is less effective.

32
Reliability of information
  • Measure of information from D
  • Not reflecting the current processing situation
  • Full coverage of all valid values
  • Measure of information from S
  • Assuming that site specific preceding history
    represents current processing situation
  • Mail distribution could be season-specific
  • Should consider the coverage of valid samples
  • Should consider the information bias if valid
    samples are from AI engine

33
Complexity of collecting mail information (S)
  • Information from mail streams should be collected
    automatically and only high confidence
    information is collected
  • Address interpretation is not ideal
  • Some error cases would be collected
  • Address interpretation may always reject a
    certain patterns of mail pieces, resulting in
    biased collected information

34
Conclusion
  • Information content of postal addresses can be
    measured
  • The efficiency of assignment strategies can be
    compared
  • Redundancy of two components can be measured
  • An uncertain component has higher probability of
    recovery when another component with larger
    redundancy is known
  • Information measure can suggest most effective
    processing flow
  • Information Theory is an effective tool for Data
    Mining
Write a Comment
User Comments (0)
About PowerShow.com