Title: eMail and Records Management with IBM Classification Module
1eMail and Records Management with IBM
Classification Module
- Jon Dellaria, IBM Certified ECM Information
Technology Specialist
2What is Classification?
Definition Class.i.fic.a.tion klas-uh-fi-key-shu
hn n the act of assigning an element (a
document for example) to a category.
3IBM Leadership in Text Analysis and
Classification
- IBM has a 50 year history in text analysis and
discovery - As early as 1957, IBM published pioneer research
done on text classification (and related topics,
such as text search, and automatic creation of
text abstracts) - IBM invests 50M annually in research and
development for search and text analytics - 200 people actively engaged in RD
- IBM holds over 200 patents in information access
with more each year
4Options for Implementing the Classification
Process
5IBM Classification ModuleImplementing the
classification process in ECM more
- Intelligent applications of policies via
automatic, advanced classification - Combines the best automatic methods context
sensitive and rule-based - Flexible automation levels accelerate adoption
and acceptance - Incorporates user feedback in real-time to
improve understanding - Integrated to IBM ECM architecture or use as a
free-standing service - 12 languages and 3 more on the way!
ICM
6Advanced Classification is Key to Compliant
Information Management
7Advanced Classification The Facts
Implications
Facts
Humans provide, at best, marginally better
accuracy in executing classification, in
controlled tests
Compliance professionals hold the incorrect
assumption that humans are the best option for
piece by piece decision-making
1
1
Results of human-reliant filing are inconsistent
and inaccurate, resulting in effective accuracy
of 50, at best
Business users find forced manually
classification burdensome and at least 50 will
not participate
2
2
Every manual classification forced on your users
will cost your organization 17 cents in
productivity
Wide-spread adoption of archiving or records
management in your organization will lead to
large, measurable productivity loss
3
3
Deploying an archiving or records management
initiative is increasingly important, large scale
and difficult problem
Unstructured content makes up 80 of the volume
of information in the average enterprise and that
segment is growing 30 annually
4
4
8Critical Dimensions of Classification
Automated
Manual
X
92
50 80
Accuracy
46
0.17
lt 0.01
Cost (per doc)
Consistency
100
lt50
Increasing Volume
9Participation Impacts Accuracy
- National Archives and Records Administration
Study - Electronic Records Management initiative focused
on user driven records declaration - 6 month study
- 60 drop-off in participation in months after
training - End users frequently outright refuse to
categorize content
Participation in Manual Filing by Month
- Manual classification and an emphasis on user
training is outdated, providing inconsistent and
inaccurate results
Inconsistent participation from humans is the
critical factorin evaluating different
classification methods
10Manual Classification
With paper
With rudimentary electronics
Todays advanced electronics
11Rules-based Classification
To Bob Smith ltBob.Smith_at_hotmail.comgt From Bill
Roker ltbroker_at_financialadv.comgt Subject Market
Movement Bob, Hope youre doing well. Ive got
a sure thing going with the stock we spoke about
on the phone. I think its time to pull the
trigger for my client. The clients name is John
Doe. His social is 123-45-6789. Hes totally on
board and hes excited to take advantage of this
new offer. Talk to you tomorrow, Bill Bill
Roker 212-555-1234 Financial Advisors, Inc.
Simple Rules Does the body contains the phrase
sure thing? Did the CFO send the email?
Complex Policies Does the body contains the
phrase sure thing in the same sentence as
stock"? Did the sender belongs to the broker
email group and send an email externally using
the phrase sure thing in the body?
Metadata extraction Does the body of the email
have anything that matches the pattern
XXX-YY-ZZZZ?
12Rule-based Classifications Achilles HeelRule
Maintenance, Accuracy and Cost
Accuracy
Changes in business
Effort to adjust rules to new environment
Time
13Context Sensitive Classification
Category 1
Category 2
Statistic-BasedCategorization
Category 3
Unclassified text
14Context Sensitive Classification
Simple rules or keyword based analysis can be too
coarse to make fine distinctions between
long-form texts with very different intent
15Choosing the Right Classification Method
- Combined approaches provide the maximum accuracy
from automation, at a slight productivity cost - Automated methods slash the costs
- Manual methods have high costs associated to them
- Manual methods suffer from lack of participation,
hampering their overall viability
Accuracy
Consistent Participation Enforcement
Multiple Methods
High
Context Based Classification
Complex Policies
Rules Based Classification
Simple Rules
Authoring Templates
Manual Classification
Cost Savings Productivity
Low
High
Low
16Enterprise Compliance VisionIntegrated Agile ECM
Platform for Compliant Information Management
IBM ECM
Content Collection
17Reclassification Records Management
18US Army Email and Records Manager Pilot
- GOAL
- Provide a means to address Armys requirement for
the successful records management of email - Challenges faced
- Lack of records management follow through from
end users - Need to capture records and transactional
activities from email - Need to capture records without user intervention
18
19US Army Email and Records Manager Pilot
- Success Criteria for pilot
- Correctly capture and retrieve email provided
- Ensure information is secure
- Determine email can be accurately Auto
Categorized by the IBM Categorization Module
(ICM) - Goal of 90 or better accuracy
- Show how ICM learns and improves accuracy over
time - Place categorized record emails under correct
Army records disposition
19
20Army Email Pilot Concept of Operations (CONOPS)
21Concept of Operations
Tasks Phase I Phase II Phase III
Identification of Records Categories ü
Delivery of .pst files ü ü ü
Organization of .pst files to build knowledge base ü
Ingesting of Emails Build Corpus ü
Ingesting of Emails - Auto Cat Runs ü ü ü
Auditing ü ü ü
complete
complete
complete
21
22Pilot Phases
- Pre-Phase Activity
- Teach the system by building the knowledge base
(Corpus) - Phase I
- Process the first run of sample .pst files
- Review and Audit the results
- Phase II (30 days later)
- Process the second run of sample .pst files
- Review and Audit the results
- Phase III (30 days later)
- Process the third run of sample .pst files
- Review and Audit the results
23Knowledge Base (Corpus) Training
PST Inboxes
Organized Email
User 1 Email
Record Category Marketing
User 2 Email
Record Category Legal
Army Records Managers
Record Category Finance
. . .
. . .
Record Category RD
User n Email
24Outlook Configuration
25Building the Knowledge Base for Email
Categorization
26Reports
27Training Knowledge Base - The Results
Adjusted Data
Raw Data
28Pilot Project Pre-Phase Activities
- Build Categorization Knowledge Base
- Work with Army Records Managers to define the
most appropriate records categories and identify
example mails for them - Goal
- Find examples of email records for each of the
record categories - Find 15 20 examples for each category
- Results
- 54 records categories were identified as being
associated with the assigned offices - 28 categories have 15 or more examples
- 26 categories have 14 or less examples
28
29Army Email Pilot Phase I III Auto
Categorization Steps
IBMP8 eMail Manager
.PST Files
IBMCategorization Module
P8 InBox Folder
Review Audit
1 Army Records Manager
30Pilot Project Phase I III Activities
- First Pass of Categorization (process .pst files)
- Take the Knowledgebase created by Army Records
Managers and apply it to the bulk of email - Measure categorization results returned and begin
Audit and Review process - Audit and Review process
- Audit Used to confirm the accuracy of
categorization via a random sampling of
categorized results. If necessary, the chosen
category may be modified which serves to retrain
the knowledgebase for the future - Review items that do not meet the defined
thresholds for categorization are available for
further analysis and categorization by records
personnel - The result of Audit and Review is improved the
accuracy of the knowledgebase therefore improved
categorization for future email ingest - Post Audit/Review reprocessing of email to
measure categorization improvements - Measure results for the completion of each Phase
30
31Pilot Project Activities
- Focus on email from 16 different offices across
Army - Demonstrate ability to categorize emails across
Army enterprise - PST files from 398 pre-selected users
- 581,634 emails in total in Phase I
- 581,256 emails in total in Phase II
- 735,333 emails in total in Phase III
- 1,898,232 total emails through Phase III
- PST files transferred to the pilot system via
secure connection
31
32Phase I Categorization Results
First Pass
Post Audit/Review
Total Categorized 84.5 98.8
Total Not Categorized 15.5 1.2
Phase II Categorization Results
First Pass
Post Audit/Review
Total Categorized 99.01 99.9
Total Not Categorized .9 .1
Phase III Categorization Results
First Pass
Post Audit/Review
Total Categorized 98.4 99.9
Total Not Categorized 1.6 .1
32
33Army Records Manager Observations
- As a records manager with a 25-year background in
federal and civilian records management, I
believe the automatic categorization of
information is the next logical evolution in
managing the records of an organization. - The classifier correctly identifies categories of
records based on information from office file
plans. Since office file plans are incorporated
within an agency records manual, the initial
input for the system is nominal. The office file
plan becomes the document classifier. - Because the classifier retains information on
document retrieval activity, it may be
appropriate for use in many other information
management program areas, including the Freedom
of Information and Privacy Act.
34Demo
34
35Thank You
35
36IBM Records Manager with Army File Plan