Characteristic Identifier Scoring and Clustering for Email Classification

About This Presentation

Title:

Characteristic Identifier Scoring and Clustering for Email Classification

Description:

Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 14

Provided by: MKC4

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Characteristic Identifier Scoring and Clustering for Email Classification

1
Characteristic Identifier Scoring and Clustering
for Email Classification

By
Mahesh Kumar Chhaparia

2
Email Clustering

Given a set of unclassified emails, the objective
is to produce high purity clusters keeping the
training requirements low.
Outline
Characteristic Identifier Scoring and Clustering
(CISC),
Identifier Set
Scoring
Clustering
Directed Training
Comparison of CISC with some of the traditional
ideas in email clustering
Comparison of CISC with POPFile (Naïve-Bayes
classifier),
Caveats
Conclusion

3
Evaluation

Evaluation on Enron Email Dataset for the
following users (purity measured w.r.t the
grouping already available)

User Number of folders Number of Messages Messages in smallest folder Messages in largest folder
Lokay-M 11 2489 6 1159
Beck-S 101 1971 3 166
Sanders-R 30 1188 4 420
Williams-w3 18 2769 3 1398
Farmer-D 25 3672 5 1192
Kitchen-L 47 4015 5 715
Kaminski-V 41 4477 3 547
4
CISC Identifier Set

Sender and Recipients
Words from the subject starting with uppercase
Tokens from the message body
Word sequences with each word starting in
uppercase (length 2,5 only) split about
stopwords (excluding them)
Acronyms (length 2,5 only)
Words followed by an apostrophe and s e.g. TWs
extracted to TW
Words or phrases in quotes e.g. Trans Western
Words where any character (excluding first is in
uppercase) e.g. eSpeak, ThinkBank etc.

5
CISC Scoring

Sender
Initial idea generate clusters of email
addresses with frequency of communication above
some threshold,
() Identifies good clusters of communication
(-) Difficult to score when an email has
addresses spread across more than one cluster
(-) Fixed partitioning and difficult to update

6
CISC Scoring (Contd)

Sender
Need a notion of soft clustering with both
recipients and content
Generate a measure of its non-variability with
respect to the addresses it co-occurs with or the
content it discusses in emails
Example
1 ? 2,3 3,4 2,3,4 in Folder 1
2 ? 1 3 4 1 3 1,3 in Folder 2
Emphasizes social clusters 1,2,3 1,3,4
Classify 2 ? 1,3,4
Traditionally Folder 2 (address frequency based)
CISC Folder 1 (social cluster based)
Difficult to say upfront which is better !
Efficacy discussed later

7
CISC Scoring (Contd)

Words or Phrases
Generate a measure of its importance
Using context captured through the co-occurring
text
Sample scenarios for score generation
Different functional groups in a company
mentioning Conference Room ? Low score
A single shipment discussion for company CERN ?
High score
Several different topic discussions (financial,
operational etc.) for company TW ? Low score
Clustering Pair with highest similarity message
and merge clusters sharing atleast one message to
produce disjoint clusters
Directed Training
For each cluster, identify a message likely to
belong to majority class
Suggest the user to classify this message

8
Efficacy of TF-IDF Cosine Similarity

Clustering using the traditional TF-IDF cosine
similarity measure for emails not very effective
!
Note
Both TF-IDF and CISC figures with only word and
phrase tokens
Number of clusters is different in both cases,
but the purity figures indicate the
discriminative capability of the respective
algorithms

User TF-IDF ( Purity before merging) TF-IDF ( Purity) CISC ( Purity)
Lokay-M 57.69 46.68 77.54
Beck-S 51.44 9.63 59.66
Sanders-R 61.53 37.45 70.03
Williams-w3 58.92 61.71 90.61
9
Efficacy of Social Cluster Based Scoring

Results

User CISC (with social clusters) ( Purity) CISC (without social clusters) ( Purity)
Lokay-M 84.21 77.54
Beck-S 60.52 59.66
Sanders-R 78.28 70.03
Williams-w3 93.31 90.61
10
CISC vs. POPFile

Results
Purity may sometimes (marginally) decrease with
increasing training set in POPFile !

Training Messages Lokay-M Beck-S Sanders-R Williams-w3
100 62.63 15.60 15.39 6.75
200 66.62 19.70 31.79 35.50
300 69.53 20.01 50.68 35.72
1000 72.68 24.63 36.51 18.40
CISC 80.47 (265) 52.81 (218) 75.67 (146) 91.40 (153)
85.22 (614) 71.47 (587) 84.79 (332) 93.38 (365)
11
Conclusion

Given a set of unclassified emails, the proposed
strategy obtains higher clustering purity with
lower training requirements than POPFile and
TF-IDF based method.
Key differentiators
Incorporates a combination of communication
cluster and content variability based scoring for
senders instead of the usual tf-idf scoring or
naïve-bayes word model (POPFile),
Picks a set of high-selectivity features for
final message similarity model than retaining
most content of messages (i.e. all
non-stopwords),
Observes and uses the fact that any email in a
class may be close to only a small number of
emails than to all in that class,
Finally, helps lower training requirements
through directed training than indiscriminate
training over as many emails as possible.

12
Future Work

Design and evaluation for non-corporate datasets
Tuning of message similarity scoring
Different weights for the score components
Different range normalization for different
components to boost proportionally
Test feature score proportional to its length
Richer feature set
Phrases following the
Test with substring-free collection e.g. TW
Capacity Release Report and TW are replaced
with Capacity Release Report and TW
Hierarchical word scoring to change granularity
of clustering
Online classification using training directed
feature extraction
Merging high purity clusters effectively to
further reduce training requirements

13
Q A

Write a Comment

User Comments (0)

About PowerShow.com

Characteristic Identifier Scoring and Clustering for Email Classification - PowerPoint PPT Presentation

Characteristic Identifier Scoring and Clustering for Email Classification

Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia – PowerPoint PPT presentation