Title: Characteristic Identifier Scoring and Clustering for Email Classification
1Characteristic Identifier Scoring and Clustering
for Email Classification
- By
- Mahesh Kumar Chhaparia
2Email Clustering
- Given a set of unclassified emails, the objective
is to produce high purity clusters keeping the
training requirements low. - Outline
- Characteristic Identifier Scoring and Clustering
(CISC), - Identifier Set
- Scoring
- Clustering
- Directed Training
- Comparison of CISC with some of the traditional
ideas in email clustering - Comparison of CISC with POPFile (Naïve-Bayes
classifier), - Caveats
- Conclusion
3Evaluation
- Evaluation on Enron Email Dataset for the
following users (purity measured w.r.t the
grouping already available)
User Number of folders Number of Messages Messages in smallest folder Messages in largest folder
Lokay-M 11 2489 6 1159
Beck-S 101 1971 3 166
Sanders-R 30 1188 4 420
Williams-w3 18 2769 3 1398
Farmer-D 25 3672 5 1192
Kitchen-L 47 4015 5 715
Kaminski-V 41 4477 3 547
4CISC Identifier Set
- Sender and Recipients
- Words from the subject starting with uppercase
- Tokens from the message body
- Word sequences with each word starting in
uppercase (length 2,5 only) split about
stopwords (excluding them) - Acronyms (length 2,5 only)
- Words followed by an apostrophe and s e.g. TWs
extracted to TW - Words or phrases in quotes e.g. Trans Western
- Words where any character (excluding first is in
uppercase) e.g. eSpeak, ThinkBank etc.
5CISC Scoring
- Sender
- Initial idea generate clusters of email
addresses with frequency of communication above
some threshold, - () Identifies good clusters of communication
- (-) Difficult to score when an email has
addresses spread across more than one cluster - (-) Fixed partitioning and difficult to update
6CISC Scoring (Contd)
- Sender
- Need a notion of soft clustering with both
recipients and content - Generate a measure of its non-variability with
respect to the addresses it co-occurs with or the
content it discusses in emails - Example
- 1 ? 2,3 3,4 2,3,4 in Folder 1
- 2 ? 1 3 4 1 3 1,3 in Folder 2
- Emphasizes social clusters 1,2,3 1,3,4
- Classify 2 ? 1,3,4
- Traditionally Folder 2 (address frequency based)
- CISC Folder 1 (social cluster based)
- Difficult to say upfront which is better !
- Efficacy discussed later
7CISC Scoring (Contd)
- Words or Phrases
- Generate a measure of its importance
- Using context captured through the co-occurring
text - Sample scenarios for score generation
- Different functional groups in a company
mentioning Conference Room ? Low score - A single shipment discussion for company CERN ?
High score - Several different topic discussions (financial,
operational etc.) for company TW ? Low score - Clustering Pair with highest similarity message
and merge clusters sharing atleast one message to
produce disjoint clusters - Directed Training
- For each cluster, identify a message likely to
belong to majority class - Suggest the user to classify this message
8Efficacy of TF-IDF Cosine Similarity
- Clustering using the traditional TF-IDF cosine
similarity measure for emails not very effective
! - Note
- Both TF-IDF and CISC figures with only word and
phrase tokens - Number of clusters is different in both cases,
but the purity figures indicate the
discriminative capability of the respective
algorithms
User TF-IDF ( Purity before merging) TF-IDF ( Purity) CISC ( Purity)
Lokay-M 57.69 46.68 77.54
Beck-S 51.44 9.63 59.66
Sanders-R 61.53 37.45 70.03
Williams-w3 58.92 61.71 90.61
9Efficacy of Social Cluster Based Scoring
User CISC (with social clusters) ( Purity) CISC (without social clusters) ( Purity)
Lokay-M 84.21 77.54
Beck-S 60.52 59.66
Sanders-R 78.28 70.03
Williams-w3 93.31 90.61
10CISC vs. POPFile
- Results
- Purity may sometimes (marginally) decrease with
increasing training set in POPFile !
Training Messages Lokay-M Beck-S Sanders-R Williams-w3
100 62.63 15.60 15.39 6.75
200 66.62 19.70 31.79 35.50
300 69.53 20.01 50.68 35.72
1000 72.68 24.63 36.51 18.40
CISC 80.47 (265) 52.81 (218) 75.67 (146) 91.40 (153)
85.22 (614) 71.47 (587) 84.79 (332) 93.38 (365)
11Conclusion
- Given a set of unclassified emails, the proposed
strategy obtains higher clustering purity with
lower training requirements than POPFile and
TF-IDF based method. - Key differentiators
- Incorporates a combination of communication
cluster and content variability based scoring for
senders instead of the usual tf-idf scoring or
naïve-bayes word model (POPFile), - Picks a set of high-selectivity features for
final message similarity model than retaining
most content of messages (i.e. all
non-stopwords), - Observes and uses the fact that any email in a
class may be close to only a small number of
emails than to all in that class, - Finally, helps lower training requirements
through directed training than indiscriminate
training over as many emails as possible.
12Future Work
- Design and evaluation for non-corporate datasets
- Tuning of message similarity scoring
- Different weights for the score components
- Different range normalization for different
components to boost proportionally - Test feature score proportional to its length
- Richer feature set
- Phrases following the
- Test with substring-free collection e.g. TW
Capacity Release Report and TW are replaced
with Capacity Release Report and TW - Hierarchical word scoring to change granularity
of clustering - Online classification using training directed
feature extraction - Merging high purity clusters effectively to
further reduce training requirements
13Q A