Characteristic Identifier Scoring and Clustering for Email Classification - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Characteristic Identifier Scoring and Clustering for Email Classification

Description:

Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 14
Provided by: MKC4
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Characteristic Identifier Scoring and Clustering for Email Classification


1
Characteristic Identifier Scoring and Clustering
for Email Classification
  • By
  • Mahesh Kumar Chhaparia

2
Email Clustering
  • Given a set of unclassified emails, the objective
    is to produce high purity clusters keeping the
    training requirements low.
  • Outline
  • Characteristic Identifier Scoring and Clustering
    (CISC),
  • Identifier Set
  • Scoring
  • Clustering
  • Directed Training
  • Comparison of CISC with some of the traditional
    ideas in email clustering
  • Comparison of CISC with POPFile (Naïve-Bayes
    classifier),
  • Caveats
  • Conclusion

3
Evaluation
  • Evaluation on Enron Email Dataset for the
    following users (purity measured w.r.t the
    grouping already available)

User Number of folders Number of Messages Messages in smallest folder Messages in largest folder
Lokay-M 11 2489 6 1159
Beck-S 101 1971 3 166
Sanders-R 30 1188 4 420
Williams-w3 18 2769 3 1398
Farmer-D 25 3672 5 1192
Kitchen-L 47 4015 5 715
Kaminski-V 41 4477 3 547
4
CISC Identifier Set
  • Sender and Recipients
  • Words from the subject starting with uppercase
  • Tokens from the message body
  • Word sequences with each word starting in
    uppercase (length 2,5 only) split about
    stopwords (excluding them)
  • Acronyms (length 2,5 only)
  • Words followed by an apostrophe and s e.g. TWs
    extracted to TW
  • Words or phrases in quotes e.g. Trans Western
  • Words where any character (excluding first is in
    uppercase) e.g. eSpeak, ThinkBank etc.

5
CISC Scoring
  • Sender
  • Initial idea generate clusters of email
    addresses with frequency of communication above
    some threshold,
  • () Identifies good clusters of communication
  • (-) Difficult to score when an email has
    addresses spread across more than one cluster
  • (-) Fixed partitioning and difficult to update

6
CISC Scoring (Contd)
  • Sender
  • Need a notion of soft clustering with both
    recipients and content
  • Generate a measure of its non-variability with
    respect to the addresses it co-occurs with or the
    content it discusses in emails
  • Example
  • 1 ? 2,3 3,4 2,3,4 in Folder 1
  • 2 ? 1 3 4 1 3 1,3 in Folder 2
  • Emphasizes social clusters 1,2,3 1,3,4
  • Classify 2 ? 1,3,4
  • Traditionally Folder 2 (address frequency based)
  • CISC Folder 1 (social cluster based)
  • Difficult to say upfront which is better !
  • Efficacy discussed later

7
CISC Scoring (Contd)
  • Words or Phrases
  • Generate a measure of its importance
  • Using context captured through the co-occurring
    text
  • Sample scenarios for score generation
  • Different functional groups in a company
    mentioning Conference Room ? Low score
  • A single shipment discussion for company CERN ?
    High score
  • Several different topic discussions (financial,
    operational etc.) for company TW ? Low score
  • Clustering Pair with highest similarity message
    and merge clusters sharing atleast one message to
    produce disjoint clusters
  • Directed Training
  • For each cluster, identify a message likely to
    belong to majority class
  • Suggest the user to classify this message

8
Efficacy of TF-IDF Cosine Similarity
  • Clustering using the traditional TF-IDF cosine
    similarity measure for emails not very effective
    !
  • Note
  • Both TF-IDF and CISC figures with only word and
    phrase tokens
  • Number of clusters is different in both cases,
    but the purity figures indicate the
    discriminative capability of the respective
    algorithms

User TF-IDF ( Purity before merging) TF-IDF ( Purity) CISC ( Purity)
Lokay-M 57.69 46.68 77.54
Beck-S 51.44 9.63 59.66
Sanders-R 61.53 37.45 70.03
Williams-w3 58.92 61.71 90.61
9
Efficacy of Social Cluster Based Scoring
  • Results

User CISC (with social clusters) ( Purity) CISC (without social clusters) ( Purity)
Lokay-M 84.21 77.54
Beck-S 60.52 59.66
Sanders-R 78.28 70.03
Williams-w3 93.31 90.61
10
CISC vs. POPFile
  • Results
  • Purity may sometimes (marginally) decrease with
    increasing training set in POPFile !

Training Messages Lokay-M Beck-S Sanders-R Williams-w3
100 62.63 15.60 15.39 6.75
200 66.62 19.70 31.79 35.50
300 69.53 20.01 50.68 35.72
1000 72.68 24.63 36.51 18.40
CISC 80.47 (265) 52.81 (218) 75.67 (146) 91.40 (153)
85.22 (614) 71.47 (587) 84.79 (332) 93.38 (365)
11
Conclusion
  • Given a set of unclassified emails, the proposed
    strategy obtains higher clustering purity with
    lower training requirements than POPFile and
    TF-IDF based method.
  • Key differentiators
  • Incorporates a combination of communication
    cluster and content variability based scoring for
    senders instead of the usual tf-idf scoring or
    naïve-bayes word model (POPFile),
  • Picks a set of high-selectivity features for
    final message similarity model than retaining
    most content of messages (i.e. all
    non-stopwords),
  • Observes and uses the fact that any email in a
    class may be close to only a small number of
    emails than to all in that class,
  • Finally, helps lower training requirements
    through directed training than indiscriminate
    training over as many emails as possible.

12
Future Work
  • Design and evaluation for non-corporate datasets
  • Tuning of message similarity scoring
  • Different weights for the score components
  • Different range normalization for different
    components to boost proportionally
  • Test feature score proportional to its length
  • Richer feature set
  • Phrases following the
  • Test with substring-free collection e.g. TW
    Capacity Release Report and TW are replaced
    with Capacity Release Report and TW
  • Hierarchical word scoring to change granularity
    of clustering
  • Online classification using training directed
    feature extraction
  • Merging high purity clusters effectively to
    further reduce training requirements

13
Q A
Write a Comment
User Comments (0)
About PowerShow.com