Title: Text Classification of USENET messages for a Conversation Visualisation System Final Year Project Fi
1Text Classification of USENET messages for a
Conversation Visualisation System Final Year
Project Final Presentation
- Jolyon Hunter
- cs91jh_at_surrey.ac.uk
- www.jrth.co.uk
- Tuesday 6th May 2003
2Introduction
- Aim
- To investigate how messages and conversations on
USENET newsgroups can be classified automatically
as part of a system to visually represent online
discussions. - Objectives
- To review systems which visualise online
discussions -enabling the identification of
phenomena to be visualised - To analyse 250,000 word corpus of text try to
identify potential cues for classification - To specify and design a system for automatic
classification of messages/conversations - To implement, test and evaluate this system
3Conversation Visualisation Systems?For example
PeopleGarden
Xiong, Rebecca Donath, Judith 1999
PeopleGarden Creating Data Portraits for Users
MIT Media Laboratory http//smg.media.mit.edu/be
cca/
- Others includeLoom (Donath et al), Netscan
(Smith) and Conversation Map (Sack), and
CodeZebra (Diamond et al)
4Phenomena to Visualise and how to do it!
- Emotion (Happy, Sad)
- Agreement/Disagreement (Argument)
- Involvement Sense of Community
- Character traits of users and many more
- How to Classify?
- Automated Text Analysis
- Smokey (Spertus)
- WebSOM (Kohonen)
- CLUTO (Karypis)
5Analysis Overview
- HOW? Initial Observations phenomena
featuresIn-depth corpus analysis - WHAT? 6000 messages from various newsgroups (4
million words) - UniS/CodeZebra Workshop features (words)
- Using System Quirk to extract words frequency
counting (Kontext) gtgt Relative Frequencies - Using gCLUTO to visualise data for interpretation
- WHY? Formulate programmable rules to code into a
system
6gCLUTO Visualisations
- Visualise clusters and the relationships between
clusters - Possible to see patterns or heuristics to help
derive rules - CLUTO has potential for future use within a
system to automatically classify text - e.g.
real-time clustering
7Analysis Creating Rules
- Possible to derive example rules from analysis
- More analysis random sample using 6 classes
- Similar patterns emerge
- Example rules also gtgtgt SYSTEM!
8System Development
- Process Model of Software EngineeringRequirement
s, Design, Implementation, Testing and Evaluation - SystemSystem Quirk gt Rules gt Program gt
CLASSIFICATION - Rule-Based Processor IF..THEN.. Rules coded
into Perl program to produce classifications
9Generic Conversation Visualisation System
10Message Text Analysis Module
11Perl Code Key points
- IFTHEN RULES (as seen earlier) CLASS COUNTER
- if((word eq "agree") (relativeword gt
0.003)) -
- AGREEMENT
-
- CLASSIFICATIONS
- if (AGREEMENT gt 2)
- classification "AGREEMENT"
-
- if (ARGUMENT gt 2)
- classification "ARGUMENT"
12Testing Evaluation
- Ten sample messages either Agreement or
Disagreement - Small sample
- Key excerpts given to human testers (ten people)
asked to rate - System vs. Humans!
- System correct 3 times, most inconclusive
- Human responses correlate with system, but
ambiguities also exist - Conclusions? Results not conclusive but show
promise gt Larger sample more research
13Recap Mission Accomplished?
- Aim
- To investigate how messages and conversations on
USENET newsgroups can be classified automatically
as part of a system to visually represent online
discussions. - Objectives
- To review systems which visualise online
discussions -enabling the identification of
phenomena to be visualised - To analyse 250,000 word corpus of text try to
identify potential cues for classification - To specify and design a system for automatic
classification of messages/conversations - To implement, test and evaluate this system
14Text Classification of USENET messages for a
Conversation Visualisation System
- Thanks for listening
- Any Questions?
15Final Report
- The Final Report for this project is also
available online at - www.jrth.co.uk
16REFERENCES
- Loom" Judith DonathDonath, Judith 2002 A
Semantic Approach to Visualising Online
Conversation Communications of the ACM 45(4)
45-49http//web.media.mit.edu/kkarahal/loom/inde
x.html - Conversation Map Warren SackSack, Warren 2000
Design for Very Large-Scale Conversations Ph.D.
Thesis, February 2000, MIT Media Laboratory
http//www.sims.berkeley.edu/sack/cm/ - Netscan Marc SmithSmith, Marc. 2001. Netscan
A tool for measuring and mapping social
cyberspaces. http//netscan.research.microsoft.c
om - PeopleGarden Rebecca Xiong Judith
DonathXiong, Rebecca Donath, Judith 1999
PeopleGarden Creating Data Portraits for Users
MIT Media Laboratory http//smg.media.mit.edu/be
cca/ - CodeZebra Sara DiamondDiamond, Sara (Project
Leader) - Banff New Media Institute, Canada plus
many others (inc. Dr. A. Salway, University of
Surrey)http//www.codezebra.net - Smokey Ellen SpertusSpertus, Ellen 1997
"Smokey Automatic Recognition of Hostile
Messages, Innovative Applications of Artificial
Intelligence 97http//www.spertus.com/ellen/ - WebSOM Teuvo KohonenKohonen, T. 1996 onwards
more details at http//websom.hut.fi/websom/ - CLUTO George KarypisKarypis, George - 2002 -
CLUTO, gCLUTO and wCLUTO University of
Minnesota, MN USA Software available from
http//www-users.cs.umn.edu/karypis/cluto/