Text Classification of USENET messages for a Conversation Visualisation System Final Year Project Fi

About This Presentation

Title:

Text Classification of USENET messages for a Conversation Visualisation System Final Year Project Fi

Description:

if ($ARGUMENT = 2) { $classification = 'ARGUMENT'; 6th May 2003 ... Key excerpts given to human testers (ten people) asked to rate. System vs. Humans! ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 17

Provided by: jolyon6

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification of USENET messages for a Conversation Visualisation System Final Year Project Fi

1
Text Classification of USENET messages for a
Conversation Visualisation System Final Year
Project Final Presentation

Jolyon Hunter
cs91jh_at_surrey.ac.uk
www.jrth.co.uk
Tuesday 6th May 2003

2
Introduction

Aim
To investigate how messages and conversations on
USENET newsgroups can be classified automatically
as part of a system to visually represent online
discussions.
Objectives
To review systems which visualise online
discussions -enabling the identification of
phenomena to be visualised
To analyse 250,000 word corpus of text try to
identify potential cues for classification
To specify and design a system for automatic
classification of messages/conversations
To implement, test and evaluate this system

3
Conversation Visualisation Systems?For example
PeopleGarden
Xiong, Rebecca Donath, Judith 1999
PeopleGarden Creating Data Portraits for Users
MIT Media Laboratory http//smg.media.mit.edu/be
cca/

Others includeLoom (Donath et al), Netscan
(Smith) and Conversation Map (Sack), and
CodeZebra (Diamond et al)

4
Phenomena to Visualise and how to do it!

Emotion (Happy, Sad)
Agreement/Disagreement (Argument)
Involvement Sense of Community
Character traits of users and many more
How to Classify?
Automated Text Analysis
Smokey (Spertus)
WebSOM (Kohonen)
CLUTO (Karypis)

5
Analysis Overview

HOW? Initial Observations phenomena
featuresIn-depth corpus analysis
WHAT? 6000 messages from various newsgroups (4
million words)
UniS/CodeZebra Workshop features (words)
Using System Quirk to extract words frequency
counting (Kontext) gtgt Relative Frequencies
Using gCLUTO to visualise data for interpretation
WHY? Formulate programmable rules to code into a
system

6
gCLUTO Visualisations

Visualise clusters and the relationships between
clusters
Possible to see patterns or heuristics to help
derive rules
CLUTO has potential for future use within a
system to automatically classify text - e.g.
real-time clustering

7
Analysis Creating Rules

Possible to derive example rules from analysis
More analysis random sample using 6 classes
Similar patterns emerge
Example rules also gtgtgt SYSTEM!

8
System Development

Process Model of Software EngineeringRequirement
s, Design, Implementation, Testing and Evaluation
SystemSystem Quirk gt Rules gt Program gt
CLASSIFICATION
Rule-Based Processor IF..THEN.. Rules coded
into Perl program to produce classifications

9
Generic Conversation Visualisation System
10
Message Text Analysis Module
11
Perl Code Key points

IFTHEN RULES (as seen earlier) CLASS COUNTER
if((word eq "agree") (relativeword gt
0.003))
AGREEMENT
CLASSIFICATIONS
if (AGREEMENT gt 2)
classification "AGREEMENT"
if (ARGUMENT gt 2)
classification "ARGUMENT"

12
Testing Evaluation

Ten sample messages either Agreement or
Disagreement
Small sample
Key excerpts given to human testers (ten people)
asked to rate
System vs. Humans!
System correct 3 times, most inconclusive
Human responses correlate with system, but
ambiguities also exist
Conclusions? Results not conclusive but show
promise gt Larger sample more research

13
Recap Mission Accomplished?

Aim
To investigate how messages and conversations on
USENET newsgroups can be classified automatically
as part of a system to visually represent online
discussions.
Objectives
To review systems which visualise online
discussions -enabling the identification of
phenomena to be visualised
To analyse 250,000 word corpus of text try to
identify potential cues for classification
To specify and design a system for automatic
classification of messages/conversations
To implement, test and evaluate this system

14
Text Classification of USENET messages for a
Conversation Visualisation System

Thanks for listening
Any Questions?

15
Final Report

The Final Report for this project is also
available online at
www.jrth.co.uk

16
REFERENCES

Loom" Judith DonathDonath, Judith 2002 A
Semantic Approach to Visualising Online
Conversation Communications of the ACM 45(4)
45-49http//web.media.mit.edu/kkarahal/loom/inde
x.html
Conversation Map Warren SackSack, Warren 2000
Design for Very Large-Scale Conversations Ph.D.
Thesis, February 2000, MIT Media Laboratory
http//www.sims.berkeley.edu/sack/cm/
Netscan Marc SmithSmith, Marc. 2001. Netscan
A tool for measuring and mapping social
cyberspaces. http//netscan.research.microsoft.c
om
PeopleGarden Rebecca Xiong Judith
DonathXiong, Rebecca Donath, Judith 1999
PeopleGarden Creating Data Portraits for Users
MIT Media Laboratory http//smg.media.mit.edu/be
cca/
CodeZebra Sara DiamondDiamond, Sara (Project
Leader) - Banff New Media Institute, Canada plus
many others (inc. Dr. A. Salway, University of
Surrey)http//www.codezebra.net
Smokey Ellen SpertusSpertus, Ellen 1997
"Smokey Automatic Recognition of Hostile
Messages, Innovative Applications of Artificial
Intelligence 97http//www.spertus.com/ellen/
WebSOM Teuvo KohonenKohonen, T. 1996 onwards
more details at http//websom.hut.fi/websom/
CLUTO George KarypisKarypis, George - 2002 -
CLUTO, gCLUTO and wCLUTO University of
Minnesota, MN USA Software available from
http//www-users.cs.umn.edu/karypis/cluto/