Implementing Neural Networks for Text Classification: Data Sets presentation

About This Presentation

Transcript and Presenter's Notes

Title: Implementing Neural Networks for Text Classification: Data Sets

1
Implementing Neural Networks for Text
ClassificationData Sets

Prerak Sanghvi
Computer Science and Engineering Department
State University of New York at Buffalo

2
Data Set Selection

There are two types of Data Sets that can be
used
Compilation of documents from web, etc manually
specifically for this project
Use of an existing Data Set that has been worked
on by other researchers

3
Advantages of Standard Data Sets

We dont have to work for obtaining the data
Distribution of documents in the corpora used is
even. Further, documents are well-classified
Comparison of results can be done with results
from other researchers. This gives a comparative
evaluation of the algorithm being used for
classification.

4
Most popular corpora

Most popular corpora used for text-classification
research are
Reuters-21578 data set (set of 21,578 newswire
articles from Reuters available as SGML
documents 1000 documents in each file)
20-newsgroups data (a set of 20,000 newsgroup
postings from 20 newsgroups available as text
files one document per file)
WebKB database (web pages from 4 universities
class)

5
Reuters-21578 data set

Data is classified into five groups of classes

Category set Number of categories Number of categories with 1 occurrences Number of categories with 20 occurrences
EXCHANGES 39 32 7
ORG 56 32 9
PEOPLE 267 114 14
PLACES 175 147 60
TOPICS 135 120 47
TOTAL 672 445 137
6
Reuters-21578 data set

Categories are overlapping and non-exhaustive.
Overlapping one document can be classified into
more than one categories. E.g. a document can be
about nasdaq (EXCHANGES) and about USA
(PLACES) in general.
Non-exhaustive There are categories into which
no documents fall, and there are documents that
do not fall into any category.
Categories with 20 occurrences are too few. ANN
approach would probably not work with such few
examples.

7
Example of a Reuter-21578 document
8
20-newsgroup data set

Each document is in a separate text file.
There are 1000 documents from each newsgroup.
Each document has only one source newsgroup, so
each document falls into only one category.
The task of classification pertains to
determining the source newsgroup of the document.

9
20-newsgroups data set
alt.atheism rec.sport.hockey
comp.graphics sci.crypt
comp.os.ms-windows.misc sci.electronics
comp.sys.ibm.pc.hardware sci.med
comp.sys.mac.hardware sci.space
comp.windows.x soc.religion.christian
misc.forsale talk.politics.guns
rec.autos talk.politics.mideast
rec.motorcycles talk.politics.misc
rec.sport.baseball talk.religion.misc
10
Example of a 20-newsgroup document

Newsgroups alt.atheism
Path cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.
cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod
.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125
Organization Penn State University
Date Fri, 23 Apr 1993 185423 EDT
From ltSMM125_at_psuvm.psu.edugt
Message-ID lt93113.185423SMM125_at_psuvm.psu.edugt
Subject Re YOU WILL ALL GO TO HELL!!!
References lt93106.155002JSN104_at_psuvm.psu.edugt
lt1qq837cm6_at_usenet.INS.CWRU.Edugt
Lines 1
jsn104 is jeremy scott noonan

Write a Comment

User Comments (0)

About PowerShow.com

Implementing Neural Networks for Text Classification: Data Sets PowerPoint PPT Presentation