Title: Implementing Neural Networks for Text Classification: Data Sets
1Implementing Neural Networks for Text
ClassificationData Sets
- Prerak Sanghvi
- Computer Science and Engineering Department
- State University of New York at Buffalo
2Data Set Selection
- There are two types of Data Sets that can be
used - Compilation of documents from web, etc manually
specifically for this project - Use of an existing Data Set that has been worked
on by other researchers
3Advantages of Standard Data Sets
- We dont have to work for obtaining the data
- Distribution of documents in the corpora used is
even. Further, documents are well-classified - Comparison of results can be done with results
from other researchers. This gives a comparative
evaluation of the algorithm being used for
classification.
4Most popular corpora
- Most popular corpora used for text-classification
research are - Reuters-21578 data set (set of 21,578 newswire
articles from Reuters available as SGML
documents 1000 documents in each file) - 20-newsgroups data (a set of 20,000 newsgroup
postings from 20 newsgroups available as text
files one document per file) - WebKB database (web pages from 4 universities
class)
5Reuters-21578 data set
- Data is classified into five groups of classes
Category set Number of categories Number of categories with 1 occurrences Number of categories with 20 occurrences
EXCHANGES 39 32 7
ORG 56 32 9
PEOPLE 267 114 14
PLACES 175 147 60
TOPICS 135 120 47
TOTAL 672 445 137
6Reuters-21578 data set
- Categories are overlapping and non-exhaustive.
- Overlapping one document can be classified into
more than one categories. E.g. a document can be
about nasdaq (EXCHANGES) and about USA
(PLACES) in general. - Non-exhaustive There are categories into which
no documents fall, and there are documents that
do not fall into any category. - Categories with 20 occurrences are too few. ANN
approach would probably not work with such few
examples.
7Example of a Reuter-21578 document
820-newsgroup data set
- Each document is in a separate text file.
- There are 1000 documents from each newsgroup.
- Each document has only one source newsgroup, so
each document falls into only one category. - The task of classification pertains to
determining the source newsgroup of the document.
920-newsgroups data set
alt.atheism rec.sport.hockey
comp.graphics sci.crypt
comp.os.ms-windows.misc sci.electronics
comp.sys.ibm.pc.hardware sci.med
comp.sys.mac.hardware sci.space
comp.windows.x soc.religion.christian
misc.forsale talk.politics.guns
rec.autos talk.politics.mideast
rec.motorcycles talk.politics.misc
rec.sport.baseball talk.religion.misc
10Example of a 20-newsgroup document
- Newsgroups alt.atheism
- Path cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.
cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod
.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125 - Organization Penn State University
- Date Fri, 23 Apr 1993 185423 EDT
- From ltSMM125_at_psuvm.psu.edugt
- Message-ID lt93113.185423SMM125_at_psuvm.psu.edugt
- Subject Re YOU WILL ALL GO TO HELL!!!
- References lt93106.155002JSN104_at_psuvm.psu.edugt
lt1qq837cm6_at_usenet.INS.CWRU.Edugt - Lines 1
- jsn104 is jeremy scott noonan