Implementing Neural Networks for Text Classification: Data Sets PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Implementing Neural Networks for Text Classification: Data Sets


1
Implementing Neural Networks for Text
ClassificationData Sets
  • Prerak Sanghvi
  • Computer Science and Engineering Department
  • State University of New York at Buffalo

2
Data Set Selection
  • There are two types of Data Sets that can be
    used
  • Compilation of documents from web, etc manually
    specifically for this project
  • Use of an existing Data Set that has been worked
    on by other researchers

3
Advantages of Standard Data Sets
  • We dont have to work for obtaining the data
  • Distribution of documents in the corpora used is
    even. Further, documents are well-classified
  • Comparison of results can be done with results
    from other researchers. This gives a comparative
    evaluation of the algorithm being used for
    classification.

4
Most popular corpora
  • Most popular corpora used for text-classification
    research are
  • Reuters-21578 data set (set of 21,578 newswire
    articles from Reuters available as SGML
    documents 1000 documents in each file)
  • 20-newsgroups data (a set of 20,000 newsgroup
    postings from 20 newsgroups available as text
    files one document per file)
  • WebKB database (web pages from 4 universities
    class)

5
Reuters-21578 data set
  • Data is classified into five groups of classes

Category set Number of categories Number of categories with 1 occurrences Number of categories with 20 occurrences
EXCHANGES 39 32 7
ORG 56 32 9
PEOPLE 267 114 14
PLACES 175 147 60
TOPICS 135 120 47
TOTAL 672 445 137
6
Reuters-21578 data set
  • Categories are overlapping and non-exhaustive.
  • Overlapping one document can be classified into
    more than one categories. E.g. a document can be
    about nasdaq (EXCHANGES) and about USA
    (PLACES) in general.
  • Non-exhaustive There are categories into which
    no documents fall, and there are documents that
    do not fall into any category.
  • Categories with 20 occurrences are too few. ANN
    approach would probably not work with such few
    examples.

7
Example of a Reuter-21578 document
8
20-newsgroup data set
  • Each document is in a separate text file.
  • There are 1000 documents from each newsgroup.
  • Each document has only one source newsgroup, so
    each document falls into only one category.
  • The task of classification pertains to
    determining the source newsgroup of the document.

9
20-newsgroups data set
alt.atheism rec.sport.hockey
comp.graphics sci.crypt
comp.os.ms-windows.misc sci.electronics
comp.sys.ibm.pc.hardware sci.med
comp.sys.mac.hardware sci.space
comp.windows.x soc.religion.christian
misc.forsale talk.politics.guns
rec.autos talk.politics.mideast
rec.motorcycles talk.politics.misc
rec.sport.baseball talk.religion.misc
10
Example of a 20-newsgroup document
  • Newsgroups alt.atheism
  • Path cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.
    cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod
    .mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125
  • Organization Penn State University
  • Date Fri, 23 Apr 1993 185423 EDT
  • From ltSMM125_at_psuvm.psu.edugt
  • Message-ID lt93113.185423SMM125_at_psuvm.psu.edugt
  • Subject Re YOU WILL ALL GO TO HELL!!!
  • References lt93106.155002JSN104_at_psuvm.psu.edugt
    lt1qq837cm6_at_usenet.INS.CWRU.Edugt
  • Lines 1
  • jsn104 is jeremy scott noonan
Write a Comment
User Comments (0)
About PowerShow.com