Text Mining Tools - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Text Mining Tools

Description:

Rainbow Toolkit. Manjal. WordNet. http://wordnet.princeton.edu ... rainbow -d ~/model --test-files testdir Classify previously unseen files in testdir ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 23
Provided by: aditya7
Category:
Tags: mining | rainbow | text | tools

less

Transcript and Presenter's Notes

Title: Text Mining Tools


1
Text Mining Tools
  • 22C196
  • Text Retrieval Text Mining Seminar

2
Tools
  • WordNet
  • MxTerminator
  • Lingpipe
  • Stanford TP Tools
  • Stanford-NER
  • SVM Light
  • Rainbow Toolkit
  • Manjal

3
WordNet
  • http//wordnet.princeton.edu/
  • English lexical database
  • Developed at Princeton Univ. by George A. Miller,
    etc.
  • Organized as Synsets
  • Cognitive synonym sets
  • Synsets for Nouns, Verbs, Adjectives and Adverbs

4
WordNet
  • Synsets interlinked via lexical and
    conceptual-sematic relations
  • Network of meaningfully related concepts and
    words
  • Available online and can also be freely
    downloaded
  • Perl and Java packages available to interface
    with WordNet

5
WordNet
  • WordNet 2.0 on sulu and geordi
  • Command line interface
  • Example
  • /usr/local/WordNet-2.0/bin/wn -over
  • Provides overview of various senses
  • /usr/local/WordNet-2.0/bin/wn -synsn
  • Provides list of synonyms

6
MxTerminator
  • http//www.id.cbs.dk/dh/corpus/tools/MXTERMINATOR
    .html
  • Java sentence boundary detection tool
  • Algorithm described in
  • J.C. Reynar and A. Ratnaparkhi. A Maximum Entropy
    Approach to Identifying Sentence Boundaries.
    1997.

7
MxTerminator
  • Installed on sulu and geordi
  • Command-line interface
  • Requires two parameters
  • Trained model directory
  • Text File to parse
  • Syntax
  • /usr/local/mxterminator/mxterminator modeldir textfile
  • Comes with pre-trained model
  • /usr/local/mxterminator/eos.project

8
MxTerminator
  • New models can be trained
  • trainmxterminator
  • is newly created model directory
  • is training data with one sentence
    per line
  • Package also includes mxpost
  • part-of-speech tagger
  • /usr/local/mxterminator/mxpost modeldir wordfile
  • Pre-built model - /usr/local/mxterminator/tagger.p
    roject
  • wordfile - contains words one sentence per line

9
LingPipe
  • http//www.alias-i.com/lingpipe/
  • Suite of Java libraries for different kinds of
    analyses
  • Sentence detection
  • Part-of-speech tagging
  • Named-entity extraction
  • Phrase extraction
  • Entity co-reference
  • Spell checker
  • Clustering
  • Chinese language support

10
LingPipe
  • Also contains tools for database text mining
  • Directly work-off a database such as MySQL
  • Package contains demos, tutorials, pre-trained
    models and javadoc
  • Widely used in text mining community
  • Especially for general and biomedical
    named-entity recognition
  • Website has links to blogs and developer
    discussion forum

11
Stanford TP Tools
  • http//nlp.stanford.edu/software/index.shtml
  • Variety of text processing tools
  • Made available by Stanford NLP group
  • All tools are implemented in Java
  • Freely downloadable

12
Stanford TP Tools
  • Parser
  • POS Tagger
  • Named Entity Recognizer
  • Chinse word segmenter
  • Classifier
  • Tregex and Tsurgeon
  • Matching patterns in trees

13
Stanford-NER
  • Based on CRFs
  • Contains demo programs
  • 4 pre-built models
  • 3 class basic model trained on US and UK Newswire
    data from CoNLL, MUC and ACE
  • Labels PERSON, ORGANIZATION and LOCATION
  • 4 class model trained on CoNLL training data
  • Additionally labels MISC
  • 2 more accurate distsim versions of above models

14
Stanford-NER
  • Example
  • java -mx600m -cp ./stanford-ner.jar. stanfordNER
    ner-eng-ie.crf-3-all2006-distsim.ser.gz text
  • Advanced distsim model
  • Example
  • java -mx300m -cp stanford-ner.jar
    edu.stanford.nlp.ie.crf.CRFClassifier -textFile
    sample.txt
  • Default basic model

15
SVMLight
  • http//svmlight.joachims.org/
  • C support-vector-machine implementation by
    Thorsten Joachims
  • Does classification, regression and ranking
  • Many other functions
  • Estimate error-rate and precision and recall
    directly
  • Freely downloadable
  • Instructions on website

16
SVMLight
  • Contains 2 main executable files
  • svm_learn (learn model from training set)
  • svm_classify (classify test set)
  • Input file contains weighted term vectors
  • Strategy index doc files using Lucene or SMART
    and obtain term vectors
  • Example -1 10.43 30.12 92840.2
  • 1 10.20 30.14 92840.97
  • Use different kernel functions
  • Support for linear and non-linear kernels

17
SVMLight
  • Syntax
  • svm_learn options example_file model_file
  • svm_classify options example_file model_file
    output_file
  • Example data included in distribution

18
Rainbow Toolkit
  • http//www.cs.cmu.edu/mccallum/bow/rainbow/
  • Part of the Bow toolkit
  • http//www.cs.cmu.edu/mccallum/bow/
  • Text Classification tool
  • Supports 4 classification methods
  • NaĂŻve Bayes (default)
  • TFIDF/Rocchio
  • K-nearest neighbor
  • Probabilistic Indexing

19
Rainbow Toolkit
  • Building a model
  • rainbow -d ./model --index
    --use-stemming --skip-html
  • contains individual folders (with text
    files) for each class
  • Model is stored in ./model
  • Test model
  • rainbow -d /model --test-set0.4 --test3
  • Train-test split is 0.6/0.4 3 iterations

20
Rainbow Toolkit
  • Test model
  • rainbow -d /model --test-set0.5 --test1
  • Specify test set
  • Half chosen randomly
  • rainbow -d /model --test-files
  • Classify previously unseen files in

21
Rainbow Toolkit
  • Formatted output
  • rainbow-stats
  • Example
  • rainbow -d ./model --test-set0.4 --test2
    rainbow-stats
  • Confusion matrix, Percent accuracy, Std. error,

22
Manjal
  • Online demo
Write a Comment
User Comments (0)
About PowerShow.com