CIS392 Text Processing, Retrieval, and Mining - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CIS392 Text Processing, Retrieval, and Mining

Description:

Assign#1. 2. Login in to AFS. On campus: go to a computer lab in GITC 2305. ... Assign#1. 8. How to create your home page on AFS system? ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 23
Provided by: wu8
Category:

less

Transcript and Presenter's Notes

Title: CIS392 Text Processing, Retrieval, and Mining


1
CIS392 Text Processing, Retrieval, and Mining
  • Instructor Dr. Y. F. Brook Wu
  • BOW toolkit
  • http//www.cs.cmu.edu/mccallum/bow

2
Login in to AFS
  • On campus go to a computer lab in GITC 2305.
  • At home make sure the internet connection has
    been established.
  • Assume everyone has Windows at home. Click on
    Start ? Run
  • Type in telnet afs1.njit.edu (without quotes
    the first screen shows some useful information.)
  • Enter user name and password 
  • What if your account doesnt work Call help desk
    973.596.2900, they can reset your password for
    you.

3
Useful UNIX commands
  • Note All filenames and commands in UNIX system
    are case sensitive.
  •  General syntax
  • Command option Argument
  • Options modify the way command works, and they
    are optional.
  • Arguments are usually files sometimes they are
    optional too.
  • Ex rm r directory_name

4
Note
  • Typing two - next to each other in MS
    PowerPoint will make them look like . Those
    BOW and UNIX commands you see in these slides,
    therefore, are confusing. So, please refer to
    BOW help file and UNIX documentations for their
    actual usages.

5
Useful UNIX commands
  • man (for manual) ex man ls (manual for ls
    command)
  • cd (change directory)
  • ls (list files and attributes)
  • dir (list files)
  • mkdir (crete a directory)
  • rm (delete a file)
  • rm fr directory_name (delete the whole directory
    and files inside it.)

6
Useful UNIX commands
  • rmdir (remove directory)
  • cp (copy)
  • pwd (current working directory)
  • pico (a text editor)
  • more filename (read plain text file one screen at
    a time. Press space bar to continue and q to
    quit.)
  • quota (disk space)

7
More useful UNIX commands
  • http//www.njit.edu/CSD/Docs/unixcmds.html
  • http//www.njit.edu/Directory/Admin/CSD/Academic_C
    omputing/Manuals/UNIX/UNIX.html

8
How to create your home page on AFS system?
  • Help info http//www-ec.njit.edu/ec_info/newuser/
    web/web.html
  • Execute this command at the UNIX prompt
    /usr/ec/bin/home.page.setup
  • Your URL
    http//www-ec.njit.edu/yourusername

9
Overview of Retrieval Experiment
  • Create a sub-directory for CIS392 assignments
    under your_user_name/public_html
  • Create 3 sub-directories under the above
    directory for the 3 automatic indexing options
  • Perform automatic indexing based on 3 different
    indexing options

10
Overview of Retrieval Experiment (cont)
  • Perform retrieval for each of the above 3 auto
    indexing options.
  • Analyze how different indexing options affect
    retrieval
  • Make an html page to present your results.

11
Creating sub directories
  • Change directory to public_html by typing cd
    public_html
  • mkdir cis392 (now youve created a directory for
    your CIS392 retrieval assignments)
  • cd cis392 (go inside cis392 directory)

12
Creating three sub-directories
  • mkdir model1 (this directory stores results from
    default settings no stemming and stopped words
    removed.)
  • mkdir model2 (this directory stores results from
    the following settings stemming, and stopped
    words removed.)
  • mkdir model3 (this directory stores results from
    the following settings no stemming, and stopped
    words included.)

13
URL of your retrieval experiment
  • http//web.njit.edu/yourusername/cis392/cis392re.
    html
  • See a sample page created by Prof Wu
    http//web.njit.edu/wu/cis392/cis392re.html

14
Getting Access to BOW and Test Collection
  • there are three directories under wu/IR_Tools
  • bow (for BOW system), to execute BOW, change
    directory to wu/IR_Tools/bow/bin
  • som (for self-organizing map program. Do NOT use
    it now!)
  • tc (test collection, Library and Information
    Science Abstracts) the text is under
    wu/IR_Tools/tc/lisa/text/group0 to group5

15
Test Collection LISA
  • The sample queries are stored inwu/IR_Tools/tc/l
    isa/LISA.QUE
  • The relevant documents corresponding to queries
    are stored inwu/IR_Tools/tc/lisa/LISA.REL
  • (-1 marks the end of the entry.)
  • Use more command to open the above two files.

16
Operating Arrow of BOW
  • Read information from BOWs web site (again, the
    URL is list on the Resources section of the
    class syllabus)
  • Read Arrows help file (available on syllabus
    page You should print a copy of the help file.)

17
Automatic Indexing
  • To begin the retrieval tasks, first you need to
    index the whole document collection.
  • Specify lexing options (stopped words removal
    and/or stemming) at this time. For model1
  • arrow -d yourusername/public_html/cis392/model1
    --index wu/IR_Tools/tc/lisa/text/
  • The sign is a wildcard represents all files and
    directories under wu/IR_Tools/tc/lisa/text

18
Automatic Indexing
  • -d parameter specifies where you will store the
    statistics resulted from indexing. (You will
    have to specify this directory when you want to
    index and retrieve documents.)
  • The path after index specifies the location of
    text collection. 
  • The default lexing settings of the above task
    include NO stemming performed, and stopped words
    REMOVED.

19
Query assigned for retrieval
  • Please refer to retrieval experiment section of
    the online syllabus to see which query you get
    for the experiment.

20
Retrieval
  • First, please specify where the indexing
    statistics is stored, and then the query to be
    performed.
  • arrow d yourusername/public_html/cis392/model1
    --num-hits-to-show25 query gt yourusername/publi
    c_html/cis392/model1/retrieved_docs
  • The greater-than sign (gt) specifies the output
    filename and where it will be stored.

21
Presenting your RE
  • create a page under your /public_html/cis392
    directory named cis392re.html
  • this page should contain several pieces of
    information, see http//web.njit.edu/wu/cis392/c
    is392re.html

22
Presenting your RE
  • You can create this html page with the pico
    editor in UNIX (if you know basic html tags) ,
    Microsoft Word (save the file in html format), or
    Netscape composer.
  • If you use an html editor, you might need FTP
    software. http//www.zdnet.com/downloads/stories
    /info/0,10615,30994,00.html
  • Before due date Please check all items on your
    html page and make sure all of them are displayed
    properly.
  • After due date do not make changes. I can
    check when the files were last updated.
Write a Comment
User Comments (0)
About PowerShow.com