LING 438538 Computational Linguistics - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

LING 438538 Computational Linguistics

Description:

Compute the edit distances for the misspellings of Britney (Spears) ... of the spelling of Cartoon Network TV channel, and 41 of pop star Britney Spears. ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 20
Provided by: sandiw
Category:

less

Transcript and Presenter's Notes

Title: LING 438538 Computational Linguistics


1
LING 438/538Computational Linguistics
  • Sandiway Fong
  • Lecture 21 11/7

2
Administrivia
  • Short Lecture Today
  • Homework 5
  • out today
  • due next Tuesday
  • usual rules

3
Homework 5
4
Britney Spears
  • Webpage
  • http//news.bbc.co.uk/cbbcnews/hi/music/newsid_195
    3000/1953614.stm

5
  • from the BBC website
  • typos 2 is wrong,
  • 3 and 5 shouldnt be the same

use this list for the homework
http//www.google.com/jobs/britney.html
  • Brittany
  • Brittney
  • Britany
  • Britny
  • Briteny
  • Britteny
  • Briney
  • Brittny
  • 9. Brintey
  • 9. Britanny

6
Question1 Britney Spears
  • Question 1
  • Part 1 (3pts)
  • Compute the edit distances for the misspellings
    of Britney (Spears)
  • use insertdelete1, substitute2
  • Part 2 (3pts)
  • Compute the edit distances for the misspellings
    of Britney (Spears)
  • use insertdelete1, substitute1
  • Part 3 (4pts)
  • Come up with a metric that correctly ranks the
    top 7 misspellings for either Part 1 or Part 2

7
Making Money from Misspellings
  • Webpage
  • http//news.bbc.co.uk/1/hi/sci/tech/1575060.stm

8
Making Money from Misspellings
  • Excerpts
  • US legal authorities are appealing for help in
    tracking down John Zuccarini, who they say is
    making more than a million dollars a year from a
    collection of misspelled domain names.
  • The Federal Trade Commission is now looking for
    ways to recover the cash Mr Zuccarini has made
    from the domain names.
  • Excerpts
  • Mr Zuccarini has been practising a novel
    variation of cybersquatting which usually
    involves gaining control of a website that you
    have no real claim to, and then offering it for
    sale to the rightful owner at a premium.
  • The domains registered by Mr Zuccarini were
    typically misspellings of well-known names. Mr
    Zuccarini has reportedly registered 15 variations
    of the spelling of Cartoon Network TV channel,
    and 41 of pop star Britney Spears.

9
Question 2
10
Corpus
  • homework corpus
  • WSJ9_041.txt
  • from the course homepage
  • Wall Street Journal articles (July 2628 1989)
  • this is the text file you will use
  • contains almost 22,000 lines and 150,000 words
  • use only the text between the SGML markers
  • example...

11
Question 2
  • Sun Microsystems Inc. said it will post a
    larger-than-expected fourth-quarter loss of as
    much as 26 million and may show a loss in the
    current first quarter, raising further troubling
    questions about the once high-flying computer
    workstation maker.
  • Sun reported last month that management
    errors, rather than a weakness in the market for
    computer workstations, would result in lower
    earnings or a "slight loss" in the quarter ended
    June 30.
  • But the amount cited in yesterday's disclosure
    was far greater than analysts had suspected, and
    suggested deepening troubles.
  • "It is extremely disconcerting," said Peter
    Rogers, computer analyst at Robertson Stephens
    amp Co. in San Francisco.
  • "Many of us had been led to believe that most (of
    the management-systems problems) had been put
    behind them.
  • It looks like there is another layer."
  • "On the surface, it would lead one to conclude
    that Sun has at least temporarily completely lost
    control of its operations," Mr. Rogers added.
  • The maker of high-performance desktop
    computers now says the loss was probably between
    20 million and 26 million, compared with
    year-ago net income of 25.3 million, or 66 cents
    a share.
  • The huge fourth-quarter loss will bring
    year-end earnings to between 55 million and 61
    million, or between 72 cents and 78 cents a
    share, compared with year-ago net of 66.4
    million, or 89 cents a share.
  • Sun said it expects to report fourth-quarter
    revenue of 425 million to 435 million, up 16
    to 19 from a year earlier.
  • That would contrast sharply with Sun's third
    quarter, when revenue surged 92, and would put
    full-year revenue in the 1.75 billion to 1.77
    billion range, up from 1.05 billion a year
    earlier.

12
Question 2
  • Sun said the problems that led to the loss have
    "largely been resolved," and that it received
    record bookings in its fourth quarter.
  • Still, Sun said profitability in the current
    quarter, ending Sept. 30, can't be assured.
  • The company added that a return to profitability
    will depend on the effectiveness of cost-cutting
    measures and its ability to obtain parts.
  • A spokeswoman said the company still faces a
    shortage of certain parts.
  • In June, Sun said operations were disrupted by
    a change to a new system for getting information
    to management.
  • The company also cited faulty forecasting of
    demand, problems in manufacturing new machines
    and a shortage of certain parts.
  • The spokeswoman reiterated that the company sees
    strong demand for its products, and believes the
    market for computer workstations remains healthy.
  • Sun said it has imposed a hiring freeze in all
    areas except sales and customer service,
    postponed moving into new facilities and
    curtailed other expenses.
  • The spokeswoman wouldn't say how much the
    cost-cutting measures will save.
  • The announcement was made after the market
    closed.
  • Sun's stock closed at 16.25, up 62.5 cents, in
    national over-the-counter trading.

13
Question 2
  • edit out other stuff...
  • WSJ890728-0079
  • 890728
  • 890728-0079.
  • Major Deficit
  • _at_ Signaled by Sun
  • _at_ Microsystems
  • _at_ ---
  • _at_ Firm to Post Quarterly Loss
  • _at_ As Much as 26 Million
  • _at_ Deepening Trouble Seen
  • _at_ ----
  • _at_ By Carrie Dolan
  • _at_ Staff Reporter of The Wall Street Journal

  • 07/28/89
  • WALL STREET JOURNAL (J)
  • SUNW
  • COMPUTERS AND INFORMATION TECHNOLOGY (CPR)


14
(Ngram Statistics Package) NSP
  • for homework question 2
  • suggest you use the NSP software package,
  • brew your own, or
  • any other package you want to use...
  • (Ngram Statistics Package) NSP
  • Ted Petersens Perl-based Ngram Statistics
    Package (NSP)
  • http//www.d.umn.edu/tpederse/nsp.html
  • you need to install a free Perl on your system if
    not already available
  • e.g. Active State Perl

15
(Ngram Statistics Package) NSP
  • you only need to use the Perl program file
  • count.pl
  • NSP on Windows
  • command line options
  • perl count.pl --help
  • Usage count.pl OPTIONS DESTINATION SOURCE ,
    SOURCE ...
  • Counts up the frequency of all n-grams occurring
    in SOURCE.
  • Sends to DESTINATION the list of n-grams found,
    along with the
  • frequencies of combinations of the n tokens that
    the n-gram is composed of. If SOURCE is a
    directory, all text files in it are counted.
  • OPTIONS
  • --ngram N Creates n-grams of N tokens
    each. N 2 by default.
  • --newLine Prevents n-grams from spanning
    across the new-line character.

16
Question 2
  • (4pts)
  • List the most frequent closed-class word (for
    each class) in the corpus
  • use the definition of closed-classes (and your
    judgment) listed in section 8.1 of the textbook
  • (2pts)
  • What is the most frequent proper noun?
  • What is the most frequent (non-auxiliary) verb?

17
Question 2
  • (8pts)
  • compute the probability of the (similar)
    sentences
  • Bristol-Myers agreed to merge with Sun
    Microsystems
  • Bristol-Myers and Sun Microsystems agreed to
    merge
  • using both the bigram and trigram approximations
  • use add-one smoothing where relevant

18
Question 2
  • Note
  • given the chain rule
  • p(w1 w2 w3...wn) p(w1) p(w2w1) p(w3w1w2)...
    p(wnw1...wn-2 wn-1)
  • what is w1 ?
  • if were talking about a sentence, w1 START
  • Example
  • sentence begin with Sun ..
  • (see opposite column)
  • p(SunSTART)
  • assume
  • p(START) 1
  • Note
  • Petersens program does not take into account
    START
  • youll have to calculate this separately or
    modify the corpus before running NSP...
  • sentence start symbol START
  • file
  • START Sun Microsystems Inc. said it will post a
    larger-than-expected fourth-quarter loss of as
    much as 26 million and may show a loss in the
    current first quarter, raising further troubling
    questions about the once high-flying computer
    workstation maker.

19
Summary
  • for both 438/538
  • Question 1 10pts
  • Question 2 14pts
  • Total 24 pts
Write a Comment
User Comments (0)
About PowerShow.com