David Newman, UC Irvine Lecture 3 1 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

David Newman, UC Irvine Lecture 3 1

Description:

In text/speech, use perplexity. perplexity is inverse of geometric mean of per-word likelihood. perplexity of K means that you are as surprised on average as you ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 19
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 3 1


1
CS 277 Data MiningLecture 3 Simple Language
Models
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 1, due Tuesday Oct 9. Any questions?
  • Project proposal, due Tuesday Oct 16
  • Today and next Tuesday well look through
    suggested projects on the class website
  • If you miss class, please check class website for
    information about homeworks and project
  • class website has schedule of what is due when

3
Today
  • More on Information Retrieval
  • Simple Language Models

4
Recap Toy example of a document-term matrix
D documents W words (sparse)
5
Find similar documents?
  • Distance metric Cosine similarity
  • simil(x,y) xTy / x y
  • Lets find similar documents
  • What is time complexity?
  • What is space complexity?
  • ? whiteboard

6
Layout of 15,000 Neuroscience abstracts
7
Simple language models
  • The unigram model
  • p(xw) fw
  • Alternative notation
  • Estimating f
  • What is a likely document?
  • Limitations of the unigram model

8
Extending the unigram model
  • Each document belongs to class c
  • Notation
  • Estimating f
  • What is a likely document?
  • Uses

9
Simple Markov Models
  • Probability of document d, sequence of words
  • doc w1 w2 w3 w4 wn
  • Chain rule and joint/conditional probabilities
    for doc
  • First-order Markov assumption
  • Uses

10
Andrey Andreyevich Markov
Following slides adapted from lecture slides
from Andrew McCallum
11
Modeling sequences
12
Bi-gram models
13
First-order Markov model
14
n-gram models
15
Perplexity
  • Compare probabilistic models using log
    probability
  • In text/speech, use perplexity
  • perplexity is inverse of geometric mean of
    per-word likelihood
  • perplexity of K means that you are as surprised
    on average as you would have been if you had to
    guess between K equiprobable choices
  • Examples
  • 6-sided die perplexity ?
  • Unigram model (W-sided weighted die) perplexity
    ?

16
Collocations (in homework 1)
  • Andrew McCallums slides on Collocations (UMass,
    Amherst), based on Chapter 5 in Manning Schutze

17
Collocations (cont.)
  • Finding collocations by hypothesis testing
  • Student t-test (homework)
  • Pearsons chi-square test
  • Likelihood ratios
  • Student t-test hint
  • t test looks at mean and variance from sample
  • null hypothesis is that sample is drawn with mean
    m
  • need to compute t-statistic
  • think of text corpus as long sequence of N
    bigrams
  • consider bigram as Bernoulli trial with sample
    mean p
  • then sample variance is p(1-p)
  • compare with m Pr(w1 w2) Pr(w1) Pr(w2)

18
Projects
  • Review project suggestions from course website
Write a Comment
User Comments (0)
About PowerShow.com