Word Weighting based on Users Browsing History - PowerPoint PPT Presentation

About This Presentation

Title:

Word Weighting based on Users Browsing History

Description:

National Institute of Advanced Industrial Science and Technology (JPN) ... has no interest in MLB would note the words 'game' or 'Seattle Mariners' as the ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 36

Provided by: mat50

Category:

more less

Transcript and Presenter's Notes

Title: Word Weighting based on Users Browsing History

1
Word Weighting based on Users Browsing History

Yutaka Matsuo
National Institute of Advanced Industrial Science
and Technology (JPN)
Presenter Junichiro Mori
University of Tokyo (JPN)

2
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
System architecture
Evaluation
Conclusion

3
Introduction
Introduction

Many information support systems with NLP use
tfidf to measure the weight of words.
Tfidf is based on statistics of word occurrence
on a target document and a corpus.
It is effective in many practical systems
including summarization systems and retrieval
systems.
However, a word that is important to one user is
sometimes not important to others.

4
Example
Introduction

Suzuki hitting streak ends at 23 games
Ichiro Suzuki is a Japanese MLB player, MVP in
2001.
Those who are greatly interested in MLB would
thinks hitting streak ends as important,
While a user who has no interest in MLB would
note the words game or Seattle Mariners as
the informative, because those words would
indicate that the subject of the article was
baseball.
If a user is not familiar with the topic, he/she
may think general words related to the topic are
important.
On the other hand, if a user is familiar with the
topic, he/she may think more detailed words are
important.

Our main hypothesis
5
Goal of this research
Introduction

This research addresses context-based word
weighting, focusing on the statistical feature of
word co-occurrence.
In order to measure the weight of words more
correctly, contextual information about a user
(we call familiar words) is used.

6
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
Previous work
IRM (Interest Relevance Measure)
System architecture
Evaluation
Conclusion

7
IRM

A new measure, IRM, is based on a word-weighting
algorithm applied to a single document.
Matsuo 03 Keyword Extraction from a Single
Document using Word Co-occurrence Statistical
Information, FLAIRS 2003

8
We take a paper for example.
Previous work Matsuo03
COMPUTING MACHINERY AND INTELLIGENCE
A.M.TURING 1 The Imitation Game I PROPOSE to
consider the question, 'Can machines think?' This
should begin with definitions of the meaning of
the terms 'machine 'and 'think'. The definitions
might be framed so as to reflect so far as
possible the normal use of the words, but this
attitude is dangerous. If the meaning of the
words 'machine' and 'think 'are to be found by
examining how they are commonly used it is
difficult to escape the conclusion that the
meaning and the answer to the question, 'Can
machines think?' is to be sought in a statistical
survey such as a Gallup poll. But this is absurd.
Instead of attempting such a definition I shall
replace the question by another, which is closely
related to it and is expressed in relatively
unambiguous words. The new form of the problem
can be described' in terms of a game which we
call the 'imitation game'. It is played with
three people, a man (A), a woman (B), and an
interrogator (C) who may be of either
9
Distribution of frequent terms
Previous work Matsuo03
10
Next, count co-occurrences
Previous work Matsuo03

The new form of the problem can be described' in
terms of a game which we call the imitation
game.

stem, stop word
elimination, phrase extraction

new and form co-occur once.
new and problem co-occur once.
.
call and imitation game co-occur once.

11
Co-occurrence matrix
Previous work Matsuo03
12
Co-occurrences ofkind frequent terms,
andmakefrequent terms
Previous work Matsuo03

A general term such as kind or make is used
relatively impartially with each frequent term,
but

13
Co-occurrence matrix
Previous work Matsuo03
Frequent terms
Frequent terms
14
Co-occurrences ofimitation frequent terms,
anddigital computerfrequent terms
Previous work Matsuo03

while a term such as imitation or digital
computer shows co-occurrence especially with
particular terms.

15
Biases of co-occurrence
Previous work Matsuo03

A general term such as kind or make is used
relatively impartially with each frequent
tem,while a term such as imitation or
digital computer shows co-occurrence especially
with particular terms.
Therefore, the degree of biases of co-occurrence
can be used as a surrogate of term importance.

16
?2-measure
Previous work Matsuo03

We use the ?2-test, which is very common for
evaluating biases between expected and observed
frequencies.

G Frequent terms
freq(w,g) Frequency of co-occurrence term w and
term g.
pg unconditional probability (the expected
probability) of g.
f(w) The total number of co-oocurrence of term w
and frequent terms G

Large bias of co-occurrence means importance of a
word.

17
Sort by ?2-value
Previous work Matsuo03
We can get important words based on co-occurrence
information in a document.
18
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
Previous work
IRM (Interest Relevance Measure)
System architecture
Evaluation
Conclusion

19
Personalize the calculation of word importance
IRM, proposed measure

The previous method is useful for extracting
reader-independent important words from a
document.
However, importance of words depends not only on
the document itself but also on a reader.

20
If we change the columns to pick up
IRM, proposed measure
a machine, b computer, c question, d
digital, e answer, f game, g argument, h
make, i state, j number u imitation, v
digital computer, wkind, xmake
21
If we change the columns to pick up
IRM, proposed measure
Frequent words
Frequent termslogic
Frequent termsGod
The relevant words to selected words have high ?2
value, because they co-occurs often.
22
Familiarity instead of frequency
IRM, proposed measure

We focus on familiar words to the user, instead
of frequent words in the document.
Definition Familiar words are the words which a
user has frequently seen in the past.

23
Interest Relevancy Measure (IRM)
IRM, proposed measure

where Hk is a set of familiar words for user k

24
IRM
IRM, proposed measure

If the value of IRM is large, word wij is
relevant to users familiar words.
The word is relevant to the users interests, so
it is a keyword for the user.
Conversely, if the value of IRM is small, word
wij is not specifically relevant to any of the
familiar words.

25
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
Previous work
IRM (Interest Relevance Measure)
System architecture
Evaluation
Conclusion

26
Browsing support system

It is difficult to evaluate IRM objectively
because the weight of words depends on a users
familiar words, and therefore varies among users.
Therefore, we evaluate IRM by constructing a Web
browsing support system.
Web pages accessed by a user are monitored by a
proxy server.
The count of each word is stored in a database.

27
System architecture ofbrowsing support system
Browser
28
Sample Screen shot
29
(No Transcript)
30
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
Previous work
IRM (Interest Relevance Measure)
System architecture
Evaluation
Conclusion

31
Evaluation

For evaluation, ten people tried this system for
more than one hour.
Three methods are implemented for comparison.
(I) word frequency
(II) tfidf
(III) IRM

32
Evaluation Result(1)

After using each system (blind), we ask the
following questions on a 5-point Likert-scale
from 1(not at all) to 5 (very much).
Q1 Do this system help you browse the Web?
(I) 2.8 (II) 3.2 (III) 3.2
Q2 Are the red color words (high IRM words)
interesting to you?
(I) 3.2 (II) 4.0 (III) 4.1
Q3 Are the interesting words colored red?
(I) 2.9 (II) 3.3 (III) 3.8
Q4 Are the blue color words (familiar words)
interesting to you?
(I) 2.7 (II) 2.5 (III) 2.0
Q5 Are the interesting words colored blue?
(I) 2.7 (II) 2.5 (III) 2.4

(I) word frequency (II) tfidf (III) IRM
33
Evaluation Result(2)

After evaluating all three system, we ask the
following two questions.
Q6 Which one helps your browsing the most?
(I) 1 people (II) 3 (III) 6
Q7 Which one detects your interests the most?
(I) 0 people (II) 2 (III) 8
Overall, IRM can detect words of the users
interests the most.

(I) word frequency (II) tfidf (III) IRM
34
Outline of the talk

Introduction
Context-based word weighting
Proposed measure
Previous work
IRM (Interest Relevance Measure)
System architecture
Evaluation
Conclusion

35
Conclusion

We develop an context-based word weighting
measure (IRM) based on the relevance (i.e., the
co-occurrence) to a users familiar words.
If a user is not familiar with the topic, he/she
may think general words related to the topic are
important.
On the other hand, if a user is familiar with the
topic, he/she may think more detailed words are
important.
We implemented IRM to browsing support system,
and showed the effect.