Instance-based mapping between thesauri and folksonomies - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Instance-based mapping between thesauri and folksonomies

Description:

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 23

Provided by: videolectu1

Category:

more less

Transcript and Presenter's Notes

Title: Instance-based mapping between thesauri and folksonomies

1
Instance-based mapping between thesauri and
folksonomies

Christian Wartena
Rogier Brussee
Telematica Instituut

2
Outline

Interoperability of Keywords
Wikipedia and del.icio.us
Keyword similarity
Experiment
Conclusion

3
Interoperability of Keywords

Documents (pictures, movies, ) are annotated
with keywords for organization and retrieval.
In different collections/communities different
sets of keywords are used.
The set of selectable keywords is often organized
in and delimited by a thesaurus.
The set of freely generated end-user keywords,
tags forms a folksonomy
Align keywords/tags by comparing usage.
Tested on del.icio.us tags and Wikipedia
categories.

4
del.icio.us and Wikipedia

Del.icio.us
Social book marking site
Bookmarks in most cases can be interpreted as
labels or tags for the bookmarked URL.
Many Wikipedia articles are tagged by del.icio.us
users
Wikipedia
Articles are labeled with one or more categories
by the article authors.
Categories are organized hierarchically.
Categories are organized consciously like in a
thesaurus
New categories are introduced after discussions
between active Wikipedians.

5
Keyword alignment

Problem
Given a keyword k in a system A, what is the most
similar keyword k in system B.
Given a tag from del.icio.us, what is the most
similar Wikipedia category (or vice versa).
Approach
Interpret similarity as similarity of usage.
Compute similarity of usage on a common
sub-collection.
Evaluation
Compare results to human judgment of similarity.

6
Keyword similarity

Basic assumption similarity is similarity of
usage.
If two keywords have similar usage they will give
similar results in retrieval tasks.
Two keywords have similar usage if they
Have a similar distribution over documents
Divergence (relative entropy) of distributions
Cosine
Often co-occur
Jaccard coefficient

7
New measure for keyword similarity

Keywords have similar usage if they co-occur with
similar frequency with all other keywords.
We use the frequency with which a tag/keyword is
assigned to a document.
We include co-occurrence information with other
terms.
Helps to cope with sparse data
In other words
Terms are similar if they have similar
co-occurrence patterns
Similar to Tag Context Similarity of Cattuto et
al.s presentation tomorrow (Semantic Social
Networks Session)

8
(No Transcript)
9
Formalization Distribution of co-occurring terms

where
q(td) is the keyword distribution of d
Q(dz) is the document distribution of z
The fraction of zs that is found in d
Weighted average of the keyword distributions of
documents
The weight is the relevance of d for z given by
the probability Q(dz)

10
Distance of keywords

For each keyword there is a distribution over all
(other) keywords.
Similarity is expressed by divergence of these
distributions
Kullback-Leibler divergence
Bits per keyword saved by compressing a
subcollection with keyword distribution p using p
instead of a general distribution q.

11
Distance of keywords (contd)

Jensen-Shannon divergence
Mean distribution
Jensen-Shannon divergence is symmetric.
Jensen-Shannon divergence is square of a
non-negative distance satisfying the triangle
inequality.

12
Alignment

Consider a collection of documents annotated with
different sets of keywords.
Represent a keyword by a distribution over terms
from both collections.
For each term find the closest term from the
other collection.

13
Experiment I

Mapping between Teleblik keywords and User Tags
Educational videos.
Professional keywords from public broadcasting
archive.
Keywords assigned in an experiment by high school
students.
Data
100 videos
12.414 tags
4.348 different tags
269 different keywords

14
Experiment II

Mapping between del.icio.us tags and Wikipedia
categories
Del.icio.us tags collected by Mathias Lux
(Klagenfurt Univ.)
Data
58.345 Wikipedia articles
500.618 tags and category annotations
42.425 different Wikipedia categories
49.603 different tags
Mappings computed for tags occurring on at least
10 docs.
Mappings for 2355 tags
Mappings for 1827 categories
Using co-occurrence data with all 49.603
tags/categories

15
Evaluation of mapping

Manual evaluation
Classification of a sample of mappings into
Broader term
Narrower
Related term
Unrelated
Source term is not a keyword (e.g. to read)
Meaning unknown

16
Evaluation of aligning Wikipedia and del.icio.us
17
Distance vs. mapping quality

Pairs with a small distance are evaluated better
than pairs with large distance.
Evaluation of mappings with smallest and largest
distance
a) Categories to tags
b) Tags to categories

18
Effect of keyword frequency

No correlation between keyword frequency and
divergence with best mapping found.

19
Comparison with Jaccard-coefficient

Evaluation of mapping using two different
distance measures.
Categories broader, narrower and related are
merged
Results for
a) Categories to tags
b) Tags to categories

20
Discussion of results

Method works very well in test
Good mapping results
Distance is good indication of quality
Insensitive to frequency (upto a certain degree)
Better than Jaccard, because it uses
co-occurrence with other tags (tag context)
frequency with which a tag is assigned to a
document.
Frequency information is typical for user
generated tags.
We expect this method to perform less well for
aligning keywords with other keywords (without
assignment frequencies).
Distance measure also works well for clustering
tags.

21
Future work

Evaluating relatedness using external sources
(e.g. Wordnet)
Compare to other distance measures
We used documents annotated completely according
to two annotation schemes.
How large has the overlap to be to obtain decent
results?
We can create partial overlap of disjoint
document sets by a partial identification of the
keywords.
Detect asymmetry in relations (broader vs.
narrower term)

22
Conclusion

Using co-occurrence patterns is a fruitful
approach.
Frequent terms from folksonomies do behave
similar to carefully assigned keywords.
Because usage based similarity measure yields
good mappings.
Folksonomy seems to work!

Write a Comment

User Comments (0)