Title: TagSense Marrying Folksonomy and Ontology
1TagSense- Marrying Folksonomy and Ontology
By Zixin Wu
Advisor Amit P. Sheth Committee John A.
Miller Prashant Doshi
2Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
3Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
4Folksonomy
Web page and photos from Flick.com
Web page from del.icio.us
5Folksonomy Definitions
- The behavior of massive tagging in social context
and its product tags for Web resources. It is
collaborative metadata extraction and annotation. - (from Thomas Vander Wal) Folksonomy is the
result of personal free tagging of information
and objects (anything with a URL) for one's own
retrieval. The tagging is done in a social
environment (usually shared and open to others).
1 - (from Tom Gruber) the emergent labeling of lots
of things by people in a social context. 2
6Features of Folksonomy
- Makes metadata extraction from multimedia Web
resources easier. - Extract information from the perspective of
information consumer, e.g. put tags about the
house in a photo but not the dog in it. - Popular tags prevail and tags for a Web resource
converge over time.
7The Long Tail
8Power Law Distribution of Tags 3
9Folksonomy Triad 4,5
- The person tagging
- The Web resource being tagged
- The tag(s) being used on that Web resource
- We can use two of the elements to find a third
element. - e.g. find persons with similar interests by
comparing the Web resources they tagged and the
tags they used
10Motivation Scenarios Ambiguous Words
Search for turkey
Search for apple
11Disambiguation
- What people usually do add more keywords for
disambiguation - Trade off between precision and recall rates
12Motivation Scenarios Background Knowledge
- Task Find photos about cities in Europe
- Solution1 Search city Europe
- Solution2 try the name of cities in Europe one
by one - Could be improved if the system knows
- Which term/concept is a city
- Which city is in Europe
13Significant Drawbacks of Folksonomy
- Keyword ambiguity
- Lack of background knowledge
14Ontology
- Ontology is an important term in Knowledge
Representation and the key enabler of the
Semantic Web - A formal specification of a conceptualization 6
- Ontologies state knowledge explicitly by using
URIs and relationships, e.g. Paris
is_located_in Europe - Current Specifications RDF(s) 7,8, OWL 9,
etc.
15Semantic Annotation
Figure from 10
16Multiple Ontologies
- One Ontology cannot be always comprehensive
enough - Ontologies may be incompatible
- If multiple ontologies are used, we need to
select and rank ontologies for a query.
17Objectives
- Shorten the time and effort for information
retrieval in folksonomy - improve recall rates by considering synonyms and
enabling semantic search - improve result ranking by putting the most
appropriate items on the top of query results
18Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
19Approach Overview
- Do not add any burden to our users they should
be able to use only tags to describe and search
Web resources - Do not expect our users have Semantic Web
background - Utilize ontologies as background knowledge in
information retrieval
20Approach Overview
Folksonomy
Ontologies
21Some Terms
- Web resource anything with a URL
- Label one or more keywords, e.g. air ticket
- Tag a label tagged to a Web resource. Two
different tags may have the same label - Sense Cluster (or cluster) where tags with
similar meanings are put together. Ideally, a
cluster corresponds to a meaning. But often
times, a meaning is represented by multiple
clusters together. - Semantic annotation to associate a cluster with
ontological concepts
22Approach Overview
Ontology 2
Ontology 1
(a dot is a tag, a circle in blue is a sense
cluster, a circle in yellow is an actual meaning)
23Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
24Data Cleanup Dirty Tags
- bird and birds
- ebook and e-book, air-ticket, airticket,
and air ticket - freephotos should be free,photo
- travelagent should be travel agent
- sculture should be sculpture
- _at_pub-travel
- Europe2005
25Tag Normalization
- Check 2 online dictionaries Webster.com and
Dict.cn - Webster.com stemming and misspelling
- Swimming -gt swim, dogs -gt dog
- Sculture -gt sculpture
- Dict.cn more words and compound words
- ibm not in Webster, but in Dict.cn
- open source
- Try to split tags
- freephotos -gt free and photo
- Ignore pure numbers, such as 2005, 07_01_2005
26Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
27Sense Indexing
Senses
Keywords
Access Permit
Ticket
For an offender
Fine
Good
28Sense Indexing
- The mappings between keywords and senses are nm
- Index Web resources by senses instead of
keywords. Put tags with similar meaning into the
same cluster - Need to disambiguate each node when indexing
29Differences from Word Sense Disambiguation 11-15
- No sentence no sentence structure, no
part-of-speech analysis. - The order of the labels in a Web resource are not
necessary relevant. - Produced in a social context significant number
of terms are not in lexicons. Terms change more
frequently. That means we need to create senses
for those terms. - Relatively less noise.
30Why Clustering (1)
- Since we will match the clusters to ontological
concepts, why not annotate each tag? - Some terms are not in any ontology
- By aggregating the contexts of the tags in the
same cluster, we know which contexts are
important, which are noise (especially for narrow
folksonomy)
powerbook
powerbook
mac
mac
ajax
light
apple
apple
web
paint
design
long
31Why Clustering (2)
- We get more context for semantic annotation
Athens
Athens
?
University
Georgia
University
Greece
32Synonym
- Seems impossible to automatically detect synonym
ONLY based on the context of tags - Reason that contexts are similar enough does not
imply synonyms - Solution use WordNets 16 synsets as synonym
lists
33Polysemy
- Cluster tags which have the same labels (or
synonyms) into sense clusters based on the
similarity of their contexts.
34Context of Tags
- Context of a tag T
- Other tags co-occur with T in a Web resource
- The co-occurrence frequencies
- e.g. User1 turkey,istanbul,mosque User2
turkey,istanbul, tour - In narrow folksonomy, all co-occurrence
frequencies are 1
turkey
1
1
tour
mosque
2
1
1
istanbul
35Relatedness of Tags
turkey
turkey
1
1
tour
mosque
tour
mosque
2
2/4
2/4
1/2
1
1
1/4
istanbul
istanbul
TF
Co-occurrence
And then times IDF
36Context of a Cluster
- Other clusters whose tags connect (co-occur) to
the tags in this cluster - The co-occurrence frequency of two clusters is
the aggregation of the co-occurrence frequencies
of the tags in the clusters
2
5
3
37Relatedness of Clusters
- The same calculation as relatedness of nodes
38Important Context of a Cluster
Relatedness
Important Context Level 1
Context
Important Context Level 3
Important Context Level 2
39Motivation for Building Senses
- In order to search photos about turkey bird, some
people use bird besides turkey, some use
animal, some use food, wild, etc. - Can we include all these tags, and then use them
to build a sense? - The clue to recognize these tags is that they
co-occur with each others more often than with
other tags (which are also the context of
turkey)
40Tag Disambiguation Process
- Put all tags with the same label (or synonyms)
into one cluster. - Do the following 3 phases to build senses.
41Tag Disambiguation Phase 1
- Identify Important Context Level 1
- Create a undirected weighted graph called Context
Graph - Each node in the graph is a cluster in the
Important Context Level 1 - The weight of an edge is the relatedness of the
two clusters. (relatedness is asymmetric, we take
the larger one). - Apply a threshold to the edges of the Context
Graph, so that the graph becomes one or more
disconnected component. - Create a sense corresponding to each component,
and use the clusters in a component as the
context of the corresponding sense.
42Tag Disambiguation Phase 1
We are disambiguating turkey, so the cluster
turkey is hidden for better illustration.
43Tag Disambiguation Phase 2
- The purpose of this phase is to find missing
senses in Phase 1, which are not used often in
the dataset - Identify Important Context Level 2
- For each cluster in Important Context Level 2,
find the most related sense built in Phase 1 (and
also above a threshold). - If there is such a sense, merge the cluster being
considered to that senses context. - Otherwise, build a new sense and use the cluster
as the context.
44Tag Disambiguation Phase 2
The red clusters are newly discovered in Phase 2
45Tag Disambiguation Phase 3
- Identify Important Context Level 3
- Similar to Phase 2, but do not create any new
sense just enrich the context of the senses
built in Phase 1 and Phase 2.
46Tag Disambiguation Process - continue
- Compare each tag we are considering with the
senses. Select the best matched sense and assign
the tag to it. - Do step 2 and step 3 again when the number of the
tags we are considering is increased to a certain
percentage.
47Tag Disambiguation Process
x
y
turkey
istanbul
turkish
MatchScorexy
48Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
49Utilizing Ontologies
- Match each cluster to ontological concepts where
appropriate - But there is no named relationships between tags
- That means we cannot compare by the names of
relationships - We will need relatedness of ontological concepts
- We will also need similarity of ontological
concepts in semantic search
50Relatedness of Ontological Concepts
- Basic idea TF-IDF
- 0 for any pair of concepts without relationship.
- TF-IDF(c1,c2)TF(c1,c2)IDF(c1)
51Relatedness of Ontological Concepts
- TF (c1 to c2)
- Issue query c1 c2 to Yahoo! Search Engine
- Get the hit count h
- Issue queries for each concept cx connected to
c2cx c2 - Get the hit counts hx
- TF(c1,c2)h/?hx
- IDF(c)
- Issue query c to Yahoo! Search Engine
- Get the hit count h
- Yahoo! current index size 20 billion pages
- IDF(C)-log(h/20 billion)
52Similarity of Ontological Concepts
- First, only consider the taxonomy in the ontology
- Information Content 17 IC(c) -log(prob(c))
- Sim(c1, c2)2IC(ancestor)/(IC(c1)IC(c2)) 18
Information Content
Hit Counts
Sum
Probability
1040 M
1174 M
0.0587
2.23
Car
Car
Car
Car
Sedan
Coupe
Sedan
Coupe
Sedan
Coupe
Sedan
Coupe
58 M
76 M
58 M
76 M
0.0029
0.0038
2.54
2.42
Sim (Sedan, Coupe) 22.23/(2.542.42) 0.899
53Similarity of Ontological Concepts
- Also consider other types of relationships by
using Jaccard (COSINE) Similarity coefficient
Athens
Atlanta
Georgia
Is_located_in
Is_located_in
54Matching Clusters to Ontologies
- Compare the important context of a cluster with
the context (concepts) of an ontological concept - Sum up the relatedness of matched context
clusters - Select the best ontological concept which gets
the best matching score and also above a threshold
55Matching Clusters to Ontologies
- A context cluster x is considered matched to a
context concept y if - They have the same label (or synonym), or
- If x is matched to y and the relatedness (or
similarity) of y to y is above a threshold, or - If the relatedness of x (which is matched to y)
to x is above a threshold
56Matching Clusters to Ontologies - example
turkey
bird
turkey
bird
turkey
bird
Rel(bird,animal)gtthreshold Sim(bird,animal)gtthresh
old
Semantic annotation
animal
Semantic annotation
bird
Rel(bird,animal)gtthreshold
bird
turkey
animal
turkey
animal
turkey
Case 2
Case 3
Case 1
57Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
58Semantic Search 19,20
- Search by the ontological relationships
- Currently only consider subclass and type
relationships - Map to the corresponding clusters by semantic
annotations - Expand the corresponding clusters by including
other clusters with the same label, because some
clusters may not have semantic annotation but
they should have.
59Semantic Search
Geography Domain Ontology
Politics Domain Ontology
Seoul
Ottawa
Seoul
Ottawa
Seoul
Madrid
Ottawa
Madrid
Madrid
Madrid
60Most-Desired Senses Ranking
- We need to rank the candidate clusters
- The system show one photo for each candidate
cluster - The user select the best photo from the samples
- The system ranks other clusters based on the
selection
61Most-Desired Senses Ranking
- The basic idea is finding shortest paths in a
graph from a single source - Put a constant energy on the source cluster, and
distribute the energy to other clusters - The weight of an edge is the similarity of the
clusters
62Geography Domain Ontology
Politics Domain Ontology
0.32185575
0.24315993
Seoul
Ottawa
Seoul
Ottawa
Seoul
Madrid
0.46428478
Ottawa
0.3705629
0.08457597
1
0.09375922
Madrid
Madrid
0.19367748
0.21556036
Madrid
0.05285901
63Clusters Similarity
- If the semantic annotations of two clusters refer
to the same ontology, use the similarity of
corresponding ontological concepts. - Otherwise, calculate cluster similarity by the
context of the two clusters.
64Cluster Similarity by Context
- A modified version of Dice similarity
- Lets say we are comparing cluster1 and cluster2
- Compare only the important context of cluster1
and cluster2 - Calculate the percentage of overlapped context
- Decide if context cluster c1 of cluster1 and
context cluster c2 of cluster2 is matched by the
way in matching clusters to ontologies
65Ontology Ranking 21,23
- Ontologies come from a repository
- If multiple ontology is used for a query, we need
to give a weight to each ontology - The ontology with higher weight has higher
power to decide the similarity/relatedness of
two ontological concepts - Rank ontologies by using the 4 most recent
queries of the same user
66Ontology Ranking
Thing
D(c)
H(c)
C
67Ontology Ranking
68Ontology Ranking
69Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
70System overview
Tag Cleanup Module
Query History
Ontology Ranking Module
Ontology Measures
Ontology Ranks
Semantic Query Module
Ontologies
Ontology Measuring Module
Queries
Search Engine
Query Result
Sense Index
Ontology Mapping
Photos with Tags
Sense Indexing Module
Ontology Mapping Module
71Evaluation Measures
- Compare with Google Desktop on the same datasets
- How much time a user has to spend in order to
find the required photos. - How many clicks of the mouse a user has to do in
order to find the required photos. - How many different queries a user has to issue in
order to find the required photos. The user may
change the query at any time if he feels
necessary.
72Evaluations (1)
- Experiment set 1 for disambiguation
- DataSets
- 500 photos with a tag apple
- 500 photos with a tag turkey
73User Case 1
- Task1 find 50 photos about Apple electronic
products
74User Case 2
- Task2 find 30 photos about the fruit apple
75User Case 3
- Task3 50 photos about the country Turkey
76User Case 4
- Task4 find 10 photos about turkey birds
77Evaluations (2)
- Experiment set 2 for semantic search
- DataSets
- About 300 photos for each of the following tag
Beijing, Madrid, Ottawa, Rome, Seoul, Tokyo,
Baltimore, New York, Pittsburgh, Washington D.C.,
Amsterdam, Florence, Venice, Athens Greece,
Athens Georgia - Ontologies
- An ontology in travel domain (partially from
Realtravel.com) - Modified AKTiveSA 24 project ontology in
geography domain - An ontology in politics domain (partially from
SWETO25)
78Use Case 5
- Task 5 find up to 5 photos for 5 cities in Europe
79Evaluation
- Most-Desire Senses Ranking approach may involve
time overhead in selecting the most wanted photo
sense - Changing query involves time overhead in thinking
and typing - Overall, users spent significantly less time and
effort in finding the information they want
80Outline
- Background and Motivation
- Approach Overview
- Tag Normalization
- Sense Indexing
- Utilizing ontologies
- Semantic Search and Ranking
- Implementation and Evaluations
- Conclusions
- Demo
81Conclusions
- We proposed an approach to combine folksonomies
and ontologies - Index Web resources by senses into sense clusters
- Match sense clusters to ontological concepts
- Semantic search based on ontological
relationships - Most-Desired Sense Ranking approach
- Multiple ontologies ranking
- Evaluation users spent significant less time and
effort in finding the information they want
82Demo
83Questions and comments
84References (1)
- 1 Wal, T.V. Folksonomy Coinage and Definition.
2004 cited Available from http//vanderwal.net
/folksonomy.html. - 2 Gruber, T., Ontology of Folksonomy A Mash-up
of Apples and Oranges. International Journal on
Semantic Web and Information Systems, 2007. 3(1). - 3 Halpin, H., V. Robu, and H. Shepherd. The
Complex Dynamics of Collaborative Tagging. in WWW
'07 Proceedings of the 16th international
conference on World Wide Web. 2007 ACM. - 4 Wal, T.V. Folksonomy Definition and
Wikipedia. 2005 cited Available from
http//www.vanderwal.net/random/entrysel.php?blog
1750. - 5 Mika, P., Ontologies are us A unified model
of social networks and semantics. Journal of Web
Semantics, 2007. 5(1) p. 5-15. - 6 Gruber, T.R., A Translation Approach to
Portable Ontology Specifications. Knowledge
Acquisition, 1993. 5(2) p. 199-220. - 7 Resource Description Framework (RDF).
cited Available from http//www.w3.org/RDF/. - 8 RDF Vocabulary Description Language 1.0 RDF
Schema. 2004 cited Available from
http//www.w3.org/TR/rdf-schema/.
85References (2)
- 9 McGuinness, D.L. and F.v. Harmelen. OWL Web
Ontology Language. 2004 cited Available from
http//www.w3.org/TR/owl-features/. - 10 Kiryakov, A., et al., Semantic annotation,
indexing, and retrieval. Web Semantics Science,
Services and Agents on the World Wide Web, 2004.
2(1) p. 49-79. - 11 Ide, N. and J. VĂ©ronis, Word sense
disambiguation The state of the art.
Computational Linguistics, 1998. 1(24) p. 1-40. - 12 Wilks, Y. and M. Stevenson. Sense Tagging
Semantic Tagging with a Lexicon. in the SIGLEX
Workshop Tagging Text with Lexical Semantics
What, why and how? 1997. Washington, D.C. - 13 Diab, M. and P. Resnik. An Unsupervised
method for Word Sense Tagging using Parallel
Corpara. in the 40th Annual Meeting of the
Association for Computational Linguistics. 2002.
Philadelphia, Pennsylvania. - 14 Molina, A., et al. Word Sense Disambiguation
using Statistical Models and WordNet. in 3rd
International Conference on Language Resources
and Evaluation. 2002. Las Palmas de Gran Canaria,
Spain. - 15 Banerjee, S. and B.P. Mullick, Word Sense
Disambiguation and WordNet Technology. Literary
and Linguistic Computing, 2007. 22(1) p. 1-15. - 16 Fellbaum, C., WordNet An Electronic Lexical
Database. 1998 The MIT Press.
86References (3)
- 17 Resnik, P., Semantic Similarity in a
Taxonomy An Information-Based Measure and its
Application to Problems of Ambiguity in Natural
Language. Journal of Artificial Intelligence
Research, 1999. 11 p. 95-130. - 18 Lin, D., An Information-Theoretic Definition
of Similarity, in International Conference on
Machine Learning (ICML). 1998 Madison,
Wisconsin, USA. - 19 Sheth, A., et al., Managing Semantic Content
for the Web. IEEE Internet Computing, 2002. 6(4)
p. 80-87. - 20 Guha, R., R. McCool, and E. Miller. Semantic
search. in the 12th international conference on
World Wide Web. 2003. - 21 Arumugam, M., A. Sheth, and I.B. Arpinar,
Towards Peer-to-Peer Semantic Web A Distributed
Environment for Sharing Semantic Knowledge on the
Web, in International Workshop on Real World RDF
and Semantic Web Applications. 2002 Hawaii, USA. - 22 Alani, H. and C. Brewster. Ontology ranking
based on the analysis of concept structures. in
the 3rd international conference on Knowledge
capture. 2005. - 23 Zhang, Y., W. Vasconcelos, and D. Sleeman.
OntoSearch An Ontology Search Engine. in The
Twenty-fourth SGAI International Conference on
Innovative Techniques and Applications of
Artificial Intelligence. 2004. Cambridge, UK. - 24 AKTiveSA. cited Available from
http//sa.aktivespace.org/. - 25 Aleman-Meza, B., et al. SWETO Large-Scale
Semantic Web Test-bed. in 16th Int'l Conf.
Software Eng. Knowledge Eng., Workshop on
Ontology in Action, Knowledge Systems Inst. 2004.