Title: Making the Web searchable, or the Future of Web Search
1Making the Web searchable, or the Future of Web
Search
- Peter Mika
- Yahoo! Research Barcelona
2About Yahoo!
- Yahoo!'s mission is to connect people to their
passions, their communities, and the worlds
knowledge - Yahoo! Research Barcelona
- Established January, 2006
- Led by Ricardo Baeza-Yates
- Research areas
- Web Mining
- content, structure, usage
- Distributed Web retrieval
- Multimedia retrieval
- NLP and Semantics
3Yahoo! by numbers (April, 2007)
- There are approximately 500 million users of
Yahoo! branded services, meaning we reach 50
percent or 1 out of every 2 users online, the
largest audience on the Internet (Yahoo! Internal
Data). - Yahoo! is the most visited site online with
nearly 4 billion visits and an average of 30
visits per user per month in the U.S. and leads
all competitors in audience reach, frequency and
engagement (comScore Media Metrix, US, Feb.
2007). - Yahoo! accounts for the largest share of time
Americans spend on the Internet with 12 percent
(comScore Media Metrix, US, Feb. 2007) and
approximately 8 percent of the worlds online
time (comScore WorldMetrix, Feb. 2007). - Yahoo! is the 1 home page with 85 million
average daily visitors on Yahoo! homepages around
the world, an increase of nearly 5 million
visitors in a month (comScore WorldMetrix, Feb.
2007). - Yahoo!s social media properties (Flickr,
delicious, Answers, 360, Video, MyBlogLog,
Jumpcut and Bix) have 115 million unique visitors
worldwide (comScore WorldMetrix, Feb. 2007). - Yahoo! Answers is the largest collection of human
knowledge on the Web with more than 90 million
unique users and 250 million answers worldwide
(Yahoo! Internal Data). - There are more than 450 million photos in Flickr
in total and 1 million photos are uploaded daily.
80 percent of the photos are public (Yahoo!
Internal Data). - Del.icio.us hits 2 million users in February,
growing more than six times its size from 300,000
users in December 2005 (Yahoo! Internal Data). - Yahoo! Mail is the 1 Web mail provider in the
world with 243 million users (comScore
WorldMetrix, Feb. 2007) and nearly 80 million
users in the U.S. (comScore Media Metrix, US,
Feb. 2007) - Interoperability between Yahoo! Messenger and
Windows Live Messenger has formed the largest IM
community approaching 350 million user accounts
(Yahoo! Internal Data). - Yahoo! Messenger is the most popular in time
spent with an average of 50 minutes per user, per
day (comScore WorldMetrix, Feb. 2007). - Nearly 1 in 10 Internet users is a member of a
Yahoo! Groups (Yahoo! Internal Data). - Yahoo! News is the 1 online news destination and
has reached a new audience high in February with
36.2 million users, 10 million more users than
its nearest competitor MSNBC (comScore Media
Metrix, US, Feb. 2007). - Yahoo! is one of only 26 companies to be on both
the Fortune 500 list and the Fortunes Best
Place to Work List (2006).
4Overview
- Why reconsider search?
- Context
- Semantic Web metadata infrastructure
- Web 2.0 user-generated metadata
- Thesis making the Web searchable
- Research challenges (SW IR)
- Conclusion
5Motivation
- State of Web search
- Picked the low hanging fruit
- Heavy investments, marginal returns
- High hanging fruits
- Hard searches remain
- The Web and its technology have changed
- Semantic Web
- Web 2.0
6Hard searches
- Ambiguous searches
- Paris Hilton
- Multimedia search
- Images of Paris Hilton
- Imprecise or overly precise searches
- Publications by Jim Hendler
- Find images of strong and adventurous people
(Lenat) - Searches for descriptions
- Search for yourself without using your name
- Product search (ads!)
- Searches that require aggregation
- Size of the Eiffer tower (Lenat)
- Public opinion on Britney Spears
- Queries that require a deeper understanding of
the query, the content and/or the world at large - Note some of these are so hard that users dont
even try them any more
7Example
8The Semantic Web (1996-)
- Making the content of the Web machine processable
through metadata - Documents, databases, Web services
- Active research, standardization, startups
- Ontology languages (RDF, OWL family), query
language for RDF (SPARQL) - Software support (metadata stores, reasoners,
APIs)
9Problem difficulties in deployment
- Not enough take-up in the Web community at large
- Technological challenges
- Discovery
- Ontology learning
- Ontology mapping
- Lack of attention to the social side
- Over-estimating complexity for users
- Need for supporting ontology creation and sharing
- Focus shifts from documents to databases --the
Web of Data - Task/domain-specific applications
10Web 2.0 (2003-)
- Simple, nimble, socially transparent interfaces
- Simplified KR
- e.g. tagging, microformats, Wikipedia infoboxes
- In exchange for a better experience,
- users are willing to
- Provide content, markup and metadata
- Provide data on themselves and their networks
- Rank, rate, filter, forward
- Develop software and improve your site
-
- User-generated content
- Content that users actually care about!
11Example Microformats
- Agreements on the way to encode certain kinds
metadata in HTML - Reuse of semantic-bearing HTML elements
- Based on existing standards
- Each microformat defines a vocabulary for
describing a given type of resource - Persons, Events, but also syntactic metadata
licenses, tags - Not ontologies
- No formal descriptions of schema, only text
- No namespaces, unique identifiers (URIs)
- ? no interlinking, reuse among schemas
- No datatypes
- Widely used in millions of hand-authored
documents - And in hundreds of millions dynamically generated
ones
12Example hCard
href"mailtojfriday_at_host.com"Joe Friday
1-919-555-7878 class"title"Area Administrator, Assistant
rel"friend colleague met" href"http//meyerweb.
com/"Eric Meyer wrote a post
( /2005/12/16/tax-relief/" Tax Relief)
about an unintentionally humorous letter he
received from the class"fn org url" href"http//irs.gov/" Interna
l Revenue Service .
13Example Wikipedia infoboxes
- Templates for common types of objects
14Example Wikipedia infoboxes
15Example Wikipedia infoboxes
- cf. microformats
- Similar level of representation
- Infoboxes are never annotations
- Largely uncontrolled growth
- Niche templates
- Templates in several languages
- ? overlapping domains
- Infoboxes to RDF
- dbPedia
- Compare also to Semantic Wikis
- Semantic MediaWiki, OntoWiki etc.
16Web 2.0 bottleneck lack of foundations
- Tags
- No shared syntax (TagCommons? A microformat for
tags?) - Mapping problems due to lack of semantics
- flickrajax del.icio.usajax ?
- flickrajaxPeter flickrajaxJohn ?
- flickrajaxPeter1990 flickrajaxPeter2006 ?
- Microformats
- You cannot make a vocabulary for everything in a
centralized way - Serious validation,mapping problems on the
instance level - Wikipedia
- Serious validation,mapping problems on both the
instance and the schema level
17Thesis making the Web searchable
- The Web has changed
- Content owners are interested in their content to
be found (Web 2.0) - Cf. findability (Peter Morville), reusability
(mashups), open data movement - Foundations are laid for a Semantic Web
- We need to
- Combine the best of Web 2.0 and the Semantic Web
- Reconsider Web search in this new world
18Semantic Web and Web 2.0
- Focus on user-generated content
- Getting the representation right
- RDF
- Embedded RDF
- GRDDL, RDFa
- Innovations on the interface side
- Capture semantics while authoring
- New methods of reasoning
- Semantics syntax statistics
- Bottom-up, emergent semantics
- Methods of logical reasoning combined with
methods of graph mining, statistics - Scalability
- Giving up soundness and/or completeness
- Dealing with the mess
- Social engineering
- Collaborative spaces for creating and sharing
ontologies, data - Connecting islands of semantics
- Best practices, documentation, advocacy
19Example GRDDL
- Bridges the world of microformats and RDF
- Associate RDF-producing XSLT transformations to
XML and (X)HTML documents - One page may contain different microformats (e.g.
persons and events described in the same page) - One microformat may be mapped to multiple
ontologies
ta-view" Joe Friday's Home page
href"http//www.w3.org/2003/12/rdf-in-xhtml-xslts
/grokFOAF.xsl" /
- Note of course it is possible to extract
non-RDF data through XSLT, - e.g. extract VCard from an HTML fragment ---
but thats not called GRRDL
20Example RDFa
- Embedding RDF into (X)HTML
- Increased complexity, e.g. namespaces
- Reuse of semantic-bearing HTML elements is not
possible any more - No need for XSLT any more
- You can use XSLT to extract RDFa, but dont have
to - Not much track record
- Big question user complexity (? data quality)
ons.bib" about"mika06jws"
class"swrcArticle hbib article" class"vcard" property"foafname"
rev"swrcauthor" href"mika06jws"
class"foafPerson author fn" Peter Mika
21Example openacademia.org and RDFa
- _at_INBOOKens03ontoknowledge,
- AUTHOR "Victor Iosif and Peter Mika and Rikard
Larsson and Hans Akkermans", - TITLE "Towards the Semantic Web
Ontology-Driven Knowledge Management", - CHAPTER "Field Experimenting With Semantic Web
Tools In A Virtual Organization", - PUBLISHER "John Wiley \ Sons",
- YEAR "2003
22Example machine tags
23Example ZoneTag project (Yahoo! Research
Berkeley)
24Example Freebase
25Web Search 2.0
- In an ideal world
- Plenty of metadata to harvest
- Metadata is unambiguous, described using a single
ontology or a set of carefully designed
ontologies - User intent can be captured directly as a formal
query - Query and the knowledge base use the same
ontology - Query is executed on a single knowledge base,
gives the correct, single answer - And all this very fast
26Web Search 2.0
- In reality
- Many lightweight ontologies or just tags
- Tags are mostly personal, not social
- Intent is unclear, matching is a problem
- Poor quality of annotations
- Everyones a librarian
- 99 of the Web is Web 1.0
- Input/output interface
- Keywords for searching
- Very limited interaction
- Not everything scales
27Web Search 2.0
- Keep on improving machine technology
- NLP
- Information Extraction
- Exploit the users for the tasks that are hard for
the machine - Encourage and support users
- Exploit user-generated metadata in any shape or
form - Support standards of the SW architecture
28Example folksonomies
- Simplified view tags are just anchortext
- Can be used to generate simple co-occurrence
graphs
hilton
url1
paris
url2
eiffel
url3
29(No Transcript)
30The more complete picture
- Folksonomies as tripartite graphs of users, urls
and tags -
user1
user2
hilton
url1
user3
paris
url2
eiffel
url3
31Mining and modelling folksonomies
- Opportunities for mining community-specific
interpretations of the world - Peter Mika. Ontologies are us A unified model of
social networks and semantics. Journal of Web
Semantics 5 (1), page 5-15, 2007 - Related works at
- Social and Collaborative Construction of
Structured Knowledge (CKC2007) - Bridging the gap between Semantic Web and Web 2.0
(ESWC2007) - Journal of Web Semantics special issue on
Semantic Web and Web 2.0 (upcoming, Q4 2007) - TAGORA project
32Vision ontology-based search
- Query at the knowledge level
- Partial description of a class/instance
- Mapping of queries and resources in the
conceptual space - Computing relevance in semantic terms
- Novel user interfaces
33Technical challenges
- Improving NLP and IE
- Query interface
- Data quality
- Cleaning up metadata, tags
- Spam
- Ontology mapping and entity resolution
- Ranking across types
- Results display
- How do you avoid information overload?
- How do you display information you partially
understand?
34Social challenges
- Getting the users on your side
- Users are unwilling to submit large amounts of
structured data to a commercial entity (Google
Base) - Provide a clear motivation and/or instant
gratification - Trust them but not too much (Mahalo)
35Example Technorati and microformats
http//technorati.com/posts/tag/semanticweb
36Conclusion
- Why a new vision?
- The opportunity convergence
- Semantic Web metadata infrastructure
- Web 2.0 user-generated metadata
- Thesis making the Web searchable
- Semantic Web and Web 2.0
- Web Search 2.0
37What is there to gain?
- Knowledge-based search
- Sorting out hard searches
- Creating new information needs
- Beyond search
- Analysis, design, diagnosis etc. on top of
aggregated data - Personalization
- Rich user profiles
- Monetization
- No more buy virgins on eBay
38Questions?
- Peter Mika. Social Networks and the Semantic Web.
Springer, July, 2007. - Special Issue on the Semantic Web and Web 2.0,
Journal of Web Semantics, December, 2007.