Making the Web searchable, or the Future of Web Search - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Making the Web searchable, or the Future of Web Search

Description:

There are approximately 500 million users ... media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut ... Public opinion on Britney Spears ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 34
Provided by: Yah956
Category:

less

Transcript and Presenter's Notes

Title: Making the Web searchable, or the Future of Web Search


1
Making the Web searchable, or the Future of Web
Search
  • Peter Mika
  • Yahoo! Research Barcelona

2
About Yahoo!
  • Yahoo!'s mission is to connect people to their
    passions, their communities, and the worlds
    knowledge
  • Yahoo! Research Barcelona
  • Established January, 2006
  • Led by Ricardo Baeza-Yates
  • Research areas
  • Web Mining
  • content, structure, usage
  • Distributed Web retrieval
  • Multimedia retrieval
  • NLP and Semantics

3
Yahoo! by numbers (April, 2007)
  • There are approximately 500 million users of
    Yahoo! branded services, meaning we reach 50
    percent or 1 out of every 2 users online, the
    largest audience on the Internet (Yahoo! Internal
    Data).
  • Yahoo! is the most visited site online with
    nearly 4 billion visits and an average of 30
    visits per user per month in the U.S. and leads
    all competitors in audience reach, frequency and
    engagement (comScore Media Metrix, US, Feb.
    2007).
  • Yahoo! accounts for the largest share of time
    Americans spend on the Internet with 12 percent
    (comScore Media Metrix, US, Feb. 2007) and
    approximately 8 percent of the worlds online
    time (comScore WorldMetrix, Feb. 2007).
  • Yahoo! is the 1 home page with 85 million
    average daily visitors on Yahoo! homepages around
    the world, an increase of nearly 5 million
    visitors in a month (comScore WorldMetrix, Feb.
    2007).
  • Yahoo!s social media properties (Flickr,
    delicious, Answers, 360, Video, MyBlogLog,
    Jumpcut and Bix) have 115 million unique visitors
    worldwide (comScore WorldMetrix, Feb. 2007).
  • Yahoo! Answers is the largest collection of human
    knowledge on the Web with more than 90 million
    unique users and 250 million answers worldwide
    (Yahoo! Internal Data).
  • There are more than 450 million photos in Flickr
    in total and 1 million photos are uploaded daily.
    80 percent of the photos are public (Yahoo!
    Internal Data).
  • Del.icio.us hits 2 million users in February,
    growing more than six times its size from 300,000
    users in December 2005 (Yahoo! Internal Data).
  • Yahoo! Mail is the 1 Web mail provider in the
    world with 243 million users (comScore
    WorldMetrix, Feb. 2007) and nearly 80 million
    users in the U.S. (comScore Media Metrix, US,
    Feb. 2007)
  • Interoperability between Yahoo! Messenger and
    Windows Live Messenger has formed the largest IM
    community approaching 350 million user accounts
    (Yahoo! Internal Data).
  • Yahoo! Messenger is the most popular in time
    spent with an average of 50 minutes per user, per
    day (comScore WorldMetrix, Feb. 2007).
  • Nearly 1 in 10 Internet users is a member of a
    Yahoo! Groups (Yahoo! Internal Data).
  • Yahoo! News is the 1 online news destination and
    has reached a new audience high in February with
    36.2 million users, 10 million more users than
    its nearest competitor MSNBC (comScore Media
    Metrix, US, Feb. 2007).
  • Yahoo! is one of only 26 companies to be on both
    the Fortune 500 list and the Fortunes Best
    Place to Work List (2006).

4
Overview
  • Why reconsider search?
  • Context
  • Semantic Web metadata infrastructure
  • Web 2.0 user-generated metadata
  • Thesis making the Web searchable
  • Research challenges (SW IR)
  • Conclusion

5
Motivation
  • State of Web search
  • Picked the low hanging fruit
  • Heavy investments, marginal returns
  • High hanging fruits
  • Hard searches remain
  • The Web and its technology have changed
  • Semantic Web
  • Web 2.0

6
Hard searches
  • Ambiguous searches
  • Paris Hilton
  • Multimedia search
  • Images of Paris Hilton
  • Imprecise or overly precise searches
  • Publications by Jim Hendler
  • Find images of strong and adventurous people
    (Lenat)
  • Searches for descriptions
  • Search for yourself without using your name
  • Product search (ads!)
  • Searches that require aggregation
  • Size of the Eiffer tower (Lenat)
  • Public opinion on Britney Spears
  • Queries that require a deeper understanding of
    the query, the content and/or the world at large
  • Note some of these are so hard that users dont
    even try them any more

7
Example
8
The Semantic Web (1996-)
  • Making the content of the Web machine processable
    through metadata
  • Documents, databases, Web services
  • Active research, standardization, startups
  • Ontology languages (RDF, OWL family), query
    language for RDF (SPARQL)
  • Software support (metadata stores, reasoners,
    APIs)

9
Problem difficulties in deployment
  • Not enough take-up in the Web community at large
  • Technological challenges
  • Discovery
  • Ontology learning
  • Ontology mapping
  • Lack of attention to the social side
  • Over-estimating complexity for users
  • Need for supporting ontology creation and sharing
  • Focus shifts from documents to databases --the
    Web of Data
  • Task/domain-specific applications

10
Web 2.0 (2003-)
  • Simple, nimble, socially transparent interfaces
  • Simplified KR
  • e.g. tagging, microformats, Wikipedia infoboxes
  • In exchange for a better experience,
  • users are willing to
  • Provide content, markup and metadata
  • Provide data on themselves and their networks
  • Rank, rate, filter, forward
  • Develop software and improve your site
  • User-generated content
  • Content that users actually care about!

11
Example Microformats
  • Agreements on the way to encode certain kinds
    metadata in HTML
  • Reuse of semantic-bearing HTML elements
  • Based on existing standards
  • Each microformat defines a vocabulary for
    describing a given type of resource
  • Persons, Events, but also syntactic metadata
    licenses, tags
  • Not ontologies
  • No formal descriptions of schema, only text
  • No namespaces, unique identifiers (URIs)
  • ? no interlinking, reuse among schemas
  • No datatypes
  • Widely used in millions of hand-authored
    documents
  • And in hundreds of millions dynamically generated
    ones

12
Example hCard
href"mailtojfriday_at_host.com"Joe Friday
1-919-555-7878 class"title"Area Administrator, Assistant

rel"friend colleague met" href"http//meyerweb.
com/"Eric Meyer wrote a post
( /2005/12/16/tax-relief/" Tax Relief)
about an unintentionally humorous letter he
received from the class"fn org url" href"http//irs.gov/" Interna
l Revenue Service .
13
Example Wikipedia infoboxes
  • Templates for common types of objects

14
Example Wikipedia infoboxes
15
Example Wikipedia infoboxes
  • cf. microformats
  • Similar level of representation
  • Infoboxes are never annotations
  • Largely uncontrolled growth
  • Niche templates
  • Templates in several languages
  • ? overlapping domains
  • Infoboxes to RDF
  • dbPedia
  • Compare also to Semantic Wikis
  • Semantic MediaWiki, OntoWiki etc.

16
Web 2.0 bottleneck lack of foundations
  • Tags
  • No shared syntax (TagCommons? A microformat for
    tags?)
  • Mapping problems due to lack of semantics
  • flickrajax del.icio.usajax ?
  • flickrajaxPeter flickrajaxJohn ?
  • flickrajaxPeter1990 flickrajaxPeter2006 ?
  • Microformats
  • You cannot make a vocabulary for everything in a
    centralized way
  • Serious validation,mapping problems on the
    instance level
  • Wikipedia
  • Serious validation,mapping problems on both the
    instance and the schema level

17
Thesis making the Web searchable
  • The Web has changed
  • Content owners are interested in their content to
    be found (Web 2.0)
  • Cf. findability (Peter Morville), reusability
    (mashups), open data movement
  • Foundations are laid for a Semantic Web
  • We need to
  • Combine the best of Web 2.0 and the Semantic Web
  • Reconsider Web search in this new world

18
Semantic Web and Web 2.0
  • Focus on user-generated content
  • Getting the representation right
  • RDF
  • Embedded RDF
  • GRDDL, RDFa
  • Innovations on the interface side
  • Capture semantics while authoring
  • New methods of reasoning
  • Semantics syntax statistics
  • Bottom-up, emergent semantics
  • Methods of logical reasoning combined with
    methods of graph mining, statistics
  • Scalability
  • Giving up soundness and/or completeness
  • Dealing with the mess
  • Social engineering
  • Collaborative spaces for creating and sharing
    ontologies, data
  • Connecting islands of semantics
  • Best practices, documentation, advocacy

19
Example GRDDL
  • Bridges the world of microformats and RDF
  • Associate RDF-producing XSLT transformations to
    XML and (X)HTML documents
  • One page may contain different microformats (e.g.
    persons and events described in the same page)
  • One microformat may be mapped to multiple
    ontologies

ta-view" Joe Friday's Home page
href"http//www.w3.org/2003/12/rdf-in-xhtml-xslts
/grokFOAF.xsl" /
  • Note of course it is possible to extract
    non-RDF data through XSLT,
  • e.g. extract VCard from an HTML fragment ---
    but thats not called GRRDL

20
Example RDFa
  • Embedding RDF into (X)HTML
  • Increased complexity, e.g. namespaces
  • Reuse of semantic-bearing HTML elements is not
    possible any more
  • No need for XSLT any more
  • You can use XSLT to extract RDFa, but dont have
    to
  • Not much track record
  • Big question user complexity (? data quality)

  • ons.bib" about"mika06jws"    
    class"swrcArticle hbib article"    class"vcard"             property"foafname"       
    rev"swrcauthor" href"mika06jws"       
    class"foafPerson author fn" Peter Mika
           

  • 21
    Example openacademia.org and RDFa
    • _at_INBOOKens03ontoknowledge,
    • AUTHOR "Victor Iosif and Peter Mika and Rikard
      Larsson and Hans Akkermans",
    • TITLE "Towards the Semantic Web
      Ontology-Driven Knowledge Management",
    • CHAPTER "Field Experimenting With Semantic Web
      Tools In A Virtual Organization",
    • PUBLISHER "John Wiley \ Sons",
    • YEAR "2003

    22
    Example machine tags
    23
    Example ZoneTag project (Yahoo! Research
    Berkeley)
    24
    Example Freebase
    25
    Web Search 2.0
    • In an ideal world
    • Plenty of metadata to harvest
    • Metadata is unambiguous, described using a single
      ontology or a set of carefully designed
      ontologies
    • User intent can be captured directly as a formal
      query
    • Query and the knowledge base use the same
      ontology
    • Query is executed on a single knowledge base,
      gives the correct, single answer
    • And all this very fast

    26
    Web Search 2.0
    • In reality
    • Many lightweight ontologies or just tags
    • Tags are mostly personal, not social
    • Intent is unclear, matching is a problem
    • Poor quality of annotations
    • Everyones a librarian
    • 99 of the Web is Web 1.0
    • Input/output interface
    • Keywords for searching
    • Very limited interaction
    • Not everything scales

    27
    Web Search 2.0
    • Keep on improving machine technology
    • NLP
    • Information Extraction
    • Exploit the users for the tasks that are hard for
      the machine
    • Encourage and support users
    • Exploit user-generated metadata in any shape or
      form
    • Support standards of the SW architecture

    28
    Example folksonomies
    • Simplified view tags are just anchortext
    • Can be used to generate simple co-occurrence
      graphs

    hilton
    url1
    paris
    url2
    eiffel
    url3
    29
    (No Transcript)
    30
    The more complete picture
    • Folksonomies as tripartite graphs of users, urls
      and tags

    user1
    user2
    hilton
    url1
    user3
    paris
    url2
    eiffel
    url3
    31
    Mining and modelling folksonomies
    • Opportunities for mining community-specific
      interpretations of the world
    • Peter Mika. Ontologies are us A unified model of
      social networks and semantics. Journal of Web
      Semantics 5 (1), page 5-15, 2007
    • Related works at
    • Social and Collaborative Construction of
      Structured Knowledge (CKC2007)
    • Bridging the gap between Semantic Web and Web 2.0
      (ESWC2007)
    • Journal of Web Semantics special issue on
      Semantic Web and Web 2.0 (upcoming, Q4 2007)
    • TAGORA project

    32
    Vision ontology-based search
    • Query at the knowledge level
    • Partial description of a class/instance
    • Mapping of queries and resources in the
      conceptual space
    • Computing relevance in semantic terms
    • Novel user interfaces

    33
    Technical challenges
    • Improving NLP and IE
    • Query interface
    • Data quality
    • Cleaning up metadata, tags
    • Spam
    • Ontology mapping and entity resolution
    • Ranking across types
    • Results display
    • How do you avoid information overload?
    • How do you display information you partially
      understand?

    34
    Social challenges
    • Getting the users on your side
    • Users are unwilling to submit large amounts of
      structured data to a commercial entity (Google
      Base)
    • Provide a clear motivation and/or instant
      gratification
    • Trust them but not too much (Mahalo)

    35
    Example Technorati and microformats
    http//technorati.com/posts/tag/semanticweb
    • rel"tag"Semantic Web

    36
    Conclusion
    • Why a new vision?
    • The opportunity convergence
    • Semantic Web metadata infrastructure
    • Web 2.0 user-generated metadata
    • Thesis making the Web searchable
    • Semantic Web and Web 2.0
    • Web Search 2.0

    37
    What is there to gain?
    • Knowledge-based search
    • Sorting out hard searches
    • Creating new information needs
    • Beyond search
    • Analysis, design, diagnosis etc. on top of
      aggregated data
    • Personalization
    • Rich user profiles
    • Monetization
    • No more buy virgins on eBay

    38
    Questions?
    • Peter Mika. Social Networks and the Semantic Web.
      Springer, July, 2007.
    • Special Issue on the Semantic Web and Web 2.0,
      Journal of Web Semantics, December, 2007.
    Write a Comment
    User Comments (0)
    About PowerShow.com