Search and the Net at 2004 Trends, Challenges and CuttingEdge Developments in Internet Search Servic

1 / 119
About This Presentation
Title:

Search and the Net at 2004 Trends, Challenges and CuttingEdge Developments in Internet Search Servic

Description:

Search and the Net at 2004. Trends, Challenges and Cutting-Edge ... Open source applications are free and available for use, modification or for-profit use. ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 120
Provided by: hun4
Learn more at: http://people.hws.edu

less

Transcript and Presenter's Notes

Title: Search and the Net at 2004 Trends, Challenges and CuttingEdge Developments in Internet Search Servic


1
Search and the Net at 2004Trends, Challenges
and Cutting-Edge Developments in Internet Search
Services
  • Michael Hunter
  • Reference Librarian
  • Hobart and William Smith Colleges
  • for Rochester Regional Library Council
  • Member Libraries Staff
  • Sponsored by the
  • Rochester Regional Library Council
  • Supported by Library Services and Technology Act
    (LSTA) and/or
  • Regional Bibliographic Databases and Resources
    Sharing (RBDB) funds granted by the
  • New York State Library 2003

2
For Today .
  • State of the Net and its Users
  • Search Industry Overview
  • Recent Developments in Established Services
  • New Services
  • The Deep Web at 2004
  • Tracking the Living Web Weblogs and RSS
  • Cutting-edge Developments
  • Trends and Challenges to Todays Search Services

3
The Internet and its Users at 2004
4
How large is the Web?
  • What do you mean by the Web?
  • The totality of all Web sites
  • Sounds simple .
  • BUT IS IT?

5
UC Berkeleys How Much Information
Projecthttp//www.sims.berkeley.edu/research/proj
ects/how-much-info-2003/internet.htmNOTE 10
terabytes total print collections of the
Library of Congress
6
Internet Use Worldwide
7
Internet Use in the UShttp//www.pewinternet.org
8
Internet Use in the UShttp//www.pewinternet.org
9
Top Ten things our users do onlinehttp//www.pe
winternet.org
10
Top Ten things our users do onlinehttp//www.pe
winternet.org
11
Undergraduates and Search EnginesColaric, S.
Instruction for Web Searching An Empirical
Study College and Research Libraries 64 (2)
March 2003 p. 111-116
12
The Internet Search IndustryConsolidationPerfor
mance MeasuresPopularity
13
The Shrinking Search IndustryEditorial control
of search is shared among few
  • Yahoo owns
  • AlltheWeb, Altavista, Inktomi, Overture (paid
    listings)
  • Google
  • MSN
  • AskJeeves owns Teoma
  • LookSmart owns Wisenut
  • Gigablast
  • NOTE Ownership is different from database
    affiliation

14
GoogleDatabase Affiliates
15
(No Transcript)
16
Database Freshnesshttp//www.searchengineshowdown
.com/stats/freshness.shtml
  • Based on a series of 6 current topic searches
  • Pages that are updated daily
  • AND report that date on the page
  • Queries submitted May 17, 2003

17
(No Transcript)
18
Database Freshnesshttp//www.searchengineshowdown
.com/stats/freshness.shtml
  • Most have some results indexed in the last few
    days
  • The bulk of most of the databases is about 1
    month old
  • Some pages may not have been re-indexed for much
    longer

19
Popularity Searches per day self-reported
data, as of 2/28/03http//searchenginewatch.com/r
eports/article.php/2156461
20
Recent Developments among Established Services
21
Google
  • Froogle
  • Phonebook
  • Wildcard Words
  • Info
  • Synonym feature
  • Supplemental Index
  • Search by location
  • News Advanced Search and News Alerts
  • ???

22
Froogle
  • Locates information about products for sale
    online
  • Gives URLs of sites offering the item
  • Provides links to exact page in the site where
    you can make the purchase

23
Froogle
  • Ranking follows normal Google ranking processes
  • Paid placements always clearly marked
  • Price range limits available
  • Access at http//froogle.google.com or via Google
    Advanced Search

24
Phonebook Command Search
  • Searches US residential (rphonebook) and
    business (bphonebook) listings of Yahoo,
    MapQuest and other services
  • rphonebook
  • MUST INCLUDE
  • Last name City and/or State
  • MAY INCLUDE
  • First name
  • bphonebook
  • MUST INCLUDE
  • Business name (min. 1 word) City and/or State
  • MAY INCLUDE
  • Full Business name

25
Wildcard Words
  • Google offers a word-sized asterisk to function
    as a wildcard
  • Stands for a whole word
  • Cannot be used for part of a word
  • three mice 22,000
  • three bl mice 0

26
Wildcard Words
  • Several can be used together
  • milosevic International Hague
  • Retrieves military tribunal OR
  • military court OR war tribunal OR military
    tribunal

27
info
  • Not exactly hidden, but not well-known
  • Searches for any information Google has about a
    site
  • Convenient way to monitor linkage
  • Typing a URL in the search box will give the same
    results

28
(No Transcript)
29
Synonym Feature
  • Place a tilde immediately before a term to
    retrieve synonyms or related terms from the
    Google Index
  • Eliminate the original term by placing a minus
    sign before it.
  • hiking -hiking

30
(No Transcript)
31
Googles Supplemental Index
  • For obscure or unusual searches
  • Queried when Google fails to find good matches
    within its main web index.
  • Live 9/9/03
  • Sample queries
  • St. Andrews United Methodist Church Homewood
    IL
  • nalanda residential junior college alumni
  • illegal access error jdk 1.2b4
  • supercilious supernovas

32
(No Transcript)
33
Search by Location (beta)
  • http//labs.google.com/location
  • U.S. only
  • Keyword(s) combined with address, city, state or
    zip
  • Search results appear on a map

34
News Advanced Searchand News Alerts
  • Advanced News Search added this Fall
  • News Alerts
  • Requires a (free) account
  • One query per alert limit of 50 alerts per
    e-mail address
  • Alerts contain links to news containing your
    alert keywords
  • Cannot edit a query delete and create a new one
    instead
  • Alerts sent once a day or as it happens

35
More about Google.
  • Google World http//indicateur.com
  • Maintained by a French Search Engine Site and
    listed under Guides. Use Google translator (see
    Language Tools) to translate the site)
  • Google Lab http//labs.google.com
  • Place for cutting edge developments, many in beta
    awaiting user feedback and testing.

36
Beyond Google AskJeeves
  • Simpler, cleaner interface
  • Teoma crawler-based results blended with AJ
    answers
  • Improved image database
  • Smart Answers
  • Popular queries mapped to news, image and other
    sources appropriate to the query

37
ATW (FAST)http//alltheweb.com
  • Continued commitment to a large database (2nd to
    Google)
  • Powerful, new advanced search capabilities
  • Extensive page customization options
  • Results clustered by topic (Folders)
  • Both HTML and Multimedia given, when available
  • NOTE Folders located at the BOTTOM of each
    results screen

38
Altavista
  • Simpler interface
  • More language options
  • Expanded image and multimedia collections
  • Results labeledRefreshed in last 48 hours
  • Includes PDF files
  • US and Local search options
  • Prisma query refinement

39
AltavistaPrisma Query Refinement
  • Offers a maximum of 12 terms having the strongest
    associations with the original query term(s)
  • Selected from the top 50 results of the original
    query
  • NOTE Clicking on a Prisma term adds it to your
    original query, creating a new set of Prisma
    terms.
  • Similar to Refine (1997) but less graphic

40
Teoma
  • Ranking Includes a sites relationship to other
    sites with similar content
  • Results
  • Ranked database results, with Related Pages
  • Refine
  • Clustering of your results and other related
    sites based on term relationships and web
    community linkages derived from your original
    results
  • Resources
  • Link Collections from experts and enthusiasts
  • (Subject metasites)

41
Hotbot
  • Searches Hotbot (Inktomi) OR Google OR Lycos OR
    AskJeeves
  • Not a true metaengine
  • Advanced features operable only if supported by
    source engines

42
(No Transcript)
43
Metacrawler
  • Along with Dogpile and Webcrawler, owned by
    Infospace
  • Simpler interface
  • Offers the following customizations
  • Selection of sources searched
  • Total number of results retrieved
  • Length of search (time-out period)
  • Offers a wide range of vertical searches Images,
    MP3, Shopping, Subject Directory, Multimedia,
    News, Message Boards

44
(No Transcript)
45
New Services Attracting Attention
46
Gigablast
  • Launched April, 2002
  • Smaller database than others
  • Over 200 million on 10/4/03
  • pope canterbury Google83,200
    Gigablast24,919
  • Created and maintained by Matt Wells (alone)
  • Only search engine continuously updated with
    index refreshed in real time (Site submissions
    are immediately searchable)
  • Ranking depends less on linkage than Googles
    ranking, to avoid penalizing newer pages.
  • No advertising (to date)

47
Gigablast Search Features
  • Basic search Full Boolean
  • Advanced Search Full Boolean and 2 (!) phrase
    boxes
  • Limit by site
  • Limit by domain (URL)
  • Links to a page available
  • Most generic html metatags indexed, searched
    and made available for display
  • Unique to Gigablast!!!

48
Gigablast Search Features
  • Field searches include title, IP address and
    non-html filetypes
  • PDF, Word, Excel, PPT, PostScript, Ascii Text
  • Results from one site clustered
  • Cached version available
  • Results include date indexed and last modified
    (!!)
  • Linking to Gigablast improves ranking there

49
KillerInfohttp//www.killerinfo.com
  • Metaengine searching Google, AOL, Lycos,
    Gigablast, MSN, Altavista, LookSmart and Open
    Directory
  • 9 topical Deep Web channels offered
  • Boolean and phrase search
  • No other Advanced Search features
  • Results clustering (a la Vivisimo)
  • Number of results not given
  • Adult content filter

50
Surfwaxhttp//surfwax.com
  • Demo site for federated search software
  • Simultaneous search of Deep Web, Intranets, Web
    and more
  • Metaengine searches Wisenut, AOL, MSN, Yahoo,
    Incarta, CNN, LookSmart
  • FOCUS search refinement feature
  • Online thesaurus of related terms and
    definitions

51
Surfwaxhttp//surfwax.com
  • Site SNAP of a result offers
  • Author summary (from metatags)
  • Related sites
  • Sites FOCUS words
  • Key Points (query-related sections)
  • Results ranking options Relevance, Alpha and
    Source
  • Preferences and Advanced Features require a
    (free) account more options available to
    fee-based accounts

52
Nutchhttp//nutch.org
  • Project to implement an open source web search
    engine
  • Why open source?
  • With open source, search results processing is
    transparent, not hidden. Bias (if any) can be
    examined by anyone.
  • Open source applications are free and available
    for use, modification or for-profit use. Users
    are asked to contribute their innovations back to
    the code base
  • Nutch is seeking volunteer developers and
    donations

53
The Deep Web at 2004
54
The Topography of the Internetor The Layers of
the Web
  • Mapping the web is challenging
  • Unregulated in nature
  • Influences from all over the globe
  • Fulfills many purposes, from personal to
    commercial
  • Changes rapidly and unexpectedly
  • Divisions and terminology are inherently
    ambiguous eg. Deep vs Invisible Web

55
May I suggest a biological, nautical metaphor,
perhaps the ocean?
  • SURFACE WEB
  • SHALLOW WEB
  • OPAQUE WEB
  • DEEP WEB
  • DARK WEB

56
Surface Web
  • Static html documents
  • Crawler-accessible

57
Shallow Web
  • Static html documents loaded on servers that use
    ColdFusion or Lotus Domino or other similar
    software
  • A different URL for the same page is created each
    time it is served.
  • Crawlers skip these to avoid multiple copies of
    the same page in their database
  • Technically human accessible via directories,
    Deep Web gateways or links from other sites

58
Opaque Web
  • Static html documents
  • Technically crawler accessible
  • 2 types
  • Downloaded and indexed by crawler
  • Not downloaded or indexed by crawler

59
Opaque Web
  • Downloaded and indexed by crawler
  • Buried in search results you never look at
  • A casualty of relevance ranking
  • Not downloaded or indexed by crawler due to
    programmed download limits
  • Document buried deep in the site
  • Part of a large document that did not get
    downloaded (Typical crawl per page is 110 K or
    less)
  • Document added since last crawler visit (Even the
    best revisit on an average of every 2 weeks,
    depending on amount of change at a site)

60
Opaque Web
  • Access to the Opaque Web
  • Specialized search engines
  • General and specialized directories
  • Subject metasites
  • These services typically index more thoroughly
    and more often than large, general search
    engines

61
Deep Web
  • Technically inaccessible to crawlers
  • Dynamically created pages
  • Databases
  • Non-textual files
  • Password protected sites
  • Sites prohibiting crawlers
  • Technically accessible to crawlers
  • Textual files in non-html formats

62
Dark Webhttp//research.arbornetwords.com
  • Up to 5 of the web is completely unreachable due
    to
  • Misconfigured routers
  • Contractual disputes between ISPs
  • Broadband users with personal or corporate
    firewalls
  • US Military sites

63
UC Berkeleys How Much Information
Projecthttp//www.sims.berkeley.edu/research/proj
ects/how-much-info-2003/internet.htmNOTE 10
terabytes total print collections of the
Library of Congress
64
http//www.sims.berkeley.edu/research/projects/how
-much-info-2003/internet.htm
65
Reducing the Deep Webmod_rewriteMaking dynamic
pages available to crawlers
  • Mod_rewrite software loaded onto a web server
    containing dynamic pages (databases, etc)
  • Crawler follows a link to a stable URL on the
    server www.mydomain.com/dvdplayers.html
  • Mod_rewrite searches all the servers dynamic
    pages containing dvdplayers and creates temporary
    pages with stable URLs.
  • These pages are linked to each other, creating a
    stream of virtual pages that can be crawled by
    any of the search engines
  • Search engines often check the stream for spam or
    duplicate pages

66
Mining the Deep WebDirected Query Engines or
Intelligent Agents
  • Designed to access distributed Deep Web
    resources
  • Some can be configured to search specific URLs
  • Databases
  • Subject metasites
  • report collections
  • dynamic pages
  • online newsletters

67
Directed Query Engines for purchase
  • Simultaneous search of Deep Web and other
    resources with many additional features
  • Lexibot http//www.lexibot.com
  • If you complete survey 189 upgrades 15
  • If you dont 289 upgrades 50
  • BullsEye http//info.intelliseek.com
  • BullsEye Pro 199 with free upgrades for 6
    months

68
Hunters Maximfor the Deep Web
  • Plan to first locate the category of information
    you want, then browse.
  • Dont be too specific in your searches.
  • Cast a wide net.

69
TRACKING THE LIVING WEBWEBLOGS AND RSS FEEDS
70
Blogs What are they?
  • Online diaries or journals, usually by one
    person, though many invite comments
  • First developed in 1997
  • Within the same blog tone can range from personal
    musings to discussion of recent issues in
    technology and research
  • High link-to-word ratio
  • Often link to other weblogs of similar content

71
Blogs What are they?
  • Can contain rumor, inside information,
    speculation, blatant errors as well as
  • Breaking news political and technical/research
  • Commentary on new software or websites
  • Consumer reaction to products or services
  • Blog authoring tools are basic content management
    software, useful in ways other than online
    diaries
  • Typify the spirit of information sharing that has
    fueled the Internet since its beginnings

72
How large is the blogosphere?2.4 to 2.9 million
active blogs (est.)
73
Whos blogging?Jupiter Research
  • 2 of Internet users have created a blog
  • About 50 women, 50 men
  • Over 50 are in English remaining language, in
    order of prevalence
  • Portuguese, Polish, Farsi, French, Spanish,
    German, Italian, Dutch and Icelandic

74
More
  • About 4 of Internet users read blogs, 60 men,
    40 women
  • On average, blogs are updated every 3 days
  • About 4 of online Americans have gone to blogs
    for information about the Iraq War
  • LiveJournal (large blog host) was the 650th most
    popular site on the Internet (May, 2003)
  • 184,000 readers every 10 days
  • Spend average of 22 minutes at the site

75
Creating a Blog Blogger http//new.blogger.com
  • Free, automated Web publishing tool
  • Requires no new software
  • Send posts to an existing website or create a
    free blog at Blogger
  • Provide a site template and where you want the
    postings to appear
  • To update, create posting, submit permission form
    and Blogger will sent FTP
  • Advanced options available

76
Locating Blogs
  • Blog Hosting Sites
  • www.livejournal.com
  • diaryland.com
  • radio.userland.com (39.95 with added
    features)
  • Blog metasites
  • www.lights.com (library-related, world-wide)
  • www.blogrunner.com
  • www.llrx.com/columns/notes46.htm
  • portal.eatonweb.com/

77
Locating Blogs
  • Subject Directories
  • dmoz.org/Computers/Internet/On_the_Web
  • General Search Engines
  • Blog keyword(s) or URL(bloghost)
    keyword(s)
  • Professional Association homepages
  • Subject Metasites
  • Use Teoma.com Resources

78
Searching Blog Content
  • Blog hosting sites
  • www.livejournal.com
  • Blog Search Engines
  • Feedster.com (includes RSS feeds also)
  • Daypop.com (current events)
  • Blogdex.media.mit
  • www.technorati.com
  • blogging-news.info
  • Topical Blog Search Engines
  • Detod (blawgs.detod.com) Exclusively legal
    weblogs

79
Blogs and General Search Engines
  • Blog-rich sites are increasingly visited by major
    crawler-based search services
  • HOWEVER
  • ANY rapidly-changing content can easily be missed
    by crawlers

80
Obstacles to Crawling and Indexing Blog Content
  • Only the most recent postings appear on the blog
    homepage (older are archived, and inaccessible to
    crawlers)
  • Many bloggers post dozens of times a day
  • Frequent postings may contain critical
    information to time-sensitive topics
  • Even a daily crawl would miss these postings
    (typical crawl is about once every 3 weeks)

81
Obstacles to Crawling and Indexing Blog Content
Page Design
  • Several postings usually appear on the blog
    homepage
  • Postings are NOT indexed separately, as crawler
    indexes the page as a whole
  • Retrieval of an individual posting on a topic is
    unreliable

82
Blogs and Libraries
  • Blogs can offer an opportunity to post content on
    the Web quicklyno delay of FTP uploading or
    submission to a webmaster
  • Whats New
  • Favorite Books
  • Recent Acquisitions
  • Program Changes due to the Weather

83
Blogs and Libraries
  • Get more people involved in posting content on
    the Library (or library-sponsored) website
  • No knowledge of html, RSS or XML needed
  • Log onto the blog hosting website, create
    content, and update the page
  • Current awareness without the annoyance of
    un-wanted e-mails
  • Choose when YOU want that information by visiting
    your blogs of choice

84
Blogs and LibrariesMetasites
  • Blogs and Libraries A Bibliography (online)
  • http//www.etches-johnson.com/nolibrary/bib.html
  • Library Weblog Directory
  • http//www.libdex.com/weblogs.html
  • Blogs at the University of Minnesota Libraries
  • http//www.lib.umn.edu/san/mt/
  • Fichter, D. (2003). Why and how to use blogs to
    promote your library's services. Marketing
    Library Services 17(6).
  • http//www.infotoday.com/mls/nov03/fichter.shtml

85
RSS
  • Rich Site Summaries
  • Really Simple Syndication
  • Really Stops Spam

86
Before RSSTracking latest news and site updates
  • Software packages that monitored and reported
    changes at sites of your choosing
  • News alert services, free and fee
  • Manual checking of your bookmarks
  • Hit or miss Listserv and Usenet postings

87
RSS What is it?
  • XML filetype with content that is
  • Structured (tags, standard and/or
    author-defined)
  • Re-useable (can be integrated into web,
  • e-mail, multimedia and many other formats
  • Originally developed by Netscape as a content
    management tool for personalizing home pages
  • My News My Sports My Weather
  • RSS in detail http//blogs.law.harvard.edu/tech/
    rss

88
RSS What can it do?
  • Creates a broadcast version of frequently updated
    content from a website, blog, news page or other
    source
  • Authors can
  • Summarize new content
  • Broadcast new content eg. online newsletters
  • Can be used as a way to distribute content to
    subscribers (syndication) independent of e-mail.
    Subscribers logon or access via aggregators.

89
How do I access them?
  • As RSS is in XML, may require downloading reader
    software (older versions of browsers cannot read
    XML). Sources for reader software include
  • www.lights.com
  • blogspace.com
  • Sites with RSS feeds display a small icon
    (usually orange) labeled RSS or XML
  • General search engines (limited, but worth a
    try)
  • filetypexml keyword(s)

90
RSS Directories and Search Engines
  • Syndic8 syndic8.com
  • Directory of available syndicated news feeds
  • Provides no reading area
  • Uses Open Directory classification
  • Feedster www.feedster.com
  • The best search engine for blogs and RSS feeds
  • Yahoo news.yahoo.com/rss
  • Canadian Government tinyurl.com/vrh7
  • Often found in Blog Directories and Engines

91
RSS aggregators
  • Receive general or topical RSS feeds and blog
    postings
  • Many are focused on news only
  • Present content in compact form
  • Combine multiple sources in one interface
  • Provide links to full content
  • In personal desktop versions or online

92
Personal desktop aggregators
  • Lets you specify any feeds you want access to
  • Ampheta Desk www.disobey.com/amphetadesk/
  • Radiouserland radio.userland.com ()
  • Feedreader feedreader.com

93
Feedreader.com
94
Online aggregators
  • Selection of feeds may be limited
  • NewsIsFree NewsIsFree.com
  • 7379 sources grouped into 16 channels
  • Create custom pages
  • offers more Premium options
  • Many RSS sites include links to other aggregators

95
Authoring and Producing RSS
  • Lockergnome
  • rss.lockergnome.com
  • Documents, tools, developers, aggregators, free
    feed generator for you site
  • RSS Primer for Publishers
  • www.eevl.ac.uk/rss_primer/
  • Producing RSS feeds
  • Technical information
  • Feed promotion
  • Feedster www.feedster.com

96
Blogs and RSS
  • Blogs may offer some or all of their content as
    RSS feeds, or not
  • Blogs can exist as pure html documents, updated
    frequently
  • Making content available in RSS increases a
    blogs access and exposure via aggregators and
    other RSS-based search services

97
The Living WebWhat can blogs and RSS feeds tell
us about an authors point of view?
  • Which ones does an author list on their
    blog/homepage?
  • Which ones does an author visit/subscribe to?
  • Sometimes I want to know what the world thinks
  • GOOGLE
  • Sometimes I want to know what I think
  • MY WEBLOG
  • Sometimes I want to know what those I respect
    think
  • BLOGS AND FEEDS I READ

98
Beyond todays(free) search enginesCutting
edge developments
99
Including Context in System Design
  • Context matters (!!??!)
  • Textual context
  • Query context Who is asking and why?
  • Traditional approaches to retrieval have been
    deductive
  • Data organized and mapped to anticipated query
    terms (controlled vocabularies, taxonomies)
  • Human created and maintained
  • Too slow for rapid data streams

100
Bayesian approaches
  • Uses statistical inference based on Bayes
    Theorem of Probability (Thomas Bayes,
    1702-1761)
  • Inductive approach (adaptive processing)
  • Take the users information environment
  • Infer structures, relationships, likely queries
  • Inferred structures and relationships can then be
    mapped to a human-created classification scheme
  • Currently used in corporate intranet and
    fee-based content management software
  • Will be used more in general information systems
    of the future

101
Adaptive ProcessingLearning the searchers
interests
  • What term(s) did you search?
  • What did you select?
  • How long did you look at it?
  • What is its source?
  • How old was it?
  • Direct input from searcher
  • Rank the sources
  • Rate individual results
  • Eliminate certain sources, sites

102
Inquirushttp//inquirus.nj.nec.com
  • Query interface research project
  • Attempts to improve precision of results
  • Monitors users search behavior to infer intent
    of queries
  • Re-formulates queries to increase likelihood of
    desired answers

103
Inquirushttp//inquirus.nj.nec.com
  • USER How do you make salsa?
  • SYSTEM salsa and (recipe or ingredients or
    food)
  • Eliminates pages on salsa dancing
  • Ranking relies heavily on proximity of query
    terms and system-provided cognates to each other
    in the document

104
Vector-Space Model3-dimensional retrieval
  • A way of ordering documents by word
    frequency/context in a term spaceand matching
    them to queries
  • Documents are assigned coordinates
  • One document may be in many term spacesor
    vectors
  • Queries that fall within a given vector are
    likely to be answered by documents located in
    that vector

105
A Multi-dimensional Boolean
  • Boolean limited to term matches
  • Vector-space model
  • More complex relationships can be mapped
  • Degrees of relatedness of document to query
  • Query and document weights based on length and
    direction of their vectors

terrier
female
puppy
106
Documents in Vector SpaceWhat do you have on
movie stars diets?
STAR
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
DIET
107
Phibothttp//phibot.org
  • Project of the Univ. of Mainz and German
    Institute of Artificial Intelligence
  • Crawls science, medicine and news web sites
  • 200 million general science sites
  • 70 million medical sites
  • Traditional Google-like processing
  • Vector-Space
  • Optimization greater vector-space processing

108
Digital Video Search
  • Searches actual visual content
  • Project of Dublin City University
  • http//www.cdvp.dcu.ie
  • Determine structure of the video by identifying
    shots with the greatest degree of change
    (keyframes)
  • Use these to create a structure, and allow user
    to refine query based on these
  • Needed by journalists, governments and airport
    security

109
Current Trends in and Challenges to Todays
Search Industry
110
User Interface Trends
  • Toolbars, Toolbars, everywhere
  • Review site searchenginewatch.com/links/article.
    php/2156381
  • Search by Location Major engines with local
    search options and local specialized ones
  • Makes the haystack smaller important in
    e-commerce
  • P2P networks (Peer-to-peer)
  • File-sharing networks, a la Napster
  • KaZaA - most popular download EVER!
  • Shares any filetype
  • 90 of files shared are audio-visual in nature

111
User Interface Trends
  • Application Program Interface (API)
  • Published set of programming hooksthat lets you
    interact directly with a companys open servers
  • You can mine the companys databases for free
  • WHY? To attract more traffic to the site
  • Example http//www.googlerace.com
  • Enter 1 or 2 terms/phrases and see how Bush and
    Democratic candidates stack up!
  • Created by Tara Calishain

112
Search in Corporate Settings Drive Search Engine
RD
  • Uniform, seamless access to all information
  • Internal external, data content
  • XML
  • More natural language processing
  • Hybrid systems to search structured AND
    unstructured data
  • Adaptive processing (Bayesian)
  • Use of intelligent agent software
  • Easier user interfaces
  • Personalization

113
Industry-wide Trends
  • Distributed Crawling
  • Volunteer your PC when not in use
  • Grub.com, Looksmart
  • Search continues to be driven by advertising and
    revenue
  • Fewer services maintain their own crawler-created
    database
  • Increased crawling of non-html filetypes

114
Challenges to the Industry
  • Revenue
  • E-content providers have cut into search software
    sales with their proprietary engines
  • Fighting fraud
  • Cloaking, ranking manipulation
  • Scalability
  • Size of surface Web increases
  • Over 300 million queries a day to all Web S.E.s

115
Challenges to the Industry
  • Freshness
  • Competitive edge demands recent crawls
  • Deep Web
  • Embedded databases
  • Non-html filetypes
  • Real-time information
  • Growing importance of the Living Web

116
Challenges to the Industry
  • Ambiguous query refinement
  • Not very hopeful among general search engines
  • User group too large
  • User profiling difficult
  • Indexing the smaller, newer sites
  • Googles link-based PageRank penalizes these sites

117
The Biggest ChallengeJust what are you looking
for?
  • A known needle in a known haystack
  • A known needle in an unknown haystack
  • An unknown needle in an unknown haystack
  • Any needle in a haystack
  • The sharpest needle in a haystack
  • Most of the sharpest needles in a haystack
  • All the needles in a haystack

118
The Biggest ChallengeJust what are you looking
for?
  • Affirmation of no needles in the haystack
  • Things like needles in any haystack
  • Let me know if any new needles show up
  • Where are the haystacks?
  • Needles, haystacks, .whatever

119
Thank You andHappy Holidays!
Michael Hunter Reference Librarian Hobart and
William Smith Colleges Geneva, NY 14456 (315)
781-3552 hunter_at_hws.edu
Write a Comment
User Comments (0)