The Invisible Web - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

The Invisible Web

Description:

If a webpage is not linked from any other page, the Web crawler will not be able to find it. ... When a web crawler visits a website, it first checks for a ' ... – PowerPoint PPT presentation

Number of Views:288
Avg rating:3.0/5.0
Slides: 62
Provided by: slai
Category:

less

Transcript and Presenter's Notes

Title: The Invisible Web


1
The Invisible Web
  • David Boudinot
  • Heather De Forest
  • Joan Pries
  • Lindsay Ure
  • Created for LIBR 557 Advanced Information
    Retrieval
  • Dr. Mary Sue Stephenson
  • November 29, 2004

2
The Invisible Web
  • Searching on the Internet today can be compared
    to dragging a net across the surface of the
    ocean. While a great deal may be caught in the
    net, there is still a wealth of information that
    is deep, and therefore, missed. The reason is
    simple Most of the Web's information is buried
    far down on dynamically generated sites, and
    standard search engines never find it.
  • Deep Web White Paper, BrightPlanet.com (2000)

3
The Invisible Web
  • Part I The Invisible Web Explained
  • What is the Invisible Web?
  • Why cant I access the Invisible Web using
    regular search engines?
  • How deep is the Invisible Web and what does it
    contain?
  • Where do I start with searching the Invisible Web?

4
What is the Invisible Web?Background and
definitions.
5
Background
  • The phrase Invisible Web was first used in the
    mid 1990s to describe web content that is not
    indexed by regular search engines
  • 2000 Deep Web White Paper, published by the
    BrightPlanet Corporation, discusses the nature
    and scope of the Invisible Web
  • 2001 publication of , The Invisible Web
    Uncovering Information Sources Search Engines
    cant See, by Chris Sherman and Gary Price

6
The Visible Web
  • The visible or surface web is that part of the
    Web that can be retrieved using standard search
    engines such as Google or AltaVista, or subject
    directories
  • In order for search engines to find them, web
    pages must be static and either linked to other
    web pages or submitted for indexing by Webmasters

7
The Invisible Web
  • Also known as deep web dark web or hidden
    web
  • The Invisible Web is what standard search tools
    either cannot or will not crawl and index
  • A large part of the Invisible Web consists of
    authoritative and pertinent information

8
The Invisible Web
  • The term invisible is somewhat misleading. It
    is possible to retrieve this content, but not
    using the same methods as for visible web content
  • Various reasons why certain web pages are not
    indexed by standard search engines or directories
  • Hard to define and varying definitions of the
    precise nature of the Invisible Web exist

9
Definitions
  • The Deep Web is content that resides in
    searchable databases, the results from which can
    only be discovered by a direct query. Without the
    directed query, the database does not publish the
    result. When queried, deep Web sites post their
    results as dynamic Web pages in real-time. Though
    these dynamic pages have a unique URL address
    that allows them to be retrieved again later,
    they are not persistent.
  • Bright Planet Corporation, BrightPlanet.com

10
Definitions
  • Text pages, files, or other often high-quality
    or authoritative information available via the
    World Wide Web that general-purpose search
    engines cannot, due to technical limitations, or
    will not, due to deliberate choice, add to their
    indices of Web pages. Sometimes also referred to
    as the Deep Web or dark matter.
  • Chris Sherman and Gary Price, The Invisible
    Web Uncovering Information Sources Search
    Engines Cant See (2001)

11
Types of Invisibility
  • Opaque Web
  • Web pages that have not yet been crawled by
    search engines for various reasons, but could
    become part of the Visible Web at any time
  • Private Web
  • Sites that could be indexed but Webmasters have
    chosen to exclude from search engines, or at
    least to specify that access is restricted.
    Either password protected, robots exclusion
    protocol and robot metatags
  • Proprietary Web
  • Sites that are available only to those who have
    agreed to terms or conditions to access the
    content. May be free registration or paid
    subscription.
  • Adapted from Sherman and Price, The Invisible
    Web Uncovering Information Sources Search
    Engines Cant See. (2001)

12
Types of Invisibility
  • Truly Invisible Web
  • Sites that search engines cannot, or will not,
    crawl for technical reasons
  • Certain file types
  • Real time information, such flight arrivals and
    weather reports that is relevant only for a very
    short time
  • Pages that generate scripts these can trap
    spiders
  • Dynamic pages that are created in response to a
    user query namely the content of relational
    databases
  • Adapted from Sherman and Price, The Invisible
    Web Uncovering Information Sources Search
    Engines Cant See. (2001)

13
Why cant I access content from the Invisible Web
using regular search engines?
14
  • Disclaimer
  • As technologies grow and search engines develop,
    parts of the Invisible Web are becoming visible
    so what may be invisible today might become
    visible tomorrow.

15
Why cant I access content from the Invisible Web
using search engines?
  • There are 4 main reasons you cant access
    Invisible Web content using search engines
  • Search engines were originally designed to index
    HTML pages
  • Search engine cant find the content
  • Search engine is blocked from the content
  • Search engine purposely ignores Invisible Web
    site

16
Search engines were originally designed to index
HTML pages
  • Anything outside of HTML (such as Flash,
    Shockwave, or mp3s) has traditionally remained
    invisible.
  • If this type of content is described in meta tags
    within the HTML document, a web crawler can index
    it.
  • Companies like Google have been developing
    technology to search non-HTML content on the
    Internet.

17
Search engine cant find the content
  • Web crawlers work by following links on websites
    and reporting back home what was found. If a
    webpage is not linked from any other page, the
    Web crawler will not be able to find it.

18
Adapted from Chris Sherman and Gary Price, The
Invisible Web Uncovering Information Sources
Search Engines Cant See (2001)
19
Search engine is blocked from the content
  • Problem You dont want a search engine to index
    parts of your website.
  • Solution Include the Robots Exclusion Protocol
    or Robots META tag in your website.

20
How the Robots Exclusion Protocol works
  • When a web crawler visits a website, it first
    checks for a robots.txt file, which tells the
    crawler what parts of the site it is allowed to
    index.
  • For the SLAIS site, this would look like
  • http//www.slais.ubc.ca/robots.txt

21
How the Robots Exclusion Protocol works Part II
  • Code in a simple text document tells the crawler
    what to do. For example
  • To exclude all crawlers from part of the server
  • User-agent
  • Disallow /cgi-bin
  • Disallow /tmp/
  • Disallow /private/
  • To exclude a single crawler
  • User-agent BadBot
  • Disallow /

Source Web Server Administrators Guide to the
Robots Exclusion Protocol
22
How the Robots META tag works
  • The Robots META tag is inserted into an
    individual HTML document to inform web crawlers
    to buzz off.
  • Unfortunately, some crawlers ignore this tag,
    and index your webpage anyway.

23
How the Robots META tag works Part II
  • Here are some examples of what Robots META tag
    code looks like
  • INDEX or NOINDEX tells the crawler to index the
    page or not.
  • FOLLOW or NOFOLLOW instructs the crawler to
    follow (or not follow) the links on the page.

24
Search engine purposely ignores Invisible Web
site
  • Due to budget constraints or technical issues,
    some search engines choose not to index non-HTML
    files.
  • Spammers tend to use script commands to trap web
    crawlers. Some search engines opt out of
    indexing sites with any script commands.
  • Web crawlers are not programmed to understand
    database structures, therefore, information in
    relational databases remains invisible.

25
Databases and HTML
  • Online databases generate web pages dynamically
    and respond to commands issued from an HTML form.
  • Some databases are proprietary.
  • In many instances web crawlers and databases are
    incompatible.

26
How it all works
27
How deep is the Invisible Web and what does it
contain?
28
How deep is the Invisible Web?
  • Bright Planets study of the Deep Web (2000)
  • estimated approximately 400-550 times more
    information than in the surface Web (or World
    Wide Web)
  • Sherman and Price (2001) refute this claim
  • estimate the IW is somewhere between 2 and 50
    times larger since much of the information is
    from ephemeral data (such as weather)

29
How fast is the Deep Web growing?
  • significantly faster than the visible Web
    (Sherman and Price)
  • The Deep Web is the fastest growing category of
    new information on the Internet. All signs point
    to the Deep Web as the dominant paradigm for the
    next-generation Internet. (Bright Planet)

30
Quality and Content Invisible Web vs.
surface web?
  • Many IW sites are first-rate content sites
  • tend to be narrower in focus with more content in
    subject area
  • Often use a variety of media and file types, many
    of which are not easily indexed
  • Largest part of the IW is information contained
    in databases
  • More than half of the content resides in
    subject-specific databases
  • Mostly human indexed

31
Content
  • Invisible Web sources are critical because they
    provide users with specific, targeted
    information, not just static text or HTML pages
  • However, general search engines are becoming much
    more sophisticated and capable
  • Eg. Googles new Google Scholar for scholarly
    resources opens up invisible web by allowing
    access to some material that wouldnt ordinarily
    be available to search spiders (Search Engine
    Watch, November 18, 2004)
  • What is invisible today may be visible tomorrow

32
Content
  • At the time Sherman and Prices book was first
    written in 2001, PDF and Microsoft Office
    documents were among those which could not be
    indexed by general search engines
  • Google became the first to index PDF and Office
    documents, a search capability that is now widely
    accepted

33
Content
  • A number of other file formats are still not
    being searched well by most search engines
  • Postscript
  • Flash
  • Shockwave
  • Executables (programs)
  • Compressed files (.zip, .tar, etc.)

34
Why arent these formats searched?
  • Although the above formats can be indexed, they
    often are not because it is expensive to index
    non-HTML pages
  • In other words, the major web engines are not in
    business to meet every need of information
    professionals and researchers. (Sherman and
    Price, 2003)

35
  • These difficult file types are becoming more
    prevalent, especially in some kinds of
    high-quality, authoritative information
  • E.g., official government documents, or scholarly
    papers stored on the Web using Postscript or
    compressed Postscript files
  • (Postscript is a page description language
    first used by Adobe in 1985. It is a programming
    language optimized for printing graphics and
    text.)

36
Whats NOT on the Web
  • Proprietary databases and information services
  • Dialog, LexisNexis, etc.
  • Government and public records
  • Some coverage of government docs but too much
    information to ever have complete coverage
  • Privacy issues come into play for public records
  • Scholarly journals
  • Publishers have tight control
  • There are a few scholarly free e-journals,
    usually found via library websites
  • Full Text of newspapers and magazines
  • Some limited content of archives but information
    is often still valuable so publishers want to
    retain control of info
  • Authors rights also a concernmany retain re-use
    rights
  • Millions of documents that will never be
    available on the Web
  • libraries are still important!

37
Why use the Invisible Web?
  • There are thousands of databases with
    high-quality information accessible via the Web,
    many from libraries, universities, businesses,
    government agencies, etc.
  • Previously, this type of information was
    available only in proprietary information systems
  • Although these databases may be accessible
    through the Web, they may not be on the Web

38
Why use the Invisible Web?
  • More comprehensive results
  • resources are more subject specific
  • More control
  • more specialized tools for searching, thus easier
    retrieval of subject-specific information
  • Increased precision and recall
  • smaller databases better recall
  • Subject-specific resources better precision
  • Authoritative
  • High quality content from reputable institutions
    or organizations

39
WHERE
  • can I find Invisible Web resources?

40
Q. How do I search the invisible web?
41
A. You already do!
42
Top 25 types of content on the invisible web.
  • 1. Public company filings
  • 2. Telephone numbers
  • 3. Customized maps driving directions
  • 4. Clinical trials
  • 5. Patents
  • 6. Out of Print Books
  • 7. Library catalogues
  • 8. Authoritative dictionaries
  • 9. Environmental information
  • 10. Historical stock quotes
  • 11. Historical documents and images
  • 12. Company directories
  • 13. Searchable subject bibliographies
  • 14. Economic information
  • 15. Award Winnings
  • 16. Job Postings
  • 17. Philanthropy grant information
  • 18. Translation tools
  • 19. Postal codes
  • 20. Basic demographic information
  • 21. Interactive school finders
  • 22. Campaign financing information
  • 23. Weather data
  • 24. Product catalogues
  • 25. Art Gallery holdings

43
Attitude Shift
  • Remember that the invisible web is there.
  • Change your expectations of what youll find.
    Look for entryways to the invisible web, not for
    the content.
  • Develop a toolkit now for later consultation

44
Searching the Invisible Web
  • 1. Adopt the mindset of a hunter
  • -Tools (weapons) are important
  • -Reading the environment and looking for clues
    is more important.
  • Adapted from Price, G. Chris Sherman (2001)
    . Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

45
Searching the Invisible Web
  • 2. Use search engines
  • Use a general purpose engine like
  • Teoma to search for your term
  • with
  • database or interactive tool
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

46
Searching the Invisible Web
  • 3. Use Site Maps and Site Searches
  • Big sites like Library of Congress
  • and Library and Archives Canada
  • are often hybrids part visible, part
    invisible.
  • Use the site map, search for database, and
    see what you get!
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

47
Searching the Invisible Web
  • 4. Rely on Baker Street Irregulars
  • Sherlock Holmes had key informants
  • You can too.
  • Early Warning Systems
  • -Search Stuff from Susie list.
  • -Search Engine Watch newsletters and
    blogfeeds.
  • -Gary Prices www.resourceshelf.com
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

48
Searching the Invisible Web
  • 5. Use Invisible Web Directories
  • Directories like
  • the Librarians Index to the Internet
  • and the Invisible Web Directory
  • have the advantage of presenting resources
    that have been hand selected.
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

49
Searching the Invisible Web
  • 6. Use offline finding aids
  • Handbooks
  • The Invisible Web, Sherman Price
  • Best of the Web Geography, Leftley
  • Website Reviews
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

50
Searching the Invisible Web
  • 7. Create your own monitoring service
  • Some specialized search engines like
  • InfoMine
  • and ProFusion
  • have alert services that will let you know
    when new resources have been added.
  • Adapted from Price, G. Chris Sherman (2001) .
    Exploring the invisible web Seven essential
    strategies. Online, 25(4), 32-35

51
Searching the Invisible Web
  • What about these so-called
  • Invisible Web Search Engines?
  • Eg. ProFusion, Incy-Wincy, Complete Planet

52
The Invisible Web
  • Part II Demonstrations of Invisible Web Search
    Tools
  • ProFusion
  • Complete Planet
  • The Invisible Web Directory

53
ProFusion
  • Claims
  • ProFusion is very dynamic and an extremely
    exciting search site that makes it easy to
    intelligently search and find information from
    the very deep and invisible parts of the web.
  • Press release from Intelliseek
    http//www.intelliseek.com/releases2.asp?id41

54
ProFusion
  • Advantages
  • Vertical search fields
  • Clean interface
  • May highlight resources that you havent seen
    before.
  • Can retrieve some items which are inaccessible
    through Google.

55
ProFusion
  • Disadvantages
  • Cant log in
  • No real help section
  • Categories/Resources are mystery meat
  • Not very effective

56
Complete Planet
  • Strengths
  • A lot of good information about the site as well
    as the invisible web
  • Help/FAQ link very useful
  • Good categories to choose from for searching
  • Advanced search provides date limiters and allows
    for either natural language or Boolean searching

57
Complete Planet
  • Weaknesses
  • Searches have to be quite broad
  • No results if search is too specific
  • Necessitates searching through individual
    databases
  • Results not always relevant
  • Advanced search is not as useful as it appears
  • Using the Basic Search and then individual
    databases gives better results

58
The Invisible Web Directory
  • A companion to the book by Sherman and Price
  • Directory of Invisible Web resources arranged
    into broad categories and subcategories covering
    a wide range of topics
  • Browse only
  • Focus on free sites
  • Emphasis on quality over quantity authoritative
    resources that all contain some Invisible content

59
The Invisible Web Directory
  • Strengths
  • High quality, authoritative information
  • Resources contain Invisible Web content
  • Simple interface
  • Provides annotations
  • Weaknesses
  • Small number of resources (Sherman Price argue
    it is intended as a starting point)
  • Browse only cannot search by keyword
  • Must know which broad category your search fits
    into
  • Not good for the general searcher more useful
    for those that have read the book
  • Several broken links
  • No information about frequency of updates

60
Invisible Web
  • What?
  • How?
  • Why?
  • Where?
  • Who?

61
  • For more resources, look at our website
  • http//www.slais.ubc.ca/boudinot/links.htm
Write a Comment
User Comments (0)
About PowerShow.com