Internet%20search%20engines:%20%20Fluctuations%20in%20document%20accessibility - PowerPoint PPT Presentation

About This Presentation
Title:

Internet%20search%20engines:%20%20Fluctuations%20in%20document%20accessibility

Description:

Hans de Bruin (Unilever Research Laboratorium, Vlaardingen, The Netherlands) ... Marten Hofstede ( Rijksuniversiteit Leiden, The Netherlands) ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 32
Provided by: X300
Category:

less

Transcript and Presenter's Notes

Title: Internet%20search%20engines:%20%20Fluctuations%20in%20document%20accessibility


1
Internet search engines Fluctuations in
document accessibility
  • Wouter Mettrop
  • CWI, Amsterdam, The Netherlands
  • Paul Nieuwenhuysen
  • Vrije Universiteit Brussel, and Universitaire
    Instelling Antwerpen, Belgium
  • Hanneke Smulders
  • Infomare Consultancy, The Netherlands
  • http//www.cwi.nl/cwi/projects/IRT
  • Presented at Internet Librarian International
    2000 in London, England, March 2000

2
Fluctuations in document accessibility - summary
  • Search engines are often compared on the basis of
    their size, i.e. the number of documents indexed
    in their databases. However, searchers should be
    aware of the fact that documents cannot be
    retrieved reliably - in the sense that unexpected
    and annoying fluctuations exist in the result set
    of documents retrieved by most search engines.
  • Fluctuations are ideally caused by alterations in
    the Web (documents come and go). However, in
    some cases they are caused by changes in indexing
    policy (indexing fluctuations), and in some
    cases the origin is more obscure documents are
    expected but not retrieved.
  • We have investigated these obscure fluctuations,
    by searching repeatedly during a year for several
    identical test documents. The documents were
    placed on different sites and remained unchanged.
    The influences of changes in indexing policy of
    the engines are excluded.
  • We consider two kinds of obscure fluctuations
  • 1. Document fluctuations appear when test
    documents disappear from the database with
    indexed documents (for whatever reason).
  • 2. Element fluctuations appear when test
    documents, that still exist in the database, do
    not show up in result sets even when they should.
  • This presentation is the result of our tests from
    October 1998 until December 1999. We have
    evaluated 13 engines AltaVista, EuroFerret,
    Excite, HotBot, InfoSeek, Lycos, MSN,
    NorthernLight, Snap, WebCrawler and 3 national
    Dutch engines Ilse, Search.nl and Vindex.
  • The outcome of our investigation is in particular
    important for known-item searches.

3
WWW growing number of WWW servers
WWW
4
Internet based information sources how many?
how much?
  • In 2000
  • about 1 billion 1000 million unique URLs in
    the total Internet
  • about 10 terabyte ( 10 000 gigabyte) of text data

5
Internet information retrieval systems in 2000
  • Several types of systems exist to retrieve
    information
  • Directories of selected sources categorised by
    subject, made by humans, mainly for browsing.
  • Search systems, based on databases with machine
    made indexes, for word-based searching!
  • Meta-search or multi-threaded search
    systems.
  • We have studied and compared several well-known
    international (and a few national) word-based
    Internet search engines.

6
Internet information retrieval systems
evaluation criteria
  • Many aspects/criteria can be considered in the
    evaluation of an Internet search engine,
    including
  • coverage of documents present on WWW (studies
    exist)
  • number of elements of a document, that are
    indexed to make them usable for retrieval
  • fluctuations over time in the result sets
    offered by a search engine
  • We started to study the depth of indexing and we
    were soon confronted with the fluctuations in the
    performance that do exist.

7
Internet information retrieval systems our
research group
  • The following persons have been involved in the
    research
  • Louise Beijer (Hogeschool van Amsterdam, The
    Netherlands)
  • Hans de Bruin (Unilever Research Laboratorium,
    Vlaardingen, The Netherlands)
  • Hans de Man (JdM Documentaire Informatie,
    Vlaardingen, The Netherlands)
  • Rudy Dokter (PNO Consultants, Hengelo, The
    Netherlands)
  • Marten Hofstede ( Rijksuniversiteit Leiden, The
    Netherlands)
  • Wouter Mettrop (CWI, Amsterdam, The Netherlands)
  • Paul Nieuwenhuysen (Vrije Universiteit Brussel,
    Belgium)
  • Eric Sieverts (Hogeschool van Amsterdam, and RUU,
    The Netherlands)
  • Hanneke Smulders (Infomare, Terneuzen, The
    Netherlands)
  • Hans van der Laan (Consultant, Leiderdorp, The
    Netherlands)
  • Ditmer Weertman (ADLIB, Utrecht, The Netherlands)

8
Internet search engines research on indexing
functionality
  • assessing the indexing functionality
  • test document
  • test method
  • conclusions concerning indexing functionality

9
Number of our test documents that were retrieved
10
Internet search engines elements of test
document studied
  • title tag
  • META-tags keywords, description and author
  • comment tag
  • ALT tag
  • text/URL of a link to a document
  • H3 tag
  • table header
  • text of an internal link, a reference anchor,
    a link to a sound file
  • name of a sound file (au/wav/aiff/ra)
  • text of a link to an image
  • name of an image file (gif or jpg inline or
    linked to)
  • name of a Java applet (with or without extension
    class)
  • terms after the first 100 lines in a document
    (200//700)
  • the URL of a document

11
Internet search engines part of the test
document source code
  • ltHTMLgt ltHEADgt
  • ltTITLEgtTest paginalt/TITLEgt
  • ltMETA NAME"keywords"
  • CONTENT"een, twee, drie"gt
  • ltMETA NAME"description"
  • CONTENT"This test page, containig a small part
    of the Secret Garden (by Frances Hodgson Burnett)
    is part of a larger site about the IRT project.
    vier, vijf, zes"gt
  • ltMETA NAME"Subject" CONTENT"zeven"gt
  • ltMETA NAME"Subject" CONTENT"acht"gt
  • ltMETA NAME"Subject" CONTENT"negen"gt
  • ltMETA NAME"Title CONTENT"tien hoofdstukken uit
    The Secret Garden"gt
  • ltMETA NAME"TitleSubtitle" content"elf"gt

12
Number of the studied document elements that were
indexed
13
Internet search engines reachability
  • 14 528 queries sent to 13 search engines
  • 721 times unreachable
  • The percentage of unreachability varies from
    nearly 0 to nearly 15.
  • The studied search engines were reachable for 95
    of the queries.

14
Search engine indexing functionality conclusions
  • Not all of the web is indexed.
  • Not all of our test documents.
  • Not all HTML elements of our test document.
  • Some of the studied search engines showed changes
    in the indexing policy.
  • No relation between the number of indexed test
    documents or HTML elements and the size of a
    search engine was found during our study.

15
Internet search engines fluctuations -
definition
  • A fluctuation appears when the result set of an
    observation
  • - i.e.
  • one query or
  • set of queries
  • misses documents with respect to a frame of
    reference
  • - i.e.
  • other observations and
  • knowledge about Web reality

16
Internet search engines detecting fluctuations
  • Through time comparing result sets of one
    observation, repeatedly performed
  • Observation one query or set of queries
  • Frame of reference other observations
    web-knowledge
  • One moment consistency of result sets
  • Observation one query in set of queries
  • Frame of reference other observations

17
Internet search engines types of fluctuations
  • Through time comparing result sets of one
    observation repeatedly performed
  • Document fluctuations
  • Indexing fluctuations
  • One moment consistency of result sets
  • Element fluctuations

18
(No Transcript)
19
Document fluctuations example 1
20
Document fluctuations example 2
21
Document fluctuations experimental results
22
(No Transcript)
23
Indexing fluctuationsexperimental results
24
(No Transcript)
25
Element fluctuations example
26
Element fluctuations experimental results
27
Percentage of documents missed due to
fluctuations
28
Internet search engines fluctuations -
quantitative conclusions
  • Many element fluctuations? many document and
    indexing fluctuations and many document elements
    indexed
  • Many document fluctuations? not always many
    element fluctuations
  • Few document elements indexed? few element
    fluctuations

29
Fluctuations remarks on correctness
  • Fluctuations can be seen as correct, if they
    are reflections of alterations in
  • (web-) reality
  • then document, indexing and element fluctuations
    are incorrect
  • the indexed database of a search engine
  • then only element fluctuations are incorrect
  • Users do not care they miss documents

30
Fluctuationsremarks on size
  • No relation document / element fluctuations lt
    gt size
  • Percentage missed documents determines (with
    other reducing effects, such as depth of
    indexing) the effective size of an engine

31
Internet search engines conclusions of our
research
  • Search engines differ in depth of indexing.
  • Search engines show fluctuations in their result
    sets
  • They are subject to changes in indexing
    policy.(indexing fluctuations)
  • They forget documents completely (document
    fluctuations)
  • They miss documents in their result sets
    (element fluctuations).

32
Internet search engines recommendations related
to fluctuations
  • Fluctuations are normal do not be surprised
    do not worry.
  • Do not try to find a simple explanation to fully
    understand what happens.
  • Known item searchers should repeat the search
  • when using an engine with many element
    fluctuations use other search terms
  • when using an engine with many document
    fluctuations repeat later.
  • Further research on effective size.

33
Element and indexing fluctuations example
Write a Comment
User Comments (0)
About PowerShow.com