Introduction to Integrated Search - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Introduction to Integrated Search

Description:

Browsing the shelves. 1992 searching moves to the desktop. New library building ... linking to fulltext OpenURL resolving, book shelves, current awareness services ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 39
Provided by: Win6231
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Integrated Search


1
Introduction to Integrated Search
  • Thomas Place
  • Presentation for Digital Libraries à la Carte
    2009
  • Friday, 31 July 2009

2
Overview
  • A Personal History
  • Users
  • Common Architecture
  • Concluding remarks

3
A Personal History
4
History search _at_ Tilburg University
  • 1988 Searching only in the library
  • 1992 Searching moves to the desktop
  • 1995-1997 Homogeneous search interface
  • 2001 Metasearch plus dynamic linking
  • 2009 Integrated search
  • 201? Searching completely in the cloud

5
1988 Searching only in the library
  • Psychological Abstracts print index (appeared
    during1927 2006)
  • Social Sciences Citation Index print index
    (appeared during1973/4 - ????)
  • OPAC terminals Online in the library ? Public
  • Stand alone PC in the library with CD-ROMs
    PsycLit, SSCI
  • (In 1985 my first PC)
  • Each week Current Contents print journal with
    table of contents of journals
  • Each month exhibition of newly acquired books
  • Browsing the shelves

6
1992 searching moves to the desktop
  • New library building
  • LBS3 of Pica (now OCLC)
  • new generation ILS the core is still
    operational in 2009
  • building of database from union catalogue took
    weeks transfer by tapes
  • updates online
  • OPAC accessible via the Internet (telnet)
  • Tilburg the first Dutch university with a Campus
    Wide Information System (1991) with entry points
    for the local bibliographical databases
  • Catalogue
  • Excerpta Informatica
  • Online Contents journal articles
  • Student theses
  • Attent reports in economics
  • Brabant database
  • and for external databases on the internet.
  • CD-ROMs available via campus network

7
1995-1997 homogeneous search interface
  • All local databases (Trip) have Z39.50 interface
    exception the catalogue
  • Z39.50 MS Windows client (Kwik)
  • Soon replaced by a Web application (Trix)
  • Homogeneous access to internal and external
    Z39.50 databases via a Web browser (Netscape)
  • Each database was, however, searched separately
    like in 1988 with the print indexes.
  • Users didnt understand that Catalogue is for
    books and journals, Online Contents is for
    articles, etc. Default selection is the first
    database in the list

8
One Interface
Z39.50
Homogeneous userinterface
9
One Interface
XML
Federator
Z39.50
SRU
Metasearch
10
2001 metasearch plus dynamic linking
  • European project Decomate II
    ?
  • commercialization by OCLC PICA,
  • software development by Tilburg University
  • not at the market anymore, other products
    are
  • First Dutch implementation of metasearch still
    running.
  • Database lists, homogeneous userinterface for
    SRU/Z39.50 databases, metasearch, de-duplication,
    dynamic linking to fulltext OpenURL resolving,
    book shelves, current awareness services
  • Local databases only available via user interface
    of iPort
  • User interface conforms to house style demo

11
Problems with metasearching
  • the performance is sometimes disappointing (no
    Google-like performance)
  • the presentation of the information is not
    optimal (merging, sorting)
  • users find it difficult to select the right
    databases for a federated search (as a solution
    they select all databases which has a negative
    effect on the performance and increases the noise
    in the search results).
  • users dont know how to formulate the best
    queries for the databases they have chosen (in
    many cases this is also not possible because a
    query that is optimal for one database is not the
    optimal query for another database in which the
    user also wants to search indexes differ over
    dbs).

12
One Interface
Z39.50
Homogeneous userinterface
13
One Interface
XML
Federator
Z39.50
SRU
Metasearch
14
One Interface
SRU
XML
OAI-PMH
Integrated search
15
2009 Integrated search
  • Page with databases is no longer the start, but
    the search box.
  • No database selection just search demo
  • Technical solution Meresco of CQ2
  • Open Source
  • We work together with the TU Delft who implements
    also Meresco Discover
  • Meresco infrastructure is also used for special
    services, e.g., Economists Online

16
What are our goals?
  • To be THE one and only search engine of Tilburg
    University
  • Searching scientific information (library) AND
    non-scientific information (website, learning
    material)
  • Query leads (in the future) to
  • Relevant documents and web pages (Meresco)
  • Experts (expert finding system developed by
    master student)
  • Specialised databases (Purple search, metasearch
    application of the University of Groningen)
  • Finding of documents no longer clicking to full
    record display most important information is
    directly presented in result list
  • Informing the user about the search results
    facets, clusters
  • Added value add-ons / mash-ups, integration in
    the workflow

17
(No Transcript)
18
Components
  • Information resources
  • Ingest
  • Search engine
  • Presentation and integration of external services

19
NEEO Institutional Repositories Other economics
repositories
Logs
Metadata
Objects
OAI-PMH
HTTP
Crawler
Harvester
OAI-PMH
Metadata enrichment server
Metadata
Gateway
SRU
Search engine
SRU
RePEc
RSS/Atom
OAI-PMH
Portlet
Portal
Publication list generator
Ajax server
Service component
Data
subcomponent
Protocol
20
Information resources
  • Repositories with OAI-PMH interface
  • Local databases (IR, Student theses, Online
    Contents, ...)
  • SHARED repositories with metadata of publishers
  • Elsevier repository _at_ UvT
  • External repositories
  • RePEc
  • IRs (e.g., NEEO)
  • ...
  • GGC Dutch Shared Cataloguing System with
    OUF/SRU interface (catalogue records)
  • ...

21
Ingest
  • Meresco harvester
  • OAI-PMH repositories to harvester
  • SRUUpdate van harvester to search engine
  • Inbox
  • Pica records from GGC go in inbox
  • Records are fetched form inbox by the search
    engine
  • Records are stored in their original format in
    database of the search engine
  • If no MODS, than conversion MODS is stored
    alongside the original records so no dynamic
    conversion for indexing and presentation
  • Parts (e.g.. ratings, annotations or fulltext)
    can be added to the record

22
Meresco search engine
  • Lucene
  • XML-based all paths in the tree can be indexed
  • Powerful facetting engine not Lucene
  • Search term suggestions
  • Clustering (sort of)
  • Indexing of fulltext
  • Has its own GUI
  • But integration via SRU with other front ends
    (e.g., Economists Online) is possible
  • Flexible writing you own pluggable components in
    Python
  • UvT develops tools for configuration by
    Functional Application Managers (information
    specialists)

23
Integration of external services (UvT)
  • place locator
  • OpenURL resolver
  • No menus OpenURL in, XML out
  • Info about location as specific as possible
  • Connection with ILS for availability info (need
    for standards DLF)
  • Is called from results list (Ajax)
  • Journal covers (local server)
  • Book covers (Syndetics)
  • More to come

24
What is now (June 2009) in de search engine?
25
What will be added?
26
Users (Delft)
  • Students lack an overview of the domain in which
    they search. They are inexperienced searchers and
    dont know the terminology of the disciplines in
    which they search. The challenge for students is
    to find structure in the chaos of information.
  • Students search without a clear plan. They want
    to be able to revisit earlier search paths. This
    is not well supported by present systems.
  • When a student starts searching there is no clear
    idea of what (s)he is searching for. During the
    search process their information need becomes
    gradually more clear and they discover the
    relevant search terms.
  • For students it is difficult to verify the
    trustworthiness of the information that they find
    during searching.
  • Student dont know RSS.
  • The way students search is not very well
    organised. They change strategies and goals. They
    are very receptive for unexpected results
    (serendipity) which give them new leads for
    searching more information.

27
Metalib statistics of the University of Groningen
50 zero or false results
  • Misspellings and typos in search terms
  • Picking databases at random
  • Unable to understand QuickSearch, MetaSearch,
    Find Database
  • Using the wrong search keys
  • Using search keys wrong
  • Using Dutch search terms in English language
    databases
  • Using non-specific terms, phrases that are too
    broad
  • Lack of understanding of Boolean logic or
    database peculiarities

Metalib statistics
28
Common architecture
  • Data layer
  • Search layer in most cases Lucene as core (Omega
    of Un. Utrecht Autonomy)
  • Presentation layer

29
(No Transcript)
30
Primo Search Engine
Import/Data APIs
data
Publishing Platform
data
Harvesting (OAI-PMH,..)
Source Repository
31
(No Transcript)
32
Data layer
  • Collection of metadata and documents from
    external sources by
  • OAI-PMH harvesting
  • downloading from CDs or DVDs
  • FTP get
  • SRU/Z39.50 requests
  • Cleaning of the metadata (e.g., repairing invalid
    XML)
  • Adding metadata elements local data, subject
    infoAlso availability info?
  • Merging metadata (e-holdings print holdings of
    same journals expressions, manifestations of the
    same work FRBR)
  • Conversion to standard XML-format (PNX, MODS,
    MARC21) proprietary vs standardized
    formatsWhat is stored? Orginal and/or converted
    records. Or nothing or only external record
    location?
  • Adding admin info source, ingest date, access
    rights
  • Fetching documents and adding (ASCII or XML
    version of) fulltext to the records
  • Processing of data generated by users tags,
    annotations, ratingsUser generated data
    external (shareable) or internal (non-shareable)
    data

33
Data layer
  • Sharing of data
  • Processing of publisher data at one place,
    indexing at many places
  • Sharing of annotations, tags and ratings (?)
  • Issues
  • What is stored?
  • Pre-processing in data layer versus
    post-processing in presentation layerstatic data
    versus dynamic datadata generated during
    post-processing cant be indexed

34
Search layer
  • Indexing of records from the data layer
  • Loading SRUUpdate or batch mechanism
  • Filters, analyzers
  • Index definitions (Lucene document format)
  • Separate indexes for facetting, search
    suggestions and/or clustering? or use Solr?
  • Searching in the index(es)
  • Search results including facets, clusters in XML
  • SRU
  • RSS

35
Search layer
  • Sharing of indexes
  • One central index with subcollections
  • Distributed index standardization of index
    definitions
  • Exchanging of indexes
  • 1 is possible but requires organisation
  • 2 and 3 are probably technical possible, but I
    dont know of successful examples
  • Issues
  • Standard search interfaces SRU,

36
Presentation layer
  • Web application that sends (converted) user query
    (HTTP request) to search engine and receives
    search results in XML
  • Processing of XML and returning HTTP response to
    the browser
  • For dynamic content, the browser is responsible
    Ajax.E.g., availability info
  • Possible modules
  • Query parser Google like queries gt CQL
  • OpenURL generator
  • Tag cloud builder
  • Authorisation module with access rules
    authentication is external support of SAML
    (A-Select, Shibboleth)

37
Presentation layer
  • Integration of external servicesApplication must
    allow for easy integration of external web
    services
  • Recommender systems like Purple search,
    metasearch application of the University of
    Groningen
  • Personalised services, e.g.,
  • Current awareness service storage of profiles
    (or is RSS sufficient?)
  • E-shelves, shopping cart permanent storage?
    Sharing of e-shelves.
  • Tagging, annotations, ratings. Sharing
  • Location services integrating OpenURL resolver
    and Circulation Control of ILS. Issue
    Standardized access to availability information
    of ILS.
  • Federated search server
  • Amazon (book covers, book reviews) / Syndetics
    (book covers, reviews, tables of content)
  • Google books
  • Web of science impact factor (or new service of
    Ex Libris)
  • Export services
  • xISSN (OCLC) get all related ISSNsCan also be
    used during preprocessing in data layer
  • TicTOCs Journal Tables of Contents Service

38
Concluding remarks
  • Just search, no database selection
  • Integrated search systems must give guidance to
    the user facets, clusters, suggestions,
    recommendations,
  • Sharing of resources requires a common
    architecture, common APIs, common standards
Write a Comment
User Comments (0)
About PowerShow.com