Information Access for a Digital Library: Cheshire II and the Berkeley Environmental Digital Library Ray R. Larson School of Information Management - PowerPoint PPT Presentation

About This Presentation
Title:

Information Access for a Digital Library: Cheshire II and the Berkeley Environmental Digital Library Ray R. Larson School of Information Management

Description:

Cheshire II and the Berkeley Environmental Digital Library Ray R. Larson School of Information Management & Systems University of California, Berkeley – PowerPoint PPT presentation

Number of Views:837
Avg rating:3.0/5.0
Slides: 60
Provided by: rayla5
Category:

less

Transcript and Presenter's Notes

Title: Information Access for a Digital Library: Cheshire II and the Berkeley Environmental Digital Library Ray R. Larson School of Information Management


1
Information Access for a Digital
LibraryCheshire II and the Berkeley
Environmental Digital Library Ray R.
LarsonSchool of Information Management
SystemsUniversity of California,
Berkeleyray_at_sherlock.berkeley.eduChad
CarsonComputer Science Division, EECSUniversity
of California, Berkeleycarson_at_eecs.berkeley.edu
2
UCB Digital Library Project Research Agenda
  • Funded by NSF/NASA/DARPA Digital Library
    Initiative (Phases I and II)
  • Research agenda
  • Understand user needs.
  • Extend functionality of documents.
  • Enliven legacy documents.
  • Improve access to information.
  • Scale to large systems.
  • Re-Invent Scholarly Information Access and Use

3
Testbed An Environmental Digital Library
  • Collection Diverse material relevant to
    Californias key habitats.
  • Users A consortium of state agencies,
    development corporations, private corporations,
    regional government alliances, educational
    institutions, and libraries.
  • Potential Impact on state-wide environmental
    system (CERES )

4
The Environmental Library -Users/Contributors
  • California Resources Agency, California
    Environment Resources Evaluation System (CERES)
  • California Department of Water Resources
  • The California Department of Fish Game
  • SANDAG
  • UC Water Resources Center Archives
  • New Partners CDL and SDSC

5
The Environmental Library - Contents
  • Environmental technical reports, bulletins, etc.
  • County general plans
  • Aerial and ground photography
  • USGS topographic maps
  • Land use and other special purpose maps
  • Sensor data
  • Derived information
  • Collection data bases for the classification and
    distribution of the California biota (e.g.,
    SMASCH)
  • Supporting 3-D, economic, traffic, etc. models
  • Videos collected by the California Resources
    Agency

6
The Environmental Library - Contents
  • As of mid 1999, the collection represents about
    three quarters of a terabyte of data, including
    over 70,000 digital images, over 300,000 pages of
    environmental documents, and over a million
    records in geographical and botanical databases.

7
Botanical Data
  • The CalFlora Database contains taxonomical and
    distribution information for more than 8000
    native California plants. The Occurrence Database
    includes over 300,000 records of California plant
    sightings from many federal, state, and private
    sources. The botanical databases are linked to
    our CalPhotos collection of Calfornia plants, and
    are also linked to external collections of data,
    maps, and photos.

8
Geographical Data
  • Much of the geographical data in our collection
    is being used to develop our web-based GIS
    Viewer. The Street Finder uses 500,000 Tiger
    records of S.F. Bay Area streets along with the
    70,000-records from the USGS GNIS database.
    California Dams is a database of information
    about the 1395 dams under state jurisdiction. An
    additional 11 GB of geographical data represents
    maps and imagery that have been processed for
    inclusion as layers in our GIS Viewer. This
    includes Digital Ortho Quads and DRG maps for the
    S.F. Bay Area.

9
Documents
  • Most of the 300,000 pages of digital documents
    are environmental reports and plans that were
    provided by California state agencies. This
    collection includes documents, maps, articles,
    and reports on the California environment
    including Environmental Impact Reports (EIRs),
    educational pamphlets, water usage bulletins, and
    county plans. Documents in this collection come
    from the California Department of Water Resources
    (DWR), California Department of Fish and Game
    (DFG), San Diego Association of Governments
    (SANDAG), and many other agencies. Among the most
    frequently accessed documents are County General
    Plans for every California county and a survey of
    125 Sacramento Delta fish species.

10
Documents - cont.
  • The collection also includes about 20Mb of
    full-text (HTML) documents from the World
    Conservation Digital Library. In addition to
    providing online access to important
    environmental documents, the document collection
    is the testbed for our Multivalent Document
    research.

11
Photographs
  • The photo collection includes 17,000 images of
    California natural resources from the state
    Department of Water Resources, several hundred
    aerial photos, 17,000 photos of California native
    plants from St. Mary's College, the California
    Academy of Science, and others, a small
    collection of California animals, and 40,000
    Corel stock photos.

12
Testbed Success Stories
  • LUPIN CERES Land Use Planning Information
    Network
  • California Country General Plans and other
    environmental documents.
  • Enter at Resources Agency Server, documents
    stored at and retrieved from UCB DLIB server.
  • California flood relief efforts
  • High demand for some data sets only available on
    our server (created by document recognition).
  • CalFlora Creation and interoperation of
    repositories pertaining to plant biology.
  • Cloning of services at Cal State Library, FBI

13
Research Highlights
  • Documents
  • Multivalent Document prototype
  • Page images, structured documents, GIS data,
    photographs
  • Intelligent Access to Content
  • Document recognition
  • Vision-based Image Retrieval stuff, thing, scene
    retrieval
  • Natural Language Processing categorizing the
    web, Cheshire II, TileBar Interfaces

14
User Interface Paradigms Multivalent Documents
  • An approach to new document types and their
    authoring.
  • Supports active, distributed, composable
    transformations of multimedia documents.
  • Enables sophisticated annotations, intelligent
    result handling, user-modifiable interface,
    composite documents.

15
Multivalent Documents
16
(No Transcript)
17
GIS in the MVD Framework
  • Layers are georeferenced data sets.
  • Behaviors are
  • display semi-transparently
  • pan
  • zoom
  • issue query
  • display context
  • spatial hyperlinks
  • annotations
  • Written in Java (to be merged with MVD-1 code
    line?)

18
GIS Viewer Example http//elib.cs.berkeley.edu/ann
otations/gis/buildings.html
19
Overview of Cheshire II
  • The Cheshire II system is intended to provide an
    easy-to-use, standards-compliant system capable
    of retrieving any type of information in a wide
    variety of settings.

20
Overview of Cheshire II
  • It supports SGML and XML.
  • It is a client/server application.
  • Uses the Z39.50 Information Retrieval Protocol.
  • Server supports a Relational Database Gateway.
  • Supports Boolean searching of all servers.
  • Supports probabilistic ranked retrieval in the
    Cheshire search engine.
  • Search engine supports nearest neighbor''
    searches and relevance feedback.
  • GUI interface on X window displays.
  • WWW/CGI forms interface for DL, using combined
    client/server CGI scripting via WebCheshire.
  • Image Content retrieval using BlobWorld
  • Support for the SDLIP (Simple Digital Library
    Interoperability Protocol) for search and as
    Z39.50 Gateway

21
Cheshire II Searching
22
Current Usage of Cheshire II
  • Web clients for
  • NSF/NASA/ARPA Digital Library
  • Includes support for full-text and page-level
    search.
  • Experimental Blob-World image search
  • SunSite
  • University of Liverpool.
  • University of Essex, HDS (part of AHDS)
  • California Sheet Music Project
  • Cha-Cha (Berkeley Intranet Search Engine)
  • Univ. of Virginia
  • Cheshire ranking algorithm is basis for Inktomi
    (i.e., Yahoo, Hotbot, MSN? and others)

23
Image Retrieval Research
  • Finding Stuff vs Things
  • BlobWorld
  • Other Vision Research

24
Blobworld use regions for retrieval
  • We want to find general objects? Represent
    images based on coherent regions

25
Outline
  • Why regions?
  • Creating Blobworld segmentation and description
  • Using Blobworld query experiments
  • Indexing blobs for faster querying
  • Conclusions

26
Creating and using Blobworld
Create
Use
27
Extract features for each pixel
  • Color
  • Take average color (Lab) at the selected
    scale ? ignore local color variations due to
    texture
  • zebra gray horse stripes
  • Texture
  • Find contrast, anisotropy, polarity at the
    selected scale
  • Position

28
Find groups in feature space
  • Model feature distribution as a mixture of
    Gaussians using Expectation-Maximization (EM)

29
Find regions in the image
  • Label each pixel based on its Gaussian cluster
  • Find connected components ? regions

2
1
3
4
30
Describe regions by color, texture, shape
  • Color
  • Color histogram within region
  • Quadratic distance encode similarity between
    color bins
  • d2hist(x, y) (x - y)' A (x - y)
  • Texture
  • Mean contrast and anisotropy
  • ? stripes vs. spots vs. smooth
  • (Basic) Shape
  • Fourier descriptors of contour

31
Select appropriate scale for processing
  • Polarity do all the gradient vectors point in
    the same direction?
  • Choose scale where polarity stabilizes ?
    include one approximate period

32
Initialize means using image data
  • Before, we picked random initialization
  • Now, choose initial means based on image tiles
  • Add noise to means and restart EM (4 runs per K)

K 2
K 5
K 4
K 3
33
Grouping Expectation-Maximization
  • Given class characteristics (?,?), find class
    membership
  • Given class membership, find class
    characteristics (?,?)
  • Iterate

update ?, ?
update labels
update ?, ?
update labels
34
How many Gaussians?
  • Model selection Minimum Description Length
  • Prefer fewer Gaussians if performance is
    comparable

vs.
vs.
35
Find groups in feature space
  • Model feature distribution as a mixture of
    Gaussians using Expectation-Maximization (EM)

36
EM math
  • Probability density
  • Update equations
  • where

37
Encode similarity between color bins
  • Quadratic distance
  • Distance between histograms x and y
  • d2hist(x, y) (x - y)' A (x - y)
  • Aij is based on the similarity between bins i and
    j
  • Neighboring bins have Aij 0.5

38
Fourier descriptors for shape
  • Zahn Roskies 72, Kuhl Giardina 82
  • Find (x,y) representation of outer contour
  • Find Fourier series of (x,y)
  • Coefficients specify an ellipse (4 parameters)
  • major axis, minor axis, orientation, starting
    point
  • Remove starting point ambiguity
  • Store first ten Fourier coefficients

39
Creating and using Blobworld
Create
Use
40
Querying let user see the representation
  • Current systems are unsatisfying
  • User cant see what the computer sees
  • Unclear how parameters relate to the image
  • User should interact with the representation
  • Helps in query formulation
  • Makes results understandable
  • Minimizes disappointment
  • http//elib.cs.berkeley.edu/photos/blobworld

41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Query experiments
  • Collection of 10,000 Corel stock photos
  • Five query images in each of ten
    categories(e.g., cheetahs, polar bears,
    airplanes)
  • Compare Blobworld to global histogram queries
  • Precision ( of retrieved images that are
    correct) vs. Recall ( of correct images that are
    retrieved)

48
Distinctive objects
  • Tigers, cheetahs, and zebras
  • Blobworld does better than global histograms

49
Distinctive objects and backgrounds
  • Eagles and black bears
  • Blobworld does better than global histograms

50
Distinctive scenes
  • Airplanes and brown bears
  • Global histograms do better than Blobworld
  • But Blobworld has room to grow (shape, etc.)

51
Index to search huge collections
  • Indexing is trickier than for traditional data
  • We can afford some mistakes even with full
    search, well miss some tigers and include some
    pumpkins
  • Two approaches we have tried
  • Store terms and treat image as a document
  • Store features and index using a tree
  • Final (correct) ranking of images from index

52
Index using conventional IR methods
  • Treat each database blob as a document
  • Store terms (bins) for color, texture,
    location, and shape
  • Repeat color terms based on histogram weights
  • Index using Cheshire II
  • Treat each query blob as a document
  • Repeat terms according to query weights

53
Indexing and Retrieval with Cheshire II
  • Originally used the same probabilistic algorithm
    used for text
  • Blobs are not distributed like text words or
    stems
  • Now using a weighting based on coordination level
    match with a minimum threshold (must have at
    least half of the characteristics of the query
    cluster.
  • Still eyeballing data, but seems much better for
    many types of queries

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Conclusions
  • Image retrieval in general collections requires
    region segmentation and description
  • Blobworld yields high precision in queries for
    distinctive objects
  • Blobworld can be indexed to allow fast querying

59
Further Information
  • Full Cheshire II client and server source is
    available ftp//sherlock.berkeley.edu/pub/cheshire
    /
  • Includes HTML and Troff documentation
  • http//cheshire.lib.berkeley.edu/
  • UC Berkeley Digital Library Project
  • http//elib.cs.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com