Title: Information Access for a Digital Library: Cheshire II and the Berkeley Environmental Digital Library Ray R. Larson School of Information Management
1Information Access for a Digital
LibraryCheshire II and the Berkeley
Environmental Digital Library Ray R.
LarsonSchool of Information Management
SystemsUniversity of California,
Berkeleyray_at_sherlock.berkeley.eduChad
CarsonComputer Science Division, EECSUniversity
of California, Berkeleycarson_at_eecs.berkeley.edu
2UCB Digital Library Project Research Agenda
- Funded by NSF/NASA/DARPA Digital Library
Initiative (Phases I and II) - Research agenda
- Understand user needs.
- Extend functionality of documents.
- Enliven legacy documents.
- Improve access to information.
- Scale to large systems.
- Re-Invent Scholarly Information Access and Use
3Testbed An Environmental Digital Library
- Collection Diverse material relevant to
Californias key habitats. - Users A consortium of state agencies,
development corporations, private corporations,
regional government alliances, educational
institutions, and libraries. - Potential Impact on state-wide environmental
system (CERES )
4The Environmental Library -Users/Contributors
- California Resources Agency, California
Environment Resources Evaluation System (CERES) - California Department of Water Resources
- The California Department of Fish Game
- SANDAG
- UC Water Resources Center Archives
- New Partners CDL and SDSC
5The Environmental Library - Contents
- Environmental technical reports, bulletins, etc.
- County general plans
- Aerial and ground photography
- USGS topographic maps
- Land use and other special purpose maps
- Sensor data
- Derived information
- Collection data bases for the classification and
distribution of the California biota (e.g.,
SMASCH) - Supporting 3-D, economic, traffic, etc. models
- Videos collected by the California Resources
Agency
6The Environmental Library - Contents
- As of mid 1999, the collection represents about
three quarters of a terabyte of data, including
over 70,000 digital images, over 300,000 pages of
environmental documents, and over a million
records in geographical and botanical databases.
7Botanical Data
- The CalFlora Database contains taxonomical and
distribution information for more than 8000
native California plants. The Occurrence Database
includes over 300,000 records of California plant
sightings from many federal, state, and private
sources. The botanical databases are linked to
our CalPhotos collection of Calfornia plants, and
are also linked to external collections of data,
maps, and photos.
8Geographical Data
- Much of the geographical data in our collection
is being used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data represents
maps and imagery that have been processed for
inclusion as layers in our GIS Viewer. This
includes Digital Ortho Quads and DRG maps for the
S.F. Bay Area.
9Documents
- Most of the 300,000 pages of digital documents
are environmental reports and plans that were
provided by California state agencies. This
collection includes documents, maps, articles,
and reports on the California environment
including Environmental Impact Reports (EIRs),
educational pamphlets, water usage bulletins, and
county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game
(DFG), San Diego Association of Governments
(SANDAG), and many other agencies. Among the most
frequently accessed documents are County General
Plans for every California county and a survey of
125 Sacramento Delta fish species.
10Documents - cont.
- The collection also includes about 20Mb of
full-text (HTML) documents from the World
Conservation Digital Library. In addition to
providing online access to important
environmental documents, the document collection
is the testbed for our Multivalent Document
research.
11Photographs
- The photo collection includes 17,000 images of
California natural resources from the state
Department of Water Resources, several hundred
aerial photos, 17,000 photos of California native
plants from St. Mary's College, the California
Academy of Science, and others, a small
collection of California animals, and 40,000
Corel stock photos.
12Testbed Success Stories
- LUPIN CERES Land Use Planning Information
Network - California Country General Plans and other
environmental documents. - Enter at Resources Agency Server, documents
stored at and retrieved from UCB DLIB server. - California flood relief efforts
- High demand for some data sets only available on
our server (created by document recognition). - CalFlora Creation and interoperation of
repositories pertaining to plant biology. - Cloning of services at Cal State Library, FBI
13Research Highlights
- Documents
- Multivalent Document prototype
- Page images, structured documents, GIS data,
photographs - Intelligent Access to Content
- Document recognition
- Vision-based Image Retrieval stuff, thing, scene
retrieval - Natural Language Processing categorizing the
web, Cheshire II, TileBar Interfaces
14User Interface Paradigms Multivalent Documents
- An approach to new document types and their
authoring. - Supports active, distributed, composable
transformations of multimedia documents. - Enables sophisticated annotations, intelligent
result handling, user-modifiable interface,
composite documents.
15Multivalent Documents
16(No Transcript)
17GIS in the MVD Framework
- Layers are georeferenced data sets.
- Behaviors are
- display semi-transparently
- pan
- zoom
- issue query
- display context
- spatial hyperlinks
- annotations
- Written in Java (to be merged with MVD-1 code
line?)
18GIS Viewer Example http//elib.cs.berkeley.edu/ann
otations/gis/buildings.html
19Overview of Cheshire II
- The Cheshire II system is intended to provide an
easy-to-use, standards-compliant system capable
of retrieving any type of information in a wide
variety of settings.
20Overview of Cheshire II
- It supports SGML and XML.
- It is a client/server application.
- Uses the Z39.50 Information Retrieval Protocol.
- Server supports a Relational Database Gateway.
- Supports Boolean searching of all servers.
- Supports probabilistic ranked retrieval in the
Cheshire search engine. - Search engine supports nearest neighbor''
searches and relevance feedback. - GUI interface on X window displays.
- WWW/CGI forms interface for DL, using combined
client/server CGI scripting via WebCheshire. - Image Content retrieval using BlobWorld
- Support for the SDLIP (Simple Digital Library
Interoperability Protocol) for search and as
Z39.50 Gateway
21Cheshire II Searching
22Current Usage of Cheshire II
- Web clients for
- NSF/NASA/ARPA Digital Library
- Includes support for full-text and page-level
search. - Experimental Blob-World image search
- SunSite
- University of Liverpool.
- University of Essex, HDS (part of AHDS)
- California Sheet Music Project
- Cha-Cha (Berkeley Intranet Search Engine)
- Univ. of Virginia
- Cheshire ranking algorithm is basis for Inktomi
(i.e., Yahoo, Hotbot, MSN? and others)
23Image Retrieval Research
- Finding Stuff vs Things
- BlobWorld
- Other Vision Research
24Blobworld use regions for retrieval
- We want to find general objects? Represent
images based on coherent regions
25Outline
- Why regions?
- Creating Blobworld segmentation and description
- Using Blobworld query experiments
- Indexing blobs for faster querying
- Conclusions
26Creating and using Blobworld
Create
Use
27Extract features for each pixel
- Color
- Take average color (Lab) at the selected
scale ? ignore local color variations due to
texture - zebra gray horse stripes
- Texture
- Find contrast, anisotropy, polarity at the
selected scale - Position
28Find groups in feature space
- Model feature distribution as a mixture of
Gaussians using Expectation-Maximization (EM)
29Find regions in the image
- Label each pixel based on its Gaussian cluster
- Find connected components ? regions
2
1
3
4
30Describe regions by color, texture, shape
- Color
- Color histogram within region
- Quadratic distance encode similarity between
color bins - d2hist(x, y) (x - y)' A (x - y)
- Texture
- Mean contrast and anisotropy
- ? stripes vs. spots vs. smooth
- (Basic) Shape
- Fourier descriptors of contour
31Select appropriate scale for processing
- Polarity do all the gradient vectors point in
the same direction? - Choose scale where polarity stabilizes ?
include one approximate period
32Initialize means using image data
- Before, we picked random initialization
- Now, choose initial means based on image tiles
- Add noise to means and restart EM (4 runs per K)
K 2
K 5
K 4
K 3
33Grouping Expectation-Maximization
- Given class characteristics (?,?), find class
membership - Given class membership, find class
characteristics (?,?) - Iterate
update ?, ?
update labels
update ?, ?
update labels
34How many Gaussians?
- Model selection Minimum Description Length
- Prefer fewer Gaussians if performance is
comparable
vs.
vs.
35Find groups in feature space
- Model feature distribution as a mixture of
Gaussians using Expectation-Maximization (EM)
36EM math
- Probability density
- Update equations
- where
37Encode similarity between color bins
- Quadratic distance
- Distance between histograms x and y
- d2hist(x, y) (x - y)' A (x - y)
- Aij is based on the similarity between bins i and
j - Neighboring bins have Aij 0.5
38Fourier descriptors for shape
- Zahn Roskies 72, Kuhl Giardina 82
- Find (x,y) representation of outer contour
- Find Fourier series of (x,y)
- Coefficients specify an ellipse (4 parameters)
- major axis, minor axis, orientation, starting
point - Remove starting point ambiguity
- Store first ten Fourier coefficients
39Creating and using Blobworld
Create
Use
40Querying let user see the representation
- Current systems are unsatisfying
- User cant see what the computer sees
- Unclear how parameters relate to the image
- User should interact with the representation
- Helps in query formulation
- Makes results understandable
- Minimizes disappointment
- http//elib.cs.berkeley.edu/photos/blobworld
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Query experiments
- Collection of 10,000 Corel stock photos
- Five query images in each of ten
categories(e.g., cheetahs, polar bears,
airplanes) - Compare Blobworld to global histogram queries
- Precision ( of retrieved images that are
correct) vs. Recall ( of correct images that are
retrieved)
48Distinctive objects
- Tigers, cheetahs, and zebras
- Blobworld does better than global histograms
49Distinctive objects and backgrounds
- Eagles and black bears
- Blobworld does better than global histograms
50Distinctive scenes
- Airplanes and brown bears
- Global histograms do better than Blobworld
- But Blobworld has room to grow (shape, etc.)
51Index to search huge collections
- Indexing is trickier than for traditional data
- We can afford some mistakes even with full
search, well miss some tigers and include some
pumpkins - Two approaches we have tried
- Store terms and treat image as a document
- Store features and index using a tree
- Final (correct) ranking of images from index
52Index using conventional IR methods
- Treat each database blob as a document
- Store terms (bins) for color, texture,
location, and shape - Repeat color terms based on histogram weights
- Index using Cheshire II
- Treat each query blob as a document
- Repeat terms according to query weights
53Indexing and Retrieval with Cheshire II
- Originally used the same probabilistic algorithm
used for text - Blobs are not distributed like text words or
stems - Now using a weighting based on coordination level
match with a minimum threshold (must have at
least half of the characteristics of the query
cluster. - Still eyeballing data, but seems much better for
many types of queries
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58Conclusions
- Image retrieval in general collections requires
region segmentation and description - Blobworld yields high precision in queries for
distinctive objects - Blobworld can be indexed to allow fast querying
59Further Information
- Full Cheshire II client and server source is
available ftp//sherlock.berkeley.edu/pub/cheshire
/ - Includes HTML and Troff documentation
- http//cheshire.lib.berkeley.edu/
- UC Berkeley Digital Library Project
- http//elib.cs.berkeley.edu