Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiati - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiati

Description:

Aides Medical Collection. NOAH: New York Online Access to Health ... Searching OAI Collections. User. OAI Data. Provider (e.g., Library of Congress ) ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 26
Provided by: panagi8
Category:

less

Transcript and Presenter's Notes

Title: Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiati


1
Extending SDARTSExtracting Metadata from Web
Databasesand Interfacing with Open Archives
Initiative
  • Panagiotis G. Ipeirotis
  • Tom Barry
  • Luis Gravano

Computer Science Dept., Columbia University
2
Metasearching? Why?Surface Web vs. Hidden Web
  • Surface Web
  • Link structure
  • Crawlable
  • Hidden Web
  • Documents hidden in databases
  • No link structure
  • Search engines do not index them
  • Need to query each collection individually

3
Metasearching Challenges
  • Select good databases for a given query
  • Evaluate the query at these databases
  • Merge the results from these databases

Hidden Web
Metasearcher
Existing Web Database
Non-indexed Documents
Relational Database / Library / etc.
4
Outline
  • Background SDARTS, SDLIP, STARTS
  • Extracting content summaries from remote web
    databases
  • Interfacing with Open Archives Initiative

5
SDARTS SDLIP STARTS
NOT yet another protocol
Metasearcher
SDLIP interfaces STARTS metadata
S
M
S
M
S
M
grep
cat
select
http//.
6
STARTS A Metasearching Protocol
  • Defines
  • Query language
  • Results format
  • Metadata for the collection
  • Complements SDLIP for metasearching purposes
  • Provides metadata for individual documents
  • Provides content summaries for databases

7
SDARTS The Toolkit
  • SDARTS architecture makes new-wrapper
    implementation easy
  • SDARTS toolkit includes reference implementations
    for common types of text databases
  • Local text databases
  • Local XML databases
  • Remote web databases
  • Customization requires just editing configuration
    files,
  • no programming

8
SDARTS Content Summaries
  • Detailed content summaries easily extracted from
    locally available (plain-text or XML) databases
  • Detailed content summaries so far not available
    for remote web databases
  • No access to full contents

9
Extracting Content Summaries from Remote Web
Databases
  • No direct access to remote documents
  • Resort to document sampling
  • Send queries to the database
  • Retrieve a representative document sample
  • Use the sample to create an approximation of the
    content summary
  • Database selection algorithms work well even with
    approximate content summaries

VLDB 2002
10
Topic-based Sampling Training
  • Start with a predefined hierarchy and associated,
    pre-classified documents
  • Train rule-based document classifiers for each
    node
  • The output is a set of rules like
  • ibm AND computers ? Computers
  • lung AND cancer ? Health
  • hepatitis AND liver ? Hepatitis
  • angina ? Heart

Root
Health
11
Topic-based Sampling Probing
  • Transform each rule into a query
  • For each query
  • Send query to database
  • Record number of matches
  • Retrieve top-k documents for query
  • At the end of the round
  • Analyze matches for each category
  • Choose category to focus on
  • The result is a representative document sample

Sampling proceeds in rounds In each round, the
rules associated with each node are turned into
queries to the database
12
Sample Contains Relative Word Frequencies
  • Liver appears in 200 out of 300 documents in
    sample
  • Kidney appears in 100 out of 300 documents in
    sample
  • Hepatitis appears in 30 out of 300 documents in
    sample

Document frequencies in actual database?
  • Query liver returned 140,000 matches
  • Query hepatitis returned 20,000 matches
  • kidney was not a query probe

Can exploit number of matches from one-word
queries
13
Adjusting Document Frequencies
  • We know absolute document frequency f of words
    from one-word queries
  • We know ranking r of words according to document
    frequency in sample
  • Mandelbrots formula connects word frequency f
    and ranking r
  • We use curve-fitting to estimate the absolute
    frequency of all words in sample

14
Implementing Content-Summary Extraction in SDARTS
Toolkit
  • Implemented content-summary extraction module as
    J2EE-compliant servlet
  • First, build SDARTS wrapper for remote web
    database
  • Then, trigger extraction process to generate
    content summary automatically
  • Module customizable with any classification
    scheme
  • Toolkit provides 72-node hierarchical scheme and
    associated classifiers
  • To add new scheme, should define the hierarchy
    and provide classifiers for the internal nodes

15
Fraction of PubMed Content Summary
PubMed content summary number of documents
3,868,552 cancer ? 1,398,178 aids ?
106,512 heart ? 281,506 angina ?
26,775 hepatitis ? 23,481 basketball ? 907
cpu ? 487
  • Extracted automatically
  • 27,500 words in the extracted content summary
  • Less than 200 queries sent
  • Retrieved 4 documents per query

The extracted content summary accurately
represents size and contents of the database
16
Topic-based Sampling Conclusions
  • SDARTS now supports extraction of detailed
    content summaries from any database, local or
    remote
  • Sophisticated database selection algorithms can
    now be implemented on top of SDARTS

Implemented and available for download Database
Selection Module SDARTS Client with Database
Selection
17
Interfacing with Open Archives Initiative (OAI)
No man is an island, entire of itself every
man is a piece of the continent, a part of the
main... (John Donne)
OAI Service Provider
  • Export SDARTS metadata under OAI
  • Access transparently any OAI collection through
    SDARTS

SDARTS/SDLIP Server
OAI Data Provider
SDARTS Client
18
Exporting SDARTS Metadata under OAI
  • SDARTS supports detailed, record-level metadata
    for each document, for XML and plain-text
    collections
  • Easy mapping to Dublin Core
  • SDARTS also exports content summaries under OAI
  • Each SDARTS collection is mapped to an OAI set
  • We export the content summaries under OAI, as
    metadata about the set

ltPAPERgt ltTITLEgtThe threat of vancomycin
resistancelt/TITLEgt ltAUTHORSgtTrish M. Perl MD,
MSclt/AUTHORSgt ltFILENOgtajm_106_05_0489lt/FILENOgt
ltAPPEAREDgt ltJRNLgtAmerican Journal of
Medicinelt/JRNLgt ltVOLgt106lt/VOLgtltISSgt5lt/ISSgt
ltDATEgt3 May lt/DATEgt ltYEARgt1999lt/YEARgt
lt/APPEAREDgt ltABSTRACTgt  lt/ABSTRACTgt ltBODYgt
lt/BODYgt lt/PAPERgt
  • COLUMBIA SDARTS Server
  • PubMed Publications
  • Aides Medical Collection
  • NOAH New York Online Access to Health
  • Cardiovascular Institute of the South
  • Columbia's DLI2 Medical Corpus
  • Harrisons Online

19
SDARTS OAI Sever Details
  • Uses OCLC OAI Server
  • Uses MySQL via JDBC to store OAI records
  • Records materialized after first request for
    space efficiency
  • Distributed as WAR file
  • Simple configuration Specify SDARTS/MySQL address

OAI Service Provider
SDARTS OAI Interface
JDBC
SDARTS Server
MySQL RDBMS
20
Searching OAI Collections
  • OAI is not designed for searching
  • Possible to restrict only Date and Set
  • Need to search OAI collections
  • Users want to specify Title, Author, etc.

OAI Service Provider
Author F. Douglass
OAI Data Provider (e.g., Library of Congress )
User
?
Author F. Douglass
21
Harvesting and Searching OAI within SDARTS
OAI Data Provider (e.g., Library of Congress )
  • OAI exports metadata records in XML
  • SDARTS can index and search XML collections
  • Solution
  • Harvest OAI records (by Date, Set)
  • Store records locally as XML documents
  • Use SDARTS XML wrapper to index them

Harvest OAI/XML records
SDARTS/SDLIP Server
Index OAI/XML records
The OAI collection is searchable as an SDARTS XML
database
22
Adding an OAI Collection in SDARTS
http//memory.loc.gov/cgi-bin/oai
loc
2002-01-01
23
Distributed Search over OAI
VT Electronic Thesis Dissertation number of
documents 2,948 study ? 1,479 thesis ?
493 cancer ? 13 basketball ? 2
  • SDARTS treats OAI collections as simple, local
    XML databases
  • Exact content summaries are exported for OAI
    collections
  • Possible to build sophisticated distributed
    search over OAI using SDARTS

SDARTS Content Summary for an OAI collection
24
Conclusions
  • SDARTS can now extract rich content summaries
    from
  • Local text and XML databases
  • Remote web databases
  • OAI-compliant collections
  • SDARTS is now OAI-compliant
  • SDARTS allows easy integration of any OAI
    collection into SDARTS
  • SDARTS supports searching transparently over a
    wide range of heterogeneous collections

No programming required for any of the tasks
25
We are on the Web -)
  • SDARTS executables and documentation
  • SDARTS source code with documentation
  • SDARTS web client
  • SDARTS database selection module
  • SDARTS-OAI interface tools
  • Sample SDARTS-compliant databases

http//sdarts.cs.columbia.edu/
Write a Comment
User Comments (0)
About PowerShow.com