Integrating Data Mining and Data Management Technologies for Scholarly Inquiry - PowerPoint PPT Presentation

Loading...

PPT – Integrating Data Mining and Data Management Technologies for Scholarly Inquiry PowerPoint presentation | free to download - id: 69685f-YzE5Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

Description:

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 16
Provided by: ValuedGate2398
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Integrating Data Mining and Data Management Technologies for Scholarly Inquiry


1
Integrating Data Mining and Data Management
Technologies for Scholarly Inquiry
  • Ray R. Larson
  • University of California, Berkeley
  • Paul Watry Richard
    Marciano
  • University of Liverpool University of
    North
  • Carolina,
    Chapel Hill

2
  • Integrating Data Mining and Data Management
    Technologies for Scholarly Inquiry
  • Goals
  • Text mining and NLP techniques to extract content
    (named Persons, Places, Time Periods/Events) and
    associate context
  • Data
  • Internet Archive Books Collection (with
    associated MARC where available) 7.2T
  • Jstore 1T
  • Context sources SNAC Archival and Library
    Authority records.
  • Tools
  • Cheshire 3 DL Search and Retrieval Framework
  • iRODS Policy-driven distributed data storage
  • Amazon S3 storage and EC2 computing

3
Grid-Based Digital Libraries Needs
  • Large-scale distributed storage requirements and
    technologies
  • Organizing distributed digital collections
  • Shared Metadata standards and requirements
  • Managing distributed digital collections
  • Security and access control
  • Collection Replication and backup
  • Distributed Information Retrieval support and
    algorithms

4
But
  • Hasnt Hadoop and its menagerie already solved
    everything?
  • Yes many tasks can be done now with great
    scaleup
  • And No most Hadoop solutions are batch oriented
    and not geared towards information access, but
    more towards summarization
  • Maybe we are looking at replacing or
    supplementing the low-level data management with
    Hadoop or Spark tools

5
Grid/Cloud IR Issues
  • Want to preserve the same retrieval performance
    (precision/recall) while hopefully increasing
    efficiency (I.e. speed)
  • Very large-scale distribution of resources is
    (still) a challenge for sub-second retrieval
  • Different from most other typical Grid/Cloud
    processes, IR is potentially less computing
    intensive and more data intensive
  • In many ways Grid IR replicates the process (and
    problems) of metasearch or distributed search
  • We have developed the Cheshire3 system to
    evaluate and manage these issues. The Cheshire3
    system is actually one component in a larger
    Grid-based environment

6
Cheshire3 Environment
or iRODS
7
Cheshire3 IR Overview
  • XML Information Retrieval Engine
  • 3rd Generation of the UC Berkeley Cheshire
    system, as co-developed at the University of
    Liverpool
  • Uses Python for flexibility and extensibility,
    but uses C/C based libraries for processing
    speed
  • Standards based XML, XSLT, CQL, SRW/U, Z39.50,
    OAI to name a few
  • Grid/Cloud capable. Uses distributed
    configuration files, workflow definitions and PVM
    or MPI to scale from one machine to thousands of
    parallel nodes
  • Free and Open Source Software

8
Cheshire3 Object Model
9
Current Version
  • iRODS and C3 on Amazon EC2 and S3

10
Sample demo
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Summary
  • Indexing and IR work very well in the Grid/Cloud
    environment, with the expected scaling behavior
    for multiple processes
  • Still in progress
  • We are still processing collecting the books
    collection from the Internet Archive
  • We are still extracting place names, personal
    names, corporate names and linking with reference
    sources (such as GeoNames, VIAF, and SNAC)

15
Thank you!
Special thanks to John Harrison (Liverpool),
Chien-Yi Hou (UNC), Shreyas and Luis Aguilar
(UCB)
Available via https//github.com/cheshire3
iRODS available via https//www.irods.org
Project web site http//diggingintodata.web.unc.ed
u
About PowerShow.com