The INFOMINE project - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

The INFOMINE project

Description:

The semi-automatic focused web crawler. You suggest a topic or search term ... The fully-automatic focused web crawler. Coming soon! The Infomine record builder ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 55
Provided by: ivia9
Category:

less

Transcript and Presenter's Notes

Title: The INFOMINE project


1
The INFOMINE project
  • Gordon Paynter
  • Infomine Lead Programmer
  • and the Infomine team
  • Steve Mitchell, Margaret Mooney, Julie Mason et
    al.
  • at the University of California, Riverside

2
The Infomine Project
  • Introduction to Infomine
  • The core Infomine system
  • Automation finding and describing resources
  • Collaboration the Fiat Lux portals
  • Conclusions

3
Introduction to Infomine
  • Infomine is a virtual library
  • Infomine's goal is to provide organised access to
    the Internet in the same way that we do for
    printed works
  • Library catalogs focus on books and periodicals
  • Infomine focuses on web sites (mostly, now)
  • There are many differences between books and web
    sites

4
Books Vs. Web sites
  • Books
  • Easily-defined, physical objects
  • Static
  • Permanent
  • LC 119 million items
  • Web sites
  • What is a web site anyway?
  • Continually changing
  • Frequently disappear
  • Google 2 billion pages

5
Books Vs. Web sites
  • Books
  • Limited number of publishers
  • Existing, coordinated cataloging effort
  • Text not usually electronically available
  • Web sites
  • Anyone can publish
  • Few indexers Infomine, LII, IPL, BUBL, MEL,
    Scout all are post-hoc
  • Can be downloaded and processed

6
Simplifying the problem
  • Editorial standards
  • Only select the best Web sites
  • Automated assistance
  • Collection building
  • Automated and semi-automated resource description
  • Catalog maintenance
  • Wide collaboration
  • More contributors
  • Less redundant effort

7
The core Infomine system
  • Infomine for patrons
  • Behind the scenes Infomine for content builders
  • Open source inputs what the community gives us
  • Open source outputs what we're distributing

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Infomine core behind the scenes
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Infomine core open source inputs
  • The Linux operating system
  • Debian GNU/Linux
  • Infrastructure
  • The Apache webserver
  • MySQL and Berkeley DB databases
  • Programming tools
  • The GNU Compiler (gcc) and libraries, emacs
  • Common libraries

18
Infomine core open source outputs
  • The Infomine general-purpose library
  • http//infomine.ucr.edu/iVia/
  • The full libInfomine library
  • Available in August (as documentation completed)
  • The full Infomine source
  • Available Fall 2002

19
Automation finding and describing resources
  • Discovering new resources
  • The Infomine record builder
  • Extracting useful metadata
  • Automatically classifying records
  • Open source inputs
  • Open source outputs

20
Discovering new resources
  • The semi-automatic focused web crawler
  • You suggest a topic or search term
  • The crawler searches for web pages and clusters
    them
  • You identify useful clusters of documents
    (optional)
  • The crawler reports the top 20 hubs and
    authorities
  • You choose from the list of URLs
  • The automatic record builder helps generate
    metadata
  • The fully-automatic focused web crawler
  • Coming soon!

21
The Infomine record builder
  • Input a URL or list of URLs
  • From the focused crawler
  • From the record builder interface
  • The record builder creates a new record
  • Fully-automatic operation
  • The builder creates new records on its own
  • Semi-automatic operation
  • The builder interacts with you at each stage
  • Output new records in the pending database

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
New research LCSH assignment
  • Dr. Steve Jones, of the University of Waikato
  • Aim assign LCSH based on document content
  • Use training data to build a model
  • Training data documents with keyphrases and LCSH
  • Model based on keyphrase and LCSH co-occurrence
  • Use model to assign LCSH to new documents
  • Extract keyphrases with Kea
  • Similarity measures identify the best LCSH

33
New research LCSH assignment
  • forest insects
  • forest insects
  • bark beetles
  • borers (insects)
  • tobacco hornworm
  • scolytidae
  • greenhouse whitefly
  • agriculture in literature
  • mountain pine beetle

34
New research LCSH assignment
  • BRASSICA
  • CROPS
  • PLANT BREEDING
  • cruciferae
  • Buriats
  • brassica
  • phytophagous insects
  • plants, effect of metals on
  • blood groups in animals
  • rapeseed
  • hybridization, vegetable

35
New research LCSH assignment
  • CLIMATOLOGY
  • ENVIRONMENTAL SCIENCES
  • POLLUTION
  • atmospheric chemistry
  • meteorology
  • continentality (meteorology)
  • chemical oceanography
  • multidimensional chromatography
  • turbulent diffusion (meteorology)
  • aerosols
  • precipitation scavenging

36
New research LCC assignment
  • Dr. Eibe Frank, of the University of Waikato
  • Aim assign LCC based on a set of LCSH
  • Infomine has LCSH but no LCC
  • Use with LCSH classifier for new documents
  • Use training data to build a model
  • Training data documents with LCSH and LCC
  • Model LCC-hierarchy of Support Vector Machines
  • Use model to assign LCC to new documents

37
New research LCC assignment
  • Performance (preliminary)
  • Absolute accuracy around 58 (pleasing)
  • Also 4 are too specific, 3 too general
  • Top-level accuracy around 80
  • What to do if we encounter completely new LCSH?
  • QA1-43 Science Mathematics General

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Automation open source inputs
  • General and C tools
  • Linux, Apache, gcc, flex, curl, etc
  • Java tools
  • The Java MARC Events (James) toolkit
  • The Waikato Environment for Knowledge Analysis
    (WEKA) machine learning toolkit
  • The Kea keyphrase extraction program

42
Automation open source outputs
  • LCSHtoLCC LCC assignment
  • http//infomine.ucr.edu/iVia/
  • KPtoLCSH LCSH assignment
  • Available August
  • PhraseRate keyphrase extractor
  • http//infomine.ucr.edu/iVia/
  • Artur's Automatic Annotator
  • Available in Fall 2002 (with Infomine)

43
Collaboration the Fiat Lux Portals
  • Fiat Lux
  • Advantages of collaboration
  • MyIResearch guides and pathfinders
  • Themes co-branding for collaborators
  • Open standards, protocols and source code
  • Challenges of collaboration

44
Fiat Lux
  • Established at ALA Midwinter 2002
  • Prominent, librarian-built, public portals
  • BUBL, Infomine, IPL, lii.org, MEL, VRL
  • Goal resource sharing through collaboration
  • Fiat Lux represents
  • 170 librarians
  • 100,000 records
  • 30 million searches/year

45
Advantages of collaboration
  • Greater sustainability and scalability
  • Reduced redundant effort
  • Shared cataloging effort
  • More resources cataloged
  • Everyone gets a bigger (better) dataset
  • Shared systems development
  • Scalability of systems
  • Preserving institutional identity

46
Themes co-branding through iVia
  • Co-branding for institutional cooperators
  • Many data views can be themed
  • The data is the same
  • The appearance is altered
  • http//infomine.ucr.edu/cgi-bin/canned_search?que
    rytreethemewfu

47
(No Transcript)
48
MyI custom collections
  • Create research guides / pathfinders
  • Create a MyI category
  • Add records to categories in the record editor
  • Batch add to your category
  • Create searches for your records
  • Examples
  • CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM...
  • UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD
  • UDM-edu459

49
(No Transcript)
50
Challenges of collaboration
  • Investigating lii.org integration
  • Granularity of metadata
  • Different editorial processes
  • Collection focus and audience level
  • Scholarly Vs. K-12 Vs. public library
  • How do you merge duplicate records?
  • LCSH, keywords easy to combine
  • Annotation not sure yet
  • These are editorial issues rather than technical

51
Current collaborators
  • Fiat Lux
  • LITA Internet Portals Interest Group
  • NSF / NSDL
  • Contributing to the content-building effort

52
Current collaborators
  • University of California contributors
  • University of Detroit Mercy
  • Wake Forest University
  • California State University contributors
  • Cornell University
  • Library of Congress / BEOnline

53
Financial support
  • FIPSE Fund for the Improvement of Post-secondary
    Education (U.S. Department of Education)
  • IMLS National Leadership Grant
  • The Library of University of California at
    Riverside

54
Join us!
  • Cataloging the Internet (or even just the best
    bits) requires collaboration, but we recognise
    that most collaborators have different needs.
  • Open standards and software offer us scalability
    and flexibility, and provide us with a base of
    work on which to build, and ensure our work will
    continue to be free.
  • http//infomine.ucr.edu/iVia/
Write a Comment
User Comments (0)
About PowerShow.com