How (Not) to Use a Semi-automated Clustering Tool - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

How (Not) to Use a Semi-automated Clustering Tool

Description:

How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006 Update on UM s efforts Built three research portals DLF – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 9
Provided by: KatHag8
Category:

less

Transcript and Presenter's Notes

Title: How (Not) to Use a Semi-automated Clustering Tool


1
How (Not) to Use a Semi-automated Clustering Tool
  • Kat Hagedorn
  • University of Michigan
  • April 11, 2006

2
Update on UMs efforts
  • Built three research portals
  • DLF lthttp//www.hti.umich.edu/cgi/b/bib-idx?cimls
    gt
  • MODS lthttp//www.hti.umich.edu/m/modsgt
  • Aquifer lthttp//www.hti.umich.edu/a/aquifergt
  • Improvements for search / display
  • Integration of MODS format records
  • Simple vs. advanced searching
  • Inclusion of thumbnails

3
The need to cluster
  • Want to offer more than search within a generic,
    large corpus of data
  • How to partition the data?
  • Emorys MetaCombine tool promising as a topical
    clustering agent
  • (Also interested in clustering by format, access
    restriction, OAI software used, etc.)

4
Clustering vs. classification
  • Clustering is main focus
  • Huge amount of data
  • Needed a tool to find the topic
  • Preferably a disjunctive tool (placing files
    under more than one topic)
  • Classification is secondary focus
  • Have potential classification (UMs browse)
  • Marrying to current system nigh on impossible

5
Results duration
  • First tried with small repository of 5500
    records (amnh)
  • Took around 25 minutes
  • Multiple tries with larger repository of 270K
    records (dlps)
  • Took around 12 hours

6
Results cluster names
  • Examples of set names from clustering UMs
    metadata
  • Good europe, mechanical, architecture
  • Not so good general, michigan, build
  • Favorite southern literari literature fine
    messenger
  • Granted
  • Only asked for 20 clusters
  • Didnt cluster hierarchically

7
Caveats
  • Metadata will always be difficult to cluster
  • Using a tool developed as a Web service, with
    obvious benefits
  • Expect necessity of mapping set names to real
    topical cluster names

8
What we need
  • Running the tool locally, with a local WSDL
    instance, would save lots (and lots) of time
  • Better set namesdoes this mean a better
    algorithm?
  • Ability to cluster by any criteria, not just
    topic, i.e., a post-processing module
  • Disjunctive clustering, meaning (so as not to hog
    storage) filename (not file) clustering
Write a Comment
User Comments (0)
About PowerShow.com