A Framework for Relationship Discovery Among Files of Different Types - PowerPoint PPT Presentation

About This Presentation
Title:

A Framework for Relationship Discovery Among Files of Different Types

Description:

A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 2
Provided by: LuigiM5
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Relationship Discovery Among Files of Different Types


1
A Framework for Relationship Discovery Among
Files of Different Types
Michal Ondrejcek, Jason Kastner and Peter Bajcsy
National Center for Supercomputing Applications
(NCSA), University of Illinois at
Urbana-Champaign (UIUC) mondrejc,jkastner,
pbajcsy _at_ncsa.uiuc.edu
Abstract We present a framework for relationship
discovery from heterogeneous data systems. The
framework consists of modules for automated file
system analysis, file content analysis,
integration of the results from analyses, storage
of metadata and data-driven decision support for
discovering relationships among files. The file
content analysis includes filtering for file type
detection (e.g., file format identification using
DROID and PRONOM) and type-specific content
analysis (such as, information extraction from 2D
engineering drawings using Optical Character
Recognition (OCR), and keyword based extraction
of information from 3D CAD models). The
integration component consolidates metadata
extracted from the file system and from the file
content using metadata Resource Description
Framework (RDF)-based representations. These are
stored using Tupelo in an underlying content
repository. We report our preliminary design of
the framework and the performance of prototype
modules for a test collection of electronic
records documenting the Torpedo Weapon Retriever
(TWR 841). This test collection presents a
problem of unknown relationships among files that
currently include 784 2D image drawings and 22
CAD models.
Framework Design
Content Information Extraction
We study the extraction of content information to
discover relationships between engineering
drawings (tiff files) with the Title Block and
corresponding AutoCAD 3D models (dwg files) of
the TWR841 ship deck.
An overall design to discovering relationships
among multiple sources of electronic records.
Engineering Drawing
RELATIONSHIP
3D CAD Model
  • Information in engineering drawings The title
    block is cropped. Information is extracted using
    Optical character recognition (OCR) software. The
    extracted information is corrected and encoded
    into about 15-20 RDF triples using a developed
    ontology.

File System Information Extraction
  • Aperture, a Java framework has been used for
    metadata extraction from File systems. It saves
    the metadata following the Nepomuk ontology.
  • We studied the size of extracted metadata and
    developed prediction capabilities to estimate
    additional storage requirements.

lt?xml version"1.0" encoding"UTF-8"?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-r
df-syntax-ns" xmlnsrdfs"http//www.w
3.org/2000/01/rdf-schema"
xmlnsdc"http//purl.org/dc/elements/1.1/"
xmlnstdrwpath/NARA/titleBlockRDF/"
ltrdfDescription rdfaboutpath/NARA/titleBlo
ckRDF/"gt lttdrwdrawingTitlegt120 TORPEDO
WEAPONS RETRIEVER
TRANSVERSE BULKHEADS BELOW
MAIN DECK
lt/tdrwdrawingTitlegt lttdrwisPreparingActiv
itygtOfPAO'MtN Of NE v
NAVAL SEA SYSTEMS
COMMAND lt/tdrwisPreparingActivitygt
lttdrwdrawingScalegt1/2"-1'-0" AS SHOWN
lt/tdrwdrawingScalegt lttdrwdrawingSizegtHlt/t
drwdrawingSizegt lttdrwdrawingNumbergt117-62
00895lt/tdrwdrawingNumbergt
lttdrwdrawingNumbergtAlt/tdrwdrawingNumbergt
lttdrwisDrawnBygt ltrdfDescription
rdfabout"http//purl.org/dc/elements/1.1/"gt
ltdccreatorgtLDOBSONlt/dccreatorgt
lt/rdfDescriptiongt lt/tdrwisDrawnBygt
lttdrwisDrawnDategt ltrdfDescription
rdfabout"http//purl.org/dc/elements/1.1/"gt
ltdcdategt4-I0-86lt/dcdategt
lt/rdfDescriptiongt lt/tdrwisDrawnDategt
lt/rdfDescriptiongt lt/rdfRDFgt
Cropped Title Block
Information from OCR
Metadata size as a function of number of files in
a File system. The test systems were, divided
based on the Operating System (OS) type to (c1 )
LINUX based 8 CPU Intel Xeon with 2.5GHz and 8GB
RAM and (c2) WindowsXP 1 CPU 2GHz Intel and 2GB
RAM. While the dots corresponds to concrete File
systems, the blue line represents the metadata
size prediction based on simulated File system
topology.
RDF representation of information extracted
Editing and Ontology Definition
  • Information in 3D CAD files The 3D CAD models in
    STEP file format are searched for any ASCII
    strings matching English dictionary. The
    information is again encoded by about 8-10 RDF
    triples.

STEP METADATA SPECIFICATION EXPECTED STEP METADATA PARSED STEP METADATA
FILE_DESCRIPTION( / description / (''), / implementation_level / '21') FILE_NAME( / name / '', / time_stamp /'', / author / (''), / organization / (''), / preprocessor_version / ' ', / originating_system / '', / authorization / ' ') FILE_DESCRIPTION((''), / implementation_level / '21') FILE_NAME( '120 TORPEDO WEAPONS RETRIEVER, TRANSVERSE BULKHEADS BELOW, MAIN DECK', 04-10-86', ('LDOBSON'), ('NAVAL SEA SYSTEMS COMMAND'), ' ', 'IDA-STEP', ' ') FILE_DESCRIPTION((''), '21') FILE_NAME( 'D\\NARA\\Archieve_data_samples\\BHD_FR12\\U2110_BHD12_2007_05_09.stp', '2007-05-10T134537', ('rakowpj'), (''), 'Autodesk Inventor 11', 'Autodesk Inventor 11', '')
File Format Identification
This component calls DROID, a file format
identification program. The results are metadata
about each file including the registered PRONOM
universal ID. PRONOM is a resource registry
(information) about the file formats, software
products and other technical components.
Table shows an example of information extracted
from 3D CAD model stored in STEP file formats of
the TWR841 ship deck.
Several 3D file formats are not supported by
PRONOM and DROID returns the unidentified file
format flag. Those files are then checked against
an internal list of 3D file types. The results
are converted into RDF triples and stored in a
metadata context repository.
Conclusions
  • We have prototyped a framework for file system
    and file content metadata extraction. The
    relationship discovery from metadata is in
    progress.
  • We developed the metadata size prediction
    capability for File systems.
  • We empirically observed the number of generated
    RDF triples for relationship discovery to be on
    average about 20-30 per file leading to the total
    number of 8-12 million RDF triples for an average
    size server.

lttag//path,2009file/UUID1gt lttag//path,2009file/UUID1gt lttag//path,2009file/UUID1gt lttag//path,2009file/UUID1gt lttag//path,2009file/UUID1gt lttag//path,2009file/UUID1gt lthttp//path/2009/droid/IdFile/hasFileNamegt lthttp//path/2009/droid/IdFile/hasIdentQualitygt lthttp//path/2009/droid/IdFile/FileFormatHit/hasFormatNamegt lthttp//path/2009/droid/IdFile/FileFormatHit/hasFormatVersiongt lthttp//path/2009/droid/IdFile/FileFormatHit/hasPronomIdgt lthttp//path/2009/droid/IdFile/FileFormatHit/hasMimeTypegt U2110_BHD12_Autocad.dwg Positive AutoCAD Drawing 2004-2005 http//www.nationalarchives.gov.uk/pronom/fmt/36 image/vnd.dwg
Acknowledgments This research was partially
supported by a National Archive and Records
Administration supplement to NSF PACI cooperative
agreement CA SCI-9619019.
RDF triples generated for two engineering
drawings in tiff and Autocad formats with PRONOM
Unique IDs highlighted. An UUID is used as a key
for storing a set of triples about the same file.
Write a Comment
User Comments (0)
About PowerShow.com