Semantic Integration in Heterogeneous Databases Using Neural Networks - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Semantic Integration in Heterogeneous Databases Using Neural Networks

Description:

This took 4 hours per data element or 25 full time employees 2 years to complete ... than to ever attribute of the training database or |attributes| - |foreign keys ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 14
Provided by: jeff145
Category:

less

Transcript and Presenter's Notes

Title: Semantic Integration in Heterogeneous Databases Using Neural Networks


1
Semantic Integration in Heterogeneous Databases
Using Neural Networks
  • Wen-Syan Li, Chris Clifton
  • Presentation by Jeff Roth

2
Introduction
  • Basic schema matching problem
  • GTEs data integration project included 27,000
    data elements
  • This took 4 hours per data element or 25 full
    time employees 2 years to complete
  • This method -gt .1 seconds, 144000 x faster
  • how to match knowledge is discovered

3
Method Outline
  • The end user is able to distinguish between
    unreasonable and reasonable answers, and exact
    results arent critical. This method allows a
    user to obtain reasonable answers requiring
    database integration at a low cost

4
Automated semantic integration methods
  • Attribute Name Comparison
  • This method is not used in this paper
  • Attribute values and domains comparison
  • Equal, Contains, Overlap, Contained-in and
    Disjoint
  • Used but not with the above measures
  • Field Specifications
  • Data type, field length constraints and others.
  • This is also used in this method

5
Field Specifications
  • The following measures are used
  • data types
  • Each possible data type has a network input,
    with the field data type having a value of 1 and
    all the other having a value of 0
  • field length
  • Length 2 (1/(1 k-length) - 0.5)
  • format specifications
  • similar to data type
  • constraints (primary key, foreign key,
    disallowing nulls, access restrictions, etc)
  • similar to data type

6
Attribute Values and Domains
  • Divide measures into character fields and numeric
    fields
  • Patterns for Character fields
  • 1. Ratio of numerical characters
  • Address 146 South 920 West would score 6/18
  • 2. Ratio of white space
  • Address 146 South 920 West would score 3/18
  • 3. Length Statistics
  • Average, Variance, and coefficient of the
    used length relative to the maximum length

7
Attribute Values and Domains cont.
  • Patterns for numeric fields
  • 1. Average (mean)
  • 2. Variance
  • 3. Coefficient of variation
  • Recognizes similarity between values of different
    Units and Granularity
  • This can also help recognize which fields may
    need unit conversions
  • 4. Grouping
  • For example area code, zip code, first three
    digits of SSN

8
Self-Organizing Grouping algorithm
  • N number of possible discriminators
  • M number of categories, this can be adjusted by
    user. ideally this is attributes - foreign
    keys
  • This is unsupervised, i.e. you dont have to
    provide a correct classification, it simply
    groups based on similarity

9
Training the Back-Prop Network
  • Inputs (N) are identical to classifier
  • Outputs (M) are trained using Back-Propagation
    and classifiers results
  • Categories are labeled with the attributes they
    grouped together

10
What is the classifier for?
  • Ease of training
  • ideally M is attributes - foreign keys
    and it is less computationally expensive to train
    M classifications where M lt attributes -
    foreign keys
  • It is less computationally complex to compare new
    elements to the M classification than to ever
    attribute of the training database or
    attributes - foreign keys
  • Networks can be trained in which there there are
    attributes that are identical

11
Integration Procedure
2
1
3
6
4
5
  • 1. DBMS Specific Parser
  • 2. Classify (Categorize) Training Data
  • 3. Train Neural Network
  • 4. DBMS Specific Parser
  • 5. Classification by Neural Network
  • 6. User Checks Results

12
Results
13
Conclusion and Future Work
  • Human Effort needed for semantic integration is
    minimized
  • Different Systems have different attribute
    properties available - automated solution
  • Extend to automated information integration
  • C source code available at eecs.nwu.edu/pub/semint
Write a Comment
User Comments (0)
About PowerShow.com