L. Padmasree - PowerPoint PPT Presentation

About This Presentation
Title:

L. Padmasree

Description:

Signature Based Duplicate Detection in Digital Libraries L. Padmasree Vamshi Ambati J. Anand Chandulal M. Sreenivasa Rao School of Information Technology, JNT ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 20
Provided by: ece106
Learn more at: http://www.bibalex.org
Category:

less

Transcript and Presenter's Notes

Title: L. Padmasree


1
Signature Based Duplicate Detection in Digital
Libraries
  • L. Padmasree
  • Vamshi Ambati
  • J. Anand Chandulal
  • M. Sreenivasa Rao

School of Information Technology, JNT University,
Hyderabad, 500 072 , India. srmeda_at_gmail.com
2
Motivation
  • Books scanned in Digital Libraries are
    procured from varied sources.
  • Scanning centers are distributed across the
    country.
  • Duplicates could arise between scanning points.
  • Pre-scanning duplicate detection is required

3
Challenges
  • Duplicate detection is by using metadata (title,
    author, publishing year, edition, etc)
  • Entered by varied operators and so there is scope
    for
  • Incorrectness
  • Incompleteness
  • Errors could be -
  • Typographical mistakes
  • Word disorder
  • Inconsistent abbreviations
  • Even with missing words
  • Makes duplicate detection more difficult.
  • Duplicate detection must have quick turnaround
    time and accuracy

4
RELATED WORK
  • Most traditional methods based on string
    similarity are
  • character-based techniques
  • vector space based techniques.
  • Character-based technique
  • rely on character edit operations, such as
    deletions, insertions, substitutions and sub
    sequence comparison.
  • Vector space based techniques
  • transform strings into vector representation on
    which similarity computations are conducted.
  • In the present work we used an efficient and
    fast duplication detection technique using
    similarity search.

5
Our Approach
  • Uses Signature file method
  • Uses Similarity search techniques to find
    duplicates with close proximity match
  • Language independent
  • Fast and Accurate
  • Uses Online Tool to customize

6
The Process
  • Metadata is created at scanning centers
  • Signature is computed for the metadata
  • Use superimposed Technique and Hashing method
  • Signature is stored in central repository
  • Pre-scanned book metadata is submitted as a query
  • Use same technique to compute the signature
  • Similarity search gives close proximity match
    duplicate

7
Duplicate Detection in Digital Library system
Duplicate Detection Technique
8
Example of the process
Books Data
Central Repository Central Repository
Metadata of Books Signatures
The Meaning And Teaching Of Music -Will Earhart Some Famous Singers Of The 19th Century -Francis Rogers A Dictionary of Musical Terms - Dr.th.baker The Arts of Japan - Edward Dillon 011111110000101111100011111011 111001010000001001111110110110 111100101000110100000111111111 111101100000000000000011001111
Example Query The Arts of Japan - Edward Dillon
Query - Spell Mistakes Query - Missing Words Query - Jumbled Words
The Ars of Japa Edward Dilon The of Japan - Edward Dillon Dillon Edward -The Japan of Arts
111101100001110000000011001111 111101100000011000000011001111 111101100000100000000011001111
Result
Result The Arts of Japan - Edward Dillon
9
Superimposed Coding Technique
  • In Superimposed Coding Technique each record is
    mapped into an individual binary signature.
  • Record is either the title or the author name of
    the book or the combination.
  • Signatures of the records in the training data
    and testing data are encoded binary
    representations.
  • The signature of the 'title or author name' of
    the book is obtained by superimposing the
    signatures of the words with OR operation.

Computer Programming 1100 0001 1000 0101 0100 0100
Signature of the book 1101 1101 0100
10
The Hashing method
  • The signature of each word is obtained by hashing
    method.
  • The hashing function H(w) maps the word(w) into
    one of the patterns generated by computing a hash
    value of the word.
  • The hash function uses shift and add strategy.
  • The ASCII values of the characters in the word
    are added and shifted by H(w).
  • in order to compute the hash value. The final
    hash value is obtained by mod operation with nCr.

11
Duplicate Detection in Digital Library System
  • The Similarity Match Algorithm for Library
    Database
  • Input L library database consists of documents
    D1, D2, , Dm, query Q.
  • Output B book corresponding to query Q
  • Procedure Library (D1, D2, ,Dm, Q in B
    out)
  • for i1 to m do
  • Si superimposed-coding (Di)
  • end do
  • X superimposed-coding (Q)
  • O Jaccard (S1, S2,Sm, X)
  • Look up in Library database L for a book B
    (document) whose Signature matches with minimum
    Jaccard distance.
  • End

12
Jaccard Distance
  • The Jaccard distance between the query signature
    and target signature can be obtained by using the
    expression
  • d (r s) / (q r st)
  • q - The number of bits that equals to1 for both
    target and query signatures.
  • r - The number of bits that equals to 1 for
    target signature but that are 0 for the query
    signatures.
  • s - The number of bits that equals to 0 for the
    target signature but equals to 1 for the query
    signature
  • t - The number of bits that equals to 0 for both
    target and query signatures .

13
False drops
  • Minimized on the appropriate choice of two
    parameters n and r.
  • Online Tool

14
EXPERIMENTAL RESULTS
Meta data Query-Spell mistakes Query-Spell mistakes Query-Missing Words Query-Missing Words Query-Jumbled Words Query-Jumbled Words
Meta data False drop () DR () false drop () DR () false drop () DR ()
1000 7 93 9 91 3 97
5000 8 92 10 90 5 95
23000 10 90 12 88 5 95
DR Detection Rate
15
Scalability and accuracy of duplicate detection
system
16
(No Transcript)
17
(No Transcript)
18
CONCLUSION
  • Effective and efficient duplicate detection
    technique is proposed.
  • Duplicate detection was done by similarity search
    using signature file method where we can detect
    the duplicate with typographical mistakes, word
    disorder, and inconsistent abbreviations and even
    with missing words.
  • Language independent and High performance with
    95 accuracy

19
Questions?
Write a Comment
User Comments (0)
About PowerShow.com