Multimedia Databases - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Multimedia Databases

Description:

The aim is to reduce dimensionality and at the same time maintain the data characteristics ... A way to express the question (information need) Types: Boolean ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 51
Provided by: GeorgeK159
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Multimedia Databases


1
Multimedia Databases
  • Text I

2
Outline
  • Spatial Databases
  • Temporal Databases
  • Spatio-temporal Databases
  • Data Mining
  • Multimedia Databases
  • Text databases
  • Image and video databases
  • Time Series databases

3
Multimedia Data Management
  • The need to query and analyze vast amounts of
    multimedia data (i.e., images, sound tracks,
    video tracks) has increased in the recent years.
  • Joint Research from Database Management,
    Computer Vision, Signal Processing and Pattern
    Recognition aims to solve problems related to
    multimedia data management.

4
Multimedia Data
  • There are four major types of multimedia data
    images, video sequences, sound tracks, and text.
  • From the above, the easiest type to manage is
    text, since we can order, index, and search text
    using string management techniques, etc.
  • Management of simple sounds is also possible by
    representing audio as signal sequences over
    different channels.
  • Image retrieval has received a lot of attention
    in the last decade (CV and DBs). The main
    techniques can be extended and applied also for
    video retrieval.

5
Content-based Image Retrieval
  • Images were traditionally managed by first
    annotating their contents and then using
    text-retrieval techniques to index them.
  • However, with the increase of information in
    digital image format some drawbacks of this
    technique were revealed
  • Manual annotation requires vast amount of labor
  • Different people may perceive differently the
    contents of an image thus no objective keywords
    for search are defined
  • A new research field was born in the 90s
    Content-based Image Retrieval aims at indexing
    and retrieving images based on their visual
    contents.

6
Feature Extraction
  • The basis of Content-based Image Retrieval is to
    extract and index some visual features of the
    images.
  • There are general features (e.g., color,
    texture, shape, etc.) and domain-specific
    features (e.g., objects contained in the image).
  • Domain-specific feature extraction can vary with
    the application domain and is based on pattern
    recognition
  • On the other hand, general features can be used
    independently from the image domain.

7
Color Features
  • To represent the color of an image compactly, a
    color histogram is used. Colors are partitioned
    to k groups according to their similarity and the
    percentage of each group in the image is
    measured.
  • Images are transformed to k-dimensional points
    and a distance metric (e.g., Euclidean distance)
    is used to measure the similarity between them.

k-dimensional space
k-bins
8
Using Transformations to Reduce Dimensionality
  • In many cases the embedded dimensionality of a
    search problem is much lower than the actual
    dimensionality
  • Some methods apply transformations on the data
    and approximate them with low-dimensional vectors
  • The aim is to reduce dimensionality and at the
    same time maintain the data characteristics
  • If d(a,b) is the distance between two objects a,
    b in real (high-dimensional) and d(a,b) is
    their distance in the transformed low-dimensional
    space, we want d(a,b)?d(a,b).

d(a,b)
d(a,b)
9
Text - Detailed outline
  • Text databases
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

10
Problem - Motivation
  • Given a database of documents, find documents
    containing data, retrieval
  • Applications
  • Web
  • law patent offices
  • digital libraries
  • information filtering

11
Problem - Motivation
  • Types of queries
  • boolean (data AND retrieval AND NOT ...)
  • additional features (data ADJACENT retrieval)
  • keyword queries (data, retrieval)
  • How to search a large collection of documents?

12
Full-text scanning
  • for single term
  • (naive O(NM))

ABRACADABRA
text
CAB
pattern
13
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth, Morris and Pratt (77)
  • build a small FSA visit every text letter once
    only, by carefully shifting more than one step

ABRACADABRA
text
CAB
pattern
14
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
15
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth Morris and Pratt (77)
  • Boyer and Moore (77)
  • preprocess pattern start from right to left
    skip!

ABRACADABRA
text
CAB
pattern
16
Full-text scanning
  • Approximate matching - string editing distance
  • d(survey, surgery) 2
  • min of insertions, deletions,
    substitutions to transform the first string
  • into the second
  • SURVEY
  • SURGERY

17
Full-text scanning
  • string editing distance - how to compute?
  • A dynamic programming
  • Let s and t are two strings
  • Then, start from the end and try to find the
    minimum number of operations
  • cost( i, j ) cost to match prefix
    of length i of first string s with prefix of
    length j of second string t

18
Full-text scanning
  • if si tj then
  • cost( i, j ) cost(i-1, j-1)
  • else
  • cost(i, j ) min (
  • 1 cost(i, j-1) //
    deletion
  • 1 cost(i-1, j-1) //
    substitution
  • 1 cost(i-1, j) //
    insertion
  • )

19
Full-text scanning
  • Complexity O(MN)
  • More on your algorithms book
  • Conclusions
  • Full text scanning needs no space overhead, but
    is slow for large datasets

20
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

21
Text Inverted Files
22
Text Inverted Files
Q space overhead?
A mainly, the postings lists
23
Text Inverted Files
  • how to organize dictionary?
  • stemming Y/N?
  • Keep only the root of each word ex. inverted,
    inversion ? invert
  • insertions?

24
Text Inverted Files
  • how to organize dictionary?
  • B-tree, hashing, TRIEs, PATRICIA trees, ...
  • stemming Y/N?
  • insertions?

25
Text Inverted Files
  • postings list more Zipf distr. eg.,
    rank-frequency plot of Bible

log(freq)
freq 1/rank / ln(1.78V)
log(rank)
26
Text Inverted Files
  • postings lists
  • CuttingPedersen
  • (keep first 4 in B-tree leaves)
  • how to allocate space Faloutsos92
  • geometric progression
  • compression (Elias codes) Zobel down to 2
    overhead!
  • Conclusions needs space overhead (2-300), but
    it is the fastest

27
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • Information Retrieval
  • Vector Model and clustering
  • information filtering and LSI

28
Signature files
  • idea quick dirty filter

29
Signature files
  • idea quick dirty filter
  • then, do seq. scan on sign. file and discard
    false alarms
  • Adv. easy insertions faster than seq. scan
  • Disadv. O(N) search (with small constant)
  • Q how to extract signatures?

30
Signature files
  • A superimposed coding!! Mooers49, ...

m (4 bits/word) F (12 bits sign. size)
31
Signature files
  • A superimposed coding!! Mooers49, ...

data
actual match
32
Signature files
  • A superimposed coding!! Mooers49, ...

retrieval
actual dismissal
33
Signature files
  • A superimposed coding!! Mooers49, ...

nucleotic
false alarm (false drop)
34
Signature files
  • A superimposed coding!! Mooers49, ...

YES is MAYBE NO is NO
35
Signature files
  • Q1 How to choose F and m ?
  • Q2 Why is it called false drop?
  • Q3 other apps of signature files?

36
Signature files
  • Q1 How to choose F and m ?
  • A so that doc. signature is 50 full

m (4 bits/word) F (12 bits sign. size)
37
Signature files
  • Q1 How to choose F and m ?
  • Q2 Why is it called false drop?
  • Q3 other apps of signature files?

38
Signature files
  • Q2 Why is it called false drop?
  • Old, but fascinating story
  • how to find qualifying books (by title word,
    and/or author, and/or keyword)
  • in O(1) time?

without computers...
39
Signature files
  • Solution edge-notched cards

1
2
40
  • each title word is mapped to m numbers(how?)
  • and the corresponding holes are cut out

40
Signature files
  • Solution edge-notched cards

1
2
40
data
data - 1, 39
41
Signature files
  • Search, e.g., for data activate needle 1,
    39, and shake the stack of cards!

1
2
40
data
data - 1, 39
42
Signature files
  • Q3 other apps of signature files?
  • A anything that has to do with membership
    testing does data belong to the set of words
    of the document?

43
Signature files
  • Another name Bloom Filters
  • Bloom-joins in System R Mackert and active
    disks Riedel99
  • differential files SeveranceLohman
  • Many other applications in Networks and databases
  • http//www.eecs.harvard.edu/michaelm/NEWWORK/pos
    tscripts/BloomFilterSurvey.pdf

44
Information Retrieval
  • What is the goal of IR?
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries.
  • This set of assumptions underlies the field of
    Information Retrieval.

45
Some IR History
  • Roots in the scientific Information Explosion
    following WWII
  • Interest in computer-based IR from mid 1950s
  • Probabilistic models at Rand (Maron Kuhns)
    (1960)
  • Boolean system development at Lockheed (60s)
  • Vector Space Model (Salton at Cornell 1965)
  • Statistical Weighting methods and theoretical
    advances (70s)
  • User Interfaces, Large-scale testing and
    application (90s)

46
Query Languages
  • A way to express the question (information need)
  • Types
  • Boolean
  • Natural Language
  • Stylized Natural Language
  • Form-Based (GUI)

47
Simple query language Boolean
  • Terms Connectors (or operators)
  • terms
  • words
  • normalized (stemmed) words
  • phrases
  • thesaurus terms
  • connectors
  • AND
  • OR
  • NOT

48
Boolean Logic
B
A
49
User Query
Collections of documents, files, etc
text input
50
Information need
Collections
text input
Reformulated Query
Write a Comment
User Comments (0)
About PowerShow.com