A%20quick%20tour%20of%20the%20datasets%20for%20VLDB%202008%20(does%20not%20include%20datasets%20already%20in%20the%20UCR%20archive) - PowerPoint PPT Presentation

About This Presentation
Title:

A%20quick%20tour%20of%20the%20datasets%20for%20VLDB%202008%20(does%20not%20include%20datasets%20already%20in%20the%20UCR%20archive)

Description:

Time series are star light curves falling into three classes. Why is difficult? ... M. Rath and R. Manmatha International Journal on Document Analysis and Recognition. ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 20
Provided by: eceNorth
Category:

less

Transcript and Presenter's Notes

Title: A%20quick%20tour%20of%20the%20datasets%20for%20VLDB%202008%20(does%20not%20include%20datasets%20already%20in%20the%20UCR%20archive)


1
A quick tour of the datasets for VLDB 2008(does
not include datasets already in the UCR archive)
2
Formatting Note
Some Name
I measured the accuracy of 1NN-ED on the training
set (only). This was to make sure we do not
have any formatting misunderstandings You should
test the 1NN-ED on the training set (only), and
see if you get the same answers. Do this first,
otherwise we may waste time.
  • The dataset came from blah blah blah blah
  • Why is difficult?
  • Blah blah
  • Blah blah
  • Blah blah

This is the one nearest neighbor, Euclidean
distance accuracy for just the training set,
measured using leaving-one-out.
Number of training objects 80
Number of testing objects 2320
Number of classes 8
Length of time series 1024
Euclidean Distance accuracy 95.05
3
MALLAT TECHNOMETRICS
  • Why is difficult?
  • Many classes
  • Some classes are globally similar, and have only
    local differences.
  • Small training set (In a, using 1024 instances
    for training, a decision tree got 96.87
    accuracy. Since this was too easy, we reduced the
    size of the training set significantly).

This figure is from a. The only change we made
was to flip the data left to right, (and
z-normalization)
This dataset is described in Mallat, S. G.
(1998), A Wavelet Tour of Signal Processing, San
Diego Academic Press. However the data we used
was donated by Jeong a. The data was obtained
by randomly choosing 55 objects for the training
set and choosing the rest for the testing set.
Each time series was also reversed. a M. K.
Jeong, J. C. Lu, X. Huo, B. Vidakovic, and D.
Chen (2006), "Wavelet-based Data Reduction
Techniques for Process Fault Detection,"
Technometrics, 48(1), 26-40. http//web.utk.edu/m
jeong/
Number of training objects 55
Number of testing objects 2345
Number of classes 8
Length of time series 1024
Euclidean Distance accuracy 98.18
4
ItalyPowerDemand (3 years)
  • Task
  • Distinguish days from Oct to March (inclusive)
    from April to September
  • Why is difficult?
  • Borderline days (late Sep vs early Oct)
  • Unusual days (soccer games etc)
  • Under sampled data?
  • August is radically different to the rest of the
    summer months.

1
3
5
7
9
11
13
15
17
19
21
23
From Keogh ICDM06
Number of training objects 67
Number of testing objects 1029
Number of classes 2
Length of time series 24
Euclidean Distance accuracy 95.522
See Keogh ICDM06 Eamonn Keogh, Li Wei, Xiaopeng
Xi, Stefano Lonardi, Jin Shieh, Scott Sirowy
(2006). Intelligent Icons Integrating
Lite-Weight Data Mining and Visualization into
GUI Operating Systems. ICDM 2006.
5
CinC_ECG_torso
  • Task
  • Data is taken from ECG data for multiple
    torso-surface sites. There are 4 classes (4
    different people)
  • Why is difficult?
  • See gray strip on figure. Depending on location
    on the body, the peak can be positive, neutral or
    negative. Similar remarks apply to all features.
  • The figure shows aligned data, but the challenge
    data is slightly out of alignment.

Number of training objects 40
Number of testing objects 1380
Number of classes 4
Length of time series 1639
Euclidean Distance accuracy 85.00
6
Haptics
  • Task
  • Data is taken from 5 people entering their
    passgraph on a touchscreen. We only consider
    the X axis.
  • Why is difficult?
  • Small training set
  • I think (but have not checked this) that the
    high variability at the beginning and end of the
    time series is just noise.
  • We are just looking at the X-axis for
    simplicity, we should also be looking at Y-axis,
    pen pressure, pen acceleration

200
180
160
140
120
4 sample time series (before normalizing)
100
80
60
40
0
200
400
600
800
1000
1200
Number of training objects 155
Number of testing objects 308
Number of classes 5
Length of time series 1092
Euclidean Distance accuracy 51.61
Novel Shoulder-Surfing Resistant Haptic-based
Graphical Password Behzad Malek, Mauricio Orozco,
Abdulmotaleb El Saddik
7
Symbols
  • Task
  • Thirteen people participated in this experiment.
    They were asked to copy the randomly appearing
    symbol as best they could. There were 3 possible
    symbols, each person contributed about 30
    attempts.
  • Why is difficult?
  • Individuality of the 13 individuals
  • Each of the 6 classes looks only at the X or Y
    axis, we really should have 3 classes looking at
    the X and Y axis
  • Two of the symbols are very very similar on the
    Y-axis
  • Small training set

X-axis
Y-axis
Number of training objects 25
Number of testing objects 995
Number of classes 6
Length of time series 398
Euclidean Distance accuracy 84.0
This dataset was created for the contest by Jill
Brady, a grad student at UCR. We gratefully
acknowledge her.
8
MedicalImages
  • Task
  • The data are histograms of pixel intensity of
    medical images. The classes are different human
    body regions.
  • Why is difficult?
  • It is not clear that treating the raw data as
    time series is the best overall approach for this
    problems, but the original authors due report
    success with a time warping measure.
  • Original time series are of different lengths,
    some are very short, making them all the same
    length may have introduced artifacts

Number of training objects 381
Number of testing objects 760
Number of classes 10
Length of time series 99
Euclidean Distance accuracy 72.178
This dataset was donated by Joaquim C. Felipe,
Agma J. M. Traina and Caetano Traina Jr.
9
SonyAIBORobotSurface
  • Task
  • The robot has roll/pitch/yaw accelerometers, here
    we looked at just X-axis.
  • The task is to detect the surface being walked
    on.
  • Why is difficult?
  • Noisy data
  • Small training set. See figure at left, with
    enough data it looks easy.

Red Cement. Blue Carpet
Number of training objects 20
Number of testing objects 601
Number of classes 2
Length of time series 70
Euclidean Distance accuracy 90.0
This dataset was donated by Manuela Veloso and
Douglas Vail of Carnegie Mellon University
10
SonyAIBORobotSurfaceII
  • Task
  • The robot has roll/pitch/yaw accelerometers, here
    we looked at just Z-axis.
  • The task is to detect the surface being walked
    on.
  • Why is difficult?
  • Noisy data
  • Small training set. See figure at left, with
    enough data it looks easier.

Red Cement. Blue Carpet or Field
Number of training objects 27
Number of testing objects 953
Number of classes 2
Length of time series 65
Euclidean Distance accuracy 85.185
This dataset was donated by Manuela Veloso and
Douglas Vail of Carnegie Mellon University
11
TwoLeadECG
  • Task
  • Time series is taken from MIT-BIH Long-Term ECG
    Database (ltdb) Record ltdb/15814, begin at time
    420, ending at 1019. The task is to distinguish
    between signal 0 and signal 1.
  • Why is difficult?
  • Subtle distinctions
  • Small training set
  • Beat extractor does not produce perfect
    alignment, but after using EM to align the signal
    (figure at left) it is clear that certain parts
    of the signal are more informative.

Number of training objects 23
Number of testing objects 1139
Number of classes 2
Length of time series 82
Euclidean Distance accuracy 78.261
12
StarLightCurves
  • Task
  • Time series are star light curves falling into
    three classes.
  • Why is difficult?
  • Two of the three classes are quite similar.
  • Large dataset (but the real datasets have
    billions of these!)
  • Phase was aligned using standard astronomy
    tricks. However we tried circular shift invariant
    Euclidean distance (see a) our accuracy
    improved, suggesting the alignment is not perfect.

Number of training objects 1000
Number of testing objects 8236
Number of classes 3
Length of time series 1024
Euclidean Distance accuracy 86.00
1 - CEPH 2 - EB 3 - RRL
a Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee
Lee and Michail Vlachos  (2006) LB_Keogh Supports
Exact Indexing of Shapes under Rotation
Invariance with Arbitrary Representations and
Distance Measures. VLDB 2006.
13
DiatomSizeReduction
Gomphonema augur
  • Task
  • Each successive generation of a clonaly
    reproducing diatom is slightly smaller than its
    forebears .a
  • Why is difficult?
  • Small training set
  • Possible errors caused by image processing step.
  • Change in scale of diatoms shows up as
    warping.

(many omitted)
Fragilariforma bicapitata
Eunotia tenella
Stauroneis smithii
b
Number of training objects 16
Number of testing objects 306
Number of classes 4
Length of time series 345
Euclidean Distance accuracy 93.75
a http//rbg-web2.rbge.org.uk/DIADIST/index.htm?
srseries.htmmain b Xiaopeng Xi, et al (2007).
Finding Motifs in Database of Shapes. SDM'07
14
Motes
  • Task
  • Sensor data used in paper b.
  • Here the task is to distinguish between sensor
    q8calibHumid and sensor q8calibHumTemp.
  • The raw data has dropouts, which I left in.
  • Why is difficult?
  • Small training set.
  • Lots of dropouts (however, when noise is
    removed, should be very easy).
  • Here the dropouts had value zero. But after
    z-normalization these values changed. It would
    have been easier to do smart smoothing if the
    data was not normalized.

Number of training objects 20
Number of testing objects 1252
Number of classes 2
Length of time series 84
Euclidean Distance accuracy 75.00
a Raw data from Carlos Guestrin (CMU),
Classification version by Keogh b Jimeng Sun,
Spiros Papadimitriou, Christos Faloutsos Online
Latent Variable Detection in Sensor Networks.
ICDE 2005 1126-1127
15
ChlorineConcentration
  • Task
  • Sensor data used in paper b.
  • Multiple sensors have spatial correlation, which
    I arbitrarily divided into 3 sets
  • Why is difficult?
  • The borderline cases are hard to classify.
    However with more data it would be easy. For
    example, when I randomly sample k items from the
    labeled test set, and do INN ED classification, I
    get
  • 1000 -gt 76.5 accuracy
  • 2000 -gt 89.85 accuracy
  • 3000 -gt 96.8 accuracy

Number of training objects 487
Number of testing objects 3840
Number of classes 3
Length of time series 166
Euclidean Distance accuracy 63.383
a Stacia Thompson and Jeanne M. VanBriesen
(CMU) Classification version by Keogh b Jimeng
Sun, Spiros Papadimitriou, Christos Faloutsos
Online Latent Variable Detection in Sensor
Networks. ICDE 2005 1126-1127
16
ECGFiveDays
  • Task
  • Data is from a 67 year old male. The two classes
    are simply
  • ECG date 12/11/1990
  • ECG date 17/11/1990
  • Why is difficult?
  • Wandering baseline was not removed, this shows up
    as linear drift.
  • Beat extractor does not produce perfect
    alignment, but after using EM to align the signal
    (figure at left) it is clear that certain parts
    of the signal are more informative.

Wandering baseline Excerpt of Class 1
Number of training objects 23
Number of testing objects 861
Number of classes 2
Length of time series 136
Euclidean Distance accuracy 82.609
17
InlineSkate
  • Task
  • This data was been collected from experiments
    with inline speed skaters on a treadmill.
  • Each time series represents an angular
    measurement of the ankle during one movement
    cycle.
  • Cycles were of different lengths, we made them
    all the same length.
  • Why is difficult?
  • Lots of warping
  • Long time series (for algorithms that scale
    poorly in dimensionality).
  • The cycle extraction algorithm might not be
    perfect (this was done before we saw the data)

Number of training objects 100
Number of testing objects 550
Number of classes 7
Length of time series 1882
Euclidean Distance accuracy 30.00
The data was provided by Fabian Moerchen and Olaf
Hoos.
18
FacesUCR
  • Task
  • This data consists of faces of grad students
    transformed into time series
  • Why is difficult?
  • Variation of head angle and expression.
  • Some have glasses/no glasses versions
  • All grad students look alike (well, some do).
  • The transformation algorithm is a little brittle
    (we have since found more robust techniques).

Number of training objects 200
Number of testing objects 2050
Number of classes 14
Length of time series 131
Euclidean Distance accuracy 75.50
Photographs by Chotirat "Ann" Ratanamahatana,
image conversion by Xiaopeng Xi and Eamonn Keogh
19
WordsSynonyms
  • Task
  • This dataset consists of word profiles for George
    Washington's manuscripts.
  • This dataset is the 50-words dataset, remapped
    to 25 classes.
  • The data was flipped left-right so that it would
    not be recognized.
  • Why is difficult?
  • There are two ways to be a member of each class.
  • In this case, length normalization clearly does
    throw away useful info.
  • Errors from the difficult task of OCR on old
    documents

The time series representation of words is known
to be very competitive with other representations
a. Here the results might not be competitive
because we are only using one (of four) time
series per word, we are normalizing, and we have
small training sets. a Word spotting for
historical documents. Toni M. Rath and R.
Manmatha International Journal on Document
Analysis and Recognition. Volume 9, Numbers 2-4 /
April, 2007
Number of training objects 267
Number of testing objects 638
Number of classes 25
Length of time series 270
Euclidean Distance accuracy 58.80
The data was provided by Toni M. Rath and R.
Manmatha.
Write a Comment
User Comments (0)
About PowerShow.com