Mining Massive RFID, Trajectory, and Traffic Data Sets - PowerPoint PPT Presentation

1 / 190
About This Presentation
Title:

Mining Massive RFID, Trajectory, and Traffic Data Sets

Description:

Mining Massive RFID, Trajectory, and Traffic Data Sets – PowerPoint PPT presentation

Number of Views:995
Avg rating:1.0/5.0
Slides: 191
Provided by: jiaw190
Category:

less

Transcript and Presenter's Notes

Title: Mining Massive RFID, Trajectory, and Traffic Data Sets


1
Mining Massive RFID, Trajectory, and Traffic
Data Sets
KDD08 Tutorial
  • Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei
    Li
  • ACM SIGKDD'08 Conference Tutorial
  • Las Vegas, NE
  • August 24, 2008

2
Tutorial Outline
  • Part I. RFID Data Mining
  • Part II. Trajectory Data Mining
  • Part III. Traffic Data Mining
  • Part IV. Conclusions

3
Part 1. RFID Data Mining
  • Introduction to RFID Data
  • Why RFID Data Warehousing and Mining?
  • RFID Data Warehousing
  • Mining RFID Data Sets
  • Conclusions

4
RFID Technology
  • Radio Frequency Identification (RFID)
  • Technology that allows a sensor (reader) to read,
    from a distance, and without line of sight, a
    unique electronic product code (EPC) associated
    with a tag

5
Broad Applications of RFID Technology
  • Supply Chain Management Real-time inventory
    tracking
  • Retail Active shelves monitor product
    availability
  • Access control Toll collection, credit cards,
    building access
  • Airline luggage management Reduce lost/misplaced
    luggage
  • Medical Implant patients with a tag that
    contains their medical history
  • Pet identification Implant RFID tag with pet
    owner information

6
Inventory Management
How many pens should we reorder?
7
Asset Tracking
British airways loses 20 million bags a year
8
Electronic Toll Collection
Illinois 1 million drivers a day use I-Pass
9
RFID System (Tag, Reader, Database)
Source www.belgravium.com
10
RFID Data Warehousing and Mining
Flow Mining
Traffic Mining
Other
Adaptive Fastest Path Computation on a Road
Network A Traffic Mining Approach Gonzalez, et
al. VLDB07
Flow Mining
Traffic Mining
Mining Engine
FlowCube Constructing RFID FlowCubes for
Multi-Dimensional Analysis of Commodity
Flows Gonzalez et al. VLDB06
RFID Warehouse
Mining Compressed Commodity Workflows From
Massive RFID Data Sets Gonzalez, et al. CIKM06
Warehousing Engine
Warehousing Engine
Warehousing and Analyzing Massive RFID Data Sets
Gonzalez, et al. ICDE06 (Best Student Paper)
Data Cleaning
Cost-Conscious Cleaning of Massive RFID Data Sets
Gonzalez, et al. ICDE07
RFID Data Site 1
RFID Data Site 2
RFID Data Site k

11
Part 1. RFID Data Mining
  • Introduction to RFID Data
  • Why RFID Data Warehousing and Mining?
  • RFID Data Warehousing
  • Mining RFID Data Sets
  • Conclusions

12
Challenges of RFID Data Sets
  • Data generated by RFID systems is enormous
    (peta-bytes in scale!) due to redundancy and low
    level of abstraction
  • Walmart is expected to generate 7 terabytes of
    RFID data per day
  • Data analysis requirements
  • Highly compact summary of the data
  • OLAP operations on multi-dimensional view of the
    data
  • Preserving the path structures of RFID data for
    analysis
  • Efficiently drilling down to individual tags when
    an interesting pattern is discovered

13
Example Trajectory
(Factory, T1,T2)
(Shipping,T3,T4)
(Warehouse, T5,T6)
(Self, T7,T8)
(Checkout,T9,T10)
14
Data Generation
EPC (L1,T1)(L2,T2)(Ln,Tn)
EPC, Location, Time_in, Time_out
EPC, Location, Time
15
RFID Data Warehouse Modeling
  • Three models in typical RFID applications
  • Bulky movements supply-chain management
  • Scattered movements E-pass tollway system
  • No movements fixed location sensor networks
  • Different applications may require different data
    warehouse systems
  • Our discussion will focus on bulky movements

16
Why RFID-Warehousing?
  • Lossless compression for bulky movement data
  • Significantly reduce the size of the RFID data
    set by redundancy removal and grouping objects
    that move and stay together
  • Data cleaning reasoning based on more complete
    info
  • Multi-reading, miss-reading, error-reading, bulky
    movement,
  • Multi-dimensional summary, multiple views
  • Multiple dimensional view Product, location,
    time,
  • Store manager Check item movements from the
    backroom to different shelves in his store
  • Region manager Collapse intra-store movements
    and look at distribution centers, warehouses, and
    stores

17
Example A Supply Chain Store
  • A retailer with 3,000 stores, selling 10,000
    items a day per store
  • Each item moves 10 times on average before being
    sold
  • Movement recorded as (EPC, location, second)
  • Data volume 300 million tuples per day (after
    redundancy removal)
  • OLAP query Costly to answer if scanning 1
    billion tuples
  • Avg time for outwear items to move from warehouse
    to checkout counter in March 2006?
  • Mining query
  • Is there a correlation between the time spent at
    transportation and the milk in store S rotten?

18
Part 1. RFID Data Mining
  • Introduction to RFID Data
  • Why RFID Data Warehousing and Mining?
  • RFID Data Warehousing
  • Mining RFID Data Sets
  • Conclusions

19
Cleaning of RFID Data Records
  • Raw Data
  • (EPC, location, time)
  • Duplicate records due to multiple readings of a
    product at the same location
  • (r1, l1, t1) (r1, l1, t2) ... (r1, l1, t10)
  • Cleansed Data Minimal information to store and
    removal of raw data
  • (EPC, Location, time_in, time_out)
  • (r1, l1, t1, t10)
  • Warehousing can help fill-up missing records and
    correct wrongly-registered information

20
What is a Data Warehouse?
Warehouse
Operational Data Site 1
Cube 1
Cube 2
Extract Transform Load

OLAP
Cube N
Operational Data Site k
Q1
Q2
Q3
Q4
DVD,Chicago,All
DVD
PC
Chicago
TV
Boston
All Possible Groupings (Product, Location, Time)
New York
Q1,All,All
All,All,All
21
Why Do We Need a New Design?
  • Ex. Avg time that milk stays at the Champaign
    Walmart that coming from farm A, and Truck 1?

Paths are lost in the aggregation
22
Data Compression
Raw Data (EPC, Reader, Time)
Cleansed Data (EPC, Reader, T_in,T_out)
Lossless
Redundancy Elimination
Bulky Movement Compression
Lossless
Stay (GID, Reader, T_in,T_out)
Stay (GID, Locale, Day 1, Day2)
Lossy
Path and Item Abstraction
23
Bulky Object Movements
1.1.1.1
1.1.1
1.1.1.2
1.1
1.1.2
1.2
1
i1,i2,,i10000, Dist Center 1, 01/01/08,
01/03/08
1.1
24
Data Compression with GID
  • Bulky object movements
  • Objects often move and stay together
  • If 1000 packs of soda stay together at the
    distribution center, register a single record
  • (GID, distribution center, time_in, time_out)
  • GID is a generalized identifier that represents
    the 1000 packs that stayed together at the
    distribution center

shelf 1
store 1
10 pallets (1000 cases)
shelf 2
Dist. Center 1
store 2

Dist. Center2
Factory

10 packs (12 sodas)

20 cases (1000 packs)
25
Movement Graph Producer Consumer
Configurations
26
Non-Spatial Generalization
Category level
Clothing
Type level
Interesting Level
Outerwear
Shoes
SKU level
Shirt
Jacket

Cleansed RFID Database Level
Shirt 1
Shirt n

EPC level
27
Path Generalization
Store View
Transportation
backroom
shelf
checkout
backroom
shelf
checkout
dist. center
truck
Transportation View
dist. center
truck
Store
28
RFID-Cube Architecture
29
RFID Cuboid
30
Example RFID Cuboid
Cleansed RFID Database
Stay Table
Map Table
31
Design Decisions Stay vs. Transition
l1
ln1
Vs.
l2
l
ln2

  • Measure of Items at location l
  • Transition n m retrievals
  • Stay 1 retrieval
  • Measure of Items from li to lj
  • Transition 1 retrieval
  • Stay 2 retrievals

ln
lnm
32
Design Decisions EPC vs. GID Lists
How many pallets traveled path l1, l7, l13?
(r1,l1,t1,t2) (r1,l2,t3,t4) (r2,l1,t1,t2) (r2,l2
,t3,t4) (rk,l1,t1,t2) (rk,l2,t3,t4)
  • With EPC lists
  • Retrieve all EPCs with location in l1,l17,l13
  • With GID lists
  • Retrieve all GIDs with location in l1,l7,l13
  • Savings
  • GIDs ltlt EPCs

(g1,l1,t1,t2) (g2,l2,t3,t4)
33
GID Naming
0.1
0.0
  • GID Name Encodes Path
  • Benefit - Speed
  • Reduce GID Intersections
  • Cost - Space
  • Locations
  • Path length

l2
l1
0.1.1
0.1.0
0.0.0
l4
l3
0.0.0.0
0.1.0.1
l5
l6
34
RFID Cuboid Construction
  • Build Path Prefix-tree
  • For each Node
  • GID parent GID unique id
  • Aggregate measure for items at each leaf under
    node
  • Generate stay records, merge if necessary

35
Compression by Data/Path Generalization
  • Data generalization
  • Analysis usually takes place at a much higher
    level of abstraction than the one present in raw
    RFID data
  • Aggregate object movements into fewer records
  • If interested in time at the day level, merge
    records at the minute level into records at the
    hour level
  • Path generalization Merge and/or collapse path
    segments
  • Uninteresting path segments can be ignored or
    merged
  • Multiple item movements within the same store may
    be uninteresting to a regional manager and thus
    merged

36
Three RFID-Cuboids
  • Stay Table (GIDs, location, time_in, time_out
    measures)
  • Records information on items that stay together
    at a given location
  • If using record transitions difficult to answer
    queries, lots of intersections needed
  • Map Table (GID, ltGID1,..,GIDngt)
  • Links together stages that belong to the same
    path. Provides additional compression and query
    processing efficiency
  • High level GID points to lower level GIDs
  • If saving complete EPC Lists high costs of IO to
    retrieve long lists, costly query processing
  • Information Table (EPC list, attribute
    1,...,attribute n)
  • Records path-independent attributes of the items,
    e.g., color, manufacturer, price

37
Algorithm Example
Stay Table
Path Tree
l2
l1
0.1 t1,t8 3
0.0 t1,t10 3
l4
l3
l3
0.1.1 t10,t20 2
0.0.0 t20,t30 3
0.1.0 t20,t30 3
r8,r9
0.1.0.1 t35,t50 1
0.1.0.0 t40,t60 2
l5
l6
l5
0.0.0.0 t40,t60 3
r1,r2,r3
r5,r6
r7
38
RFID-Cuboid Construction Algorithm
  • Build a prefix tree for the paths in the cleansed
    database
  • For each node, record a separate measure for each
    group of items that share the same leaf and
    information record
  • Assign GIDs to each node
  • GID parent GID unique id
  • Each node generates a stay record for each
    distinct measure
  • If multiple nodes share the same location, time,
    and measure, generate a single record with
    multiple GIDs

39
Algorithm Properties
  • Construction Time
  • Single scan of cleansed data
  • Compression
  • Lossless compression for abstraction level
  • In our experiments we get 80 lossless
    compression at the level of abstraction of the
    raw data

40
From RFID-Cuboids to RFID-Warehouse
  • Which cuboids to materialize?
  • Minimum interesting level
  • Popular RFID-Cuboids
  • How?
  • Run algorithm
  • Input From the smallest materialized RFID-Cuboid
    that is at a lower level of abstraction

41
Query Processing
  • Traditional OLAP operations
  • Roll up, drill down, slice, and dice
  • Implemented efficiently with traditional
    techniques, e.g., what is the avg time spent by
    milk at the shelf

?stay.location 'shelf', info.product 'milk'
(stay gid info)
  • Path selection (New operation)
  • Compute an aggregate measure on the tags that
    travel through a set of locations and that match
    a selection criteria on path-independent
    dimensions

q ? lt ?c info,(?c1 stage1, ..., ?ck stagek) gt
42
Query Processing (II)
  • Query What is the average time spent from l3 to
    l5?
  • GIDs for l3 lt0.0.0gt, lt0.1.0gt
  • GIDs for l5 lt0.0.0.0gt, lt0.1.0.1gt
  • Prefix pairs p1 (lt0.0.0gt,lt0.0.0.0gt)
  • p2 (lt0.1.0gt,lt0.1.0.1gt)
  • Retrieve stay records for each pair (including
    intermediate steps) and compute measure
  • Savings No EPC list intersection, remember that
    each EPC list may contain millions of different
    tags, and retrieving them is a significant IO cost

43
Performance Study RFID-Cube Compression
Compression vs. Cleansed data size P 1000, B
(500,150,40,8,1), k 5 Lossless compression,
cuboid is at the same level of abstraction as
cleansed RFID database
Compression vs. Data Bulkiness P 1000, N
1,000,000, k 5 Map gives significant benefits
for bulky data For data where items move
individually we are better off using tag lists
44
From Distribution Center Model to Gateway-Based
Movement Model
  • Gateway-based movement model
  • Supply-chain movement is a merge-shuffle-split
    process
  • Three types of gateways
  • Out, In, In/Out
  • Multiple hierarchies for compression and
    exploration
  • Location, Time, Path, Gateway, Info

45
Part 1. RFID Data Mining
  • Introduction to RFID Data
  • Why RFID Data Warehousing and Mining?
  • RFID Data Warehousing
  • Mining RFID Data Sets
  • Conclusions

46
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

47
Data Cleaning by Data Mining
  • RFID data warehouse substantially compresses the
    RFID data and facilitate efficient and systematic
    data analysis
  • Data cleaning is essential to RFID applications
  • Multiple reading, miss reading, errors in
    reading, etc.
  • How RFID warehouse facilitates data cleaning?
  • Multiple reading automatically resolved when
    being compressed
  • Miss reading gaps can be stitched by simple
    look-around
  • Error reading use future positions to resolve
    discrepancies
  • Data mining helps data cleaning
  • Multiple cleaning methods can be cross-validated
  • Cost-sensitive method selection by data mining

48
Cost-Conscious Cleaning of RFID Data (Gonzalez et
al. 07)
  • Unreliable System
  • 50 loss rate
  • Interference Water, metal, speed
  • Large data volume
  • Thousands of readers
  • Millions of tags
  • Key Idea
  • Use inexpensive methods first, escalate only when
    necessary

49
DBN-Based Cleaning (DBNs Dynamic Bayesian
Networks)
  • No need to remember recent tag readings, we just
    update our belief in the item being present given
    the readings
  • Dynamically give more weight to recent
    observations
  • Differentiate between these two cases

Transition Model
uncertain
certain
Present t
Present t-1
hidden
Observation Model
Detect t-1
Detect t
Observed
new belief state
observation model
transition model
old belief state
50
Cleaning Sequence
  • A cleaning method is a classifier
  • For a tag case (EPC, time, history of readings)
    it assigns a label (location), and gives a
    confidence value for the prediction
  • Cost of applying method is proportional to CPU,
    memory, and amortized training costs
  • Given a set of tag cases and cleaning methods,
    determine best method application order to
    maximize accuracy and minimize costs
  • C(M1) 1, C(M2) 1.5, C(M3) 0.5
  • C(Error) 0.5
  • SD,M M1 ? M3 ? M2
  • Greedy algorithm At each iteration choose
    cheapest cleaning method (including error cost)
    for the set of tag cases still not correctly
    classified

51
Cleaning Plan
  • The cleaning plan is a decision tree
  • Internal nodes are tag features
  • Leaves contain all tag cases matching the
    conditions on the branch, and define the optimal
    cleaning sequence to use on such cases
  • Tag cases have features that can be used to
    segment them

Induction Algorithm
  • Traditional top down induction of decision trees
    Quinlan, ML 86
  • Split the tag cases as long as we can reduce
    cleaning costs

Cleaning sequence cost, before the split
Average cost for each cleaning sequence after
the split
52
Experimental Result
  • Setup
  • Diverse environment, different levels of noise,
    tag speed, and reader locations
  • Results
  • Cleaning plan wins in accuracy and cost
  • In general, DBN outperforms smoothing window
    methods

53
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

54
RFID Data A Path Database View
  • From raw tuples to cleansed data A Stay Table
    view
  • Raw tuples ltEPC, location, timegt
  • Stay view (EPC, Location, time_in, time_out)
  • A data flow view of RFID data path forms
  • ltEPC, (l1,t1),(l2,t2),...,(lk,tk)gt, where li
    location i, ti duration i
  • The paths can be augmented with path-independent
    dimensions to get a Path Database of the form
  • ltProduct, Manufacturer, Price, Color, (l1,t1),
    ..., (lk,tk)gt

Path independent dimensions
Path stages
55
What Can Product Flows Tell?
Why was the Milk discarded?
Correlation between operator and returns?
56
Summarizing Flows FlowGraph
  • Tree shaped workflow
  • Nodes Locations
  • Edges Transitions
  • Each node is annotated with
  • Distribution of durations at the node
  • Distribution of transition probabilities
  • Significant duration, transition exceptions

storage
shelf
factory
backroom
truck
warehouse
57
FlowGraph Example
Duration Dist 1 0.2 2 0.8 Duration
Exceptions Given (f,5) 1 0.0, 2 1.0 Given
(f,10) 1 0.5, 2 0.5
checkout
0.60
0.20
dist. ctr.
truck
shelf
1.00
1.00
0.65
factory
0.20
shelf
checkout
dist. center
0.35
1.00
truck
0.67
0.33
warehouse
Duration Dist 1 0.67 2 0.33 Transition
Dist shelf 0.67 warehouse 0.33
Transition Exceptions Given (t,1) shelf 0.5,
warehouse 0.5 Given (t,2) shelf 1.0,
warehouse 0.0
58
FlowCube
  • Data cube computed on the path database, by
    grouping entries that share the same values on
    the path independent dimensions.
  • Each cuboid has an associated level in the item
    and path abstraction lattices.
  • Level in the item lattice.
  • (product category, country, price)
  • Level in the path lattice.
  • (lttransportation, factory, backroom, shelf,
    checkoutgt, hour)
  • The measure for each cell in the FlowCube is a
    FlowGraph computed on the paths aggregated in the
    cell.

59
FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
1.0
0.67
factory
truck
1.0
0.33
warehouse
60
Cubing FlowGraphs FlowCube
  • Fact Table Path Table (EPC, path)
  • Dimensions
  • Path independent dimensions
  • Product, Vendor, Price, etc
  • Abstraction Level
  • Each dimension has a concept hierarchy
  • Paths aggregated according to location, time
  • Measure
  • FlowGraph

61
FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
factory
truck
warehouse
62
Cells to Compute
  • Frequent Cells (Iceberg FlowCube)
  • Min Support Number of paths in cell
  • FlowGraph is statistically significant
  • Non-Redundant Cells
  • Redundant cell Can be inferred from others
  • Flow patterns for Milk same for Milk 2
  • Compression Keep non-redundant general cells

63
FlowCube Computation - Ideas
  • Transform Paths into Transaction Database
  • Mine frequent path segments
  • Mine frequent dimension combinations
  • Cross Pruning
  • Infrequent (Factory ? Shelf) for NorthEast
  • Has to be infrequent in MA
  • Infrequent (Laptop, MN)
  • Has to be infrequent for (Factory ? Shelf)

64
Transaction Encoding
Jacket (1 1 1 2)
Jacket
Outerwear
Clothing
Product
(factory,10)(dist,2)(truck,1)(shelf,5)(checkout,0)
1 (factorydisttruck,1) 2 (factoryTransportati
on,1) 3 (factorydisttruck,)
65
One Step Algorithm
Path DB
Freq. Cells Freq. Paths
FlowCube
Encode Transactions
Freq. Pattern Mining
Build FlowGraphs
  • Integrated Pruning
  • Pre-Counting level k1
  • Prune non-related stages
  • Prune parent-child

66
Two Step Algorithm
Path DB
Cube
Freq Path Mining cell 1
Freq Path Mining cell 2
Build FlowGraphs
Cubing Non-Spatial

Freq Path Mining cell n
FlowGraph is holistic no shared
computation Wasted effort No cross pruning One
cell at a time High IO cost
67
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

68
Path- or Segment- Based Classification and
Cluster Analysis
  • Classification Given class label (e.g., broken
    goods vs. quality ones), construct path-related
    predictive models
  • Take paths or segments as motifs and perform
    motif-based high-dimensional information for
    classification
  • Clustering Group similar paths or similar stay
    or movements of RFIDs, with other
    multi-dimensional information into clusters
  • It is essential to define new distance measure
    and constraints for effective clustering

69
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

70
Frequent Pattern and Sequential Pattern Analysis
  • Frequent patterns and sequential patterns can be
    related to movement segments and paths
  • Taking movement segments and paths base units,
    one can perform multi-dimensional frequent
    pattern and sequential pattern analysis
  • Correlation analysis can be formed in a similar
    way
  • Correlation components can be stay, move
    segments, and paths
  • Efficient and scalable algorithms can be
    developed using the warehouse modeling

71
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

72
Outlier Analysis in RFID Data
  • Outlier detection in RFID data is by-product of
    other mining tasks
  • Data flow analysis Detect those not in the major
    flows
  • Classification Treat outliers and normal data as
    different class labels
  • Cluster analysis Identify those that deviate
    substantially in major clusters
  • Trend analysis Those not following the major
    trend
  • Frequent pattern and sequential pattern analysis
    anomaly patterns

73
Mining RFID Data Sets
  • Data cleaning by data mining
  • RFID data flow analysis
  • Path-based classification and cluster analysis
  • Frequent pattern and sequential pattern analysis
  • Outlier analysis in RFID data
  • Linking RFID data mining with others

74
Linking RFID Mining with Others
  • RFID warehouse and cube model makes the data
    mining better organized and more efficient
  • Real time RFID data mining will need further
    development of stream data mining methods
  • Stream cubing and high dimensional OLAP are two
    key method that will benefit RFID mining
  • RFID data mining is still a young, largely
    unexplored field
  • RFID data mining has close links with sensor data
    mining, moving object data mining and stream data
    mining
  • Thus will benefit from rich studies in those
    fields

75
Part 1. RFID Data Mining
  • Introduction to RFID Data
  • Why RFID Data Warehousing and Mining?
  • RFID Data Warehousing
  • Mining RFID Data Sets
  • Conclusions

76
Part I Conclusions
  • A new RFID warehouse model
  • Allows efficient and flexible analysis of RFID
    data in multidimensional space
  • Preserves the structure of the data
  • Compresses data by exploiting bulky movements,
    concept hierarchies, and path collapsing
  • Mining RFID data
  • Powerful mining mechanisms can be constructed
    with RFID data warehouse
  • Flowgraph analysis, data cleaning,
    classification, clustering, trend analysis,
    frequent/sequential pattern analysis, outlier
    analysis
  • Lots can be done in RFID data analysis

77
Part II. Trajectory Data Mining
  • Introduction to Trajectory Data
  • Pattern Mining
  • Clustering
  • Classification
  • Outlier Detection

78
Trajectory Data
  • A trajectory is a sequence of the location and
    timestamp of a moving object

Hurricanes
Turtles
Vehicles
Vessels
79
Importance of Analysis on Trajectory Data
  • The world becomes more and more mobile
  • Prevalence of mobile devices such as cell phones,
    smart phones, and PDAs
  • Satellite, sensor, RFID, and wireless
    technologies have been improved rapidly
  • Tremendous amounts of trajectory data of moving
    objects

80
Research Impacts
  • Trajectory data mining has many important,
    real-world applications driven by the real need
  • Homeland security (e.g., border monitoring)
  • Law enforcement (e.g., video surveillance)
  • Weather forecast
  • Traffic control
  • Location-based service

81
Part II. Trajectory Data Mining
  • Introduction to Trajectory Data
  • Pattern Mining
  • Clustering
  • Classification
  • Outlier Detection

82
Trajectory Pattern (Giannotti et al. 07)
  • A trajectory pattern should describe the
    movements of objects both in space and in time

83
Definition of Trajectory Patterns
  • A Trajectory Pattern (T-pattern) is a couple
    (s,a)
  • s lt(x0,y0),..., (xk,yk)gt is a sequence of k1
    locations
  • a lta1,..., akgt are the transition times
    (annotations)
  • also written as
  • (x0,y0) (x1,y1) (xk,yk)
  • A T-pattern Tp occurs in a trajectory if the
    trajectory contains a subsequence S such that
  • Each (xi,yi) in Tp matches a point (xi,yi) in
    S, and
  • the transition times in Tp are similar to those
    in S

a2
ak
a1
84
Characteristics of T-Patterns
  • Routes between two consecutive regions are not
    relevant
  • Absolute times are not relevant

These two movements are not discriminated
1 hour
A
B
1 hour
These two movements are not discriminated
1 hour at 5 p.m.
A
B
1 hour at 9 a.m.
85
T-Pattern Mining
  • 1. Convert each trajectory to a sequence, i.e.,
    by converting a location (x,y) into a region

86
  • 2. Execute the TAS (temporally annotated
    sequence) algorithm, developed by the same
    authors, over the set of converted trajectories
  • A TAS is a sequential pattern annotated with
    typical transition times between its elements
  • The algorithm of TAS mining is an extension of
    PrefixSpan so as to accommodate transition times

87
Sample T-Patterns
Data Source Trucks in Athens 273 trajectories)
88
Periodic Pattern (Mamoulis et al. 04)
  • In many applications, objects follow the same
    routes (approximately) over regular time
    intervals
  • e.g., Bob wakes up at the same time and then
    follows, more or less, the same route to his work
    everyday

89
Definition of Periodic Patterns
  • Let S be a sequence of n spatial locations l0,
    l1, , ln-1, representing the movement of an
    object over a long history
  • Let Tltltn be an integer called period
  • A periodic pattern P is defined by a sequence
    r0r1rT-1 of length T that appears in S by more
    than min_sup times
  • For every ri in P, ri or ljTi is inside ri

90
Periodic Pattern Mining
  • 1. Obtain frequent 1-patterns
  • Divide the sequence S of locations into T spatial
    datasets, one for each offset of the period T,
    i.e., locations li, liT, , li(m-1)T go to a
    set Ri
  • Perform DBSCAN on each dataset
  • e.g.,

Five clusters discovered in datasets R1, R2, R3,
R4, and R6
91
  • 2. Find longer patterns Two methods
  • Bottom-up level-wise technique
  • Generate k-patterns using a pair of
    (k-1)-patterns with their first k-2 non- regions
    in the same position
  • Use a variant of the Apriori-TID algorithm

r1a
r3c
r1d
r3f
r1w
r3z
r2b
r1ar2br3c r1dr2er3f
r1x
r2e
r2y
2-length patterns generated 3-length patterns
92
  • Faster top-down approach
  • Replace each location in S with the cluster-id
    which it belongs to or with if the location
    belongs to no cluster
  • Use the sequence mining algorithm to discover
    fast all frequent patterns of the form r0r1rT-1,
    where each ri is a cluster in a set Ri or
  • Create a max-subpattern tree and traverse the
    tree in a top-down, breadth-first order

93
Four Kinds of Relative Motion Patterns (Laube et
al. 04, Gudmundsson et al. 07)
  • Flock (Parameters m gt 1 and r gt 0) At least m
    entities are within a circular region of radius r
    and they move in the same direction
  • Leadership (Parameters m gt 1, r gt 0, and s gt 0)
    At least m entities are within a circular region
    of radius r, they move in the same direction, and
    at least one of the entities was already heading
    in this direction for at least s time steps
  • Convergence (Parameters m gt 1 and r gt 0) At
    least m entities will pass through the same
    circular region of radius r (assuming they keep
    their direction)
  • Encounter (Parameters m gt 1 and r gt 0) At least
    m entities will be simultaneously inside the same
    circular region of radius r (assuming they keep
    their speed and direction)

94
  • Examples

An example of a flock pattern for p1, p2, and p3
at 8th time step also a leadership pattern with
p2 as the leader
A convergence pattern if m 4 for p2, p3, p4,
and p5
95
  • Algorithms Exact and approximate algorithms are
    developed
  • Flock Use the higher-order Voronoi diagram
  • Leadership Check the leader condition
    additionally

t is multiplicative factor in all time bounds
96
An Extension of Flock Patterns (Benkert et al.
06, Gudmundsson and Kreveld 07)
  • A new definition considers multiple time steps,
    whereas the previous definition only one time
    step
  • Flock A flock in a time interval I, where the
    duration of I is at least k, consists of at least
    m entities such that for every point in time
    within I there is a disk of radius r that
    contains all the m entities
  • e.g.,

A flock through 3 time steps
97
Computing Flock Patterns
  • Approximate flocks
  • Convert overlapping segments of length k to
    points in a 2k-dimensional space
  • Find 2k-d pipes that contain at least m points
  • Longest duration flocks
  • For every entity v, compute
  • a cylindrical region and
  • the intervals from the
  • intersection of the cylinder
  • Pick the longest one

98
An Extension of Leadership Patterns (Andersson
et al. 07)
  • Leadership We have a leadership pattern if there
    is an entity that is a leader of at least m
    entities for at least k time units
  • An entity ej is said to be a leader at time tx,
    ty for time-points tx, ty, if and only if ej
    does not follow anyone at time tx, ty, and ej
    is followed by sufficiently many entities at time
    tx, ty

ei
ej
ei follows ej
di dj ß
99
Reporting Leadership Patterns
  • Algorithm Build and use the follow-arrays

e.g., store nonnegative integers specifying for
how many past consecutive unit-time-intervals ej
is following ei (ej ? ei)
100
Trajectory Join (Bakalov et al. 05)
  • Identify all pairs of similar trajectories
    between two datasets deal with the restricted
    version of the problem where a temporal predicate
    is specified by the query
  • e.g., identify the pairs of trucks that were
    never apart from each other by more than 1 mile
    this morning
  • Definition Given two sets of object trajectories
    R and S, a threshold e and time interval dt, the
    result of the trajectory join query is a subset V
    of pairs ltRi, Sjgt (where Ri ? R, Sj ? S), such
    that during the time-interval dt the distance
    Ddt(Ri, Sj) e for any pair in V
  • Ri and Sj are sub-trajectories for the time
    interval dt

101
Evaluation of Trajectory Join
  • Use the Piecewise Aggregate Approximation (PAA)
    and then reduce trajectories to strings
  • e.g)
  • Introduce a distance function for strings that
    appropriately lower-bounds the distance function
    Ddt for trajectories
  • Propose a pruning heuristic for reducing the
    number of trajectory pairs that need to be
    examined

a4a3a2a1a2
102
Time-Relaxed Trajectory Join (Bakalov et al. 05)
  • Here, the interval dt can be anywhere in each
    trajectory
  • Definition Two trajectories match if there exist
    time intervals of the same length dt such that
    the distance between the locations of the two
    trajectories during these intervals is no more
    than the spatial threshold e

103
Evaluation of Time-Relaxed Trajectory Join
  • Approximate raw trajectories using symbolic
    representations each trajectory is represented
    as a string
  • Generate all subsequences of length k for each
    string (Assume dt covers completely a total of k
    frames)
  • Compare all pairs of subsequences and obtain the
    candidates where the distance is no more than ke
  • Verify the candidates by accessing raw
    trajectories
  • Provide two heuristics for reducing false
    positives

104
Hot Motion Path (Sacharidis et al. 08)
  • Identify hot motion paths followed by moving
    objects over a sliding window with guarantees
  • Motion path A directed line segment
    approximating objects movement
  • Hotness The number of objects crossing a motion
    path within the window
  • Guarantees User-defined tolerance e for
    approximating the location of an object at a
    given time

105
Example of Hot Motion Paths
  • Consider 4 moving objects and their trajectories
  • 1. Extract motion paths
  • 2. Calculate hotness
  • 3. Select the hottest (hotness 2)

106
Finding Hot Motion Paths
  • System setting
  • Objects communicate with the central coordinator
  • Two-tiered approach
  • Object side RayTrace algorithm
  • Update locations only when the object falls
    outside a filter
  • Coordinator side SinglePath strategy
  • Discover motion paths using lightweight indexes

107
Part II. Trajectory Data Mining
  • Introduction to Trajectory Data
  • Pattern Mining
  • Clustering
  • Classification
  • Outlier Detection

108
Moving Object Clustering
  • A moving cluster is a set of objects that move
    close to each other for a long time interval
  • Note Moving clusters and flock patterns are
    essentially the same
  • Formal Definition Kalnis et al. 05
  • A moving cluster is a sequence of (snapshot)
    clusters c1, c2, , ck such that for each
    timestamp i (1 i lt k), ci n ci1 / ci ?
    ci1 ? (0 lt ? 1)

109
Retrieval of Moving Clusters (Kalnis et al. 05)
  • Basic algorithm (MC1)
  • 1. Perform DBSCAN for each time slice
  • 2. For each pair of a cluster c and a moving
    cluster g, check if g can be extended by c
  • If yes, g is used at the next iteration
  • If no, g is returned as a result
  • Improvements
  • MC2 Avoid redundant checks (Improve Step 2)
  • MC3 Reduce the number of executing DBSCAN
    (Improve Step 1)

110
Moving Micro-Clusters (Li et al. 04)
  • A group of objects that are not only close to
    each other, but also likely to move together for
    a while
  • It is desirable to provide multi-level data
    analysis for prohibitively large datasets A
    moving micro-cluster could be viewed as a moving
    object
  • Initial moving micro-clusters are obtained using
    a generic clustering algorithm then, split and
    collision events are identified

111
Trajectory Clustering
  • Group similar trajectories into the same cluster
  • 1. Whole Trajectory Clustering
  • Probabilistic Clustering
  • Density-Based Clustering TF-OPTICS
  • 2. Partial Trajectory Clustering
  • The Partition-and-Group Framework

112
Probabilistic Trajectory Clustering (Gaffney and
Smyth 99)
  • Basic assumption The data are being produced in
    the following generative manner
  • An individual is drawn randomly from the
    population of interest
  • The individual has been assigned to a cluster k
    with probability wk, these are
    the prior weights on the K clusters
  • Given that an individual belongs to a cluster k,
    there is a density function fk(yj ?k) which
    generates an observed data item yj for the
    individual j

113
  • The probability density function of observed
    trajectories is a mixture density
  • fk(yj xj, ?k) is the density component
  • wk is the weight, and ?k is the set of parameters
    for the k-th component
  • ?k and wk can be estimated from the trajectory
    data using the Expectation-Maximization (EM)
    algorithm

114
Clustering Results For Hurricanes (Camargo et al.
06)
Mean Regression Trajectory
TRACKS
Tracks Atlantic named Tropical Cyclones 1970-2003.
115
Density-Based Trajectory Clustering (Nanni and
Pedreschi 06)
  • Define the distance between whole trajectories
  • A trajectory is represented as a sequence of
    location and timestamp
  • The distance between trajectories is the average
    distance between objects for every timestamp
  • Use the OPTICS algorithm for trajectories
  • e.g.,

Reachability Plot
Time
Four clusters
Y
X
116
Temporal Focusing TF-OPTICS
  • In a real environment, not all time intervals
    have the same importance
  • e.g., urban traffic In rush hours, many people
    move from home to work, and vice versa
  • Clustering trajectories only in meaningful time
    intervals can produce more interesting results
  • TF-OPTICS aims at searching the most meaningful
    time intervals, which allows us to isolate the
    clusters of higher quality

117
TF-OPTICS
  • Define the quality of a clustering
  • Take account of both high-density clusters and
    low-density noise
  • Can be computed directly from the reachability
    plot
  • Find the time interval that maximizes the quality
  • 1. Choose an initial random time interval
  • 2. Calculate the quality of neighborhood
    intervals generated by increasing or decreasing
    the starting or ending times
  • 3. Repeat Step 2 as long as the quality increases

118
Partition-and-Group Framework (Lee et al. 07)
  • Existing algorithms group trajectories as a whole
    ? They might not be able to find similar portions
    of trajectories
  • e.g., the common behavior cannot be discovered
    since TR1TR5 move to totally different
    directions
  • The partition-and-group framework is proposed to
    discover common sub-trajectories

TR5
TR4
TR3
A common sub-trajectory
TR2
TR1
119
Usefulness of Common Sub-Trajectories
  • Discovering common sub-trajectories is very
    useful, especially if we have regions of special
    interest
  • Hurricane Landfall Forecasts
  • Meteorologists will be interested in the common
    behaviors of hurricanes near the coastline or at
    sea (i.e., before landing)
  • Effects of Roads and Traffic on Animal Movements
  • Zoologists will be interested in the common
    behaviors of animals near the road where the
    traffic rate has been varied

120
Overall Procedure
  • Two phases partitioning and grouping

A set of trajectories
(1) Partition
A representative trajectory
(2) Group
A cluster
A set of line segments
Note a representative trajectory is a common
sub-trajectory
121
Partitioning Phase
  • Identify the points where the behavior of a
    trajectory changes rapidly ? characteristic
    points
  • An optimal set of characteristic points is found
    by using the minimum description length (MDL)
    principle
  • Partition a trajectory at every characteristic
    point

characteristic point trajectory
partition
122
Overview of the MDL Principle
  • The MDL principle has been widely used in
    information theory
  • The MDL cost consists of two components L(H) and
    L(DH), where H means the hypothesis, and D the
    data
  • L(H) is the length, in bits, of the description
    of the hypothesis
  • L(DH) is the length, in bits, of the description
    of the data when encoded with the help of the
    hypothesis
  • The best hypothesis H to explain D is the one
    that minimizes the sum of L(H) and L(DH)

123
MDL Formulation
  • Finding the optimal partitioning translates to
    finding the best hypothesis using the MDL
    principle
  • H ? a set of trajectory partitions, D ? a
    trajectory
  • L(H) ? the sum of the length of all trajectory
    partitions
  • L(DH) ? the sum of the difference between a
    trajectory and a set of its trajectory partitions
  • L(H) measures conciseness L(DH) preciseness

124
Grouping Phase (1/2)
  • Find the clusters of trajectory partitions using
    density-based clustering (i.e., DBSCAN)
  • A density-connect component forms a cluster,
    e.g., L1, L2, L3, L4, L5, L6

125
Grouping Phase (2/2)
  • Describe the overall movement of the trajectory
    partitions that belong to the cluster

A red line a representative trajectory, A blue
line an average direction vector, Pink lines
line segments in a density-connected set
126
Sample Clustering Results
7 Clusters from Hurricane Data
570 Hurricanes (19502004)
A red line a representative trajectory
127
2 Clusters from Deer Data
128
Part II. Trajectory Data Mining
  • Introduction to Trajectory Data
  • Pattern Mining
  • Clustering
  • Classification
  • Outlier Detection

129
Trajectory Classification
  • Predict the class labels of moving objects based
    on their trajectories and other features
  • 1. Machine learning techniques
  • Studied mostly in pattern recognition,
    bioengineering, and video surveillance
  • The hidden Markov model (HMM)
  • 2. TraClass Trajectory classification using
    hierarchical region-based and trajectory-based
    clustering

130
Machine Learning for Trajectory Classification
(Sbalzarini et al. 02)
  • Compare various machine learning techniques for
    biological trajectory classification
  • Data encoding
  • For the hidden Markov model, a whole trajectory
    is encoded to a sequence of the momentary speed
  • For other techniques, a whole trajectory is
    encoded to the mean and the minimum of the speed
    of a trajectory, thus a vector in R2
  • Two 3-class datasets Trajectories of living
    cells taken from the scales of the fish
    Gillichthys mirabilis
  • Temperature dataset 10C, 20C, and 30C
  • Acclimation dataset Three different fish
    populations

131
Machine Learning Techniques Used
  • k-nearest neighbors (KNN)
  • A previously unseen pattern x is simply assigned
    to the same class to which the majority of its
    k-nearest neighbors belongs
  • Gaussian mixtures with expectation maximization
    (GMM)
  • Support vector machines (SVM)
  • Hidden Markov models (HMM)
  • Training Determine the model parameters ? (A,
    B, p) to maximize P x ? for a given
    observation x
  • Evaluation Given an observation x O1, , OT
    and a model ? (A, B, p), compute the
    probability P x ? that the observation x has
    been produced by a source described by ?

132
Temperature data set
Acclimation data set
133
Vehicle Trajectory Classification (Fraile and
Maybank 98)
  • 1. The measurement sequence is divided into
    overlapping segments
  • 2. In each segment, the trajectory of the car is
    approximated by a smooth function and then
    assigned to one of four categories ahead, left,
    right, or stop
  • 3. In this way, the list of segments is reduced
    to a string of symbols drawn from the set a, l,
    r, s
  • 4. The string of symbols is classified using the
    hidden Markov model (HMM)

134
Use of the HMM for Classification
  • Classification of the global motions of a car is
    carried out using an HMM
  • The HMM contains four states which are in order
    A, L, R, S, which are the true states of the car
    ahead, turning left, turning right, stopped
  • The HMM has four output symbols in order a, l, r,
    s, which are the symbols obtained from the
    measurement segments
  • The Viterbi algorithm is used to obtain the
    sequence of internal states

135
Experimental Result
Measurement sequence
Observed symbols
Sequence of inferred states
This measurement sequence means the driver stops
and then turns to the right
136
Motion Trajectory Classification (Bashir et al.
07)
  • Motion trajectories
  • Tracking results from video trackers, sign
    language data measurements gathered from wired
    glove interfaces, and so on
  • Application scenarios
  • Sport video (e.g., soccer video) analysis Player
    movements ? A strategy
  • Sign and gesture recognition Hand movements ? A
    particular word

137
The HMM-Based Algorithm
  • 1. Trajectories are segmented at points of change
    in curvature
  • 2. Sub-trajectories are represented by their
    Principal Component Analysis (PCA) coefficients
  • 3. The PCA coefficients are represented using a
    GMM for each class
  • 4. An HMM is built for each class, where the
    state of the HMM is a sub-trajectory and is
    modeled by a mixture of Gaussians

138
Use of the HMM for Classification
  • Training and parameter estimation
  • The Baum-Welch algorithm is used to estimate the
    parameters
  • Classification
  • The PCA coefficient vectors of input trajectories
    after segmentation are posed as an observation
    sequence to each HMM (i.e., constructed for each
    class)
  • The maximum likelihood (ML) estimate of the test
    trajectory for each HMM is computed
  • The class is determined to be the one that has
    the largest maximum likelihood

139
Experimental Result
  • Datasets
  • The Australian Sign Language dataset (ASL)
  • 83 classes (words), 5,727 trajectories
  • A sport video data set (HJSL)
  • 2 classes, 40 trajectories of high jump and 68
    trajectories of slalom skiing objects
  • Accuracy

140
Common Characteristics of Previous Methods
  • Use the shapes of whole trajectories to do
    classification
  • Encode a whole trajectory into a feature vector
  • Convert a whole trajectory into a string or a
    sequence of the momentary speed or
  • Model a whole trajectory using the HMM
  • Note Although a few methods segment
    trajectories, the main purpose is to approximate
    or smooth trajectories before using the HMM

141
TraClass Trajectory Classification Based on
Clustering
  • Motivation
  • Discriminative features are likely to appear at
    parts of trajectories, not at whole trajectories
  • Discriminative features appear not only as common
    movement patterns, but also as regions
  • Solution
  • Extract features in a top-down fashion, first by
    region-based clustering and then by
    trajectory-based clustering

142
Intuition and Working Example
  • Parts of trajectories near the container port and
    near the refinery enable us to distinguish
    between container ships and tankers even if they
    share common long paths
  • Those in the fishery enable us to recognize
    fishing boats even if they have no common path
    there

143
Trajectory Partitions
Region-Based Clustering
Region-Based Clustering
Trajectory-Based Clustering
Features
Trajectory-Based Clustering
144
Class-Conscious Trajectory Partitioning
  • 1. Trajectories are partitioned based on their
    shapes as in the partition-and-group framework
  • 2. Trajectory partitions are further partitioned
    by the class labels
  • The real interest here is to guarantee that
    trajectory partitions do not span the class
    boundaries

Non-discriminative Discriminative
Class A Class B
Additional partitioning points
145
Region-Based Clustering
  • Objective Discover regions that have
    trajectories mostly of one class regardless of
    their movement patterns
  • Algorithm Find a better partitioning alternately
    for the X and Y axes as long as the MDL cost
    decreases
  • The MDL cost is formulated to achieve both
    homogeneity and conciseness

146
Trajectory-Based Clustering
  • Objective Discover sub-trajectories that
    indicate common movement patterns of each class
  • Algorithm Extend the partition-and-group
    framework for classification purposes so that the
    class labels are incorporated into trajectory
    clustering
  • If an e-neighborhood contains trajectory
    partitions mostly of the same class, it is used
    for clustering otherwise, it is discarded
    immediately

147
Selection of Trajectory-Based Clusters
  • After trajectory-based clusters are found, highly
    discriminative clusters are selected for
    effective classification
  • If the average distance from a specific cluster
    to other clusters of different classes is high,
    the discriminative power of the cluster is high
  • e.g.,

Class A Class B
C2
C1
C1 is more discriminative than C2
148
Overall Procedure of TraClass
  • 1. Partition trajectories
  • 2. Perform region-based clustering
  • 3. Perform trajectory-based clustering
  • 4. Select discriminative trajectory-based
    clusters
  • 5. Convert each trajectory into a feature vector
  • Each feature is either a region-based cluster or
    a trajectory-based cluster
  • The i-th entry of a feature vector is the
    frequency that the i-th feature occurs in the
    trajectory
  • 6. Feed feature vectors to the SVM

149
Classification Results
  • Datasets
  • Animal Three classes ? three species elk, deer,
    and cattle
  • Vessel Two classes ? two vessels
  • Hurricane Two classes ? category 2 and 3
    hurricanes
  • Methods
  • TB-ONLY Perform trajectory-based clustering only
  • RB-TB Perform both types of clustering
  • Results

150
Extracted Features
Features 10 Region-Based Clusters 37
Trajectory-Based Clusters
Data (Three Classes)
Accuracy 83.3
151
Part II. Trajectory Data Mining
  • Introduction to Trajectory Data
  • Pattern Mining
  • Clustering
  • Classification
  • Outlier Detection

152
Trajectory Outlier Detection
  • Detect trajectory outliers that are grossly
    different from or inconsistent with the remaining
    set of trajectories
  • 1. Whole Trajectory Outlier Detection
  • An unsupervised method
  • A supervised method based on classification
  • 2. Integration with multi-dimensional information
  • 3. Partial Trajectory Outlier Detection
  • The Partition-and-Detect Framework

153
A Distance-Based Approach (Knorr Ng00)
  • Define the distance between two whole
    trajectories
  • A whole trajectory is represented by
  • The distance between two whole trajectories is
    defined as

where
154
  • Apply a distance-based approach to detection of
    trajectory outliers
  • An object O in a dataset T is a DB(p, D)-outlier
    if at least fraction p of the objects in T lies
    greater than distance D from O
  • Unsupervised learning

155
Sample Trajectory Outliers
  • Detect outliers from person trajectories in a room

The entire data set
The outliers only
156
Use of the Neural Network (Owens and Andrew
Hunter 00)
  • A whole trajectory is encod
Write a Comment
User Comments (0)
About PowerShow.com