Geographic Privacy-aware Knowledge Discovery - PowerPoint PPT Presentation

1 / 138
About This Presentation
Title:

Geographic Privacy-aware Knowledge Discovery

Description:

Geographic Privacyaware Knowledge Discovery – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 139
Provided by: geop4
Category:

less

Transcript and Presenter's Notes

Title: Geographic Privacy-aware Knowledge Discovery


1
Geographic Privacy-aware Knowledge Discovery
Delivery
  • Fosca Giannotti1, Dino Pedreschi1, Yannis
    Theodoridis2
  • 1 KDD Lab, University of Pisa and ISTI-CNR, Italy
    www-kdd.isti.cnr.it
  • 2 InfoLab, University of Piraeus, Greece
    infolab.cs.unipi.gr

Tutorial _at_ EDBT 2009, St Petersburg, 25th March
2009
2
Mobile devices and services
  • Large diffusion of mobile devices, mobile
    services and location-based services

3
Wireless networks as mobility data collectors
  • Wireless networks infrastructures are the nerves
    of our territory
  • besides offering their services, they gather
    highly informative traces about the human mobile
    activities
  • UbiComp infrastructure will further push this
    phenomenon
  • Miniaturization, wearability, pervasiveness will
    produce traces of increasing
  • positioning accuracy
  • semantic richness

4
Which mobility data?
  • Location data from mobile phones, i.e. cell
    positions in the GSM/UMTS network.
  • Location data from GPS-equipped devices Galileo
    in the (near?) future
  • Nokia (and other) mobile phones have on-board GPS
    receiver, and can transmit GPS tracks by SMS/MMS
  • Location data from
  • peer-to-peer mobile networks
  • intelligent transportation environments VANET
  • ad hoc sensor networks, RFIDs (radio-frequency
    ids)

5
What can we learn from mobility data ...
6
Real-time density estimation in urban areas
The senseable project http//senseable.mit.edu/gr
azrealtime/
7
(No Transcript)
8
More ambitiously mobility patterns
9
From mobility data to mobility patterns
10
GeoPKDD
  • (for) Geographic Privacy-aware Knowledge
    Discovery Delivery
  • Towards an archaeology of the present?
  • A scenario of great opportunities and risks
  • mining mobility data can yield useful knowledge
  • but, individual privacy is at risk.
  • A new multidisciplinary research area is emerging
    at this crossroads, with potential for broad
    social and economic impact
  • F. Giannotti and D. Pedreschi (Eds.) Mobility,
    Data Mining and Privacy. Springer, 2008.

11
  • A paradigmatic project GeoPKDD
  • http//www.geopkdd.eu
  • A European FP6/FET project
  • (Dec. 2005 Mar. 2009)
  • 30 researchers involved (18 young researchers)
  • 80 conference paper 34journal paper, 2 books, 7
    workshops
  • 30 specific algorithms, 2 integration project
    platforms

12
The GeoPKDD scenario
  • From the analysis of the traces of our mobile
    phones it is possible to reconstruct our mobile
    behaviour, the way we collectively move
  • This knowledge may help us improving
    decision-making in many mobility-related issues
  • Planning traffic and public mobility systems in
    metropolitan areas
  • Planning physical communication networks
  • Localizing new services in our towns
  • Forecasting traffic-related phenomena
  • Organizing logistics systems
  • Avoid repeating mistakes
  • Timely detecting changes.

13
The big picture
14
GSM network, WSN, GPS
End user
Mobility manager
Mobility Patterns
Mobility Data
Privacy and anonymity protection
Raw data
15
GeoPKDD research issues
Spatio-temporal patterns
  • Trajectory Warehouse
  • Privacy-preserving OLAP
  • Spatio-temporal models for moving objects
  • Moving Object DB
  • Geographic reasoning
  • Visual Analytics
  • ST data mining methods
  • Data mining query languages
  • Privacy-preserving data mining

16
Key questions
  • How to reconstruct a trajectory from raw logs,
    how to store and query trajectory data?
  • How to classify trajectories according to means
    of transportation (pedestrian, private vehicle,
    public transportation vehicle, )?
  • Which spatio-temporal patterns and/or models are
    useful abstractions of mobility data?
  • How to compute such patterns and models
    efficiently?
  • Privacy protection and anonymity how to make
    such concepts formally precise and measurable?
  • How to find an optimal trade-off between privacy
    protection and quality of the analysis?

17
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining
  • Trajectory warehousing and OLAP
  • Mobility data mining
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

18
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining and the geographic KDD
    process
  • Trajectory warehousing and OLAP
  • Mobility data mining
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

19
Acquiring and Storing Trajectories in MODs
  • About mobility data
  • The trajectory reconstruction problem
  • MOD engines

20
Mobility Data
  • Typical structure and size

NTimeLatLonHeightCourseSpeedPDOPStateNSat
822/03/07 08515250.7771327.205580
67.6345.421.8173.818084 922/03/07
08515650.7773527.205435 68.435.614.2233.8
18084 1022/03/07 08515950.7774157.205543
68.3112.725.2983.818084 1122/03/07
08520350.7773177.205877 68.8119.832.4473.8
18084 1222/03/07 08520650.7771857.206202
68.1124.130.0583.818084 1322/03/07
08520950.7770577.206522 67.9117.734.0033.8
18084 1422/03/07 08521250.7769257.206858
66.9117.537.1513.818084 1522/03/07
08521550.7768137.207263 67.099.239.1883.8
18084 1622/03/07 08521850.7767807.207745
68.890.641.1703.818084 1722/03/07
08522150.7768037.208262 71.182.035.0583.8
18084 1822/03/07 08522450.7768327.208682
68.6117.111.3713.818084
21
Location data producers GSM, GPS, WiFi
Streaming data manager Trajectory reconstructor
Streaming location data are received
Trajectory data are reconstructed
Moving Object Database
22
The trajectory reconstruction problem
  • From raw location data (obj-id, x, y, t)
  • To trajectory data (obj-id, traj-id, (x, y, t))

a sample of a users movement (GPS recordings)
a sample of reconstructedtrajectories
23
Reconstructing trajectories
  • Collected raw data represent time-stamped
    geo-locations
  • Raw points arrive in bulk sets
  • We need a filter that decides if the new series
    of data is to be appended to an existing
    trajectory or not
  • Prospective parameters
  • Tolerance distance
  • Temporal gap
  • Spatial gap
  • Maximum speed
  • Maximum noise duration







24
Moving Objects Database Systems
  • The traditional database technology has been
    extended into Moving Object Databases (MODs) that
    handle modeling, indexing and query processing
    issues for trajectories
  • Spatial and temporal dimensions are considered as
    first-class citizens.
  • Both past and current (as well as anticipated
    future) positions of moving objects are of
    interest.
  • Several prototype MODs
  • DOMINO (Wolfson et al.) NGITS02, EDBT02,
    ICDT05,
  • PLACE (Aref et al.) SSDBM04, VLDB04,
  • SECONDO (Güting et. al.) IDEAS00, ICDE05,
    MDM06
  • HERMES (Pelekis et. al.) EDBT06, SIGMOD08

25
DOMINO
  • Databases for Moving Objects tracking
    (http//www.cs.uic.edu/wolfson/html/mobile.html)
  • Built on top of DBMS using a three-layers
    approach
  • Utilize dynamic attributes for future predicted
    locations
  • Manage uncertainty that is inherent in future
    motion plans
  • Support various location models
  • Exact point location
  • An area in which the object is located in
  • An approximate motion plan
  • A complete motion plan

26
SECONDO
  • An Extensible DBMS Architecture and Prototype
    (http//dna.fernuni-hagen.de/Secondo.html/index.h
    tml)
  • A generic DBMS framework that can be filled with
    implementation of various data models
    (relational, object-oriented, or XML) and data
    types (spatial data, moving objects)
  • A database is a set of SECONDO objects of the
    form (name, type, value), where type is one of
    (about 20) implemented algebras
  • Query optimizer includes
  • optimization of conjunctive queries
  • selectivity estimation
  • implementation of an SQL-like query language
  • Built on top of Berkeley DB.

Command Manager
GUI
Query Processor Catalog

Op 1
Op 2
Op n
Optimizer
Storage Manager Tools
Kernel
27
The PLACE Server
  • Pervasive Location-Aware Computing Environments
    (http//www.cs.purdue.edu/place/)
  • Continuous evaluation of queries over
    spatio-temporal data streams
  • Shared execution among concurrent continuous
    queries
  • Built inside a DB engine
  • Incremental evaluation of continuous queries
  • Spatio-temporal query operators

28
The Hermes MOD engine
  • Data model
  • Current location as a function in time over the
    starting location
  • linear vs. arc movement
  • A palette of ADTs
  • Moving Point, Moving Rectangle, Moving Polygon,
    etc.
  • on top of Oracle Spatial Cartridge
  • Indexing support
  • TB-tree for Trajectories, R-tree for stationary
    spatial data

29
Location-aware querying
  • Traditional vs. advanced queries

30
What kind of queries?
  • The nature of trajectory data provides us with
    the ability to query them with a variety of
    operators
  • Coordinate-based
  • Range
  • Nearest Neighbor
  • Similarity-based
  • Trajectory-based
  • Topological
  • Derived information
  • Combined

31
Coordinate-based queries
  • Spatial (range or NN) search
  • Find all trajectories that were inside area A at
    time instant t (or time interval I) or
  • Find the trajectory that was closest to point B
    at time instant t (or time interval I)

32
Trajectory-based queries
  • Topological / directional search
  • Find all trajectories that entered (crossed,
    left, bypassed, etc.) or were located west
    (south, etc.) of an area or
  • Find all trajectories that crossed (met, etc.)
    or were located left of (right of, in front of,
    etc.) a query trajectory TQ

33
but even more advanced queries
  • Most-similar-trajectories
  • Frentzos et al. 2007 Given a query trajectory
    TQ, show me the k- most similar trajectories to
    TQ (perhaps, constrained is space and/or time)
  • Motion patterns
  • Hadjieleftheriou et al, 2005Find objects that
    crossed through region A at time t1, came as
    close as possible to point B at a later time t2
    and then stopped inside circle C during interval
    (t3, t4)

34
Trajectory Similarity Queries
  • Issues
  • How do we measure (dis-)similarity between two
    trajectories?
  • Similarity variations
  • Similarity in space, in time, in derived info
    (e.g. speed, acceleration, direction)
  • Similarity queries have been studied extensively
    in time-series literature
  • Here, things are different! Where you are and at
    what time are important.
  • While in time-series there is no spatial
    component, we typically start with normalization

35
Motion Pattern Queries
  • Trajectories represent behavior over time they
    capture the evolution of a movement
  • Can we query the behavior / motion of
    trajectories?
  • Yes! We can use complex motion patterns
  • A Motion Pattern (MP) query is actually a
    time-ordered sequence of primitive queries
  • Qmp Q1 ? Q2 ? Q3 , where Qi is a primitive
    query
  • The time-ordering of the spatial predicates may
    be explicit or implicit
  • MP queries are different !
  • They are not typical similarity queries
  • It is not the same predicate that holds for the
    duration of an interval
  • They are not typical range/NN queries
  • We can now choose separate predicates at
    different times

36
Trajectory Indexing
  • Indexing in native vs. parametric space
  • Trajectory-oriented indexing techniques

37
Two approaches
  • Indexing in the Native Space
  • Typically approximate using MBRs then index
    these MBRs
  • Advantage we can use R-trees etc. can also
    index other moving objects (areas etc.)
  • Disadvantage trajectories are lines thus MBRs
    add extensive empty space
  • Indexing in the Parametric Space
  • approximate each trajectory by a function
    (typically a polynomial) then index the
    functions coefficients
  • Advantage better approximation
  • Disadvantage translate btw Native Parametric
    spaces, better approximation means, more
    coefficients

38
Indexing in the Native Space
  • Traditional approaches
  • One MBR per trajectory
  • Too much empty space
  • vs. one MBR per segment
  • Too many objects
  • Can we do any better?
  • Q Where can we cut for MBRs?
  • A Balancing this tradeoff Hadjieleftheriou et
    al. 2002

39
Trajectory-oriented indexing techniques
  • Indexing movement in free space
  • The multi-version R-tree (MVR-tree)
  • The trajectory-bundle tree (TB-tree)
  • Indexing network-constrained movement
  • The fixed-network-restricted tree (FNR-tree)

40
MVR-tree
  • (Kumar et al. 1998), (Kollios et al. 2001), (Tao
    and Papadias, 2001)
  • The idea is to use a multi-version structure
    (MVR-tree) to index the trajectory approximations
  • Why multi-versioning?
  • A traditional R-tree considers time as another
    dimension
  • for example x,y,t creates a 3D R-tree
  • Instead, an MVR-tree effectively provides a
    separate 2D R-tree, indexing each time-slice

41
TB-tree
  • (Pfoser et al. 2000) Maintains the trajectory
    concept
  • Each node consists of segments of a single
    trajectory
  • nodes corresponding to the same trajectory are
    linked together in a chain
  • Effective for trajectory-oriented queries
  • Implemented in Hermes MOD engine using Oracles
    indexing extensibility

42
FNR-tree
  • FNR-tree (Frentzos, 2003)
  • a forest of 1D (temporal) R-trees on top of a 2D
    (spatial) R-tree
  • There is an additional Parent 1D R-tree which
    indexes the temporal intervals of the 1D R-trees
    leaf nodes

43
Indexing in the Parametric Space
  • Each trajectory is a collection of functions
  • Recall the definition of a trajectory
  • A trajectory T is defined as T oid, t0,
    (f1(t)t1), (f2(t)t2), where
  • f1(t), f2(t), are functions of time
    representing movement during time interval tj-1
    .. tj,
  • t1, t2, are time stamps in chronological order
  • If fi(t) is linear function, the trajectory
    becomes a polyline

44
Indexing in the Parametric Space (cont.)
  • A first approach in Porkaew et al, 2001 Use
    the parameters for each function fi(t) as the
    keys in the index structure.
  • Problem Hundreds of functions per trajectory
  • Result Large storage overhead, reduced
    efficiency
  • Cai Ng, 2004 approximate each trajectory with
    a Chebyshev polynomial and use its coefficients
    for indexing
  • Easy to compute
  • Almost identical to optimal minmax polynomial
  • Focused on similarity queries
  • Over entire trajectories of equal length
  • Same degree polynomials for all trajectories

45
Summary on Mobility Data Management
  • From spatial and spatio-temporal to moving object
    databases
  • Research has touched almost all aspects, from
    data modeling to efficient storage and retrieval
  • Open issues
  • Efficient query processing for location-based
    services (LBS)
  • Indexing both archived and prospective locations
  • MOD architecture centralized vs. distributed
    vs. stream-oriented
  • Exotic applications mobile computer vision, etc.

46
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining and the geographic KDD
    process
  • Trajectory warehousing and OLAP
  • Mobility data mining and reasoning
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

47
Trajectory Warehousing and OLAP
  • From DW to TDW
  • Trajectory-based OLAP
  • ETL and the distinct count problem

48
Data warehousing (DW)
  • Widely investigated for conventional, non-spatial
    data.
  • Some research on spatial DW, pioneering work by
    Han et al. in 1998.
  • Spatial and non-spatial dimensions and measures.
  • OLAP operations in a spatial data cube.
  • Recent research direction developing
    spatio-temporal DW and supporting spatio-temporal
    OLAP operations in order to extract summarized
    spatio-temporal information.
  • Useful for traffic supervision systems,
    transportation and supply chain managements,
    mobile e-commerce.
  • Focus on methods for an efficient implementation
    of spatio-temporal aggregate queries.

49
Trajectory data warehousing (TDW)
  • TDW should
  • extract aggregate information from MOD
  • support a variety of dimensions (temporal,
    spatial, thematic, ) and measures (about space,
    time and their derivatives)
  • Storing measures associated with facts,
    concerning the set of trajs crossing the cell
    ? aggregate information in base cells
  • Challenges
  • design of a trajectory-oriented data cube
  • high volume and complex nature of data special
    query processing requirements
  • extensions of traditional aggregation techniques
    to produce summary information for OLAP analysis

50
A trajectory warehouse system architecture
data analyst (desktop)
data producers (mobile)
location data (obj-id, x, y, t)(not
trajectories) are generated
trajectory stream manager
moving object database
trajectory data cube
geo- layers
trajectory data (obj-id, traj-id, (x, y, t))
are reconstructed
Geographical context is considered
aggregated trajectory data are computed (ETL
procedure)
51
Basic definitions schemas
  • Trajectory
  • Moving Object Database D T1, T2, , TN
  • Trajectory Data Warehouse
  • Dimensions Spatial, Temporal, Object Profile
  • Measures count (trajectories), count (users),
    avg (distance traveled), avg (travel duration),
    avg (speed), avg (abs (acceler) )

52
OLAP (spatio-temporal aggregation)
53
ETL processing loading
  • Loading data into the dimension tables ?
    straightforward
  • Loading data into the fact table ? complex
  • Fill in the measures with the appropriate numeric
    values
  • In order to calculate the measures, we have to
    extract the portions of the trajectories that fit
    into the base cells of the cube
  • alternative solutions
  • cell-oriented
  • trajectory-oriented

y
x
54
Aggregating measures in the cube
R
R1
R5
R4
At the lowest hierarchy level count of
trajectories in R4 3 count of trajectories in
R5 2 count of trajectories in R6 1
R2
R6
count of trajectories in R 6 (according to
traditional roll up) Correct answer 3 (!!) due
to the fact that the contents (trajectories) of
the partitions are overlapping
R3
  • A naïve solution is to query back the raw data.
  • Can we do something better?

55
The distinct count problem
  • During the ETL process, measures can be computed
    in an accurate way by executing MOD queries
  • Once the fact table has been fed, aggregate-only
    information is stored inside the TDW (no
    trajectory / user ids)
  • When rolling up, COUNT_USERS, COUNT_TRAJECTORIES
    and, hence, all other measures defined over
    COUNT_TRAJECTORIES are subject to the distinct
    count problem Tao et al. 2004
  • if an object remains in the query region for
    several timestamps during the query interval,
    instead of counting this object once, it is
    counted multiple times in the result

y
x
56
Mobility Data Mining
  • Trajectory pattern mining
  • Trajectory clustering

57
Examples of mobility patterns
  • Trajectory clustering
  • Cluster trajectories
  • For each cluster, find the mean trajectory to
    represent/classify
  • Frequent pattern mining
  • Discover frequent routes, etc.

58
Trajectory pattern mining
59
Q What is a trajectory pattern?
60
A A spatio-temporal sequential pattern
  • Giannotti et al. 2007 A sequence of visited
    regions, frequently visited in the specified
    order with similar transition times

61
T-Pattern discovery
1- Find Regions of Interest
2- Find similar Trajectory in space and time
3- Extract patterns
62
T-Patterns for trajectories
  • A Trajectory Pattern (T-pattern) is a pair (s,
    ?)
  • s lt(x0,y0),..., (xk,yk)gt is a sequence of k1
    locations
  • ? lt?1,..., ?kgt are the transition times
    (annotations)?
  • also written as
  • A T-pattern Tp occurs in a trajectory if it
    contains a sub-sequence S such that
  • each (xi,yi) in Tp matches a point (xi,yi) in
    S, and
  • the transition times in Tp are similar to those
    in S

63
Continuity issues (space time)?
  • What does matches mean in space/time?
  • The same exact spatial location (x,y) usually
    never occurs twice
  • The same exact transition times usually do not
    occur twice
  • Solution allow approximation
  • a notion of spatial neighborhood
  • a notion of temporal tolerance

64
T-Pattern approximate occurrence
  • Two points match if one falls within a spatial
    neighborhood N() of the other
  • Two transition times match if their temporal
    difference is t
  • Example

65
T-Pattern approximate occurrence
  • Two points match if one falls within a spatial
    neighborhood N() of the other
  • Two transition times match if their temporal
    difference is t
  • Example

66
T-Pattern approximate occurrence
  • Two points match if one falls within a spatial
    neighborhood N() of the other
  • Two transition times match if their temporal
    difference is t
  • Example

67
Computing general T-Patterns
  • T-pattern mining can be mapped to a density
    estimation problem over R3n-1
  • 2 dimensions for each (x,y) in the pattern (2n)?
  • 1 dimension for each transition (n-1)?
  • Density computed by
  • mapping each sub-sequence of n points of each
    input trajectory to R3n-1
  • drawing an influence area for each point
    (composition of N() and t)
  • Too computationally expensive, heuristics
    needed!!!

68
Approach 1 predefined regions
  • Fix a set of pre-defined regions of interest
  • Map each (x,y) of the trajectory to its region
  • Sample pattern

68
69
Approach 2 static discovered regions
  • Detect significant regions thru spatial
    clustering
  • Map each (x,y) of the trajectory to its region
  • Sample pattern

69
70
Approach 3 dynamic discovered regions
  • Dynamic discovering of dense regions
  • Regions are located at each step of the pattern
    generation
  • Sample pattern

1.Considering all trajectories, A is a cluster /
dense region 2.Considering only trajectories
that visit A, B is a cluster 3.20 mins is a
typical time for pattern A?B
71
Sample T-patterns
Data source Athens trucks 273 trajectories
(source www.rtreeportal.org)
72
Ongoing work on T-pattern mining
  • Application-oriented assessments on large, real
    datasets show that T-patterns are many and
    difficult to evaluate
  • A starting point for further model construction,
    rather than a final product
  • Simplification of output transition times
  • The most complex info for end users
  • Study relations with
  • Geographic background knowledge, such as points
    of interests and road network
  • Privacy issues are T-patterns safe? Can we use
    T-patterns to protect (anonymize) original data?
  • Reasoning on trajectories and patterns

73
Location prediction based on T-patterns
  • T-Pattern extracts a set of local patterns from a
    global set of data.
  • Can we use these patterns to build a global model
    to predict the next location? Yes! Pinelli et
    al. 2008

Global model (Ptree)
Local patterns (T-pattern)
74
Trajectory Clustering
75
Trajectory Clustering
  • Questions
  • Which distance between trajectories?
  • Which kind of clustering?
  • What is a cluster mean in our case?
  • A representative trajectory?

76
Which distance?
  • Average Euclidean distance
  • Synchronized behaviour distance
  • Similar objects almost always in the same place
    at the same time
  • Computed on the whole trajectory
  • Computational aspects
  • Cost O( ?1 ?2 ) (? number
    of points in ?)
  • It is a metric gt efficient indexing methods
    allowed, e.g. Frentzos et al. 2007

77
Which kind of clustering?
  • General requirements
  • Non-spherical clusters should be allowed
  • E.g. A traffic jam along a road snake-shaped
    cluster
  • Tolerance to noise
  • Low computational cost
  • Applicability to complex, possibly non-vectorial
    data
  • A suitable candidate Density-based clustering
  • OPTICS (Ankerst et al., 1999)
  • ? T(rajectory)-OPTICS

78
A sample dataset
  • A set of trajectories forming 4 clusters noise
    (synthetic)

79
T-OPTICS vs. HAC K-means
K-means
HAC-average
Reachability plot ( objects reordering for
distance distribution)
T-OPTICS
? threshold
80
Extension1 Temporal focusing
  • Different time intervals can show different
    behaviours
  • E.g. objects that are close to each other within
    a time interval can be much distant in other
    periods of time
  • The time interval becomes a parameter
  • E.g. rush hours vs. low traffic times
  • Already supported by the distance measure
  • Just compute D(?1 , ?2) T on a time interval T
    ? T
  • Problem significant T are not always known a
    priori
  • An automated mechanism is needed to find them
  • Nanni, Pedreschi. Time-focused clustering of
    trajectories of moving objects. J. of
    Intelligent Information Systems, 2006

81
Extension2 visually-driven clustering
  • Progressive refinement through visually-driven
    exploration
  • Progressively complex similarity functions
  • Scalability
  • Index structures to support efficient
    neighborhood queries for trajectory clustering
  • Progressive clustering by sampling
  • Incremental clustering and concept drift

82
Interactive density-based trajectory clustering
  • Rinzivillo, Pedreschi, Nanni, Giannotti,
    Andrienko, Andrienko.Visually-driven analysis of
    movement data by progressive clustering. J. of
    Information Visualization, 2008

83
Progressive clustering
  • First, create a large clusters of trajectories
    using the common ends distance function,
  • Concentrate on the (big) cluster of inward
    trajectories (routes towards the city center)
  • Refine by creating subclusters using a more
    sophisticated distance function (route similarity)

84
Looking for frequent stops moves
85
Clusters of typical trips
86
Cluster 1 from work to home
Observation the eastern route is chosen more
often
87
Cluster 2 from home to work
Observation the eastern route is chosen much
more often
88
MILANO data on the map
89
5 biggest (sub-)clusters of trajectories towards
the city centre
Dark grey moves occurring in trajectories from
several clusters
90
Clustering trajectories on route similarity
Left peripheral routes middle inward routes
right outward routes.
  • Rinzivillo, Pedreschi, Nanni, Giannotti,
    Andrienko, AndrienkoVisually-driven analysis of
    movement data by progressive clustering. J. of
    Information Visualization, 2008

91
Cluster-based Classification of Large Trajectory
Datasets
  • Gennady Andrienko , Natalia AndrienkoFraunhofer
    IAIS, Sankt Augustin, Germany
  • Salvatore Rinzivillo , Mirco Nanni, Fosca
    GiannottiISTI - CNR, Pisa, Italy
  • Dino PedreschiUniversità di Pisa, Pisa, Italy

92
Motivation
  • Massive collections of GPS tracks are rough
    approximations of complex human activities
  • Challenge
  • develop analysis techniques capable of mastering
    the complexity of the data and extracting
    meaningful abstractions
  • A trajectory clustering problem
  • find, for the spatial area and the time interval
    under analysis, the natural clusters of similar
    trajectories and attach them semantics

93
The process at a glance
  • Given a trajectory dataset D, extract a sample D
    of trajectories from D
  • Apply OPTICS with a suitable distance function d
    and get a set of density-based clusters C1 , C2
    , . . . , Cm
  • For each cluster Ci
  • Select s specimens as its representative
  • Visually inspect and re?ne the selected
    specimens. The set of the specimens for all
    clusters forms a classi?er
  • Apply the classi?er to the remaining
    trajectories, attaching each new trajectory to
    the closest specimens. The trajectories with no
    close specimen remain unclassi?ed
  • Repeat the whole process for the unclassi?ed
    trajectories

94
Classification of the original DBfor each
candidate trajectory findthe closest specimen
and attach it to the corresponding cluster
Sampling
Find Specimens
The trajectories attached to the cluster after
the classification
T-OPTICS
95
Summary on Mobility Data Mining
  • Data Analysis and Knowledge Discovery in MODs is
    here to stay
  • It is the opportunity to discover, from the
    digital traces of human activity, the knowledge
    that makes us comprehend timely and precisely the
    way we live, the way we use our time and our
    land.
  • Open issues
  • Integrating mining process into MODs
  • Interactive, progressive KDD customized to users
    needs

96
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining and the geographic KDD
    process
  • Trajectory warehousing and OLAP
  • Mobility data mining and reasoning
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

97
From opportunities to threats
  • Personal mobility data, as gathered by the
    wireless networks, are extremely sensitive
  • Their disclosure may represent a brutal violation
    of the privacy protection rights, i.e., to keep
    confidential
  • the places we visit
  • the places we live or work at
  • the people we meet

97
98
The naïve scientists view
  • Knowing the exact identity of individuals is not
    needed for analytical purposes
  • De-identified mobility data are enough to
    reconstruct aggregate movement behaviour,
    pertaining to groups of people.
  • Reasoning coherent with European data protection
    laws personal data, once made anonymous, are not
    subject to privacy law restrictions
  • Is this reasoning correct?

98
99
Unfortunately not!
  • Making data (reasonably) anonymous is not easy.
  • Sometimes, it is possible to reconstruct the
    exact identities from the de-identified data.
  • Many famous examples of re-identification
  • Dalenius
  • Governor of Massachusetts clinical records
    (Sweeneys experiment, 2001)
  • AOL August 2006 crisis user re-identified from
    search logs
  • Two main sources of danger
  • Many observations on the same anonymous subject
  • Linking data, after joining separate datasets

99
100
Spatio-temporal linkage in Mobility Data
almost every day Mon-Fri between 745 815
Id 34567
A
B
almost every day Mon-Fri between 1745 1815
A
B
  • By intersecting the phone directories of
    locations A and B we find that only one
    individual lives in A and works in B.
  • Id34567 Prof. Smith
  • Then you discover that on Saturday night Id34567
    usually drives to the city red lights district

100
101
Preserving anonymity
  • Anonymity-preserving mobility mining

101
102
How do people (try to) stay anonymous?
  • either by camouflage
  • pretending to be someone else or somewhere else
  • or by hiding in the crowd
  • becoming indistinguishable among many others

102
103
Concepts for Location Privacy
  • Location Perturbation Randomization
  • The user location is represented with a fake
    value
  • Privacy protection is achieved from the fact that
    the reported location is false
  • The accuracy and the amount of privacy mainly
    depends on how far is the reported location from
    the exact location

103
104
Concepts for Location Privacy
  • Spatial Cloaking Generalization
  • The user exact location is represented as a
    region that includes the exact user location
  • An adversary does know that the user is located
    in the region, but has no clue where the user is
    exactly located
  • The area of the region achieves a trade-off
    between user privacy and accuracy

104
105
Concepts for Location Privacy
Y
  • Spatio-temporal generalization
  • In addition to the spatial dimension, generalize
    also the temporal dimension

X
T
105
106
Concepts for Location Privacy
  • k-anonymity
  • Users position is generalized to a region
    containing at least k users
  • The user is indistinguishable among other k-1
    users
  • The area largely depends on the surrounding
    environment.
  • A value of k 100 may result in a very small
    area downtown Hong Kong, or a very large area
    in the desert.

10-anonymity
106
107
Trajectory anonymization
  • Several variants developed in GeoPKDD
  • Abul, Bonchi, Nanni (Pisa KDD LAB). Int. Conf.
    Data Engineering ICDE 2008
  • Nergiz, Atzori, Saygin (Sabanci Univ. Pisa KDD
    LAB). ACMGIS 2008
  • Gkoulalas-Divanis, Verykios (Univ. Thessaly).
    2007 (submitted)
  • Monreale, Pensa, Pedreschi, Pinelli PILBA 2008
  • Yarovoy, Bonchi, Laksmanan, Wang, EDBT 2009
  • Common goal construct an anonymized version of a
    trajectory dataset, preserving some target
    analytical properties
  • Different techniques adopted

108
Anonymity preserving mobility mining
  • Never Walk Alone Bonchi et al. 2008
  • Trade uncertainty for anonymity trajectories
    that are close up the uncertainty threshold are
    indistinguishable
  • Combine k-anonymity and perturbation
  • Two steps
  • Cluster trajectories into groups of k similar
    ones (removing outliers)
  • Perturb trajectories in a cluster so that each
    one is close to each other up to the uncertainty
    threshold

108
109
Sample results(dataset Oldenburg, synthetic)?
original data
original data
anonymized data
110
Key open challenges
  • Define an acceptable formal measure of anonymity
    protection
  • Probability of re-identification (in a given
    context)
  • A (technically supported) juridical issue!
  • Sampling a necessity and an opportunity!
  • Necessary for performance/feasibiliy of data
    mining from massive mobility datasets
  • Good for anonymity (re-identification probability
    decreases)

110
111
Summary on Privacy Aspects
  • Today, tracking is an everytime / everywhere
    process
  • Therefore, privacy-preservation is a must!
  • What is required
  • Privacy-aware KDD process
  • Much already in the literature
  • Privacy-aware MOD management
  • Not so much!

111
112
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining and the geographic KDD
    process
  • Trajectory warehousing and OLAP
  • Mobility data mining and reasoning
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

113
Incorporating semantics a step towards the user
  • May data and patterns be re-combined and queried?
  • May the datamining tasks be more accurate if data
    are semantically enriched?
  • May we deduce something new from data and
    patterns?

114
Why a DMQL?
  • Data, Patterns/models and background knowledge
    need to be combined
  • Find the patterns that involve trajectories
    crossing a polluted area during rush hours

115
A unifying framework DEDALUS
The queries between data and/or models can be
expressed with Object-relational language using
Hermes package and Tas package. For
example 1. Select all TASs belonging to a
certain trajectory (e.g. id3) SELECT
Patterns.id FROM Patterns, Trajectories WHERE
Trajectories.id 3 AND Patterns.TAS.f_membership
s(Trajectories.trajectory) 2. Select all
trajectories belonging to a TAS included in a
polluted area. SELECT Trajectories.id FROM
Patterns, Trajectories, Polluted_Areas WHERE
Patterns.f_membership(Trajectories.trajectory) AN
D Polluted_Areas.geometry.include(Patterns.TAS.get
_geometry())
Id Number Object Pattern_TAS
Patterns
116
Building mobility data mining applications
  • requires reasoning on a richer form of knowledge
    about mobility
  • Geographic semantics
  • Landmarks and interesting places
  • Road network
  • Landscape
  • Movement sematics
  • stops and moves
  • Purposes of movement
  • means of transportation

117
End user
GSM network
Where should I go next?
Multimedia Geo
Mobility Database
Mobility models
118
Semantic Trajectory Data
  • Physical Trajectory
  • e.g. GPS recording over some period of time
  • Semantic Trajectory
  • places where a person stayed
  • means of transportation
  • combination of above elements for higher-level
    description

way to work
bus stop
work
home
bus stop
bus stop
bus stop
119
Semantic (frequent) patterns
120
How to enrich?
  • An ontological framework enables a progressive
    semantic enrichment of mobility data and patterns

121
ATHENA The ontological framework
Query
Movement Ontology Application Ontology Data
Ontology
How a movement ontology should be developed?
How a data ontology should be mapped onto a
database?
patterns
trajectories
geography
122
AthenaQuerying Reasoning
Data Ontology
Application Ontology
ONTOLOGY SYSTEM
123
A guided tour on GeoPKDD
  • Mobility data management
  • Acquiring and storing trajectories in MODs
  • Location-aware querying
  • Trajectory indexing
  • Mobility data mining and the geographic KDD
    process
  • Trajectory warehousing and OLAP
  • Mobility data mining and reasoning
  • Visual analytics for mobility data
  • Privacy aspects on mobility data
  • Preserving anonymity
  • (Semantic enriched) Geographic KDD process
  • Combining Mining and Querying
  • Ontological framework for end-user querying and
    reasoning
  • Outlook

124
Outlook
  • (Privacy-preserving) Mobility Data Acquisition,
    Querying, and Mining strives for a win-win
    situation
  • Obtaining the advantages of collective mobility
    knowledge without disclosing inadvertently any
    individual mobility knowledge.
  • A word of wisdom solutions can only be obtained
    via an alliance of technology, legal regulations,
    and social norms (Rakesh Agrawal)
  • GeoPKDD.eu is in the mix, shaping up the area of
    PP mobility data mining
  • Challenge UbiComp will flood us with new complex
    data (in a decentralized setting)
  • data miners have only begun to scratch the
    surface of this problem

125
trying to accomplish a long-time dream
126
Acknowledgements
  • We are grateful to all the GeoPKDD researchers,
    who made the project successful through their
    results and contributed actively to this tutorial
  • Theyre too many to be listed here, their work
    has been cited along these notes

GeoPKDD is a project under the FP6 / FET
Programme of the European Commission, FET-Open
contract nr 014915 (Dec. 2005 Mar. 2009)
127
Selected literature on
  • Mobility Data Modeling MOD engines
  • de Almeida, V.T. et al. (2006) Querying Moving
    Objects in SECONDO. Proceedings of MDM.
  • Behr, T. and Güting, R.H. (2005) Fuzzy Spatial
    Objects An Algebra Implementation in SECONDO.
    Proceedings of ICDE.
  • Cao, H. and Wolfson, O. (2005) Nonmaterialized
    Motion Information in Transport Networks.
    Proceedings of ICDT.
  • Chen, C.X. and Zaniolo, C. (2000) SQLST A
    Spatio-Temporal Data Model and Query Language.
    Proceedings of ER.
  • Cheng, R. et al. (2004) Efficient Indexing
    Methods for Probabilistic Threshold Queries over
    Uncertain Data. Proceedings of VLDB.
  • Dieker, S. and Güting, R.H. (2000) Plug and Play
    with Query Algebras SECONDO A Generic DBMS
    Development Environment. Proceedings of IDEAS.
  • Güting, R.H. et al. (2000) A Foundation for
    Representing and Querying Moving Objects. ACM
    Transactions on Database Systems, 25(1)1-42.
  • Güting, R.H. et al. (2006) Modeling and querying
    moving objects in networks. VLDB Journal, 15(2)
    165-190.
  • Karimi, H. and Liu, X. (2003) A Predictive
    Location Model for Location-Based Services,
    Proceedings of ACM-GIS.
  • Mokbel, M.F. et al. (2004a) Continuous Query
    Processing of Spatio-temporal Data Streams in
    PLACE. Proceedings of SSDBM.
  • Mokbel, M.F. et al. (2004a) PLACE A Query
    Processor for Handling Real-time Spatio-temporal
    Data Streams. Proceedings of VLDB.

128
Selected literature on
  • Mobility Data Modeling MOD engines (cont.)
  • Mokhtar, H., and Su, J. (2005) A Query Language
    for Moving Object Trajectories. Proceedings of
    SSDBM.
  • Patroumpas, K. and Sellis, T.K. (2004) Managing
    Trajectories of Moving Objects as Data Streams.
    Proceedings of STDBM.
  • Pelekis, N. and Theodoridis, Y. (2007) An Oracle
    Data Cartridge for Moving Objects. Technical
    Report, TR-2007-04, University of Piraeus.
  • Pelekis, N. et al. (2004) Literature Review of
    Spatio-temporal Database Models. Knowledge
    Engineering Review, 19(3) 235-274.
  • Pelekis, N. et al. (2006) Hermes - A Framework
    for Location-Based Data Management. Proceedings
    of EDBT.
  • Pelekis, N. et al. (2008) HERMES aggregative LBS
    via a trajectory DB engine. Proceedings of ACM
    SIGMOD. Pfoser, D. and Jensen, C.S. (1999)
    Capturing the Uncertainty of Moving-Object
    Representations. Proceedings of SSD.
  • Schlieder, C. et al. (2001) Location Modeling for
    Intentional Behavior in Spatial Partonomies.
    Proceedings of Location Modeling for Ubiquitous
    Computing Workshop.
  • Sistla, P. et al. (1997) Modeling and Querying
    Moving Objects. Proceedings of ICDE.
  • Trajcevski, G. et al. (2002) The geometry of
    uncertainty in moving objects databases.
    Proceedings of EDBT.
  • Trajcevski, G. et al. (2004) Managing uncertainty
    in moving objects databases. ACM Transactions on
    Database Systems 29(3) 463-507.

129
Selected literature on
  • Mobility Data Modeling MOD engines (cont.)
  • Wolfson, O. (2002) Moving Objects Information
    Management The Database Challenge. Proceedings
    of NGITS.
  • Wolfson, O. et al. (1998) Moving Objects
    Databases Issues and Solutions. Proceedings of
    SSDBM.
  • Wolfson, O. et al. (1999) Updating and Querying
    Databases that Track Mobile Units. Distributed
    and Parallel Databases, 7(3) 257-387.

130
Selected literature on
  • MOD Query Processing
  • Benetis, R. et al. (2002) Nearest Neighbor and
    Reverse Nearest Neighbor Queries for Moving
    Objects. Proceedings of IDEAS.
  • Frentzos, E. et al. (2005) Nearest Neighbor
    Search on Moving Object Trajectories. Proceedings
    of SSTD.
  • Frentzos, E. et al. (2007) Index-based Most
    Similar Trajectory Search. Proceedings of ICDE.
  • Gedik, B., and Liu, L. (2004) MobiEyes
    Distributed Processing of Continuously Moving
    Queries on Moving Objects in a Mobile System.
    Proceedings of EDBT.
  • Jensen, C.S. et al. (2003) Nearest Neighbor
    Queries in Road Networks. Proceedings of ACM-GIS.
  • Lema, J.A.C. et al. (2003) Algorithms for Moving
    Objects Databases. The Computer Journal,
    46(6)680-712.
  • Li, F. et al. (2005) On Trip Planning Queries in
    Spatial Databases. Proceedings of SSTD.
  • Mokbel, M.F. et al. (2004) SINA Scalable
    Incremental Processing of Continuous Queries in
    Spatio-temporal Databases. Proceedings of ACM
    SIGMOD.
  • Papadias, D. et al. (2003) Query Processing in
    Spatial Network Databases, Proceedings of VLDB.
  • Pelekis, N. et al. (2007) Similarity Search in
    Trajectory Databases, Proceedings of TIME.
  • Pfoser, D. and C.S. Jensen (2001) Querying the
    Trajectories of On-line Mobile Objects.
    Proceedings of MobiDE.
  • Porkaew, K. et al. (2001) Querying Mobile Objects
    in Spatio-Temporal Databases. Proceedings of
    SSTD.
  • Sankaranarayanan, J. et al. (2005) Efficient
    Query Processing on Spatial Networks. Proceedings
    of ACM-GIS.
  • Seydim, A.V. et al. (2001) Location Dependent
    Query Processing. Proceedings of MobiDE.
  • Tao, Y. et al. (2002) Continuous Nearest Neighbor
    Search. Proceedings of VLDB.
  • Xia, T. and Zhang, D. (2006) Continuous Reverse
    Nearest Neighbor Monitoring. Proceedings of ICDE.

131
Selected literature on
  • MOD Indexing
  • Cai, Y. and Ng, R.T. (2004) Indexing
    Spatio-Temporal Trajectories with Chebyshev
    Polynomials. Proceedings of ACM SIGMOD.
  • Frentzos, E. (2003) Indexing Objects Moving on
    Fixed Networks. Proceedings of SSTD.
  • Hadjieleftheriou, M. et al. (2006) Indexing
    Spatio-temporal Archives. VLDB Journal, 15(2)
    143-164.
  • Kollios, G. et al. (2001) Indexing Animated
    Objects Using Spatiotemporal Access Methods. IEEE
    Transactions on Knowledge and Data Engineering,
    13(5) 758-777.
  • Myllymaki, J. and Kaufman, J. (2003)
    High-Performance Spatial Indexing for
    Location-Based Services. Proceedings of WWW.
  • Ni, J. and Ravishankar, C.V. (2007) Indexing
    Spatio-Temporal Trajectories with Efficient
    Polynomial Approximations. IEEE Transactions on
    Knowledge and Data Engineering, 19(5) 663-678.
  • Pfoser, D. et al. (2000) Novel Approaches to the
    Indexing of Moving Object Trajectories.
    Proceedings of VLDB.
  • Rasetic, S. et al. (2005) A Trajectory Splitting
    Model for Efficient Spatio-Temporal Indexing.
    Proceedings of VLDB.
  • Saltenis, S. et al. (2000) Indexing the Positions
    of Continuously Moving Objects. Proceedings of
    ACM SIGMOD.
  • Saltenis, S. and C.S. Jensen (2002) Indexing of
    Moving Objects for Location-Based Services.
    Proceedings of ICDE.
  • Tao, Y. and Papadias, D. (2001) MV3R-Tree A
    Spatio-Temporal Access Method for Timestamp and
    Interval Queries. Proceedings of VLDB.

132
Selected literature on
  • Mobility Data Warehousing
  • Han, J. et al. (1998) Selective Materialization
    An Efficient Method for Spatial Data Cube
    Construction. Proceedings of PAKDD.
  • Jensen, C.S. et al. (2004) Multidimensional data
    modeling for location-based services, The VLDB
    Journal, 13121.
  • Leonardi, L. et al. (2009) Frequent
    Spatio-Temporal Patterns in Trajectory Data
    Warehouses. Proceedings of ACM SAC.
  • Marketos, G. et al. (2008) Building Real World
    Trajectory Warehouses. Proceedings of MobiDE.
  • Orlando, S. et al. (2007) Spatio-Temporal
    Aggregations in Trajectory Data Warehouses.
    Proceedings of DaWaK.
  • Pelekis, N. et al. (2008) Towards Trajectory Data
    Warehouses. Chapter in Mobility, Data Mining and
    Privacy Geographic Knowledge Discovery.
    Springer-Verlag. 2008.
  • Shekhar, S. et al. (2001) Map Cube a
    Visualization Tool for Spatial Data Warehouses,
    Chapter in Geographic Data Mining and Knowledge
    Discovery. Taylor and Francis.
  • Tao, Y. et al. (2004) Spatio-Temporal Aggregation
    Using Sketches. Proceedings of ICDE.

133
Selected literature on
  • Mobility Pattern Querying Mining
  • Cao, H. et al. (2005) Mining frequent
    spatio-temporal sequential patterns. Proceedings
    of ICDM.
  • Djafri, N. et al. (2002) Spatio-temporal
    evolution querying patterns of change in
    databases. Proceedings of ACM-GIS.
  • Giannotti, F. et al. (2006) Efficient Mining of
    Temporally Annotated Sequences. Proceedings of
    SDM.
  • Giannotti, F. et al. (2007) Trajectory Pattern
    Mining. Proceedings of KDD.
  • Hadjieleftheriou, M. et al. (2005) Complex
    Spatio-Temporal Pattern Queries. Proceedings of
    VLDB.
  • Horvitz, E. et al. (2005) Prediction,
    expectation, and surprise Methods, designs, and
    study of a deployed traffic forecasting service.
    Proceedings of Conference on Uncertainty in
    Artificial Intelligence.
  • Kalnis, P. et al. (2005) On discovering moving
    clusters in spatio-temporal data. Proceedings of
    SSTD.
  • van Kreveld, M. et al. (2007) Efficient Detection
    of Motion Patterns in Spatio-Temporal Data Sets.
    GeoInformatica, 11(2)195-215.
  • Laube, P. et al. (2005) Discovering relative
    motion patterns in groups of moving point
    objects. Int. Journal of Geographical Information
    Science, 19(6) 639-668.
  • Li, X. et al. (2007) Traffic density-based
    discovery of hot routes in road networks.
    Proceedings of SSTD.
  • Liu, Y. et al. (2006) A scalable distributed
    stream mining system for highway traffic data.
    Proceedings of PKDD.
  • Mamoulis, N. et al. (2004) Mining, indexing, and
    querying historical spatiotemporal data.
    Proceedings of KDD.
  • du Mouza, C. and Rigaux, P. (2005) Mobility
    Patterns. GeoInformatica, 9(4) 297-319.

134
Selected literature on
  • Mobility Pattern Querying Mining (cont.)
  • Nakata, T. and Takeuchi, J. (2004) Mining traffic
    data from probe-car system for travel time
    prediction. Proceedings of KDD.
  • Qu, Y. et al. (2003) Supporting Movement Pattern
    Queries in User-Specified Scales. IEEE
    Transacti
Write a Comment
User Comments (0)
About PowerShow.com