Title: Data Mining for Sensor Networks - Opportunities and Challenges
1Data Mining for Sensor Networks - Opportunities
and Challenges
Vipin Kumar University of Minnesota kumar_at_cs.umn
.edu www.cs.umn.edu/kumar Research funded by
NSF, ARL, NASA, ARDA, and DHS
2What is Data Mining?
- Many Definitions
- Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data - Exploration analysis, by automatic or
semi-automatic means, of large quantities of data
in order to discover meaningful patterns
3Why Mine Data? Commercial Viewpoint
- Lots of data is being collected and warehoused
- Web data, e-commerce
- purchases at department/grocery stores
- Bank/Credit Card transactions
- Computers have become cheaper and more powerful
- Competitive Pressure is Strong
- Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
4Why Mine Data? Scientific Viewpoint
- Data collected and stored at enormous speeds
(GB/hour) - Remote sensors on a satellite
- Telescopes scanning the skies
- Microarrays generating gene expression data
- Scientific simulations generating terabytes of
data - Data mining may help scientists
- In classifying and segmenting data
- In hypothesis formation
5Why mine data? Sensor Networks Viewpoint
- Potentially massive streams of sensor data
- Data mining offers the hope of real-time delivery
of actionable knowledge
Interactive VR
Game
Wearable
Disaster Recovery
Environmental Monitoring
Computing
Earth Science
Space Exploration
Context-Aware
Computing
Immerse
Sensor Networks
Environments
Biological
Monitoring
Hazard
Detection
Smart
Environment
RFID-based systems
Traffic Monitoring
Mobile Data Stream Mining
Courtesy Tian He, Hillol Kargupta, Shashi
Shekhar and CENS/UCLA
6Multi-Robot Teams and Sensor Networks
- Nikos Papanikolopoulos, University of Minnesota
- Goal A sensor network of distributed robots
with various mobility and sensory modes for
exploration of structures in an urban scenario. - Some Applications
- Emergency response
- Law enforcement
- Monitoring/surveillance
- NASA space exploration programs
7Gateway Change Detection Project
- Robert Grossman, University of Illinois, Chicago
- Goal Monitor, integrate, analyze, detect
changes, and send alerts for data streaming from
a distributed highway sensor system.
System detects changes from baselines in
real-time and distributes them as alerts.
- 830 traffic sensors, 170,000 new sensor readings
per day - also image, text, and semi-structured data
(about 1 TB)
Video Feeds unstructured
Sensor Data structured
8Health Monitoring
- Real-time Health Monitoring
- Smart shirts that collect many attributes in
real-time - Health Monitoring for fire fighters for safety
evaluation
Objective Improve the health assessment of
elders/disabled
Monitor sleep quality of elders Monitor gait,
falls, and other movement disorders Reduce risks
and improve the efficiency of caregivers
Subject position
Courtesy Tian He Hillol Kargupta
http//www.cs.virginia.edu/wsn/medical/
9Structure monitoring
Objective Understand Interaction between ground
motions and structure/foundation response.
N. Xu, S. Rangwala, K. Chintalapudi, D. Ganesan,
A. Broad, R. Govindan, and D. Estrin, "A wireless
sensor network for structural monitoring,"
SenSys, 2004.
Courtesy Tian He
10MineFleet A Vehicle Data Stream Management and
Mining Software System
- On-board Module
- Continuous data streams from the vehicle data bus
- Onboard data stream mining
- Communicates with a remote control station
- Privacy management
- Central control station
- Data Management
- Data mining
- Communicates with the on-board modules over
wireless networks - Privacy management
Hillol Kargupta, UMBC/AGNIK, LLC
Courtesy Hillol Kargupta
- Applications
- Vehicle Health Monitoring and Maintenance
- Fuel Consumption Analysis
- Driver Behavior Monitoring
11Origins of Data Mining
- Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems - Traditional techniques may be unsuitable due to
data that is - Large-scale
- High dimensional
- Heterogeneous
- Complex
- Distributed
12Data Mining Tasks
Data
Clustering
Predictive Modeling
Anomaly Detection
Association Rules
Milk
13Predictive Modeling
- Find a model for class attribute as a function
of the values of other attributes
Model for predicting credit worthiness
Class
- Applications
- Targeted Marketing
- Customer Attrition/Churn
- Predicting damage in complex structures by using
sensor values for mode shapes and frequencies
14Clustering
- Finding groupings in data such that objects
within the same cluster are more similar to each
other than objects in different clusters
- Applications
- Market Segmentation
- Gene expression clustering
- Document Clustering
- Finding groups of driver behaviors based upon
patterns of automobile motions (normal, drunken,
sleepy, rush hour driving, etc)
Courtesy Michael Eisen
15Association Rule Discovery
- Given a set of records, find dependency rules
which will predict occurrence of an item based on
occurrences of other items in the record - Applications
- Marketing and Sales Promotion
- Supermarket shelf management
- Traffic pattern analysis (e.g., rules such as
"high congestion on Intersection 58 implies high
accident rates for left turning traffic")
Rules Discovered Milk --gt Coke (s0.6,
c0.75) Diaper, Milk --gt Beer
(s0.4, c0.67)
16Deviation/Anomaly Detection
- Detect significant deviations from normal
behavior - Applications
- Credit Card Fraud Detection
- Network Intrusion Detection
- Identify anomalous behavior from sensor networks
for monitoring and surveillance.
17Data Mining Challenges General
- Scale
- High-dimensional
- Heterogeneous
- Spatio-temporal
- Privacy
- Streaming
- Distributed
18Specific Issues and Challenges in Data Mining for
Sensor Networks
- Spatio-temporal
- Streaming
- Distributed
- Privacy
- Security
- Uncertainty/noise/missing values
- Low bandwidth communication
- Resource/power considerations
- Online feature extraction/mining
Courtesy Hillol Kargupta
Courtesy Robert Grossman
Courtesy Nikos Papanikolopoulos
19Discovery of Climate Patterns from Global Data
Sets
Questions that can be answered using
spatio-temporal data mining
- General Questions
- How is the global climate changing?
- What are the consequences of changes in the
climate? - How well can we predict future climate changes?
- Illustrative Specific Questions
- How is the frequency and intensity of ecosystem
disturbanceon land related to variability in
surface climate? - How is land surface precipitation and temperature
affected by ocean temperature? - How is the frequency and extent of wildfires
related to variability in surface climate
(precipitation, temperature, and wind speed)?
- Data sources
- Sensors
- Ground-based remote atmospheric measurements
- In-situ measurements from sensors in coastal
waters and atmosphere - Weather observation stations
- High-resolution EOS satellites
- Model-based data from forecast and other models
- Data sets created by data fusion
Earth Observing System
20Detection of Ecosystem Disturbances
Detection of sudden changes in greenness over
extensive areas from these large global satellite
data sets allowed Earth Science researchers to
gain a deeper insight into the interplay among
natural disasters, human activities and the rise
of carbon dioxide in Earth's atmosphere during
two recent decades.
Release 03-51AR NASA DATA MINING
REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA
is using satellite data to paint a detailed
global picture of the interplay among natural
disasters, human activities and the rise of
carbon dioxide in the Earth's atmosphere during
the past 20 years.
http//www.nasa.gov/centers/ames/news/releases/200
3/03_51AR.html
21Climate Indices Connecting the Ocean/Atmosphere
and the Land
- A climate index is a time series of sea surface
temperature or sea level pressure - Climate indices capture teleconnections
- The simultaneous variation in climate and related
processes over widely separated points on the
Earth
El Nino Events
Nino 12 Index
22Discovery of Climate Indices Using Clustering
A novel clustering technique was developed to
identify regions of uniform behavior in
spatio-temporal data. The use of clustering for
discovering climate indices is driven by the
intuition that a climate phenomenon is expected
to involve a significant region of the ocean or
atmosphere where the behavior is relatively
uniform over the entire area. A cluster-based
approach for discovering climate indices provides
better physical interpretation than those based
on the SVD/EOF paradigm, and provide candidate
indices with better predictive power than known
indices for some land areas. Some SST clusters
reproduce well-known climate indices. In
particular, we were able to replicate the four El
Nino SST-based indices cluster 94 corresponds to
NINO 12, 67 to NINO 3, 78 to NINO 3.4, and 75 to
NINO 4. The correlations of these clusters to
their corresponding indices are higher than
0.9. Some SST clusters, e.g., cluster 29, are
significantly different than known indices, but
provide better correlation with land climate
variables than known indices for many parts of
the globe. The bottom figure shows the
difference in correlation to land temperature
between cluster 29 and the El Nino indices. Areas
in yellow indicate where cluster 29 has higher
correlation.
23Moving Clusters in Space and Time
- Most well-known indices based on data collected
at fixed land stations. - NAO computed as the normalized difference between
SLP at a pair of land stations in the Arctic and
the subtropical Atlantic regions of the North
Atlantic Ocean
24Moving Clusters in Space and Time (contd.)
- However, underlying phenomenon may not occur at
exact location of the land station. e.g. NAO - Challenge Given sensor readings for SLP at
different points in the ocean, how to identify
clusters of low/high pressure points that may
move with space and time.
25Spatio-temporal Associations in Climate Data
Ref Tan et al 2001
FPAR-Hi gt NPP-Hi (sup5.9, conf55.7)
Grassland/Shrubland areas
Association rule is interesting because it
appears mainly in regions with grassland/shrubland
vegetation type
26Data Mining for Cyber Intrusion Detection
Incidents Reported to Computer Emergency Response
Team/Coordination Center (CERT/CC)
- Due to the proliferation of Internet, more and
more organizations are becoming vulnerable to
cyber attacks - Sophistication of cyber attacks as well as their
severity is also increasing - Cyber strategies can be a major force multiplier
and equalizer - Security mechanisms always have inevitable
vulnerabilities - Firewalls are not sufficient to ensure security
in computer networks - Insider attacks
- Traditional signature-based intrusion detection
systems (IDSs) (e.g. SNORT) cannot detect
emerging cyber threats - Data Mining can alleviate this limitation
Example of SNORT rule (MS-SQL Slammer worm) any
-gt udp port 1434 (content"81 F1 03 01 04 9B 81
F1 01" content"sock" content"send")
www.snort.org
27MINDS Minnesota INtrusion Detection System
- Data mining based anomaly detection system
- Incorporated into Interrogator architecture at
ARL Center for Intrusion Monitoring and
Protection (CIMP), - Helps analyze data from multiple sensors at DoD
sites around the country - MINDS anomalies are used as the primary key when
viewing related alerts from other tools (SNORT,
Jids, etc.) - MINDS is the first effective anomaly intrusion
detection system used by ARL - Routinely detects attacks and intrusive behavior
not detected by widely used intrusion detection
systems - Insider Abuse / Policy Violations / Worms / Scans
28Typical MINDS Output
- UMN computer connecting to a remote FTP server,
running on port 5002 - Summarized TCP reset packets received from
64.156.X.74, which is a victim of DoS attack, and
we were observing backscatter, i.e. replies to
spoofed packets - Summarization of FTP scan from a computer in
Columbia, 200.75.X.2 - Summary of IDENT lookups, where a remote computer
tries to get user name - Summarization of a USENET server transferring a
large amount of data
29NSF Press Release
Just because an event occurs rarely doesn't mean
it won't have dramatic impacts. Consider heart
attacks, power blackouts, credit card frauds or
computer virus infections. Vipin Kumar and
colleagues at the University of Minnesota are
developing data-mining techniques to detect rare
events, such as computer break-ins, that are
difficult to detect using traditional methods
that recognize attacks only through pre-defined
patterns. The new techniques have been
incorporated in the Minnesota Intrusion Detection
System (MINDS) software, which helps
cybersecurity analysts detect computer break-ins
and other undesirable activity in real-world
networks, potentially while the break-in is
underway. "MINDS allows cybersecurity experts to
quickly analyze massive amounts of network
traffic," Kumar said. "They only need to evaluate
the most anomalous connections identified by the
system." The data-mining research on rare event
analysis is supported by a 300,000 award from
the National Science Foundation. MINDS is
currently being used to monitor over 40,000
computers at the University of Minnesota. In
addition, it is an integral part of the Army's
Interrogator architecture, used at the Army
Research Laboratory's Center for Intrusion
Monitoring and Protection to analyze network
traffic from Defense Department sites around the
country. MINDS routinely detects novel
intrusions, policy violations and insider abuse
that are missed by other widely used
tools. Detecting computer intrusions is only the
first application for the Minnesota team's new
data-mining methods. The underlying techniques
could be applied to many areas beyond
cybersecurity, such as detecting financial or
health-care fraud.
http//www.nsf.gov/discoveries/disc_summ.jsp?cntn_
id100488
30Correlation of suspicious events across network
sites
- Needed to detect sophisticated attacks not
identifiable by single site analyses - Distributed correlation algorithms
- Grids middleware
Data Mining Middleware for Grids NSF/ITR funded
project jointly with B. Grossman, S. Ranka, and
J. Weissman
How to detect a distributed network attack?
31Map of the Global IP Space
32Attack Traffic on Port 445
Destination IPs of suspicious connections within
the 3 class B networks at the U of M
Source IPs of suspicious connections in the
global IP space
7982 unique sources, 6184 unique destinations,
9930 total flows involved Failed connections
O Successful connections
33Spatio-temporal Data Mining on Network Zombies
- Spatial Attack Distribution of IPs on the Same
Day (Left) IPs attacking the UFL network on
12/09/04 (712 scanners). (Middle) IPs attacking
the UMN network on 12/09/04 (14,938 scanners).
(Right) Intersection of the IPs attacking UFL and
UMN (201 scanners). - Challenge Given distribution of attackers at
many different sensors (i.e., Internet sites),
how to find attack patterns in space and time.
34Resources
- Workshop Proceedings
- Data Mining and Wireless Sensor Networks
(DM-WSN'06), IEEE International Conference on
Data Mining, ICDM'06, Hong Kong, December 18-22,
2006. - 2nd Workshop on Geosensor Networks (GSN2.0),
Boston, MA, Oct 1-3 2006. - Data Mining in Sensor Networks, SIAM Data Mining
Conference, April 23, 2005. http//www.public.asu.
edu/huanliu/dmml_presentation/sdm-Sensor-Networks
.pdf - ISSNIP 2004 Workshop on Machine Learning for
Signal Processing - Data Mining in Resource Constrained Environments,
SIAM Data Mining Conference (SDM 2004). - 1st Geo Sensor Networks Workshop, Portland, ME,
Oct 9-11 2003. - Pervasive, Distributed, and Stream Data Mining, a
session at the National Science Foundation
Workshop on Next Generation Data Mining
(NGDM'02). - PKDD Workshop on Ubiquitous Data Mining for
Mobile and Distributed Environments, 2001.
http//www.cs.umbc.edu/hillol/pkdd2001/udm.html - IJCAI-01 Workshop on Knowledge Discovery From
Distributed, Dynamic, Autonomous, Heterogeneous
Data and Knowledge Sources, August 6, 2001,
Seattle, WA
35Resources (contd.)
- Journal Special Issues
- Special Issue on Distributed Sensing for Quality
and Productivity Improvement, IEEE Transactions
on Automation Science and Engineering, July 2006. - Special Issue on Distributed and Mobile Data
Mining, IEEE Transactions on System, Man,
Cybernetics, Part B, December 2004. - Special Isssue on Signal Processing for Mining
Information, IEEE Signal Processing Magazine, May
2004 - Special Issue on Sensor Network Technology
Infrastructure, Security, Data processing, and
Deployment, SIGMOD record, March 2004. - Special Section on Sensor Network Technology and
Sensor Data Management, SIGMOD Record, December
2003.
36Resources (contd.)
- Overview Articles
- Pang-Ning Tan, Knowledge Discovery from Sensor
Data, Sensors Magazine, March (2006). - Secure sensor information management and mining,
B Thuraisingham, IEEE Signal Processing Magazine,
May 2004. - A survey on sensor networks, IF Akyildiz, W Su, Y
Sankarasubramaniam, E Cayirci, IEEE
Communications Magazine, August 2002. - Bibliography
- Distributed Data Mining Bibliography (by Hillol
Kargupta) http//www.cs.umbc.edu/hillol/DDMBIB/
37Resources (contd.)
- Books
- P-N Tan, M. Steinbach, and V. Kumar, Introduction
to Data Mining, Addison-Wesley, 2005. - K. Chakrabarty and S.S. Iyengar. Scalable
Infrastructure for Distributed Sensor Networks,
Springer, 2005. - Hillol Kargupta and Philip Chan. Advances in
Distributed and Parallel Knowledge Discovery,
xv--xxvi, MIT/AAAI Press, 2000.