VIRTUAL PRESENCE

About This Presentation

Title:

VIRTUAL PRESENCE

Description:

http://galeb.etf.bg.ac.yu/~vm/tutorial. Voislav ... it is snowing today: (confirm :sender i :receiver j :content 'weather( today, snowing )' :language Prolog ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 100

Provided by: Gal118

Category:

more less

Transcript and Presenter's Notes

Title: VIRTUAL PRESENCE

1
VIRTUAL PRESENCE
Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial
2
DEFINITION
Virtual presence is a term with various shades
of meanings in different industries, but its
essence remains constant it is a new tool that
enables some form of telecommunication in which
the individual may substitute their physical
presence with an alternate, typically,
electronic presence
3
SUMMARY
- Introduction to Virtual Presence - Data Mining
for Virtual Presence - A New Software
Paradigm - Selected Case Studies
4
INTRODUCTION TO VP

- Definitions
VP applications
Psychological aspects

5
DATA MINING FOR VP

Why Data Mining?
What can Data Mining do?
Growing popularity of Data Mining
- Algorithms

6
SOFTWARE AGENTS

A new software paradigm
Standardization
FIPA specifications
Agent management
Agent Communication Language

7
GoodNews (CMU)

Categorization of financial news articles
Co-located phrases
Domain Experts
Implementation and results

Carnegie Mellon University, Pittsburgh, USA
8
iMatch (MIT)

The idea
- associate MIT students and staff in order to
ease their cooperation
- help students find resources they need
Implementation
advanced, agent-based system architecture
- Tomorrow?

Massachusetts Institute of Technology, USA
9
Tourist city (ETF)

A qualitative step forward in the domain of
maximization of customer satisfaction
Technologies
Data Mining
Software Agents (mobile)

Faculty of Electrical Engineering, University
of Belgrade, Serbia and Montenegro
10
CONCLUSION

This tutorial will attempt to familiarize you
with
The concept of VP (Virtual Presence) as a
new technological challenge
The new paradigms and technologies that will
bring the VP to everyday life
- Data Mining - Software Agents

11
INTRODUCTION

Virtual presence will arguably be one of the
most important aspects of personal communication
in the twenty-first century

12
Essence of VP

The usefulness and reliability of virtual
presence
The ability to conduct everyday tasks by being
virtually or electronically present

13
How to Accomplish it?

The presence is accomplished through the
Internet, video, or other communications,
perhaps even psychically one day
Technological advance will sophisticate virtual
presence, altering the very meaning of the word
presence

14
VP Applications

VP in government
Sunshine laws
Voting

15
VP Applications

VP in business
Online board meetings
Shareholder voting online

16
VP Applications

VP in education
interactive lectures and courses

17
VP Applications

VP in medicine
Telemedicine
Diagnostics
Remote surgery
Risks
Privacy

18
VP Applications

VP in everyday life
Telecommuting/Telework
Software agents as our virtual shadows

19
Psychological Aspects

Cyberspace and Mind
Presence in Virtual Space
Communal Mind and Virtual Community

20
DATA MINING

Knowledge discovery is a non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data

21
Many Definitions

Data mining is also called data or knowledge
discovery
It is a process of inferring knowledge from
large oceans of data
Search for valuable information in large volumes
of data
Analyzing data from different perspectives and
summarizing it into useful information

22
Why Data Mining ?

DM allows you to extract knowledge from
historical data and predict outcomes of future
situations
Optimize business decisions and improve
customers satisfaction with your services
Analyze data from many different angles,
categorize it, and summarize the relationships
identified
Reveal knowledge hidden in data and turn this
knowledge into a crucial competitive advantage

23
What Can Data Mining Do?

Identify your best prospects and then retain
them as customers
Predict cross-sell opportunities and make
recommendations
Learn parameters influencing trends in sales and
margins
Segment markets and personalize communications
etc.

24
The Power of Data Mining

Having a database is one thing, making sense of
it is quite another
It does not rely on narrow human queries to
produce results, but instead uses AI related
technology and algorithms
Inductive reasoning
Using more than one type of algorithm to search
for patterns in data
Data mining produces usually more general (more
powerful) results than those obtained by
traditional techniques
Relational DB storage and management technology
is OK for data mining applications less than 50
gigabytes

25
Reasons for the Growing Popularity of Data Mining

Growing Data Volume
Low Cost of Machine Learning
Limitations of Human Analysis

26
Tasks Solved by Data Mining

Predicting
Classification
Detection of relations
Explicit modeling
Clustering
Market basket analysis
Deviation detection

27
Algorithms

Generally, their complexity is around n (log
n)(n is the number of records)
Data mining includes three major components,
with corresponding algorithms
Clustering (Classification)
Association Rules
Sequential Analysis

28
Classification Algorithms

The aim is to develop a description or model for
each class in a database, based on the features
present in a set of class-labeled training
data
Data Classification Methods
Statistical algorithms
Neural networks
Genetic algorithms
Nearest neighbor method
Rule induction
Data visualization

29
Classification-rule Learning

Data abstraction
Classification-rule learning finding rules or
decision trees that partition given data into
predefined classes
Hunts method
Decision tree building algorithms
ID3 / C4.5 algorithm
SLIQ / SPRINT algorithm (IBM)
Other algorithms

30
Parallel Algorithms

Basic Idea N training data items are randomly
distributed to P processors. All the processors
cooperate to expand the root node of the
decision tree
There are two approaches for future progress
(the remaining nodes)
Synchronous approach
Partitioned approach

31
Association Rule Algorithms

Association rule implies certain association
relationship among the set of objects in a
database
These objects occur together, or one implies
the other
Formally X ? Y, where X and Y are sets of items
(itemsets)
Key terms
Confidence
Support
The goal to find all association rules that
satisfy user-specified minimum support and
minimum confidence constraints.

32
Association Rule Algorithms

Apriori algorithm and its variations
AprioriTid
AprioriHybrid
FT (Fault-tolerant) Apriori
Distributed / Parallel algorithms (FDM, )

33
Sequential Analysis

Sequential Patterns
The problem finding all sequential patterns
with user-specified minimum support
Elements of a sequential pattern need not to be
consecutive
simple items
Algorithms for finding sequential patterns
count-all algorithms
count-some algorithms

34
Conclusion

Drawbacks of existing algorithms
Data size
Data noise
There are two critical technological drivers
Size of the database
Query complexity
The infrastructure has to be significantly
enhanced to support larger applications
Solutions
Adding extensive indexing capabilities
Using new HW architectures to achieve
improvements in query time

35
THE NEW SOFTWARE PARADIGM

All software agents are programs, but not all
programs are agents

36
Many Definitions

Computational systems that inhabit some dynamic
environment, sense and act autonomously and
realize a set of goals or tasks for which they
are designed
Hardware or (more usually) software-based
computer system that enjoys the following
properties

- Reactive (sensing and acting) - Autonomous -
Goal-oriented (pro-active purposeful) -
Temporally continuous - Communicative (socially
able)
- Learning (adaptive) - Mobile - Flexible -
Character
37
Interesting Topic of Study

They draw on and integrate many diverse
disciplines of computer science and other areas
objects and distributed object architectures
adaptive learning systems
artificial intelligence and expert systems
collaborative online social environments
security
knowledge based systems, databases
communications networks
cognitive science and psychology

38
What Problems do Agents Solve ?

Client/server network bandwidth problem
In the design of a client/server architecture
The problems created by intermittent or
unreliable network connections
Attempts to get computers to do real thinking for
us

39
The New Software Paradigm

Unless special care has been taken in the design
of the code, two software programs cannot
interoperate
The promise of agent technology is to move the
burden of interoperability from software
programmers to programs themselves
This can happen if two conditions are met
A common language (Agent Communication Language
ACL)
An appropriate architecture

40
The Need for Standards

Anywhere, anytime consumer access to the
Universal bouquet of information and services is
the new goal of the information revolution
The scope of Internet standards makes the scope
of choices extreme
The Foundation for Intelligent Physical Agents
(FIPA), established in 1996 in Geneva
international non-profit association of companies
and organizations
specifications of generic agent technologies.

41
FIPA Specifications

Agent Management
Agent Communication Language
Agent/Software Integration
Agent Management Support for Mobility
Human-Agent Interaction
Agent Security Management
Agent Naming
FIPA Architecture
Agent Message Transport
etc.

42
Agent Management

Provides the normative framework within which
FIPA agents exist and operate
Establishes the logical reference model for the
creation, registration, location, communication,
migration and retirement of agents

The entities contained in the reference model are
logical capability sets and do not imply any
physical configuration
- Additionally, the implementation details of
individual APs and agents are the design choices
of the individual agent system developers

43
Components of the Model

Agent

- computational process - fundamental actor on an
AP - as a physical software process has a life
cycle that has to be managed by the AP

Directory Facilitator

- yellow pages to other agents
supported function are
register
deregister
modify
search

Agent Management System

- white pages services to other agents
- maintains a directory of AIDs which contain
transport addresses
supported function are
register
deregister
modify
search
get-description
operations for underlying AP

Message Transport Service

- communication method between agents

Agent Platform

- physical infrastructure in which agents can be
deployed

Software

- all non-agent, executable collections of
instructions accessible through an agent
44
Agent Life Cycle

FIPA agents exist physically on an AP and utilize
the facilities offered by the AP for realising
their functionalities
In this context, an agent, as a physical software
process, has a physical life cycle that has to
be managed by the AP

The state transitions of agents can be described
as
- create - invoke - destroy - quit - suspend
- resume - wait - wake up - move - execute
45
Agent Communication Language

The specification consists of a set of message
types and the description of their meanings
Requirements
Implementing a subset of the pre-defined message
types and protocols
Sending and receiving the not-understood message
Correct implementation of communicative acts
defined in the specification
Freedom to use communicative acts with other
names, not defined in the specification
Obligation of correctly generating messages in
the transport form
Language must be able to express propositions,
objects and actions
The use of Agent Management Content Language and
ontology

46
ACL Syntax Elements

Pre-defined message parameters

sender receiver content reply-with in-reply-t
o envelope language ontology reply-by protoco
l conversation-id

Communicative acts

accept-proposal agree cancel cfp confirm disconfir
m failure inform inform-if inform-ref
not-understood propose query-if query-ref refuse r
eject-proposal request request-when request-whenev
er subscribe
47
Communication Examples

Agent i confirms to agent j that it is,
in fact, true that it is snowing today
(confirm sender i receiver j
content "weather( today, snowing )"
language Prolog
)

- Agent i asks agent j if j is registered with
domain server d1 (query-if     sender i
receiver j    content       (registered
(server d1) (agent j))    reply-with
r09) ... (inform    sender j    receiver
i    content (not (registered (server d1)
(agent j)))    in-reply-to r09)
- Agent j replies that it can reserve trains,
planes and automobiles (inform     sender j
    receiver i    content       ( (iota ?x
(available-services j ?x))
((reserve-ticket train)
(reserve-ticket plane)           (reserve
automobile))       )    )

Agent i, believing that agent j thinks that a
shark is a
mammal, attempts to change j's belief
(disconfirm sender i receiver j
content (mammal shark)
)

- Agent j refuses to i reserve a ticket for i,
since i there are insufficient funds in i's
account (refuse     sender j     receiver
i    content      (       (action j
(reserve-ticket LHR, MUC, 27-sept-97))
(insufficient-funds ac12345)      )
language sl)
- Auction bid (inform    sender agent_X
receiver auction_server_Y    content
(price (bid good02) 150) in-reply-to
round-4 reply-with bid04 language sl
ontology auction)
- Agent i did not understand an query-if message
because it did not recognize the
ontology (not-understood    sender i
receiver j    content ((query-if sender j
receiver i )              (unknown (ontology
www)))    language sl )
- Agent i asks agent j for its available
services (query-ref     sender i
receiver j    content       (iota ?x
(available-services j ?x))    )
48
Agent/Software Integration

Integration of services provided by non-agent
software into a multi-agent community
Definition of the relationship between agents
and software systems
Allowing agents to describe, broker and negotiate
over software systems
Allowing new software services to be dynamically
introduced into an agent community
Defining how software resources can be described,
shared and dynamically controlled in an agent
community

49
New Agent Roles

To support specification, two new agent roles
have been identified
Agent Resource Broker (ARB)
WRAPPER Agent

50
GoodNews

A system that automatically categorizesnews
reports that reflect positively or negativelyon
a companys financial outlook

51
Introduction

Correlation between news reports on a companys
financial outlook and its attractiveness as an
investment
Volume of such reports is huge
A new text classification algorithm Domain
Experts with self-confident sampling
technique
Two types of data
(Human-)labeled
Unlabeled
The algorithm classifies financial news into the
predefined five categories
(good) ? (good, uncertain) ? (neutral) ? ? (bad,
uncertain) ? (bad)

52
Introduction

Text categorization task
FCP (Frequently Co-located Phrase) the building
elementfor the categorization algorithm
Text categorization very difficult domainfor
the use of machine learning
Very large number of input features
High level of attribute and class noise
Large percent of irrelevant features
Very expensive labeled data, while unlabeled
data are cheaply available

53
Categorization

The algorithm categorizes each given news article
into the predefined categories in terms of
referred companys financial well-being
GOOD strong and explicit evidences of the
companys financial status
shares of ABC company rose 2 percent to
24-15/16
GOOD, UNCERTAIN predictions and forecasts of
future profitability
ABC company predicts fourth-quarter earnings
will be high

54
Categorization

NEUTRAL nothing is mentioned about the
financial well-being of the company
ABC announced plans to focus on products based
on recycledmaterials
BAD, UNCERTAIN predictions of future loses
ABC announced today that fourth-quarter results
couldfall short of expectations
BAD explicitly bad evidences
shares of ABC fell 0.57 to 44.65 in early NY
trading
Problems with construction of the training (i.e.
labeled)data set inter-indexer inconsistency

55
Co-located Phrase

The proposed algorithm labels the unlabeled
news articlesthrough voting process among
experts that are FCPs
Definition a co-located phrase is a sequence of
nearby, but not necessarily consecutive words
shares of ABC rose 8.5 (shares, rose) GOOD
ABC presented its new product (present,
product) NEUTRAL
Contextual information
The use of heuristics to cope with enormous
phrase space(amount of possible phrases)

56
Naive-Bayes v Domain Experts

Naive-Bayes with EM (Expectation Maximization)
Problems with small sets of labeled (training)
data
EM (Expectation Maximization) a class of
iterative algorithms for maximum likelihood
estimation in problems with incomplete data
Domain Experts algorithm is able to deal with
inconsistent hypotheses
Iterative building of the training set

57
Implementation and Results

The experiment focused on two performance
criteria
Using unlabeled data for improving categorization
accuracy
The categorization itself
The accuracy is around 75 (total of 2000 news
articles)
Comparison of a few different methods (picture)

58
Conclusions

Domain Experts with SC sampling outperform naive
Bayes with EM
collocation property and vote entropy are
appropriate to such a domain
The accuracy of around 75 is the limit with the
techniques used
Better performance could be achieved by using
some natural language processing techniques
Such techniques are pretty rudimental today

59
iMatch

The vision of each MIT student
having a personal software agent,
which helps to manage its owner's academic life

60
Introduction

The aim bring together MIT students and staff
who may usefully collaborate with each other
This collaboration can have several goals
completing final projects
studying for exams
tutoring one another
iMATCH agents are supposed to facilitate students
and faculty matching for
Research
Teaching
Internship
opportunities within and across campuses

61
iMatch Agent Architecture

iMatch agents are situated within an environment
Sensors of the agent convert environmental
inputsinto representations that can be
manipulated within the agent
Effectors translate actions planned by the
agentinto executable statements for the
environment
The action planner selects the action with the
highest utilityaccording to the owners
preference specification

62
Impacts and Benefits

MIT
Benefit MIT students by matching them to
appropriate resources
Aid the recruitment of student researchers
Help students manage their lives
Use iMATCH in Medical Computing
GLOBAL
Facilitate Cross Community Collaboration

63
Research Topics

Knowledge representation
preference specification
Multi-agents systems
reputation management system
static interest matching
dynamic interest matching
Infrastructure
distributed security infrastructure

64
Ceteris Paribus Preference

Ceteris paribus relations express a preference
over sets of possible outcomes
All possible outcomes are considered to be
describable by some (large) set of binary
features (true or false)
The specified features are instantiated to either
true or false
Other features are ignored

65
CPP Agent Configuration

Specify a domain for preference
Agent methods of communication and notification
Different security settings of different servers
Preference statements themselves
How to get users to easily adjust C.P. rules
(graphical interface)
Pose hypothetical preference questions to user to
help complete the preferences of an ambivalent
user
People will only put down their true profile, if
they know that the system is secure

66
Static Interest Matching

Group together similar users for specific context
This enables viewing a human user as a
resourcefor dynamic resource discovery
(locate experts, enthusiasts,...)
The approach
Keyword matching
Ontological matching using Kulbeck-Leiber (KL)
distance

67
Dynamic Interest Matching

Location and/or temporal specific resource
matching
As students and their agents move from one
physical location to another, iMatch services for
matching the closest resources can be offered
The idea anything worthwhile is locatable
The approach
Intentional naming scheme
Reputation based resource discovery

68
Technology

Components
Distributed Multi-Agent Infrastructures
Ceteris Paribus preference-based Interest
Matching
Reputation Management Infrastructure
Technology
Microsoft.Net
Bluetooth
IEEE 802.11
Smartcards (PC/SC)
INS (International Naming System)

69
Conclusion

Benefit MIT students by matching them to
appropriate resources
Static interest matching
Group together similar users for specific context
This enables viewing a human user as a resource
for dynamic resource discovery (locate experts,
enthusiasts,...)
Dinamic interest matching
Location and/or temporal specific resource
matching As students and their agents move from
one physical location to another, iMatch
services for matching the closest resources can
be offered
Help students manage their lives

70
The near future

The focus of the research is on e-tourism after
the year 2005, but the applications of the
proposed infrastructure are multifold

71
Introduction

The assumptions
after the year 2005, each tourist in Europe will
be equiped with a cell phone of the power same or
better than the Pentium IV
whenever a tourism-based service or product is
purchased, a mobile agent is assigned to that
cell phone PC, to monitor the behaviour of the
customer
all tourist cell phone PCs create an AD-HOC
networkaround the points of touristic
attractions, and link to a data mine that
collects all information of interest

72
How to accomplish it?

The information of interest is not collected by
asking the customer to fill out the forms, but by
monitoring the behaviour of the customer
The collected information, sorted in the data
mine, is made available to other tourists, as an
on-line owner-independent source of information
about the given services and/or products

73
What can be done

If a tourist would like to know, at that very
moment, what restaurant has good food/atmosphere
and happy customers, he/she can access the data
mine (via the Internet) and obtain the
information that is linked to that very moment,
and is not created by the owner of the business,
but by the customers themselves
Accessing the given restaurants website has two
drawbacks
the information is not fresh - periodically
updated
the information is made by the owner of the
restaurant, and therefore not completely objective

74
Conclusion

Consequently, the proposed approach works much
better , and represents a qualitative step
forward in the domain of maximization of
customer satisfaction
This may mean that the privacy of the person is
jeopardized,however, if the monitored behaviour
is non-personalized, and if the customer obtains
a discount based on the fact that mobile agents
are welcome, the privacy stops to be an issue,
and people will sign up voluntarily

75
Appendix

A Survey of the Data Mining Algorithms

76
Apriori Algorithm

The task mining association rules by finding
large itemsets and translating them to the
corresponding association rules
A ? B, or A1 ? A2 ?? Am ? B1 ? B2 ?? Bn, where
A ? B ?
The terminology
Confidence
Support
k-itemset a set of k items
Large itemsets the large itemset A, B
corresponds to the following rules
(implications) A ? B and B ? A

77
Apriori Algorithm

The ? operator definition
n 1 S2 S1 ? S1 A, B, C ? A, B,
C AB, AC, BC
n k Sk1 Sk ? Sk X ? Y X, Y ? Sk, X ?
Y k-1
X and Y must have the same number of elements,
and must have exactly k-1 identical elements
Every k-element subset of any resulting set
element (an element is actually a k1 element
set) has to belong to the original set of
itemsets

78
Apriori Algorithm

Example

79
Apriori Algorithm

Step 1 generate a candidate set of 1-itemsets
C1
Every possible 1-element set from the database is
potentially a large itemset, because we dont
know the number of its appearances in the
database in advance (á priori ?)
The task adds up to identifying (counting) all
the different elements in the database every
such element forms a 1-element candidate set
C1 A, B, C, D, E
Now, we are going to scan the entire database, to
count the number of appearances for each one of
these elements (i.e. one-element sets)

80
Apriori Algorithm

Now, we are going to scan the entire database, to
count the number of appearances for each one of
these elements (i.e. one-element sets)

81
Apriori Algorithm

Step 2 generate a set of large 1-itemsets L1
Each element in C1 with support that exceeds some
adopted minimum support (for example 50) becomes
a member of L1
L1 A, B, C,E and we can omit D in
further steps (if D doesnt have enough support
alone, there is no way it could satisfy
requested support in a combination with some
other element(s))

82
Apriori Algorithm

Step 3 generate a candidate set of large
2-itemsets, C2
C2 L1 ? L1 AB, AC, AE, BC, BE,
CE
Count the corresponding appearances
Step 4 generate a set of large 2-itemsets, L2
Eliminate the candidates without minimum
support
L2 AC, BC, BE, CE

83
Apriori Algorithm

Step 5 (C3)
C3 L2 ? L2 BCE
Why not ABC and ACE because their 2-element
subsets AB and AE are not the elements of
large 2-itemset set L2 (calculation is made
according to the operator ? definition)
Step 6 (L3)
L3 BCE, since BCE satisfies the required
support of 50 (two appearances)
There can be no further steps in this particular
case, because L3 ? L3 ?
Answer L1 ? L2 ? L3

84
Apriori Algorithm

L1 large 1-itemsets
for (k2 Lk-1 ? ? k)
Ck apriori-gen(Lk-1)
forall transactions t ? D do begin
Ct subset (Ck, t)
forall candidates c ? Ct do
c.count
end
Lk c ? Ck c.count ? minsup
end
Answer ?k Lk

85
Apriori Algorithm

Enhancements to the basic algorithm
Scan-reduction
The most time consuming operation in Apriori
algorithm is the database scan it is originally
performed after each candidate set generation, to
determine the frequency of each candidate in the
database
Scan number reduction counting candidates of
multiple sizes in one pass
Rather than counting only candidates of size k in
the kth pass, we can also calculate the
candidates Ck1, where Ck1 is generated from
Ck (instead Lk), using the ? operator

86
Apriori Algorithm

Compare Ck1 Ck ? Ck Ck1 Lk ? Lk
Note that Ck1 ? Ck1
This variation can pay off in later passes, when
the cost of counting and keeping in memory
additional Ck1 - Ck1 candidates becomes less
than the cost of scanning the database
There has to be enough space in main memory for
both Ck and Ck1
Following this idea, we can make further scan
reduction
Ck1 is calculated from Ck for k gt 1
There must be enough memory space for all Cks (k
gt 1)
Consequently, only two database scans need to be
performed (the first to determine L1, and the
second to determine all the other Lks)

87
Apriori Algorithm

Abstraction levels
Higher level associations are stronger (more
powerful), but also less certain
A good practice would be adopting different
thresholds for different abstraction levels
(higher thresholds for higher levels of
abstraction)

88
DHP Algorithm

DHP Direct Hashing and Pruning another
algorithm for mining association rules
Based on the Apriori algorithm (Ck/Lk generation
in the kth step)
Empirical analysis of the Apriori algorithm shows
that candidate sets (Ck) are much larger than
corresponding sets of large itemsets (Lk),
especially in a first few iterations
DHP introduces more efficient candidate set
generation method
The idea is to insert into Ck only those
candidate sets that are likely to become large
itemsets

89
DHP Algorithm

Additional improvement is accomplished through
two-dimensional search base reduction
length(number of records in the search base)
and width (number of relevant attributes in a
record)
Large itemsets characteristics
Every non-empty subset of a large itemset is a
large itemset as well, for example, BCD ? L3 ?
BC, CD, BD ? L2
It implies that a record is relevant for
discovering large k1-itemsets only if it
contains at least k1 large k-itemsets

90
DHP Algorithm

During the Ck ? Lk phase we might count large
k-itemsets in each record if their number in a
particular record is less than k1, we omit that
record during the Ck1 generation
Similarly, if a record contains one or more large
k1-itemsets, each element (item) of these
itemsets appears in, at least, k candidates from
Ck
Hashing
Hashing boosts the performance of the DHP
algorithm
The algorithm does not specify any hash function
in particular, it depends on the application
Likewise, it does not specify the size of the
hash table (number of groups/addresses)

91
DHP Algorithm

Application example

92
DHP Algorithm

Step 1 generate a candidate set of 1-itemsets
C1
C1 A, B, C, D, E
Simultaneously with counting each elements
support, a hash tree is generated that contains
all the elements from the database, in order to
improve the counting performance
For each new element, DHP checks whether the
element is already in the tree or not
If yes, DHP increments the current number of
appearances for that element otherwise, the
element is added to the hash tree, and the number
of its appearances is set to 1

93
DHP Algorithm

Having counted each C1 element appearances, all
possible 2-element subsets are generated and
inserted into H2 hash table
The address of a particular subset could be
calculated with respect to the position of its
elements in C1 candidate set, using chosen hash
function h(x, y)

94
DHP Algorithm

For example, lets adopt the following hash
function h(x y) (posC1(x)10 posC1(y))
mod 7
The corresponding H2 hash table is shown below

95
DHP Algorithm

Whenever a new element is added to the hash
table, the weight of the particular address is
increased by one
C2 is generated out of L1 (just like in Apriori
case)
Besides that, only those elements that map to the
addresses whose weight is greater or equal than
specified minimum support (let the minimum
support be 50), will be taken into consideration
during the C2 generation
C2 AC, BC, BE, CE
It contains two elements less (!) than the C2 set
generated by the Apriori algorithm for the same
example database

96
DHP Algorithm

In general, the Hk hash table is used for the Ck
candidate set generation in the kth step of the
algorithm Hk is created in the previous (k-1)th
step
Each address of the Hk hash table contains a
number of k-element subsets as elements its
weight denotes the number of elements
The fact that an address doesnt satisfy minimum
support requirement means that neither element
(set) that is mapped to the address can satisfy
the requirement alone ? all the elements (sets)
at such Hk addresses are omitted for the Ck
generation
During the kth step, Ck is generated starting
from Lk-1, with the restrictions described above

97
DHP Algorithm

Conclusions
DHP outperforms Apriori, for the same input data
The time spent for the hash tables generation
(especially H2) is overcome by extremely reduced
candidate sets (C2, )
The same improvements applied on Apriori, may as
well be applied here (scan reduction, abstraction
levels, )

98
References

http//www.marconi.com
http//www.blueyed.com
http//www.fipa.org
http//www.rpi.edu
http//research.microsoft.com
http//imatch.lcs.mit.edu

99
THE END

Quatenus nobis denegatum diu vivere, relinquamus
aliquid, quo nos vixisse testemur

Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial

Write a Comment

User Comments (0)

About PowerShow.com

VIRTUAL PRESENCE - PowerPoint PPT Presentation

VIRTUAL PRESENCE

http://galeb.etf.bg.ac.yu/~vm/tutorial. Voislav ... it is snowing today: (confirm :sender i :receiver j :content 'weather( today, snowing )' :language Prolog ... – PowerPoint PPT presentation