Why Not Store Everything in Main Memory? Why use disks? - PowerPoint PPT Presentation

Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 732ecf-OTdmZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Why Not Store Everything in Main Memory? Why use disks?

Description:

Entity Tables, Relationship Tables Educational We Classify using any Table (as the Training Table) on any of its columns, the class label column. – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 16
Provided by: William1271
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory? Why use disks?


1
Entity Tables, Relationship Tables
We Classify using any Table (as the Training
Table) on any of its columns, the class label
column. Medical Expert System Using entity
Training Table Patients(PID,Symptom1,Symptom2,Sym
ptom3,Disease) and class label Disease, we can
classify new patients based on symptoms using
Nearest Neighbor, Model based (Decision Tree,
Neural Network, SVM, SVD, etc.)
classification. Netflix Contest Using
relationship TrainingTable, Rents(UID,MID,Rating
,Date) and class label Rating, classify new
(UID,MID,Date) tuples. How (since there are no
feature columns really)?
We can let Near Neighbor Users vote? What makes
a user, uid, "near" to UID? If uid rates
similarly to UID on movies, mid1, ..., mid. So we
use correlations for near instead of distance.
This is typical when classifying on a
relationship table.
We can let Near Neighbor Movies vote? What
makes a movie, mid, "near" to MID? If mid is
rated similarly to MID by users, uid1, ...,
uid.. Here we use correlations for near instead
of an actual distance once again.
We Cluster on a Table also, but usually just to
"prepare" a classification TrainingTable (move up
a semantic hierarchy, e.g., in Items, cluster on
PriceRanges in 100 dollar intervals or to
determine classes in the first place (each
cluster is then declared a separate class) ). We
do Association Rule Mining (ARM) on relationships
(e.g., the Market Basket Research ARM on "buys"
relationship).
2
WALK THRU a kNN classification example 3NN
CLASSIFICATION of an unclassified sample, a(
a5 a6
a11 a12 a13
a14 ) ( 0 0
0 0 0 0
).
t12 0 0 1 0 1 1 0 2
t13 0 0 1 0 1 0 0 1
t53 0 0 0 0 1 0 0 1
t15 0 0 1 0 1 0 1 2
0 1
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12
a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0
0 0 1 1 0 1 0 1 1 0 1 1 0 0 0
1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0
0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1
0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0
1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0
1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0
1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0
0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0
1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1
1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0
1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0
0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35
0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0
1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0
1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0
0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55
0 1 0 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0
1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1
0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1
0 0 0 1 0 1 0 0 1 1 0 0
3
WALK THRU of required 2nd scan to find Closed 3NN
set. Does it change vote?
YES! C0 wins now!
Vote after 1st scan.
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12
a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0
0 0 1 1 0 1 0 1 1 0 1 1 0 0 0
1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0
0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1
0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0
1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0
1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0
1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0
0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0
1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1
1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0
1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0
0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35
0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0
1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0
1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0
0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55
0 1 0 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0
1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1
0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1
0 0 0 1 0 1 0 0 1 1 0 0
4
WALK THRU C3NN using P-trees
First let all training points at distance0
vote, then distance1, then distance2,
... until ? 3 For distance0 (exact matches)
constructing the P-tree, Ps then AND with PC and
PC to compute the vote.
(black denotes complement, red denotes
uncomplemented
a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0
a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1
No neighbors at distance0
a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0
C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0
a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
Ps 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
5
Construct Ptree, PS(s,1) OR
Pi Psi-ti1 sj-tj0, j?i OR
PS(si,1) ? ? S(sj,0)
WALK THRU C3NNC distance1 nbrs
i5,6,11,12,13,14
i5,6,11,12,13,14
j?5,6,11,12,13,14-i
a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0
a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1
a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0
C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
PD(s,1) 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0
a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
6
WALK THRU C3NNC distance2 nbrs
We now have 3 nearest nbrs. We could quite and
declare C1 winner?
We now have the C3NN set and we can declare C0
the winner!
P5,12
P5,13
P5,14
P6,11
P6,12
P6,13
P6,14
P11,12
P11,13
P11,14
P12,13
P12,14
P13,14
P5,6
P5,11
a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
7
In this example, there were no exact matches
(dis0 nbrs or similarity6 neighbors) for the
sample. There were two nbrs found at a distance
of 1 (dis1 or sim5) and nine dis2, sim4
nbrs. All 11 neighbors got an equal votes even
though the two sim5 are much closer neighbors
than the nine sim4. Also processing for the 9
is costly. A better approach would be to weight
each vote by the similarity of the voter to the
sample (We will use a vote weight function which
is linear in the similarity (admittedly, a better
choice would be a function which is Gaussian in
the similarity, but, so far, it has been too hard
to compute). As long as we are weighting votes
by similarity, we might as well also weight
attributes by relevance also (assuming some
attributes are more relevant than others. e.g.,
the relevance weight of a feature attribute could
be the correlation of that attribute to the class
label). P-trees accommodate this method very
well (in fact, a variation on this theme won the
KDD-cup competition in 02 (
http//www.biostat.wisc.edu/craven/kddcup/ ) and
is published in the so-called Podium
Classification methods. Notice though that the
Ptree method (Horizontal Processing of Vertical
Data or HPVD) really relies on a bonafide
distance, in this case Hamming Distance or L1.
However, the Vertical Processing of Horizontal
Data (VPHD) doesn't. VPHD therefore works even
when we use a correlation as the notion of
"near". If you have been involve in the DataSURG
Netflix Prize efforts, you will note that we used
Ptree technology to calculate correlations but
not to find near neighbors through correlations.
8
Netflix
Again, Rents is also a Table ( Rents(TID,
UID,MID,Rating,Date) ), using the class label
column, Rating, we can classify potential new
ratings. Using these predicted ratings, we can
recommend Rating5 rentals. Does ARM require a
binary relationship? YES! But since every
Table gives a binary relationship between any two
of its columns we can do ARM on them. E.g., we
could do ARM on Patients(symptom1, disease) to
try to determine which diseases symptom1 alone
implies (with high confidence)
AND in some cases, the columns of a Table are
really instances of another entity. E.g.,
Images, e.g., Landsat Satellite Images,
LS(R, G, B, NIR, MIR, TIR), with wavelength
intervals in micrometers B(.45, .52, G(.52,
.60, R(.63, .69, NIR(.76, .90, MIR(2.08,
2.35, TIR(10.4, 12.5 Known recording
instruments (that record the number of photons
detected in a small space of time in any a given
wavelength range) seem to be very limited in
capability (eg, our eyes, CCD cameras...). If we
could record all consecutive bands (e.g., of
radius .025 µm) then we would have a relationship
between pixels (given by latitude-longitude) and
wavelengths. Then we could do ARM on Imagery!!!!
9
Intro to Association Rule Mining (ARM)
  • Given any relationship between entities,
  • T (e.g., a set of Transactions an enterprise
    performs) and
  • I (e.g., a set of Items which are acted upon by
    those transactions).
  • e.g., in Market Basket Research (MBR) the
    transactions, T, are the checkout transactions
    (a customer going thru checkout) and the items,
    I, are the Items available for purchase in that
    store.
  • The itemset, T(I), associated with (or related
    to) a particular transaction, T, is the subset of
    the items found in the shopping cart or market
    basket that the customer is bringing through
    check out at that time).
  • an Association Rule, A?C, associates two disjoint
    subsets of I (called Itemsets). (A is called
    the antecedent, C is called the consequent)

The support set of itemset A, supp(A), is the
set of of t's that are related to every a?A,
e.g., if Ai1,i2 and Ci4 then supp(A)t2,
t4 (support ratio t2,t4 /
t1,t2,t3,t4,t5 2/5 ) Note
means set size or count of elements in the
set. The support ratio of rule A?C,
supp(A?C), is the support of A
?Ct2,t4/t1,t2,t3,t4,t52/5 The
confidence of rule A?C, conf(A?C), is supp(A?C)
/ supp(A) (2/5) / (2/5)
1 Data Miners typically want to find all
STRONG RULES, A?C, with supp(A?C) minsupp and
conf(A?C) minconf (minsupp, minconf are
threshold levels) Note that conf(A?C) is also
just the conditional probability of t being
related to C, given that t is related to A).
10
APRIORI Association Rule Mining Given a
Transaction-Item Relationship, the APRIORI
algorithm for finding all Strong I-rules can be
done 1. vertically processing of a Horizontal
Transaction Table (HTT) or 2. horizontally
processing of a Vertical Transaction Table
(VTT).In 1., a Horizontal Transaction Table
(HTT) is processed through vertical scans to find
all Frequent I-sets (I-sets with support ?
minsupp, e.g., I-sets "frequently" found in
transaction market baskets).In 2. a Vertical
Transaction Table (VTT) is processed thru
horizontal operations to find all Frequent
I-setsThen each Frequent I-set found is
analyzed to determine if it is the support set of
a strong rule.Finding all Frequent I-sets is
the hard part. To do this efficiently, APRIORI
Algorithm takes advantage of the "downward
closure" property for Frequent I-sets If an
I-set is frequent, then all its subsets are also
frequent.E.g., in the Market Basket Example, If
A is an I-subset of B and if all of B is in a
given Transaction's basket, the certainly all of
A is in that basket too. Therefore Supp(A) ?
Supp(B) whenever A?B.First, APRIORI scans to
determine all Frequent 1-item I-sets (contain 1
item therefore called 1-Itemsets),next APRIORI
uses downward closure to efficiently find
candidates for Frequent 2-Itemsets,next APRIORI
scans to determine which of those candidate
2-Itemsets is actually Frequent,next APRIORI
uses downward closure to efficiently find
candidates for Frequent 3-Itemsets,next APRIORI
scans to determine which of those candidate
3-Itemsets is actually Frequent, ...Until
there are no candidates remaining (on the next
slide we walk through an example using both a HTT
and a VTT)
11
ARM The relationship between Transactions
and Items can be expressed in a
  • Horizontal Transaction Table (HTT)

or a Vertical Transaction Table (VTT)
TID 1 2 3 4 5
100 1 0 1 1 0
200 0 1 1 0 1
300 1 1 1 0 1
400 0 1 0 0 1
minsupp is set by the querier at 1/2 and minconf
at 3/4 (note minsupp and minconf can expressed as
counts rather than as ratios. If so, since there
are 4 transactions, then as counts, minsupp2 and
minconf3)
(downward closure property of "frequent") Any
subset of a frequent itemset is frequent.
APRIORI METHOD Iteratively find the Frequent
k-itemsets, k1,2,... Find all strong
association rules supported by each frequent
Itemset. (Ck will denote candidate k-itemsets
generated at each step. Fk will denote frequent
k-itemsets).
12
ARM
HTT


123 pruned since 12 not frequent 135 pruned
since 15 not frequent
Example ARM using uncompressed Ptrees (note I
have placed the 1-count at the root of each Ptree)

TID 1 2 3 4 5
100 1 0 1 1 0
200 0 1 1 0 1
300 1 1 1 0 1
400 0 1 0 0 1

13
L3
L1
L2
ARM
1-ItemSets dont support Association Rules (They
will have no antecedent or no consequent).
2-Itemsets do support ARs.
Are there any Strong Rules supported
by FrequentLarge 2-ItemSets
(at minconf.75)?
1,3 conf1?3 supp1,3/supp1 2/2 1
.75 STRONG
conf3?1 supp1,3/supp3 2/3 .67 lt .75
2,3 conf2?3 supp2,3/supp2 2/3 .67
lt .75
conf3?2 supp2,3/supp3 2/3 .67 lt .75
2,5 conf2?5 supp2,5/supp2 3/3 1
.75 STRONG!
conf5?2 supp2,5/supp5 3/3 1
.75 STRONG!
3,5 conf3?5 supp3,5/supp3 2/3 .67
lt .75
conf5?3 supp3,5/supp5 2/3 .67 lt .75
Are there any Strong Rules supported by Frequent
or Large 3-ItemSets?
2,3,5 conf2,3?5 supp2,3,5/supp2,3
2/2 1 .75 STRONG!
conf2,5?3 supp2,3,5/supp2,5 2/3 .67
lt .75
No subset antecedent can yield a strong rule
either (i.e., no need to check conf2?3,5 or
conf5?2,3 since both denominators will be at
least as large and therefore, both confidences
will be at least as low.
conf3,5?2 supp2,3,5/supp3,5 2/3
.67 lt .75
No need to check conf3?2,5 or conf5?2,3
DONE!
14

Collaborative Filtering is the
prediction of likes and dislikes (retail or
rental) from the history of previous expressed
purchase of rental satisfactions (filtering new
likes thru the historical filter of
collaborator likes)
E.g., the 1,000,000 Netflix Contest was to
develop a ratings prediction program that can
beat the one Netflix currently uses (called
Cinematch) by 10 in predicting what rating users
gave to movies. I.e., predict rating(M,U) where
(M,U) ? QUALIFYING(MovieID, UserID). Netflix
uses Cinematch to decide which movies a user will
probably like next (based on all past rating
history). All ratings are "5-star" ratings (5
is highest. 1 is lowest. Caution 0 means did
not rate). Unfortunately rating0 does not mean
that the user "disliked" that movie, but that it
wasn't rated at all. Most ratings are 0.
Therefore, the ratings data sets are NOT vector
spaces! One can approach the Netflix contest
problem as a data mining Classification/Prediction
problem. A "history of ratings given by users
to movies, TRAINING(MovieID, UserID, Rating,
Date) is provided, with which to train your
predictor, which will predict the ratings given
to QUALIFYING movie-user pairs (Netflix knows the
rating given to Qualifying pairs, but we
don't.) Since the TRAINING is very large,
Netflix also provides a smaller, but
representative subset of TRAINING,
PROBE(MovieID, UserID) (2 orders of magnitude
smaller than TRAINING). Netflix gave 5 years to
submit QUALIFYING predictions. That contest was
won in the late summer of 2009, when the
submission window was about 1/2 gone. The
Netflix Contest Problem is an example of the
Collaborative Filtering Problem which is
ubiquitous in the retail business world (How do
you filter out what a customer will want to buy
or rent next, based on similar customers?).
15
ARM in Netflix?
ARM used for pattern mining in the Netflix
data? In general we look for rating patterns (RP)
that have the property RP true gt MIDr
(movie, MID, is rated r). Singleton Homogeneous
Rating Patterns RP Nr Multiple Homogeneous
Rating Patterns RP N1 r ... Nk
r Multiple Heterogeneous Rating Patterns RP
N1 r1 ... Nk rk
About PowerShow.com