# Why Not Store Everything in Main Memory? Why use disks? - PowerPoint PPT Presentation

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 7260df-ZmE4M

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Why Not Store Everything in Main Memory? Why use disks?

Description:

### FAUST Analytics X(X1..Xn) Rn, |X|=N. If X is a classified training set with classes=C={C1..CK} then X=X((X1..Xn,C}. In either case d=(d1..dn), p=(p1..pn) Rn. – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 13
Provided by: William1248
Category:
Tags:
Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory? Why use disks?

1
FAUST Analytics X(X1..Xn)?Rn, XN. If X is a
classified training set with classesCC1..CK
then XX((X1..Xn,C. In either case d(d1..dn),
p(p1..pn)?Rn. We have functionals, FRn?R,
FL, S, R (We think of these as mapping n-vectors
to 1-vectors of numbers - or in terms of bit
columns (compressed or not), of mappings from a
PTS to a SPTS).
Ld,p ? (X-p)od Xod - pod And letting Ld ?
Xod, Ld,p Ld - pod
Sp ? (X-p)o(X-p) XoX Xo(-2p) pop
L-2p XoX pop
Rd,p ? Sp - L2d,p XoXL-2ppop-(Ld)2-2podXod(
pod)d2 L-2p-(2pod)d - (Ld)2 XoX
pop(pod)2
Fmind,p,k ? min(Fd,pCk) minFd,p,k where
Fd,p,k Fd,p Ck Fmaxd,p,k ? max(Fd,pCk)
maxFd,p,k
XoX can be pre-computed, one time.
FPCCd,p,k,j ? jth precipitous count change
(from left-to-right) of Fd,p,k. Same notation
for PCIs and PCDs (incr/decr)
Then the main tools are the inequality mask
pTrees, e.g., PF(x)gtFmin, which has 1 iff the
vector x lies on the "positive" side of the n-1
dimensional hyperplane through Fmin which is
perpendicular to the "d-line through p".
GAP Gap Clusterer If DensityThreshold, DT,
isn't reached, cut C mid-gap of Ld,pC using the
next (d,p) from dpSet
PCC Precipitous Count Change Clusterer If DT
isn't reached, cut C at PCCs?Ld,pC using the
next (d,p) from dpSet
Fusion step may be required? Use density,
proximity, or use Pillar pkMeans (next slide).
TKO Top K Outlier Detector Use
rankn-1Sx for TopKOutlier-slider.
LIN Linear Classifier y?Ck iff y?LHk ?
z minLd,p,k ? Ld,p,k(z) ? maxLd,pd,k ?
(d,p)?dpSet LHk is a
Linear hull around Ck. dpSet is a set of (d,p)
pairs, e.g., (Diag,DiagStartPt).
LSR Linear Spherical Radial Classifier y?Ck
iff y?LSRHk?z minFd,p,k? Fd,p,k(z) ?
maxFd,p,k ?d,p?dpSet, ?FL,S,R
What should we pre-compute besides XoX?
stats(min/avg/max/std) Xop pclass_Avg/Med
Xod Xox d2(X,x) Rkid2(X,x) Ld,p, Rd,p
We need a "Basic pTree Operations Timing Manual"
to show users the cost of various pTree
computations.
2
MG44d60w 44 MOTHER GOOSE RHYMES with a
synonymized vocabulary of 60 WORDS 1. Three blind
mice! See how they run! They all ran after the
farmer's wife, who cut off their tails with a
carving knife. Did you ever see such a thing in
your life as three blind mice? 2. This little pig
went to market. This little pig stayed at home.
This little pig had roast beef. This little pig
had none. This little pig said Wee, wee. I can't
find my way home. 3. Diddle diddle dumpling, my
son John. Went to bed with his breeches on, one
stocking off, and one stocking on. Diddle diddle
dumpling, my son John. 4. Little Miss Muffet sat
on a tuffet, eating of curds and whey. There came
a big spider and sat down beside her and
frightened Miss Muffet away. 5. Humpty Dumpty sat
on a wall. Humpty Dumpty had a great fall. All
the Kings horses, and all the Kings men cannot
put Humpty Dumpty together again. 6. See a pin
and pick it up. All the day you will have good
luck. See a pin and let it lay. Bad luck you will
have all the day. 7. Old Mother Hubbard went to
the cupboard to give her poor dog a bone. When
she got there cupboard was bare and so the poor
dog had none. She went to baker to buy him some
bread. When she came back dog was dead. 8. Jack
Sprat could eat no fat. His wife could eat no
lean. And so between them both they licked the
platter clean. 9. Hush baby. Daddy is near. Mamma
is a lady and that is very clear. 10. Jack and
Jill went up the hill to fetch a pail of water.
Jack fell down, and broke his crown and Jill came
tumbling after. When up Jack got and off did trot
as fast as he could caper, to old Dame Dob who
patched his nob with vinegar and brown paper. 11.
One misty moisty morning when cloudy was the
weather, I chanced to meet an old man clothed all
in leather. He began to compliment and I began to
grin. How do you do And how do you do? And how do
you do again 12. There came an old woman from
France who taught grown-up children to dance. But
they were so stiff she sent them home in a sniff.
This sprightly old woman from France. 13. A robin
and a robins son once went to town to buy a bun.
They could not decide on plum or plain. And so
they went back home again. 14. If all the seas
were one sea, what a great sea that would be! And
if all the trees were one tree, what a great tree
that would be! And if all the axes were one axe,
what a great axe that would be! And if all the
men were one man what a great man he would be!
And if the great man took the great axe and cut
down the great tree and let it fall into the
great sea, what a splish splash that would
be! 15. Great A. little a. This is pancake day.
Toss the ball high. Throw the ball low. Those
that come after may sing heigh ho! 16. Flour of
England, fruit of Spain, met together in a shower
of rain. Put in a bag tied round with a string.
If you'll tell me this riddle, I will give you a
ring. 17. Here sits the Lord Mayor. Here sit his
two men. Here sits the cock. Here sits the hen.
Here sit the little chickens. Here they run in.
Chin chopper, chin chopper, chin chopper,
chin! 18. I had two pigeons bright and gay. They
flew from me the other day. What was the reason
they did go? I can not tell, for I do not
know. 21. The Lion and the Unicorn were fighting
for the crown. The Lion beat the Unicorn all
around the town. Some gave them white bread and
some gave them brown. Some gave them plum cake,
and sent them out of town. 22. I had a little
husband no bigger than my thumb. I put him in a
pint pot, and there I bid him drum. I bought a
little handkerchief to wipe his little nose and a
pair of little garters to tie his little
hose. 23. How many miles is it to Babylon? Three
score miles and ten. Can I get there by candle
light? Yes, and back again. If your heels are
nimble and light, you may get there by candle
light. 25. There was an old woman, and what do
you think? She lived upon nothing but victuals,
and drink. Victuals and drink were the chief of
her diet, and yet this old woman could never be
quiet. 26. Sleep baby sleep. Our cottage valley
is deep. The little lamb is on the green with
woolly fleece so soft and clean. Sleep baby
sleep. Sleep baby sleep, down where the woodbines
creep. Be always like the lamb so mild, a kind
and sweet and gentle child. Sleep baby sleep. 27.
Cry baby cry. Put your finger in your eye and
tell your mother it was not I. 28. Baa baa black
sheep, have you any wool? Yes sir yes sir, three
bags full. One for my master and one for my dame,
but none for the little boy who cries in the
lane. 29. When little Fred went to bed, he always
said his prayers. He kissed his mamma and then
his papa, and straight away went upstairs. 30.
Hey diddle diddle! The cat and the fiddle. The
cow jumped over the moon. The little dog laughed
to see such sport, and the dish ran away with the
spoon. 32. Jack come and give me your fiddle, if
ever you mean to thrive. No I will not give my
fiddle to any man alive. If I should give my
fiddle they will think that I've gone mad. For
many a joyous day my fiddle and I have had 33.
Buttons, a farthing a pair! Come, who will buy
them of me? They are round and sound and pretty
and fit for girls of the city. Come, who will buy
them of me? Buttons, a farthing a pair! 35. Sing
a song of sixpence, a pocket full of rye. Four
and twenty blackbirds, baked in a pie. When the
pie was opened, the birds began to sing. Was not
that a dainty dish to set before the king? The
king was in his counting house, counting out his
money. The queen was in the parlor, eating bread
and honey. The maid was in the garden, hanging
out the clothes. When down came a blackbird and
snapped off her nose. 36. Little Tommy
Tittlemouse lived in a little house. He caught
fishes in other mens ditches. 37. Here we go
round mulberry bush, mulberry bush, mulberry
bush. Here we go round mulberry bush, on a cold
and frosty morning. This is way we wash our
hands, wash our hands, wash our hands. This is
way we wash our hands, on a cold and frosty
morning. This is way we wash our clothes, wash
our clothes, wash our clothes. This is way we
wash our clothes, on a cold and frosty morning.
This is way we go to school, go to school, go to
school. This is the way we go to school, on a
cold and frosty morning. This is the way we come
out of school, come out of school, come out of
school. This is the way we come out of school, on
a cold and frosty morning. 38. If I had as much
money as I could tell, I never would cry young
lambs to sell. Young lambs to sell, young lambs
to sell. I never would cry young lambs to sell.
39. A little cock sparrow sat on a green tree.
And he chirped and chirped, so merry was he. A
naughty boy with his bow and arrow, determined to
shoot this little cock sparrow. This little cock
sparrow shall make me a stew, and his giblets
shall make me a little pie, too. Oh no, says the
sparrow, I will not make a stew. So he flapped
his wings and away he flew. 41. Old King Cole was
a merry old soul. And a merry old soul was he. He
called for his pipe and he called for his bowl
and he called for his fiddlers three. And every
fiddler, he had a fine fiddle and a very fine
fiddle had he. There is none so rare as can
compare with King Cole and his fiddlers
three. 42. Bat bat, come under my hat and I will
give you a slice of bacon. And when I bake I will
give you a cake, if I am not mistaken. 43. Hark
hark, the dogs do bark! Beggars are coming to
town. Some in jags and some in rags and some in
velvet gowns. 44. The hart he loves the high
wood. The hare she loves the hill. The Knight he
loves his bright sword. The Lady loves her
will. 45. Bye baby bunting. Father has gone
hunting. Mother has gone milking. Sister has gone
silking. And brother has gone to buy a skin to
wrap the baby bunting in. 46. Tom Tom the piper's
son, stole a pig and away he run. The pig was eat
and Tom was beat and Tom ran crying down the
street. 47. Cocks crow in the morn to tell us to
rise and he who lies late will never be wise. For
early to bed and early to rise, is the way to be
healthy and wealthy and wise. 48. One two, buckle
my shoe. Three four, knock at the door. Five six,
ick up sticks. Seven eight, lay them straight.
Nine ten. a good fat hen. Eleven twelve, dig and
delve. Thirteen fourteen, maids a courting.
Fifteen sixteen, maids in the kitchen. Seventeen
eighteen. maids a waiting. Nineteen twenty, my
plate is empty. 49. There was a little girl who
had a little curl right in the middle of her
forehead. When she was good she was very very
good and when she was bad she was horrid. 50.
Little Jack Horner sat in the corner, eating of
Christmas pie. He put in his thumb and pulled out
a plum and said What a good boy am I!
3
Relationships and ARM In Market Basket Research
(MBR), we introduce the relationship,
cash-register transactions, T, between customers,
C, and purchasable items, I, and briefly
discussed what strong rules tell us in that
context. In Software Engineering (SE), the
relationship between Aspects, T, and Code
Modules, I (t is related to i iff module, i, is
part of the aspect, t). In Bioinformatics, the
relationship between experiments, T, and genes, I
(t is related to i iff gene, i, expresses at a
threshold level during experiment, t). In Text
Mining, the relationship between Documents, D,
and Words, W (w related to d iff w?d). A strong
D-rule means two things The DSets A and C have
many common words. If a word occurs in every
document of the DSet, A, it occurs in every doc
of C with high probability. In any Entity
Relationship diagram, a part of relationship in
which i?I is part of t?T (t is related to i iff
i is part of t) and an ISA relationship in
which i?I ISA t?T (t is related to i iff i
IS A t) . . .
Given any relationship between two entities
(e.g., between customers and items) there are
always two ARM problems to analyze. E.g., We
analyzed Itemset rule, A?C (call them I-rules
using info recorded on which customer
transactions contained those itemsets. With this,
we can intelligently shelf items, to accurately
order items (Supply Chain Management), and etc.
There are also C-rules,
The support ratio of itemset A, supp(A), is the
fraction of Ts such that A ? T(I), e.g., if
Ai1,i2 and Ci4 then supp(A) t2,t4
/ t1,t2,t3,t4,t5 2/5 Note
means set size count of elements in the
set. The support ratio of rule A?C, supp(A?C),
is the support of A ?CT2,T4/T1,T2,T3,T4,T
52/5 The confidence of rule A?C, conf(A?C),
is supp(A?C) / supp(A) (2/5) /
(2/5) 1 Data Miners typically want
to find all STRONG RULES, A?C, with supp(A?C)
minsupp and conf(A?C) minconf (minsupp,
minconf are threshold levels). A Strong rule
indicates two things high support means it's
non-trivial (A and B are found in many market
baskets at checkout) and high confidence means
that the implication rule is highly likely to be
true. Note that conf(A?C) is also just the
conditional probability of t being related to C,
given that t is related to A, (e.g., the
conditional probability that the market basket
contents, T(I), contains C, given that T(I)
contains A.
4
APRIORI Association Rule Mining Given a
Transaction-Item Relationship, the APRIORI
algorithm for finding all Strong I-rules can be
done vertically processing of a Horizontal
Transaction Table (HTT) or horizontally
processing of a Vertical Transaction Table
(VTT).In 1., a HTT is processed thru vertical
scans for all Frequent I-sets (I-sets with
support ? minsupp, e.g., I-sets "frequently"
found in transaction market baskets).In 2. a VTT
is processed thru horizontal operations to find
all Frequent I-setsThen each Frequent I-set
found is analyzed to determine if it is the
support set of a strong rule. Finding all
Frequent I-sets is the hard part. The APRIORI
Algorithm takes advantage of the "downward
closure" property for Frequent I-sets If a
I-set is frequent, then all its subsets are also
frequent.E.g., in the MBR Example, If A is an
I-subset of B and if all of B is in a given
Transaction's basket, the certainly all of A is
in that basket too. Therefore Supp(A) ? Supp(B)
whenever A?B (downward closure).First, APRIORI
scans to determine all Frequent 1-item sets
(contain 1 item therfore called
1-Itemsets),next APRIORI uses downward closure
to efficiently find candidates for Frequent
2-Itemsets,next APRIORI scans to determine which
of those candidate 2-Itemsets is actually
Frequent, ...Until there are no candidates
remaining (on the next slide we walk through an
example using both a HTT and a VTT)
minsupp is set by the querier at 1/2 and minconf
at 3/4 (note minsupp and minconf can be expressed
as counts rather than as ratios. If so, since
there are 4 transactions, then as counts,
minsupp2, minconf3)
(downward closure property of "frequent"). Any
subset of a frequent itemset is frequent.
APRIORI MET Iteratively find Frequent
k-itemsets, k1,2,... Find all strong rules
supported by each frequent Itemset.
(Ckcandidate k-itemsets. Fkfrequent k-itemsets
• Other ARM methods FP-Growth builds a large
linked data structure precounting the counts used
in APRIORI. Hash-based itemset counting A
k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent.
Transaction reduction A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans. Partitioning Any itemset that
is potentially frequent in DB must be frequent in
at least one of the partitions of DB. Sampling
mining on a subset of given data, lower support
threshold a method to determine completeness.
Dynamic itemset counting add new candidate
itemsets only when all of their subsets are
estimated to be frequent
• The core of the Apriori algorithm
• Use only large (k 1)-itemsets to generate
candidate large k-itemsets
• Use database scan and pattern matching to collect
counts for the candidate itemsets
• The bottleneck of Apriori candidate generation
• 1. Huge candidate sets
• 104 large 1-itemset may generate 107
candidate 2-itemsets
• To discover large pattern of size 100, eg,
a1a100, we need to generate 2100 ? 1030
candidates.
• 2. Multiple scans of database (Needs (n 1 )
scans, n length of the longest pattern)

A supplemental text document on ARM (with
additional topics and discussions) at
http//www.cs.ndsu.nodak.edu/perrizo/classes/785/
hk6.html
5
ARM-6
HTT

123 need not be scanned for since 12 is not
frequent. 135 need not be scanned for since
15 not frequent
Example ARM using uncompressed Ptrees (note I
have placed the 1-count at the root of each Ptree)

TID 1 2 3 4 5
100 1 0 1 1 0
200 0 1 1 0 1
300 1 1 1 0 1
400 0 1 0 0 1

6
L3
L1
L2
ARM-7
1-ItemSets dont support Association Rules (They
eihter have no antecedent or no consequent).
2-Itemsets do support ARs.
Are there any Strong Rules supported
by FrequentLarge 2-ItemSets
(at minconf.75)?
1,3 conf(1?3) supp1,3/supp1 2/2 1
.75 STRONG
conf(3?1) supp1,3/supp3 2/3 .67 lt
.75
2,3 conf(2?3) supp2,3/supp2 2/3
.67 lt .75
conf(3?2) supp2,3/supp3 2/3 .67 lt
.75
2,5 conf(2?5) supp2,5/supp2 3/3 1
.75 STRONG!
conf(5?2) supp2,5/supp5 3/3 1
.75 STRONG!
3,5 conf(3?5) supp3,5/supp3 2/3
.67 lt .75
conf(5?3) supp3,5/supp5 2/3 .67 lt
.75
Are there any Strong Rules supported by Frequent
or Large 3-ItemSets?
2,3,5 conf(2,3?5) supp2,3,5/supp2,3
2/2 1 .75 STRONG!
conf(2,5?3) supp2,3,5/supp2,5 2/3
.67 lt .75
No subset antecedent can yield a strong rule
either (i.e., no need to check conf(2?3,5) or
conf(5?2,3) since both denominators will be
at least as large and therefore, both confidences
will be at least as low.
conf(3,5?2) supp2,3,5/supp3,5 2/2
1 ? .75 STRONG!
conf(3?2,5) supp2,3,5/supp3 2/3
.67 lt .75
DONE!
7
ARM9Find all Frequent 1Sets, F1DSs and F1WSs,
given minsupp5 (suppgt5) and minconf.5 (confgt.5)
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
1
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
0
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
0
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
2
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
0
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
1
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
1
F1DS 7 7 21 6 26 7 28 6 35 13 39 7 46 6
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
1
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
0
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
1
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
1
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
1
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
1
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
1
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
1
No frequent 2Sets, either F2WSs or F2DSs. And we
would have to lower minsupp down to 1 to get just
one C2DS, D7, D35 (we might get more because
there would be additional F1DSs but they would be
far less likely to combine for a large count than
these "largest" ones), and still no F2WSs.
F1WS 38 44 6 6
D O C 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8
21 2 3 5 6 7 8 9 30 2 3 5 6 7 8 9 41 2 3 4 5 6 7
8 9 0
Looking at Doc7 and Doc35, we can see that
doc7--gtdoc35 has confidence2/7 and
35--gt7 has conf2/13, neither of which are
confident rules (unless we lowered minconf to
2/8, then we pick up 7--gt35 as confident.
This says "if a word is contained in doc7 then
we have a 1/4 confidence that it will be
contained in doc35.
W C 4 2 2 2 3 2 7 3 3 5 3 3 5 4
3 2 4 2 6 2 2 2 7 3 6 4 5 3 3 13
2 4 3 7 5 2 2 4 3 6 4 3 2 5
7. Old Mother Hubbard went to the cupboard to
give her poor dog a bone. When she got there
cupboard was bare and so the poor dog had none.
She went to baker to buy him some bread. When she
came back dog was dead. 35. Sing a song of
sixpence, a pocket full of rye. Four and twenty
blackbirds, baked in a pie. When the pie was
opened, the birds began to sing. Was not that a
dainty dish to set before the king? The king was
in his counting house, counting out his money.
The queen was in the parlor, eating bread and
honey. The maid was in the garden, hanging out
the clothes. When down came a blackbird and
snapped off her nose.
We would have to lower minconf to 2/14 to pick up
doc35--gtdoc7 as confident DS-rule also.
8
WHAT I'M WORKING ON NOW
I am creating a bigger text corpus consisting of
50 Mother Goose Rythmes, 10 short emails (actual,
but anonymized) and 10 text messages (anonymized)
for a total or 70 documents. The vocab is the
full vocab (no content word extraction) with very
minor synonimizing. I am about done creating
the word pTrees ( 300 pTrees). I am using the
list version for those words that occur in just 1
document and the bit map version for those that
occur in more than 1. For the document pTrees I
will use only bit maps. I will retry clustering
and ARM on this expanded corpus of documents
(with a complete vocab). The idea is that the
articles and other so called non-content words
may be important. We will see????? The next
important thing I will do is work on how to do
this for large volatile corpuses of documents
(such as emails, text and tweets). How should we
handle "new" documents and "new" vocab as they
come in?
9
01TBM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 1 0 0
02TLP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1 0 0 0
03DDD 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0
04LMM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
05HDS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
06SPP 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
07OMH 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
08JSC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0
09HBD 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
10JAJ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
11OMM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
12OWF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
13RRS 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 0 1 0 0 0 0 0 0
14ASO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
15PCD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
16PPG 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0
17FEC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 1 0 0 0 0
18HTP 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
21LAU 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0 0
22HLH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0
23MTB 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0
25WOW 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
26SBS 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1
27CBC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
28BBB 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 1
29LFW 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
30HDD 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0
32JGF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
33BFP 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0
35SSS 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0
36LTT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
37MBB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0
38YLS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
39LCS 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0
41OKC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0
42BBC 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
43HHD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0
44HLH 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
45BBB 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
46TTP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 1 0 0 0 0 0 0 0 0 0
47CCM 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0
48OTB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 1 0 0 0 0
49WLG 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
50LJH 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0
10
APPENDIX
Let LLKWL,origin L-1 (k) docs with exactly
k KeyWord matches.
FAUST KWL Clustering (w or w/o replacement
KWLKeyWordList)
It should scale to Big Text Corpuses (billions of
docs, thousands of words), because it uses 1 dot
product SPTS over KWL not Vocab, then UDR.
Thanksgiving Clustering (no replacement) carves
off clusters as we go, so the sequence of KWLs is
(GVO) may be useful in determining "next best
KWL"? To keep KWL small we might take KWL to
be the nearest actual doc to the GVO vector).
KWL w/o rrepl uses a sequence of KWLs
replacement. e.g., using the words in the pillar
documents ? (e.g., initial pillarFFA "next"
pillar maximizes the distance from PillarSet (or
sum of Pillar distances), until the distance to
PillarSet (or sum/min) falls below a threshold.
How far away from each other are KWLs?
Let LL100 L(100)1 L(010)0 L(011)0 gap 1
i.e., L(010)-L(100)1 L(011)-L(100)1
Euclidean Distances are, E(100, 010)?21.41
E(100, 011)?31.73
Manhattan Distances, M(100, 010)2 M(100,
011)3
(We also note incidentally that L(010)-L(011)0
whereas, ED(010,011)MD(010,011)1)
The KWL approach is Hub And SPoke (HASP)
clustering, where the affinity is to the hubKWL
(there may be less or no spoke-spoke affinity).
The client might want a more Uniform Affinity
(UA) clustering, assuming affinity(x,y)
Count(KeyWordMatch(x,y)) e.g., define UAdoc,T
to contain doc and have uniform affinity?T, i.e.,
Affinity(x,y)?T ?x,y?UAdoc,T. (T?KWLdoc, lest
contain at least T KeyWordsdoc, and therefore
x?HASPdoc So to find UAdoc,T we can first
construct HASPdoc and then search internal to it
for UAdoc,T. But this is also Hub and Spoke
since the result will depend upon the search
order. Let RINGdoc,S be the HASPdoc docs with
exactly S docwords. Carve off from HASPdoc,
RINGdoc,T, then carve off RINGdoc,T1, ...
11
A document-word matrix can be viewed as a labeled
bipartite graph
doc1 doc2 . . . docN
word1 word2 . . . wordn
dcdf2
There are at least three common labels, the
first is term_frequency (tf) which labels each
edge with the number of times the word occurs in
the doc.
wc3
The 2nd and 3rd are incidence counts which are
node labels. The second is doc_frequency (df)
(or dc for doc count) which labels each word
with the number of docs the word occurs in.
This is just the incidence count of each word as
a node in the graph (i.e., edge count).
tf8
The third is word_count (wc) which labels each
doc with the number of words it contains. This
is just the incidence count of each doc as a node
in the graph (i.e., edge count).
At this point we note that the two incidence
counts, as actual node labels, are just a
convenience, since that count can be made from
the bipartite graph any time. However, tf is a
necessary label (not a convenience) since it
cannot be discerned from the bipartite graph
itself.
We typically lower-bound threshold each of these
labels. First, we transform the corpus using the
lower bound tf?1 which effectively removes the
need for an edge label.
Second, we lower bound and/or upper bound df
(e.g., df?2 requires each word to occur in at
least 2 docs.
Third, we could lower bound wc (e.g., wc?2
requires each doc to contain at least 2 words.
In the previous slides, the original corpus was
transformed by tf?1 into an "word existential
corpus first, called MGd44w60. Then each round
was a matter of identifying the
sub-bipartite-graph satisfying one of the node
label inequalities, doc-node label
wc?2, word-node label dc?2, doc-node label
wc?2, word-node label dc?2 etc. until stability
was achieved (it converged to a
sub-bipartite-graph).
What happens if we reverse that order (word-node
label dc?2 first). word-node label dc?2, doc-node
label wc?2, word-node label dc?2 etc. until
stability is achieved.
12
What do we have to describe relationships
graphically?
The incidence counts (as well as any other entity
attribute) can be used to define sub-graphs and
then we can search for stable (convergent)
sub-graphs under that definition. For the
doc-word relationship we used wc?2 dc?2.
Next we will try wc?2 dc?1.
After that we will try wc?1 dc?2.
In all cases we implement as two (redundant)
,pTreeSets, one for each entity, with 1 iff there
is an edge. One PTS is the rotation of the
other. If there is a numeric edge label (e.g.,
tf) each SPTS is made up of its bitslices,
otherwise each SPTS is one bit map.