To view this presentation, you'll need to enable Flash.

Show me how

After you enable Flash, refresh this webpage and the presentation should play.

Loading...

PPT – Mining di dati web PowerPoint presentation | free to download - id: 67e46-ZDc1Z

The Adobe Flash plugin is needed to view this content

View by Category

Presentations

Products
Sold on our sister site CrystalGraphics.com

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

Mining di dati web

- Clustering di Documenti Web
- Le metriche di similarità
- A.A 2006/2007

Learning What does it Means?

- Definition from Wikipedia
- Machine learning is an area of artificial

intelligence concerned with the development of

techniques which allow computers to "learn". - More specifically, machine learning is a method

for creating computer programs by the analysis of

data sets. - Machine learning overlaps heavily with

statistics, since both fields study the analysis

of data, but unlike statistics, machine learning

is concerned with the algorithmic complexity of

computational implementations.

And What about Clustering?

- Again from Wikipedia
- Data clustering is a common technique for

statistical data analysis, which is used in many

fields, including machine learning, data mining,

pattern recognition, image analysis and

bioinformatics. Clustering is the classification

of similar objects into different groups, or more

precisely, the partitioning of a data set into

subsets (clusters), so that the data in each

subset (ideally) share some common trait - often

proximity according to some defined distance

measure. - Machine learning typically regards data

clustering as a form of unsupervised learning.

What clustering is?

Clustering

- Example where euclidean distance is the distance

metric - hierarchical clustering dendrogram

Clustering K-Means

- Randomly generate k clusters and determine the

cluster centers or directly generate k seed

points as cluster centers - Assign each point to the nearest cluster center.
- Recompute the new cluster centers.
- Repeat until some convergence criterion is met

(usually that the assignment hasn't changed).

K-means Clustering

- Each point is assigned to the cluster with the

closest center point - The number K, must be specified
- Basic algorithm

- Select K points as the initial center points.
- Repeat
- 2. Form K clusters by assigning all points to the

closet center point. - 3. Recompute the center point of each cluster
- Until the center points dont change

Example 1 data points

image the k-means clustering result.

K-means clustering result

Importance of Choosing Initial Guesses (1)

Importance of Choosing Initial Guesses (2)

Local optima of K-means

Supplement to K-means

- The local optima of K-means
- Kmeans is to minimize the sum of point-centroid

distances. The optimization is difficult. - After each iteration of K-means the MSE (mean

square error) decreases. But K-means may converge

to a local optimum. So K-means is sensitive to

initial guesses.

Example 2 data points

Image the clustering result

Example 2 K-means clustering result

Limitations of K-means

- K-means has problem when clusters are of

differing - Sizes
- Densities
- Non-globular shapes
- K-means has problems when the data contains

outliers - One solution is to use many clusters
- Find parts of clusters, but need to put together

Clustering K-Means

- The main advantages of this algorithm are its

simplicity and speed which allows it to run on

large datasets. Its disadvantage is that it does

not yield the same result with each run, since

the resulting clusters depend on the initial

random assignments. - It maximizes inter-cluster (or minimizes

intra-cluster) variance, but does not ensure that

the result has a global minimum of variance.

d_intra or d_inter? That is the question!

- d_intra is the distance among elements (points or

objects, or whatever) of the same cluster. - d_inter is the distance among clusters.
- Questions
- Should we use distance or similarity?
- Should we care about inter cluster distance?
- Should we care about cluster shape?
- Should we care about clustering?

Distance Functions

- Informal definition
- The distance between two points is the length of

a straight line segment between them. - A more formal definition
- A distance between two points P and Q in a metric

space is d(P,Q), where d is the distance function

that defines the given metric space. - We can also define the distance between two sets

A and B in a metric space as being the minimum

(or infimum) of distances between any two points

P in A and Q in B.

Distance or Similarity?

- In a very straightforward way we can define the

Similarity functionsim SxS ? 0,1 as

sim(o1,o2) 1 - d(o1,o2)where o1 and o2 are

elements of the space S.

What does similar (or distant) really mean?

- Learning (either supervised or unsupervised) is

impossible without ASSUMPTIONS - Watanabes Ugly Duckling theorem
- Wolperts No Free Lunch theorem
- Learning is impossible without some sort of bias.

The Ugly Duckling theorems

The theorem gets its fanciful name from the

following counter-intuitive statement assuming

similarity is based on the number of shared

predicates, an ugly duckling is as similar to a

beautiful swan A as a beautiful swan B is to A,

given that A and B differ at all. It was proposed

and proved by Satosi Watanabe in 1969.

Satosis Theorem

- Let n be the cardinality of the Universal set S.
- We have to classify them without prior knowledge

on the essence of categories. - The number of different classes, i.e. the

different way to group the objects into clusters,

is given by the cardinality of the Power Set of

S - Pow(S)2n
- Without any prior information, the most natural

way to measure the similarity among two distinct

objects we can measure the number of classes they

share. - Oooops They share exactly the same number of

classes, namely 2n-2.

The ugly duckling and 3 beautiful swans

- So1,o2,o3,o4
- Pow(S) ,o1,o2,o3,o4, o1,o2,o1,

o3,o1,o4, o2,o3,o2,o4,o3,o4, o1,o2

,o3,o1,o2,o4, o1,o3,o4,o2,o3,o4, o1,

o2,o3,o4 - How many classes have in common oi, oj iltgtj?
- o1 and o3
- 4
- o1 and o4
- 4

The ugly duckling and 3 beautiful swans

- In binary 0000, 0001, 0010, 0100, 1000,
- Chose two objects
- Reorder the bits so that the chosen object are

represented by the first two bits. - How many strings share the first two bits set to

1? - 2n-2

Wolperts No Free Lunch Theorem

- For any two algorithms, A and B, there exist

datasets for which algorithm A outperform

algorithm B in prediction accuracy on unseen

instances. - Proof Take any Boolean concept. If A outperforms

B on unseen instances, reverse the labels and B

will outperforms A.

So Lets Get Back to Distances

- In a metric space a distance is a function

dSxS-gtR so that if a,b,c are elements of S - d(a,b) 0
- d(a,b) 0 iff ab
- d(a,b) d(b,a)
- d(a,c) d(a,b) d(b,c)
- The fourth property (triangular inequality) holds

only if we are in a metric space.

Minkowski Distance

- Lets consider two elements of a set S described

by their feature vectors - x(x1, x2, , xn)
- y(y1, y2, , yn)
- The Minkowski Distance is parametric in pgt1

p1. Manhattan Distance

- If p 1 the distance is called Manhattan

Distance. - It is also called taxicab distance because it is

the distance a car would drive in a city laid out

in square blocks (if there are no one-way

streets).

p2. Euclidean Distance

- If p 2 the distance is the well known Euclidean

Distance.

p?. Chebyshev Distance

- If p ? then we must take the limit.
- It is also called chessboard Distance.

2D Cosine Similarity

- Its easy to explain in 2D.
- Lets consider a(x1,y1) and b(x2,y2).

a

b

?

Cosine Similarity

- Lets consider two points x, y in Rn.

Jaccard Distance

- Another commonly used distance is the Jaccard

Distance

Binary Jaccard Distance

- In the case of binary feature vector the Jaccard

Distance could be simplified to

Edit Distance

- The Levenshtein distance or edit distance between

two strings is given by the minimum number of

operations needed to transform one string into

the other, where an operation is an insertion,

deletion, or substitution of a single character

Binary Edit Distance

- The binary edit distance, d(x,y), from a binary

vector x to a binary vector y is the minimum

number of simple flips required to transform one

vector to the other

x(0,1,0,0,1,1) y(1,1,0,1,0,1) d(x,y)3

The binary edit distance is equivalent to the

Manhattan distance (Minkowski p1) for binary

features vectors.

The Curse of High Dimensionality

- The dimensionality is one of the main problem to

face when clustering data. - Roughly speaking the higher the dimensionality

the lower the power of recognizing similar

objects.

Volume of theUnit-Radius Sphere

Sphere/CubeVolume Ratio

- Unit-radius Sphere / Cube whose edge lenghts is 2.

Sphere/Sphere Volume Ratio

- Two embedded spheres. Radiuses 1 and 0.9.

Concentration of the Norm Phenomenon

- Gaussian distributions of points (std. dev. 1).

Dimensions (1,2,3,5,10, and 20). - Probability Density Functions to find a point

drawn according to a Gaussian distribution, at

distance r from the center of that distribution.

Web Document Representation

- The Web can be characterized in three different

ways - Content.
- Structure.
- Usage.
- We are concerned with Web Content information.

Bag-of-Words vs. Vector-Space

- Let C be a collection of N documentsd1, d2, ,

dN - Each document is composed of terms drawn from a

term-set of dimension T. - Each document can be represented in two different

ways - Bag-of-Words.
- Vector-Space.

Bag-of-Words

- In the bag-of-words model the document is

represented asdApple,Banana,Coffee,Peach - Each term is represented.
- No information on frequency.
- Binary encoding. t-dimensional Bitvector

Apple Peach Apple Banana Apple Banana Coffee Apple

Coffee

Apple Peach Apple Banana Apple Banana Coffee Apple

Coffee

d1,0,1,1,0,0,0,1.

d1,0,1,1,0,0,0,1.

Vector-Space

- In the vector-space model the document is

represented asdltApple,4gt,ltBanana,2gt,ltCoffee,2gt,

ltPeach,2gt - Information about frequency are recorded.
- t-dimensional vectors.

Apple Peach Apple Banana Apple Banana Coffee Apple

Coffee

Apple Peach Apple Banana Apple Banana Coffee Apple

Coffee

d4,0,2,2,0,0,0,1.

d4,0,2,2,0,0,0,1.

Typical WebCollection Dimensions

- No. of documents 20B
- No of terms is approx 150,000,000!!!
- Very high dimensionality
- Each document contains from 100 to 1000 terms.
- Classical clustering algorithm cannot be used.
- Each document is very similar to the other due to

the geometric properties just seen.

We Need to Cope with High Dimensionality

- Possible solutions
- JUST ONE Reduce Dimensions!!!!

Dimensionality reduction

- dimensionality reduction approaches can be

divided into two categories - feature selection approaches try to find a subset

of the original features. Optimal feature

selection for supervised learning problems

requires an exhaustive search of all possible

subsets of features of the chosen cardinality - feature extraction is applying a mapping of the

multidimensional space into a space of fewer

dimensions. This means that the original feature

space is transformed by applying e.g. a linear

transformation via a principal components analysis

PCA - Principal Component Analysis

- In statistics, principal components analysis

(PCA) is a technique that can be used to simplify

a dataset. - More formally it is a linear transformation that

chooses a new coordinate system for the data set

such that the greatest variance by any projection

of the data set comes to lie on the first axis

(then called the first principal component), the

second greatest variance on the second axis, and

so on. - PCA can be used for reducing dimensionality in a

dataset while retaining those characteristics of

the dataset that contribute most to its variance

by eliminating the later principal components (by

a more or less heuristic decision).

The Method

- Suppose you have a random vector population x
- x(x1,x2, , xn)T
- and the mean
- ?xEx
- and the covariance matrix
- CxE(x- ?x)(x- ?x)T

The Method

- The components of Cx, denoted by cij, represent

the covariances between the random variable

components xi and xj. The component cii is the

variance of the component xi. The variance of a

component indicates the spread of the component

values around its mean value. If two components

xi and xj of the data are uncorrelated, their

covariance is zero (cij cji 0). - The covariance matrix is, by definition, always

symmetric. - Take a sample of vectors x1, x2, , xM we can

calculate the sample mean and the sample

covariance matrix as the estimates of the mean

and the covariance matrix.

The Method

- From a symmetric matrix such as the covariance

matrix, we can calculate an orthogonal basis by

finding its eigenvalues and eigenvectors. The

eigenvectors ei and the corresponding eigenvalues

?i are the solutions of the equation Cxei ?

iei gt Cx- ?I 0 - By ordering the eigenvectors in the order of

descending eigenvalues (largest first), one can

create an ordered orthogonal basis with the first

eigenvector having the direction of largest

variance of the data. In this way, we can find

directions in which the data set has the most

significant amounts of energy.

The Method

- By ordering the eigenvectors in the order of

descending eigenvalues (largest first), one can

create an ordered orthogonal basis with the first

eigenvector having the direction of largest

variance of the data. In this way, we can find

directions in which the data set has the most

significant amounts of energy. - To reduce n to k take the first k eigenvectors.

A Graphical Example

PCA Eigenvectors Projection

- Project the data onto the selected eigenvectors

Another Example

Singular Value Decomposition

- Is a technique used for reducing dimensionality

based on some properties of symmetric matrices. - Will be the subject for a talk given by one of

you!!!!

Locality-Sensitive Hashing

- The key idea of this approach is to create a

small signature for each documents, to ensure

that similar documents have similar signatures. - There exists a family H of hash functions such

that for each pair of pages u, v we have Prmh(u)

mh(v) sim(u,v), where the hash function mh

is chosen at random from the family H.

Locality-Sensitive Hashing

- Will be a subject of a talk given by one of

you!!!!!!!

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Presentation Removal Request Contact Us Send Us Feedback

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "Mining di dati web" is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!