Algorithms (wait, Math?) Everywhere - PowerPoint PPT Presentation

About This Presentation

Title:

Algorithms (wait, Math?) Everywhere

Description:

Title: The Multi-Disciplinary Nature of Technology Author: kruse Last modified by: Kruse, Gerald (KRUSE) Created Date: 5/18/2004 2:59:12 PM Document presentation format – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 59

Provided by: kru3

Learn more at: https://jcsites.juniata.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms (wait, Math?) Everywhere

1
Algorithms (wait, Math?) Everywhere

Gerald Kruse, PhD.
John 54 and Irene 58 Dale Professor of MA, CS
and I T
Interim Assistant Provost 2013-14
Juniata College
Huntingdon, PA
kruse_at_juniata.edu
http//faculty.juniata.edu/kruse

2
Some Context / Confessions

Prepare to be underwhelmed. I cant return the
hour or so you spend here.
I am impressed by the elegance of the algorithms
I will present today, and I will probably try too
hard to explain the underlying math (but its so
cool).
We like and depend on many automated processes,
we just have issues implementing or interacting
with them.
But, when we understand an algorithm, we can
manipulate it. (my CS 315 students Google
Bombed Juniata in a good way).
Are we really surprised to learn that a Google
search isnt free?

3
What movie should we pick?1,000,000 to the
first algorithm that was 10 better than
Netflixs original algorithm
4
The first 8 improvement was easy
5
The first 8 improvement was easy
Just A Guy In A Garage Psychiatrist father and
hacker daughter team
6
The first 8 improvement was easy
Team from Bell Labs ended up winning
7
Heres an interesting billboard, from a few years
ago in Silicon Valley
8
First 70 digits ofe2.71828182845904523536028747
1352662497757247093699959574966967627724077
9
What happened for those who found the answer?

The answer is 7427466391

10
What happened for those who found the answer?

The answer is 7427466391
Those who typed in the URL, http//7427466391.com
, ended up getting another puzzle. Solving that
lead them to a page with a job application for

11
What happened for those who found the answer?

The answer is 7427466391
Those who typed in the URL, http//7427466391.com
, ended up getting another puzzle. Solving that
lead them to a page with a job application for
Google!

12
(1) Just what does it take to solve that
problem?
First Question
13
(1) Just what does it take to solve that
problem?Calculations (most probably on a
computer), knowledge of number theory, a general
aptitude and interest in problem solving.
First Question
14
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine?
Second Question
15
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.
Second Question
16
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.Maybe their
algorithms are mathematically based?
Second Question
17

Google-ing Google

Results in an early paper from Page, Brin et. al.
while in graduate school

19
Search EnginesWeve all used them, but what is
under the hood?

Crawl the web and locate all public pages
Index the crawled data so it can be searched
Rank the pages for more effective searching (
the math part of this talk )
Each word which is searched on is linked with a
list of pages (just URLs) which contain it. The
pages with the highest rank are returned first.
- cant get a snapshot of the web at a
particular instance

20
NoteGoogles PageRank uses the link structure
(crowd sourcing) of the World Wide Web to
determine a pages rank, it doesnt grade content
of a page.
21
PageRank is NOT a simple citation index
Which is the more popular page below, A or B?
A
B
22
PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B

NOTE
Rankings based on citation index would be very
easy to manipulate

23
PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B

NOTE
Rankings based on citation index would be very
easy to manipulate
PageRank has evolved to be a minor part of
Googles search results.

24
Intuitively PageRank is analogous to popularity

The web as a graph each page is a vertex, each
hyperlink a directed edge.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
25
Intuitively PageRank is analogous to popularity

The web as a graph each page is a vertex, each
hyperlink a directed edge.
A page is popular if a few very popular pages
point (via hyperlinks) to it.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
26
Intuitively PageRank is analogous to popularity

The web as a graph each page is a vertex, each
hyperlink a directed edge.
A page is popular if a few very popular pages
point (via hyperlinks) to it.
A page could be popular if many not-necessarily
popular pages point (via hyperlinks) to it.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
27
So what is the mathematical definition of
PageRank?

In particular, a pages rank is equal to the sum
of the ranks of all the pages pointing to it.
note the scaling of each page rank

28
Writing out the equation for each web-page in our
example gives
Page A
Page B
Page C
29
Even though this is a circular definition we can
calculate the ranks.
30
Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
31
Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
The PageRank vector is simply an eigenvector of
the coefficient matrix, with
32
Wait whats an eigenvector?
33
PageRank 0.4
PageRank 0.2
Page A
Page B
Page C
PageRank 0.4
Note we choose the eigenvector with
34
Implementation Details

Billions of web-pages would make a huge matrix
The matrix (in theory) is column-stochastic,
which allows for iterative calculation
Previous PageRank is used as an initial guess
Random-Surfer term handles computational
difficulties associated with a disconnected
graph

35
Wait what else gets searched?
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40

Attempts to Manipulate Search Results
Via a Google Bomb

Liberals vs. Conservatives!
In 2007, Google addressed Google Bombs, too many
people thought the results were intentional and
not merely a function of the structure of the web

Juniatas own Google Bomb

At Juniata, CS 315 is my Analysis and
Algorithms course

Miscellaneous points

Try a search in Google on PigeonRank.
What types of sites would Google NOT give good
results on?
PageRank has been deprecated. Google is
continuosly trying new ranking algorithms.

SPAM filters

A rules approach filter out all messages with
things like, Dear Friend or Click.
The first 80 is captured easily, with few
false-positives.
But the last few (remember Netflix) will be
difficult to catch, the rules will offer many
more false-positives, and the SPAMMers can
adapt.
A statistical approach, called a Bayesian filter,
is much more effective.
It learns from a given set of SPAM and non-SPAM
emails, automatically counting the frequency of
words.
Some words are incriminating, like Madam,
others almost guarantee the email is non-SPAM,
like describe, or example.

Bibliography

1 S. Brin, L. Page, et. al., The PageRank
Citation Ranking Bringing Order to the Web,
http//dbpubs.stanford.edu/pub/1999-66 , Stanford
Digital Libraries Project (January 29,
1998). 2 K. Bryan and T. Leise, The
25,000,000,000 Eigenvector The Linear Algebra
behind Google, SIAM Review, 48 (2006), pp.
569-581. 3 G. Strang, Linear Algebra and Its
Applications, Brooks-Cole, Boston, MA, 2005. 4
D. Poole, Linear Algebra A Modern Introduction,
Brooks-Cole, Boston, MA, 2005.
47
Any Questions?

Slides available at http//faculty.juniata.edu/kru
se

48
The following slides give some of the more
in-depth mathematics behind Google
49
A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x.
50
A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x. But, if the vector (v in the
image at the left) is an eigenvector of A, then
Av will give a vector which is same direction
as v, but just scaled a different length, by
?. Note that ? is called an eigenvalue of A.
51
Note that the coefficient matrix is
column-stochastic
Every column-stochastic matrix has 1 as an
eigenvalue. As long as there are no dangling
nodes and the graph is connected.
52
Dangling Nodes have no outgoing links
In this example, Page C is a dangling node. Note
that its associated column in the coefficient
matrix is all 0. Matrices like these are called
column-substochastic.
Page A
Page C
Page B
In Page, Brin, et. al. 1, they suggest dangling
nodes most likely would occur from pages which
havent been crawled yet, and so they simply
remove them from the system until all the
PageRanks are calculated.It is interesting to
note that a column-substochastic does have a
positive eigenvalue and corresponding
eigenvector with non-negative entries, which is
called the Perron eigenvector, as detailed in
Bryan and Leise 2.
53
A disconnected graph could lead to non-unique
rankings
Notice the block diagonal structure of the
coefficient matrix. Note Re-ordering via
permutation doesnt change the ranking, as in 2.
Page C
Page A
Page E
Page D
Page B
In this example, the eigenspace assiciated with
eigenvalue is two-dimensional. Which
eigenvector should be used for ranking?
54
Add a random-surfer term to the simple PageRank
formula.
Let S be an n x n matrix with all entries 1/n. S
is column-stochastic, and we consider the matrix
M , which is a weighted average of A and S.

This models the behavior of a real web-surfer,
who might jump to another page by directly typing
in a URL or by choosing a bookmark, rather than
clicking on a hyperlink. Originally, m0.15 in
Google, according to 2.
can also be written as

Important Note We will use this formulation
with A when computing x , and s is a
column vector with all entries 1/n, where
if
55
M for our previous disconnected graph, with
m0.15
Page C
Page A
Page E
Page D
Page B
The eigenspace associated with is
one-dimensional, and the normalized eigenvector is
So the addition of the random surfer term permits
comparison between pages in different subwebs.
56
Iterative Calculation
By many estimates, the web currently contains at
least 8 billion pages. How does Google compute
an eigenvector for something this large?One
possibility is the power method.In 2, it is
shown that every positive (all entries are gt 0)
column-stochastic matrix M has a unique vector q
with positivecomponents such that Mq q, with
, and it can becomputed as
, for any initial guess
withpositive components and .
57
Iterative Calculation continued
Rather than calculating the powers of M directly,
we could use the iteration,
.Since M is positive, would be an
calculation. As we mentioned
previously, Google uses the equivalent expression
in the computationThese products can be
calculated without explicitly creating the huge
coefficient matrix, since A contains mostly 0s.
The iteration is guaranteed to converge, and it
will converge quicker with a better first guess,
so the previous PageRank vector is used as the
initial vector.
58
This gives a regular matrix

In matrix notation we have
Since we can rewrite as
The new coefficient matrix is regular, so we can
calculate the eigenvector iteratively.
This iterative process is a series of
matrix-vector products, beginning with an
initial vector (typically the previous PageRank
vector). These products can be calculated
without explicitly creating the huge coefficient
matrix.