Algorithms (wait, Math?) Everywhere - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithms (wait, Math?) Everywhere

Description:

Title: The Multi-Disciplinary Nature of Technology Author: kruse Last modified by: Kruse, Gerald (KRUSE) Created Date: 5/18/2004 2:59:12 PM Document presentation format – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 59
Provided by: kru3
Category:

less

Transcript and Presenter's Notes

Title: Algorithms (wait, Math?) Everywhere


1
Algorithms (wait, Math?) Everywhere
  • Gerald Kruse, PhD.
  • John 54 and Irene 58 Dale Professor of MA, CS
    and I T
  • Interim Assistant Provost 2013-14
  • Juniata College
  • Huntingdon, PA
  • kruse_at_juniata.edu
  • http//faculty.juniata.edu/kruse

2
Some Context / Confessions
  • Prepare to be underwhelmed. I cant return the
    hour or so you spend here.
  • I am impressed by the elegance of the algorithms
    I will present today, and I will probably try too
    hard to explain the underlying math (but its so
    cool).
  • We like and depend on many automated processes,
    we just have issues implementing or interacting
    with them.
  • But, when we understand an algorithm, we can
    manipulate it. (my CS 315 students Google
    Bombed Juniata in a good way).
  • Are we really surprised to learn that a Google
    search isnt free?

3
What movie should we pick?1,000,000 to the
first algorithm that was 10 better than
Netflixs original algorithm
4
The first 8 improvement was easy
5
The first 8 improvement was easy
Just A Guy In A Garage Psychiatrist father and
hacker daughter team
6
The first 8 improvement was easy
Team from Bell Labs ended up winning
7
Heres an interesting billboard, from a few years
ago in Silicon Valley
8
First 70 digits ofe2.71828182845904523536028747
1352662497757247093699959574966967627724077
9
What happened for those who found the answer?
  • The answer is 7427466391

10
What happened for those who found the answer?
  • The answer is 7427466391
  • Those who typed in the URL, http//7427466391.com
    , ended up getting another puzzle. Solving that
    lead them to a page with a job application for

11
What happened for those who found the answer?
  • The answer is 7427466391
  • Those who typed in the URL, http//7427466391.com
    , ended up getting another puzzle. Solving that
    lead them to a page with a job application for
  • Google!

12
(1) Just what does it take to solve that
problem?
First Question
13
(1) Just what does it take to solve that
problem?Calculations (most probably on a
computer), knowledge of number theory, a general
aptitude and interest in problem solving.
First Question
14
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine?
Second Question
15
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.
Second Question
16
(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.Maybe their
algorithms are mathematically based?
Second Question
17
  • Google-ing Google

18
  • Results in an early paper from Page, Brin et. al.
    while in graduate school

19
Search EnginesWeve all used them, but what is
under the hood?
  • Crawl the web and locate all public pages
  • Index the crawled data so it can be searched
  • Rank the pages for more effective searching (
    the math part of this talk )
  • Each word which is searched on is linked with a
    list of pages (just URLs) which contain it. The
    pages with the highest rank are returned first.
  • - cant get a snapshot of the web at a
    particular instance

20
NoteGoogles PageRank uses the link structure
(crowd sourcing) of the World Wide Web to
determine a pages rank, it doesnt grade content
of a page.
21
PageRank is NOT a simple citation index
Which is the more popular page below, A or B?
A
B
22
PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B
  • NOTE
  • Rankings based on citation index would be very
    easy to manipulate

23
PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B
  • NOTE
  • Rankings based on citation index would be very
    easy to manipulate
  • PageRank has evolved to be a minor part of
    Googles search results.

24
Intuitively PageRank is analogous to popularity
  • The web as a graph each page is a vertex, each
    hyperlink a directed edge.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
25
Intuitively PageRank is analogous to popularity
  • The web as a graph each page is a vertex, each
    hyperlink a directed edge.
  • A page is popular if a few very popular pages
    point (via hyperlinks) to it.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
26
Intuitively PageRank is analogous to popularity
  • The web as a graph each page is a vertex, each
    hyperlink a directed edge.
  • A page is popular if a few very popular pages
    point (via hyperlinks) to it.
  • A page could be popular if many not-necessarily
    popular pages point (via hyperlinks) to it.

Page A
Page B
Which of these three would have the highest page
rank?
Page C
27
So what is the mathematical definition of
PageRank?
  • In particular, a pages rank is equal to the sum
    of the ranks of all the pages pointing to it.
  • note the scaling of each page rank

28
Writing out the equation for each web-page in our
example gives
Page A
Page B
Page C
29
Even though this is a circular definition we can
calculate the ranks.
30
Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
31
Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
The PageRank vector is simply an eigenvector of
the coefficient matrix, with
32
Wait whats an eigenvector?
33
PageRank 0.4
PageRank 0.2
Page A
Page B
Page C
PageRank 0.4
Note we choose the eigenvector with
34
Implementation Details
  • Billions of web-pages would make a huge matrix
  • The matrix (in theory) is column-stochastic,
    which allows for iterative calculation
  • Previous PageRank is used as an initial guess
  • Random-Surfer term handles computational
    difficulties associated with a disconnected
    graph

35
Wait what else gets searched?
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
  • Attempts to Manipulate Search Results
  • Via a Google Bomb

41
  • Liberals vs. Conservatives!
  • In 2007, Google addressed Google Bombs, too many
    people thought the results were intentional and
    not merely a function of the structure of the web

42
  • Juniatas own Google Bomb

43
  • At Juniata, CS 315 is my Analysis and
    Algorithms course

44
  • Miscellaneous points
  • Try a search in Google on PigeonRank.
  • What types of sites would Google NOT give good
    results on?
  • PageRank has been deprecated. Google is
    continuosly trying new ranking algorithms.

45
  • SPAM filters
  • A rules approach filter out all messages with
    things like, Dear Friend or Click.
  • The first 80 is captured easily, with few
    false-positives.
  • But the last few (remember Netflix) will be
    difficult to catch, the rules will offer many
    more false-positives, and the SPAMMers can
    adapt.
  • A statistical approach, called a Bayesian filter,
    is much more effective.
  • It learns from a given set of SPAM and non-SPAM
    emails, automatically counting the frequency of
    words.
  • Some words are incriminating, like Madam,
    others almost guarantee the email is non-SPAM,
    like describe, or example.

46
  • Bibliography

1 S. Brin, L. Page, et. al., The PageRank
Citation Ranking Bringing Order to the Web,
http//dbpubs.stanford.edu/pub/1999-66 , Stanford
Digital Libraries Project (January 29,
1998). 2 K. Bryan and T. Leise, The
25,000,000,000 Eigenvector The Linear Algebra
behind Google, SIAM Review, 48 (2006), pp.
569-581. 3 G. Strang, Linear Algebra and Its
Applications, Brooks-Cole, Boston, MA, 2005. 4
D. Poole, Linear Algebra A Modern Introduction,
Brooks-Cole, Boston, MA, 2005.
47
Any Questions?
  • Slides available at http//faculty.juniata.edu/kru
    se

48
The following slides give some of the more
in-depth mathematics behind Google
49
A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x.
50
A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x. But, if the vector (v in the
image at the left) is an eigenvector of A, then
Av will give a vector which is same direction
as v, but just scaled a different length, by
?. Note that ? is called an eigenvalue of A.
51
Note that the coefficient matrix is
column-stochastic
Every column-stochastic matrix has 1 as an
eigenvalue. As long as there are no dangling
nodes and the graph is connected.
52
Dangling Nodes have no outgoing links
In this example, Page C is a dangling node. Note
that its associated column in the coefficient
matrix is all 0. Matrices like these are called
column-substochastic.
Page A
Page C
Page B
In Page, Brin, et. al. 1, they suggest dangling
nodes most likely would occur from pages which
havent been crawled yet, and so they simply
remove them from the system until all the
PageRanks are calculated.It is interesting to
note that a column-substochastic does have a
positive eigenvalue and corresponding
eigenvector with non-negative entries, which is
called the Perron eigenvector, as detailed in
Bryan and Leise 2.
53
A disconnected graph could lead to non-unique
rankings
Notice the block diagonal structure of the
coefficient matrix. Note Re-ordering via
permutation doesnt change the ranking, as in 2.
Page C
Page A
Page E
Page D
Page B
In this example, the eigenspace assiciated with
eigenvalue is two-dimensional. Which
eigenvector should be used for ranking?
54
Add a random-surfer term to the simple PageRank
formula.
Let S be an n x n matrix with all entries 1/n. S
is column-stochastic, and we consider the matrix
M , which is a weighted average of A and S.
  • This models the behavior of a real web-surfer,
    who might jump to another page by directly typing
    in a URL or by choosing a bookmark, rather than
    clicking on a hyperlink. Originally, m0.15 in
    Google, according to 2.
  • can also be written as

Important Note We will use this formulation
with A when computing x , and s is a
column vector with all entries 1/n, where
if
55
M for our previous disconnected graph, with
m0.15
Page C
Page A
Page E
Page D
Page B
The eigenspace associated with is
one-dimensional, and the normalized eigenvector is
So the addition of the random surfer term permits
comparison between pages in different subwebs.
56
Iterative Calculation
By many estimates, the web currently contains at
least 8 billion pages. How does Google compute
an eigenvector for something this large?One
possibility is the power method.In 2, it is
shown that every positive (all entries are gt 0)
column-stochastic matrix M has a unique vector q
with positivecomponents such that Mq q, with
, and it can becomputed as
, for any initial guess
withpositive components and .
57
Iterative Calculation continued
Rather than calculating the powers of M directly,
we could use the iteration,
.Since M is positive, would be an
calculation. As we mentioned
previously, Google uses the equivalent expression
in the computationThese products can be
calculated without explicitly creating the huge
coefficient matrix, since A contains mostly 0s.
The iteration is guaranteed to converge, and it
will converge quicker with a better first guess,
so the previous PageRank vector is used as the
initial vector.
58
This gives a regular matrix
  • In matrix notation we have
  • Since we can rewrite as
  • The new coefficient matrix is regular, so we can
    calculate the eigenvector iteratively.
  • This iterative process is a series of
    matrix-vector products, beginning with an
    initial vector (typically the previous PageRank
    vector). These products can be calculated
    without explicitly creating the huge coefficient
    matrix.
Write a Comment
User Comments (0)
About PowerShow.com