Web Crawlers - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Web Crawlers

Description:

... new (updated, longer) list of URLs. A very simple crawl. wget -r -w 10 http://blah.blah.com -r : ... Why Crawling is Hard. Huge Storage / Bandwidth Issues ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 17
Provided by: gil3
Category:
Tags: crawl | crawlers | web

less

Transcript and Presenter's Notes

Title: Web Crawlers


1
Web Crawlers Page Scraping
2
Web Crawlers
  • Start with 'seed' urls
  • Download pages
  • Process pages (for links)
  • Update list of URLS
  • Repeat with new (updated, longer) list of URLs

3
A very simple crawl
  • wget -r -w 10 http//blah.blah.com
  • -r recursive
  • -w wait time in seconds
  • wget will not leave the domain
  • wget will save the files in a new folder named
    blah.blah.com

4
Parsing URLs
  • Open each downloaded file and read contents
  • Use regular expressions to find all links
  • Loop over links, inserting them into the database

5
REGEX to get links
  • lt?PHP
  • page file_get_contents('http//www.rpi.edu/')
  • echo page
  • regex '/https?\\/\/\" /i'
  • preg_match_all(regex, page, matches)
  • print_r(matches0)
  • ?gt

6
Last Step
  • Run the Crawler again (and again)
  • Schedule runs with Cron
  • Select distinct domains from the db
  • Run wget on each distinct domain

7
Why Crawling is Hard
  • Huge Storage / Bandwidth Issues
  • The web is always changing - hard to keep up
  • an advanced scheduling algorithm would take into
    account how often the site is updated, and crawl
    accordingly
  • Different URLs might point to the same resource
  • Huge amounts of the web are behind forms/database
    queries ....crawlers cannot populate the form so
    they cannot see the data

8
Friendly roBOTS
  • Your bot should have a user-agent properly set in
    the HTTP GET request
  • ini_set('user_agent', ltbot namegt)
  • Construct your own http get request
  • http//us3.php.net/fsockopen

9
Friendly roBOTS
  • Obey robots.txt
  • Check each directory you open for robots.txt
  • It contains rules (suggestions) for who can read
    what
  • EXAMPLE block all robots from all folders
  • User-agent
  • Disallow /

10
Friendly roBOTS
  • Interval between requests
  • arbitrary
  • 10 seconds
  • Smart
  • last download time coefficient

11
Page Scraping
  • Identify where what you want on a page
  • Get it

12
Means of Scraping
  • Use strpos() and substr() to locate the start
    and end of what you want and return it.
  • (works great for data that will only be on a page
    once)
  • use simple xml to access nodes on the page
  • (works great if you're dealing with XHTML)
  • http//us3.php.net/manual/en/book.simplexml.php
  • use regular expressions
  • requires the learning of another language
  • very flexible
  • http//us.php.net/manual/en/regexp.reference.php

13
String Searches(everything inside the body tag)
  • lt?PHP
  • page file_get_contents('http//www.rpi.edu/')
  • echo page
  • start strpos(page, 'ltbody')
  • end strpos(page, 'lt/bodygt')
  • len end - start
  • out substr(page, start, len)
  • echo out
  • ?gt

14
Simplexml (everything inside the body tag)
  • lt?PHP
  • page simplexml_load_file('http//www.rpi.edu/'
    )
  • body page-gtbody
  • echo page
  • ?gt
  • Due to sloppy HTML, This will work approximately
    never. It is great for parsing xml documents.

15
REGEX (everything inside the body tag)
  • lt?PHP
  • page file_get_contents('http//www.rpi.edu/')
  • echo page
  • regex "/ltbody(.)lt\/bodygt/s"
  • preg_match_all(regex, page, matches)
  • print_r(matches0)
  • ?gt

16
Your Own Crawler
  • Try
  • Creating a database table with one field (url)
  • Manually, enter one url into that table
  • Take the code from slide 5, add
  • Connection to the db
  • Select distinct urls from the table
  • Read the file at each url
  • Get all the urls on that page
  • Insert those URLS into the table
  • Run your script TWICE
  • Look at the pile of URLS in your table and be
    glad you didnt run your script three times
Write a Comment
User Comments (0)
About PowerShow.com