Building Your Own Web Spider - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Building Your Own Web Spider

Description:

Softbyte Labs: Black Widow. BurpSuite: Burp Spider. Jspider. Robots for every major search engine ... How far do I spider (maximum recursion depth) ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 18
Provided by: Goog265
Category:
Tags: building | spider | web

less

Transcript and Presenter's Notes

Title: Building Your Own Web Spider


1
Building Your Own Web Spider
  • Thoughts, Considerations and Problems

2
Who am I
  • Graduate Computer Systems Technology Fanshawe
    College London, ON
  • Occupation Security Research Engineer nCircle
    Network Security Toronto, ON
  • Current Primary Focus Web Security Research
  • Past Focus OS X Security, Reverse Engineering
  • Blogger ComputerDefense.org

3
Why Discuss This?
  • Spiders are becoming more common, and everyone is
    making use of them.
  • Web Spidering is the backbone for many Web
    Application Security Scanners.
  • Its actually a pretty cool topic!

4
What Will We Talk About?
  • Why Build a Spider?
  • Current Products
  • Design Considerations
  • Hurdles
  • Sample Spider Code

5
Why Build a Spider?
  • Create the base for a larger web-based product.
  • Monitor Websites for Changes.
  • Mirror a Website.
  • Download specific types of files.
  • Create a Dynamic Search Engine.

6
Current Products
  • The most well known wget
  • Others include
  • Softbyte Labs Black Widow
  • BurpSuite Burp Spider
  • Jspider
  • Robots for every major search engine
  • Others?

7
Design Considerationsaka Spider Dos and Donts
  • What do I want to spider?
  • Do I want specific pages?
  • Following on that, do I want specific page
    extensions?
  • Do I want to submit forms?
  • Do I want to submit valid data?
  • Do I want to reach authenticated portions of the
    website?
  • Do I want to support SSL?

8
Dos and Donts 2
  • What dont I want to spider for?
  • Do I NOT want to spider external links?
  • Do I NOT want to download files over X bytes?
  • Do I NOT want to follow links on error pages?
  • What other donts can you think of?

9
Dos and Donts 3 4
  • Not all web servers are created equal.
  • How do you prepare yourself for potential
    idiosyncrasies of non-compliant web servers?
  • Do you want to handle non-compliant web servers?
  • How far do I spider (maximum recursion depth)?
  • This one is interesting, and potentially unique
    to the individual and the task.

10
Dos and Donts 5 6
  • What are the user definable options?
  • Part of this falls back to our included options.
  • Do we allow the user to specific authentication?
  • Do we allow the user to provide default
    credentials (eg. valid data is required for a
    blog comment)
  • Does the user define the recursion level and
    other tasks laid out in your includes/excludes?
  • Which begs the question Who are you designing
    this for?

11
Hurdles
  • (X)HTML is unstructured
  • Whitespace is insignificant, as are certain
    design requirements.
  • Forms can have a name but dont require one, they
    can have a action but dont need one.
  • Image (img) tags can have a /gt to close or they
    can exclude the / and they can even be followed
    by a lt/imggt tag
  • How do you deal with all these differences when
    parsing HTML?
  • Build your own parser or integrate a freely
    available parser?

12
Hurdles 2
  • Client-Side Technologies are a pain.
  • Is it sufficient today for a spider to simply
    parse (x)html? Nope.
  • We have to consider client side technologies.
    Links are built via javascript now, entire sites
    are developed in flash.
  • Do we ignore these client-side technologies?
  • Do we include a javascript engine?
  • Do we decompile the flash and parse the
    actionscript?
  • What do these actions do to our development time?
    Is it justifiable?

13
Hurdles 3
  • What do I consider to be a link?
  • Spidering is about finding and following links
    but what is considered to be a link?
  • Does it require an anchor
  • If so, do I follow URI anchors or any anchor,
    including one that points to a local name?
  • Does a frame src tag count?
  • Does an iframe a count?
  • Do you parse each of these separately or include
    them in a single group?

14
Simple Spider Sample
!/usr/bin/pythonimport urllibimport
urlparseimport sysimport reRECURSION_LEVEL 3
15
Simple Spider Sample Continued
  • def getLinks ( start_page, page_data )    
    url_list     anchor_href_regex
    'lt\sa\shref\s\s\x27\x22?(a-zA-Z0-9/\\\\._
    -)\x27\x22?\s'    urls re.findall(anchor_h
    ref_regex,page_data)    for url in urls
            url_list.append(urlparse.urljoin(
    start_page, url ))    return url_list    def
    getPage ( url )     page_data
    urllib.urlopen(url).read()    return page_data

16
Simple Spider Sample Continued (2)
  • if __name__ '__main__'     end_results
        recursion_count 0    try page_array
    sys.argv1    except IndexError        
    print 'Please provide a valid url.'       
    sys.exit()    while recursion_count lt
    RECURSION_LEVEL        results         for
    current_page in page_array            page_data
    getPage( current_page )            link_list
    getLinks(current_page, page_data)            for
    item in link_list                if item.find(
    current_page ) ! -1                   
    results.append( item )            results
    list(set(results))        page_array
    results        end_results results       
    end_results list(set(end_results))       
    recursion_count 1    for item in
    end_results        print item

17
Q A
Questions, Comments, Concerns? Bring them up now
or email me ht_at_computerdefense.org treguly_at_ncircl
e.com
Write a Comment
User Comments (0)
About PowerShow.com