Building Your Own Web Spider

About This Presentation

Title:

Building Your Own Web Spider

Description:

Softbyte Labs: Black Widow. BurpSuite: Burp Spider. Jspider. Robots for every major search engine ... How far do I spider (maximum recursion depth) ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 18

Provided by: Goog265

Category:

more less

Transcript and Presenter's Notes

Title: Building Your Own Web Spider

1
Building Your Own Web Spider

Thoughts, Considerations and Problems

2
Who am I

Graduate Computer Systems Technology Fanshawe
College London, ON
Occupation Security Research Engineer nCircle
Network Security Toronto, ON
Current Primary Focus Web Security Research
Past Focus OS X Security, Reverse Engineering
Blogger ComputerDefense.org

3
Why Discuss This?

Spiders are becoming more common, and everyone is
making use of them.
Web Spidering is the backbone for many Web
Application Security Scanners.
Its actually a pretty cool topic!

4
What Will We Talk About?

Why Build a Spider?
Current Products
Design Considerations
Hurdles
Sample Spider Code

5
Why Build a Spider?

Create the base for a larger web-based product.
Monitor Websites for Changes.
Mirror a Website.
Download specific types of files.
Create a Dynamic Search Engine.

6
Current Products

The most well known wget
Others include
Softbyte Labs Black Widow
BurpSuite Burp Spider
Jspider
Robots for every major search engine
Others?

7
Design Considerationsaka Spider Dos and Donts

What do I want to spider?
Do I want specific pages?
Following on that, do I want specific page
extensions?
Do I want to submit forms?
Do I want to submit valid data?
Do I want to reach authenticated portions of the
website?
Do I want to support SSL?

8
Dos and Donts 2

What dont I want to spider for?
Do I NOT want to spider external links?
Do I NOT want to download files over X bytes?
Do I NOT want to follow links on error pages?
What other donts can you think of?

9
Dos and Donts 3 4

Not all web servers are created equal.
How do you prepare yourself for potential
idiosyncrasies of non-compliant web servers?
Do you want to handle non-compliant web servers?
How far do I spider (maximum recursion depth)?
This one is interesting, and potentially unique
to the individual and the task.

10
Dos and Donts 5 6

What are the user definable options?
Part of this falls back to our included options.
Do we allow the user to specific authentication?
Do we allow the user to provide default
credentials (eg. valid data is required for a
blog comment)
Does the user define the recursion level and
other tasks laid out in your includes/excludes?
Which begs the question Who are you designing
this for?

11
Hurdles

(X)HTML is unstructured
Whitespace is insignificant, as are certain
design requirements.
Forms can have a name but dont require one, they
can have a action but dont need one.
Image (img) tags can have a /gt to close or they
can exclude the / and they can even be followed
by a lt/imggt tag
How do you deal with all these differences when
parsing HTML?
Build your own parser or integrate a freely
available parser?

12
Hurdles 2

Client-Side Technologies are a pain.
Is it sufficient today for a spider to simply
parse (x)html? Nope.
We have to consider client side technologies.
Links are built via javascript now, entire sites
are developed in flash.
Do we ignore these client-side technologies?
Do we include a javascript engine?
Do we decompile the flash and parse the
actionscript?
What do these actions do to our development time?
Is it justifiable?

13
Hurdles 3

What do I consider to be a link?
Spidering is about finding and following links
but what is considered to be a link?
Does it require an anchor
If so, do I follow URI anchors or any anchor,
including one that points to a local name?
Does a frame src tag count?
Does an iframe a count?
Do you parse each of these separately or include
them in a single group?

14
Simple Spider Sample
!/usr/bin/pythonimport urllibimport
urlparseimport sysimport reRECURSION_LEVEL 3
15
Simple Spider Sample Continued

def getLinks ( start_page, page_data )
url_list     anchor_href_regex
'lt\sa\shref\s\s\x27\x22?(a-zA-Z0-9/\\\\._
-)\x27\x22?\s'    urls re.findall(anchor_h
ref_regex,page_data)    for url in urls
        url_list.append(urlparse.urljoin(
start_page, url ))    return url_list    def
getPage ( url )     page_data
urllib.urlopen(url).read()    return page_data

16
Simple Spider Sample Continued (2)

if __name__ '__main__'     end_results
    recursion_count 0    try page_array
sys.argv1    except IndexError
print 'Please provide a valid url.'
sys.exit()    while recursion_count lt
RECURSION_LEVEL        results         for
current_page in page_array            page_data
getPage( current_page )            link_list
getLinks(current_page, page_data)            for
item in link_list                if item.find(
current_page ) ! -1
results.append( item )            results
list(set(results))        page_array
results        end_results results
end_results list(set(end_results))
recursion_count 1    for item in
end_results        print item

17
Q A
Questions, Comments, Concerns? Bring them up now
or email me ht_at_computerdefense.org treguly_at_ncircl
e.com

Write a Comment

User Comments (0)

About PowerShow.com

Building Your Own Web Spider - PowerPoint PPT Presentation

Building Your Own Web Spider

Softbyte Labs: Black Widow. BurpSuite: Burp Spider. Jspider. Robots for every major search engine ... How far do I spider (maximum recursion depth) ... – PowerPoint PPT presentation