Web log analysis - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Web log analysis

Description:

Access log ,agent log and referrer log are always together that is called ... Referrer log (cont.) How is the path of the visitor navigate in your site? ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 27
Provided by: wuz
Category:
Tags: analysis | log | referrer | web

less

Transcript and Presenter's Notes

Title: Web log analysis


1
Web log analysis
  • Presented by Zhan Wu
  • Guided by Dr.Bettina Berendt
  • Seminar Web Mining

2
Whats log file
  • Log file A file that lists all the
    actions
  • that have occurred
  • Every time you visit a site, the web server
    will generate a record of the HTTP transaction
    into a log file.

3
Why web log analysis
  • Is anyone looking at your Web site?
  • Do they like what they see?
  • Do all your links work well?
  • Whats the traffic of your web?

4
Why web log analysis(cont.)
Web designers The incentives of visitors,what make them stay and what make them leave
Web administrators all clicks lead to documents ,images, multimedia files, scripts and applets are loaded and displayed properly
Companies that place adv. Make their investment effectively,refuse to waste money
5
Log file type
  • Access log
  • Referrer log
  • Agent log
  • Error log

6
Log file from www.eduserver.de
  • pd9e0e981.dip.t-dialin.net - -
    01/Dec/2001001742 0100
  • "GET /db/stellenliste.html HTTP/1.1" 200 8038
  • Mozilla/4.0 (compatible MSIE 5.5 Windows NT
    5.0) http//www.jobs.zeit.de/akad.html
  • Access log ,agent log and referrer log
    are always together that is called extended log
    file.however some server turn off the agent log
    and referrer log ,only leave the access log which
    is called common log file

7
Access log

Address / DNS pd9e0e981.dip.t-dialin.net
identification
authuser
timestamp 01/Dec/2001001742 0100
Request page "GET /db/stellenliste.html HTTP/1.1"
Status code 200
Transfer volume 8038
8
four series status codes
  • Success (200 series)
  • Redirect (300 series)
  • Failure (400 series)
  • Server Error (500 series)

9
Agent log
  • The agent log has information about the browser
    version, and operating system of the visitor.
  • Mozilla is the original code name of
    Netscape.Now almost all browsers compatible with
    Netscape use Mozilla as code name.

10
Referrer log
  • The referrer log indicates the page where the
    visitor was located when making the next request.
  • how is your site categorized in search engine ?
  • http//de.dir.yahoo.com/Bildung_und_Ausbildung/Po
    rtale_und_Linksammlungen/Bildungsserver

11
Referrer log (cont.)
  • How is the path of the visitor navigate in your
    site?
  • pd9e0e981.dip.t-dialin.net "GET
    /db/set.html?Id221KATEGORIEstellenangebotsaea
    df3e55f209b8c73ba53df99dc574a HTTP/1.1"
    http//www.bildungsserver.de/db/stellenliste.html
  • www.job.zeit.de joblist
    a certain job information

12
Error log
  • Another standard and important log that separates
    from the other three logs
  • example from www.schulweb.de
  • Wed Jan 16 134045 2002 error client
    194.51.47.214 File does not exist
    /home/schulweb/html/images/dot_so.gif

13
Overview of log analysis software
  • Writing own program
  • Free software (top 3 by Google pagerank)
  • eETReMe Tracking, The Webalizer, Analog
  • Commercial software and solution package
  • (top 3 by Google pagerank)
  • Wusage, WebTrends, AccessWatch

14
Three step of web log analysis
  • Decide what we need
  • Choose a log analysis software
  • Analyze the output of program

15
Step 1 what we need
  • The traffic of the site
  • The distribution of the domains
  • The referrer site

16
Step 1 (cont.)what we dont need
  • We dont care the error log.this problem will be
    left to the web administrator.
  • We donn care the browser ,operation system of
    the visitors
  • User sessions are not important either.

17
Step 2 which way I should choose
  • Limited budget and poor background on computer
    science determine that I have to choose the free
    software!
  • I choose the Analog
  • there are different versions for Macintosh,
    Unix, DOS, Windows.
  • Also, while the default configuration gives a
    great report, Analog is easy customizable to
    produce exactly the report you want.

18
Step 3 get the output-traffic
  • All the data come from the results of Analog
  • The average request per month is 912,615 and
    30,420 per day ,the traffic increased month by
    month last year.

19
Step 3(cont.) Domain distribution
20
Data Clean
  • Why domain eduserver doesn't appear?
  • Separating in-house from external
  • Thank Dr.Berendt for filtering all the entries
    from the eduserver itself

21
Limitations of log Analysis
  • User Sessions
  • Not all information are captured
  • Confusion of domains

22
User Sessions
  • The popular methods to measure the user sessions
    as following
  • 1. Authenticated user
  • 2. Cookies
  • 3. IP address of the visitor
  • All these above have problems!!!

23
Not all entries are captured
  • ISPs cache the specific pages
  • Web browsers also have their own local caches

24
Confusion of domains
  • there is nothing to stop a commercial entity from
    registering a site in the .org domain .
  • sites in the .com domain and other domains can
    also be located in foreign countries, so you
    cannot tell exactly which requests are coming
    from users in other countries.
  • .edu domains only exist in USA.We can not tell a
    German educational site from the last term of
    domains.

25
Conclusion
  • there is a great deal of useful information you
    can get from web logs.
  • There is still a lot of things to do in this
    field in the future.

26
  • THANKS !
Write a Comment
User Comments (0)
About PowerShow.com