Title: CS403: Online Network Exploration
1CS403 Online Network Exploration
- The World Wide Web
- Fall, 2007
- Modified by Linda Kenney
- 9/16/07
2Defining the World Wide Web
- Using the Web, its possible for anyone to
publish their own Web pages on a host running a
Web server and have those pages available to any
Internet user with a Web browser.
3Defining the World Wide Web (cont.)
- Remember -- Along with e-mail, the Web is the
Internet service that accounts for the vast
majority of what people routinely do with the
Internet.
4A conceptual network
- Although the Web is most definitely not a
computer network, we could argue that it is a
conceptual network of distributed resources. - Most commonly those resources are Web pages.
- They may also include images, sounds, videos and
more. - Most Web pages are connected to other resources
via hyperlinks. - Visualizing the links between several separate
resources provides some idea of where the term
Web comes from.
5Hypertext
- The Web was invented in 1990.
- But it was based on the concept of hypertext
which had been around for decades. - The basic idea of hypertext is to take the
passive cross-references that are common in
printed text and make them active. - When reading a book, a cross-reference passively
informs the reader where to turn for additional
info and the reader must manually perform the
actions necessary to obtain that additional info
if it is desired. - Examples?
6Hypertext
- On a computer, its easy to make cross-references
active. You notify the reader that additional
info is available, but let the computer take the
actions necessary to obtain that info if the
reader desires it. - Such an active cross-reference is called a
hyperlink (or just link) and text that contains
such links is called hypertext. - This concept is fundamental to the Web as we know
it.
7Web presentations
- Most Web pages do not exist in isolation.
- The vast majority of them are grouped together
into collections of pages with a common purpose
or theme. - Such a collection of Web pages is called a Web
presentation or Web site. - Typically, all the pages within a given
presentation are under the editorial control of a
single individual or organization.
8Web presentations (cont.)
- A given Web page is likely to contain several
links to other pages. - Often, those links will lead to other resources
within the same presentation. These links are
called local links or links to local
resources. - Some of those links may lead to other resources
which are part of a different presentation. These
links are called remote links or links to
remote resources.
9Clients and servers on the Web
- Like most Internet services, the Web is based on
the client/server model. - A Web browser is just a specific example of a
client program.
10Clients and servers on the Web (cont.)
- The browser cant accomplish much without the
cooperation of a server. - A Web server is a program that makes files
available to Web browsers upon request. - In general, the files a Web server makes
available contain Web pages and the images,
sounds, videos and other media that supplement
them. - And all the files a Web server has access to are
generally stored in the secondary storage of the
host on which the server runs.
11Hypertext Transfer Protocol
- Hypertext Transfer Protocol (HTTP) is the
protocol that Web browsers and Web servers use to
communicate with one another - As a protocol, it carefully defines the range of
possibilities, determining precisely what a
browser may say to a server and when. - Of course, it also dictates what servers can say
to browsers and when.
I need the file page.html
Here is the file page.html
Browser
Server
12HTTP requests and responses
- When speaking HTTP, a Web browser generally
sends an HTTP GET request to the Web server on a
specific host requesting a specific resource. - When it receives an HTTP GET request from a
browser, a Web server, in turn, sends some sort
of HTTP response back to the browser. - Most commonly, the response will consist of the
file and some information about the file. - But on occasion, the response will consist of an
error message of some sort. - Note that HTTP requests and responses rely on TCP
and IP to get across the Internet. (see p 72-74) - In other words, HTTP is layered on top of TCP and
IP.
HTTP GET request for /page.html
HTTP response Status code 200 Content-type
text/html Content-length 4370 contents of
/page.html
HTTP response Status code 404 Not
Found Content-type text/html Content-length
1634 contents of error status page
Browser
Server
13The servers responsibilities
- When it receives an HTTP GET request, a Web
server must prepare an appropriate HTTP response
message. - The request will specify the file it is
requesting. - The server must first locate the requested file
within the file system of its host. - If the file cannot be located, the server sends
back a 404 File not found response message.
14The servers responsibilities (cont.)
- Having found the file, however, the server must
also verify that the file permissions allow it to
access the file. - If the server is not able to access the file, it
will typically return a 403 Forbidden response
message. - If the requested file is located and accessible,
the server generates a 200 OK response message
that includes the contents of the file as well as
a variety of headers that provide information
about the file, such as its type, size and last
modified date.
15Locating files
- A typical host stores thousands of files, all of
which must be uniquely identified. - Its impractical to give 100,000 files unique
names. - Instead, a host uses a file system consisting of
a hierarchy of directories to create uniquely
identified locations in which files may be
stored.
16Locating files (cont.)
- Each location can be uniquely identified by the
sequence of steps necessary to reach it from the
top of the hierarchy. - The list of steps needed to reach a location from
the top of the hierarchy is called the absolute
path to that location, and every location has a
unique absolute path.
17Locating files (cont.)
- All items in a given location must have unique
names. - So each item in the hierarchy can be uniquely
identified by combining its absolute path with
its filename to form an absolute pathname.
18Uniform Resource Locators
- Before a browser can request a resource, it needs
to know where it can find that resource and what
type of server will be providing it. - To find a specific resource, the browser must be
told not only the name of the file containing
that resource, but also what host it is on and
where it is in the file system of that host. - Fortunately, all the information needed to find a
specific resource, out of the billions available
on the Web, is contained in that resources
Uniform Resource Locator (URL). - Each resource available on the Web is identified
by a unique URL that contains all the information
necessary for a browser to retrieve that resource.
19Uniform Resource Locators (cont.)
- Regardless of how the URL is provided, the
browser always does the same thing with it it
requests the resource and renders it on the
screen. - In computer science, we use the term render to
refer to the process of producing an image by
interpreting some data. - A browser renders a Web resource by determining
what to display on the screen based upon what it
finds in the HTTP response that contains the
contents of that resource.
20The anatomy of a URL
- Consider a typical URL
- A URL typically begins with the protocol to use
when accessing the resource. - The remainder of the URL is the identifier that
tells the browser how to locate the resource. - The identifier starts with a hostname that
uniquely identifies the host on which the
resource is stored. - The rest of the identifier is the pathname that
uniquely locates the resource in that hosts file
system. - The pathname, as weve discussed consists of a
path and a file name.
http//www.sample.com/products/catalog/prod1.html
http//www.sample.com/products/catalog/prod1.html
http//www.sample.com/products/catalog/prod1.html
21The Web step-by-step step 1
- The process of displaying a Web resource begins
when the browser is given the URL of that
resource by the user. - The browser examines that URL to find out what it
needs to do next. - The first part (ex http//) tells the browser
what protocol to use, and indirectly what type of
server to contact. - The identifier tells the browser where the
resource is located. - The hostname in the identifier tells the browser
which host is running the server responsible for
the resource. - The pathname in the identifier tells the browser
precisely where the desired resource is stored in
that hosts file system. - Using this information, the browser composes an
HTTP GET request message. - The GET request contains the pathname of the
desired resource as well as the hostname of the
servers host and various other information.
22The Web step-by-step step 2
- The HTTP GET request must be sent to the
appropriate server. - Since it must arrive in its entirety at a
specific host, the request gets sent over the
Internet using TCP and IP. - To establish a TCP connection with the server,
the browser needs to know the IP address of the
host running the server. - To get the IP address of the servers host, the
browser resolves the hostname in the URLs
identifier using DNS. - Using the IP address of the servers host, the
browser establishes connection with the server. - The HTTP GET request message is sent to the
server over this connection. Since the request
message is small, it takes little time to send.
23The Web step-by-step step 3
- When a Web server receives an HTTP GET request,
it composes an HTTP response. - Using the pathname specified in the request, the
server attempts to locate the file containing the
resource within the file system of its host. - Once the resources file has been located, the
server verifies that it has permission to access
that file.
24The Web step-by-step step 3 (cont.)
- If the server is able to locate and access the
file, the HTTP response will indicate success. - The response will also indicate the date and time
at which the file was last modified, the type of
resource the file contains and how big it is. - And the server will include the contents of the
resources file in the response message. - Note that this means the size of the response
message is primarily determined by the size of
the resource being requested. - If the server is unable to locate or access the
file, the HTTP response will indicate the nature
of the problem. - The response may also contain some content for
the browser to use in lieu of the requested
resource.
25The Web step-by-step step 4
- Having composed an HTTP response, the server must
now send it back to the requesting browser. - The server uses TCP over IP for this purpose.
- It gets the IP address for the browser from the
packet that carried the HTTP request. - Because they typically contain the contents of
the requested resource, HTTP response messages
tend to be significantly larger than HTTP request
messages. - Responses generally take much longer to send over
the Internet than requests. - This is generally the source of the derogatory
term The World Wide Wait. - To minimize the time a user must wait to receive
a requested resource, its up to the creator of
that resource to minimize the size of the file
containing the resource.
26The Web step-by-step step 5
- Upon receiving an HTTP response message, the
browser is responsible for rendering the resource
it contains. - Many resources will be Web pages, which are
written in Extensible Hypertext Markup Language
(XHTML). - Rendering a Web page involves interpreting the
XHTML to determine what the page should look
like. - Other resources, however, will be other forms of
media such as images, sounds and video. - Rendering multimedia resources involves
interpreting the data those resources contain and
producing the image, sound or video that data
represents. - Browsers therefore need to understand a range of
resource types.
27The Web step-by-step step 5 (cont.)
- Its also useful to note at this stage that even
though a Web page may appear to contain images,
sounds and videos, each of those resources must
be stored separately in its own file. - And each of those resources must therefore be
retrieved from a server with a separate HTTP
transaction. - As a result, the time it takes to retrieve a Web
page is the sum of the time it takes to retrieve
all of its component parts.
28The browser lends a hand
- Browsers can also play a role in minimizing the
time the user must wait for a page to load. - A user often revisits the same resources
repeatedly. - Imagine waiting five minutes to retrieve a
resource. - Then, after the resource loads, you activate a
link within it and go to another resource. - Examining the second resource, you realize its
not what you expected and decide to use your
browsers back button to return to the previous
resource. - Now the browser needs to retrieve the same
resource it just rendered all over again. - Obviously, you dont want to wait five minutes
again. - What you want is for the browser to have saved
that resource so that you can return to it
without having to request it from the server
again. - Thats exactly what browsers do.
29The browser cache
- As a browser receives each requested resource, it
stores a copy of that resource in a special place
called the browser cache. - Along with the contents of the resource it stores
the current date and time and the URL used to
retrieve the resource. - Each time a resource is requested, the browser
checks to see if that resource is already stored
in its cache. - If its not, then the browser goes about
retrieving the resource as weve already
described.
30The browser cache (cont.)
- If the resource is in the cache, however, the
browser may be able to use it. - To find out if its useable, the browser sends an
HTTP HEAD request for that resource to the
server. - This causes the server to send back only the
information about the resource, which will
include the date and time it was last modified. - If the resource on the server has not been
modified since the copy of that resource was
stored in the browsers cache, the browser can
use the cached copy. - Otherwise, the browser must retrieve (and cache)
a fresh copy from the server. - This requires more HTTP messages, but theyre
smaller on average.
31When things go wrong
- Although it often goes off without a hitch, there
are places in an HTTP transaction where problems
can occur. - Knowing what might go wrong can help us make
sense of otherwise cryptic or confusing error
messages we may get from our browser. - Of course, different browsers and servers are
free to use different error messages as they see
fit, so the wording may differ.
32When things go wrong (cont.)
- If the hostname in the URL cannot be resolved to
an IP address using DNS, theres no way to
establish the necessary TCP connection to the
server. - In this case, well get an error to the effect of
- Unable to locate server.
33When things go wrong (cont.)
- The hostname may resolve but the TCP connection
may not be able to be established for a variety
of other reasons. - In this case, well get an error to the effect of
- No response.
34When things go wrong (cont.)
- If were able to get a TCP connection and send an
HTTP request to the server, theres no guarantee
it will be successful. - If the server is unable to locate the requested
file, well get an error to the effect of - Not found.
- If the server locates the file but does not have
permission to access it, well get an error to
the effect of - Forbidden or Access denied.
35And how to fix it
- Understanding the root cause of an error can
often help you devise a solution to the problem.
36And how to fix it (cont.)
- If you get an Unable to locate server error,
you know theres a problem with the hostname in
the URL. - Double-check your typing of the hostname.
- Make sure your network connection is still
working. - Ensure that your DNS server is functioning in
general.
37And how to fix it (cont.)
- If you get a No response error, you know the
hostname is okay but the server is not able to
respond. - Often, theres nothing you can do about this
yourself. - However, since this is often a temporary problem,
try again a little later.
38And how to fix it (cont.)
- If you get a Not found error, you know theres
a problem with the pathname in the URL. - Again, double-check your typing, paying attention
to case. - Try eliminating steps from the pathname one at a
time, moving from right to left. - How?
39And how to fix it (cont.)
- If you get a Forbidden error, the problem is
with the permissions on the file containing the
requested resource. - If the file belongs to you, simply adjust the
permissions. - Otherwise, theres little you can do about this
problem yourself except contact the owner of the
resource.
40Resource types
- As weve seen, the Web consists of a variety of
resource types. - In each HTTP response, the server includes an
indicator of the resources type so the browser
knows how to render it. - Since servers and browsers must agree on the
meaning of this type info, it needs to be
standardized.
41Resource types (cont.)
- The standard used for this purpose is called
Multipurpose Internet Mail Extensions (MIME). - As you can tell from its name, MIME was
originally designed for use with e-mail. - A MIME type consists of an indicator of the
general resource type (text, image, audio, etc.)
followed by a / followed by an indicator of the
specific resource type (html, jpeg, mpeg, etc.). - For example, XHTML files are assigned a MIME type
of text/html. - JPEG image files are assigned a MIME type of
image/jpeg. - MP3 sound files are assigned a MIME type of
audio/mpeg.
42Filename extensions
- The server needs to know the type of each
resource for which it is responsible. - Otherwise, it wouldnt know what MIME type to
list in the HTTP response message. - To avoid having to explicitly tell the server the
type of each resource, servers are set up to use
the extension of the resources filename to
determine its type. - A filename extension is part of the actual
filename, but it comes at the end and starts with
a dot. - Examples?
- The server is configured to associate certain
filename extensions with specific MIME types.
43Filename extensions (cont.)
- For this reason, its important to name all of
the files containing your Web resources with
appropriate filename extensions. - Well generally use only a small number of
resource types in this course. - XHTML files are given .html (or .htm) extensions.
- JPEG images are given .jpg (or . jpeg )
extensions. - GIF images are given .gif extensions.
- CSS files are given .css extensions.
44What Browsers Understand
- A browser understands the HTTP protocol for
retrieving Web pages. - Most browsers also understand protocols for other
Web services like file transfer, instant
messaging, e-mail and network news. - A browser understands XHTML and HTML and can
interpret it in order to render Web pages. - Many also understand other popular languages like
CSS, JavaScript and XML .
45What Browsers Understand (cont.)
- Most browsers understand common image file
formats like JPEG and GIF and can render images
stored in these formats. - Some also understand image file formats like BMP
and PNG. - Many browsers understand other forms of media as
well. - Flash presentations are used for interactive
animations. - MP3 is a file format commonly used for storing
sounds and music. - MPEG and AVI are common file formats for storing
video.
46What Browsers Understand (cont.)
- A good browser is designed to provide the
functionality most Web users are likely to need. - Browser designers, however, realize that people
use the Web in many different ways. - For this reason, most browsers are designed to
accept two different types of add-ons that extend
their capabilities.
47Add-Ons Helpers and Plug-Ins (p. 76-83)
- An application is a program you run on your
computer to accomplish specific tasks. - You can obtain applications from retail software
stores or the Internet. - A browser often uses other applications to view
the Web. - You can customize what applications your browser
uses.
48Helpers
- A helper application is an application a browser
can launch. It can be any application on your
computer. - Examples?
- When your browser encounters a file that requires
special handling, it looks for an appropriate
helper application and opens the file in that
application. - When browsers first were introduced, helper
applications were the only option.
49Plug-Ins
- A browser plug-in is an application that expands
the capabilities of a web browser. - When you install a plug-in, you extend the
capabilities of your browser to handle a file
type that it wasnt originally designed to
handle. - Any file requiring that plug-in will be displayed
inside the browser window, with the plug-in
working as if it were a part of your browser.
50Plug-Ins (cont.)
- Plug-ins support everything from audio to
animation to documents - Plug-ins increase your browsers memory
requirements and launch time. - You can find Web pages to help you locate
plug-ins for your browser.
51Common plug-ins and helper applications
52Review questions
- Define the World Wide Web and explain its
relationship with the Internet. - Explain what is meant by referring to the Web as
a conceptual network of distributed resources. - Explain the concept of hypermedia.
- What type of information does a Web server
typically include in the header of an HTTP
response, and how might it be useful to a Web
browser? - Explain the usefulness of a file system in the
context of the Web. - Describe three ways in which a user might specify
a desired URL to their browser. - Explain how the Web works behind the scenes. What
roles do Hypertext Transfer Protocol (HTTP),
Uniform Resource Locators (URLs), and the
browsers cache play in this process? - What are some common errors that can occur when
requesting a Web page and what do they mean? - Explain the relationship between resource types
and filename extensions on the Web. Why is it
important? - Compare and contrast a plug-in with a helper app.
53Key terms
- Absolute path
- Absolute pathname
- Browser cache
- Browsing
- Conceptual network
- File system
- Filename extension
- Helper app
- Hostname
- HTTP
- HTTP GET request
- HTTP HEAD request
- HTTP response
- Hyperlink
- Hypermedia
- Hypertext
- Identifier
Link Local link MIME MIME type Pathname Permission
s Plug-in Remote link Render Scheme URL Web
browser Web presentation Web server Web
site World Wide Web XHTML
54- Some information used from
- Web 101 by Lehnert and Kopec