How The Web Works

The World Wide Web

The familiar "web" of connected web pages runs on top of the basic TCP/IP phone system. Chat programs, email, .. these are other services, distinct from the web, which also run on top of the basic connectivity provided by TCP/IP. The web was created by Tim Berners-Lee working at the physics research facility CERN in Switzerland (now Sir Tim Berners-Lee). Browsers were available in 1993, and the web, urls etc. were becoming broadly popular by 1995.

Study question: why did something as important as the web not come out of a computer company like IBM or Microsoft or Apple or whatever? The web is a free and open standard (like TCP/IP), and for the most part is not locked-in to any particular vendor, and this freedom is a vital part of the web's success. Vendors would love to get their proprietary technology into the web (e.g. Microsoft, Adobe). Adobe Flash is currently the one proprietary, vendor-lock-in component that is widely used on the Internet, although it seems to be slowly dying off in favor of open standards.

1. A URL

http://www.stanford.edu/class/cs101

A visit to a web page begins with a URL (Uniform Resource Locator) that points to that web page. Of course you've seen a million URLs over the years, but we'll look at the parts:

The http at the start is the networking scheme to use, and "http" and its secure variant "https" are by far the most common. In the future, if there were some new networking scheme, the URL syntax could still support it by starting with a different word before the colon.

After the // we have the www.stanford.edu which is the domain name of the computer on the internet that hosts this web page -- the web server. For the browser to request this web page, it will make a TCP/IP connection to that machine.

After the domain name we have the /class/cs101 path which indicates essentially which directory and file we want specifically from that web server.

2. Web Browser "Client"

The Web Browser is the familiar computer program, such as Firefox, that you run on your local computer to access the web. In short, you type URLs into your browser or click a link, and the browser requests and displays those pages for you. The browser also keeps track of your history of web pages so it can implement the back-button for you.

In networking terminology, the browser is the "client" which makes requests and displays what it gets back. The "server" is the other side of the request/response, servicing requests it gets.

3. Web Server

The other side of the conversation is the web server -- a machine which hosts a set of web pages, and waits for requests to come in for those pages. The phrase "web server" can refer to the physical machine, or it can mean the program that responds to requests. Below I'll use the phrases "web server machine" or "web server program" to distinguish those two cases.

The web server machine needs to be switched on, ready, and connected to the internet at all times. It is essentially waiting for an incoming request which could happen at any time. In contrast, you can switch on your laptop, do some browsing, and switch it off.

The web server program runs all the time, handling any incoming requests. For simple web pages, the web server program identifies a directory (aka a folder) as the web-root of the files to serve. The "path" part of the url maps into the web-root directory. So the url http://example.com/a.html means to get the file a.html from the web-root directory. The web-root can itself contain directories, so http://example.com/class/cs101/b.html refers to a "class" directory in the web-root directory, in turn containing a "cs101" directory, which contains a b.html file.

Question: what does it mean for someone to edit a web site? One simple case is that they change the files located in the web-root, such as editing the class/cs101/b.html file in the example above. After the edit, someone out on the internet who requests http://example.com/class/cs101/b.html will get the new version. In reality, the HTML documents which make up a web site may be stored in something more complex than the simple web-root directory arrangement.

Put It Together -- HTTP Request / Response

The HTTP (Hyper Text Transfer Protocol) standard describes how a browser makes a request to the web server program. ("Hypertext" is the idea of links within documents pointing to other documents. This idea long predates the web.) If HTML describes the code for a web page, HTTP is the protocol for getting a web page from the server.

http request/response

The server just sends back back the HTML or whatever data to your browser. Your browser then "renders" this data into a window. This is why the View Source command works -- it just shows you the HTML of the server response, which is what the browser was using anyway.

The approximate appearance of the HTML is specified in the standard, but not the exact details; the appearance can vary with how wide your browser window is, what fonts your machine has etc. If you want to send a document and specify exactly how it looks, where the line breaks are etc. use PDF (Portable Document Format, owned by Adobe but also a free standard).

Q: On Your Laptop

You have some HTML on your laptop. In what ways does your laptop not going to work to serve these files? (i.e. what does the server do?)

Web Applications

The above HTTP request/response sequence is the major pattern of the web. For the simple case outlined above, we have static, unchanging content. Each web page corresponds basically to an HTML file stored on the server, and the contents of the file do not change quickly.

A more complex web site will have some pages which are "dynamic" -- the HTML for them is computed, producing HTML on the fly. With a static web site, each web page corresponds roughly to a file stored on the server. With a dynamic web site, a web page corresponds to a program on the server. A request for that web page runs the corresponding program code on the server. That code, essentially runs a series of print statements (like the print we we have used) to dynamically produce the HTML which is sent back as the response. The program could do anything -- look at various data sources, putting together any sort of HTML response page.

Dynamic Web Site - Google Trends

A dynamic web site with a very simple interface is www.google.com/trends -- which graphs the frequency that different words appear in google searches. The front page shows an HTML form which includes fields where you can type in information. Clicking the button in the form (or sometimes typing return) "submits" the form to the server -- sending a request to the server which includes the values typed into the fields. This runs a program on the server which takes in the inputs from the form, and looks up information stored in files or databases on the server. The program puts all this information together and dynamically produces the HTML and images etc. for the result -- basically using print to produce HTML. Note that this still fits within the request/response pattern, but now the response is a one-off, computed on the fly just for this request.

The site sfbay.craigslist.org/ is another nice simple example of a dynamic form/response website. Here's how it works at a very high level. On the craigslist servers are files that store all the current listings .. say 1 million listings. When you submit your search term, code runs on the craigslist server, pulling out the listings that include that word, and printing out HTML to show you the first 100 matches. This program will use a for-loop, if-statement, and print, just as we have been doing.

Web Application Advantages

Web Application Disadvantages

Web Page - HTML

Here is simple web page with a few elements...


A Heading

This is the first paragraph.

This is a second paragraph, including a link to the codingbat site

An image is done with "img" tag which includes a "src" url pointing to the image data file, like this


<html> starts the whole thing. The <head> section with <title> sets the title used at the top of the window. Inside <body> is the regular HTML content of the page.




<html>


<head>
<title>PAGE TITLE HERE</title>
</head>

<body>

<h2>A Heading</h2>

<p>This is the first <b>paragraph</b>.

<p>This is a second paragraph, including a link to the
<a href="http://codingbat.com">codingbat site</a>

<p>An image is done with "img" tag which includes a "src" url
pointing to the image data file, like this 
<br><img src="http://www.stanford.edu/class/cs101/abby.jpg">

</body>
</html>


A web page is written is written in a plain text code called HTML (Hyper Text Markup Language). Basically, HTML adds "markup" commands <...> within plain text. The markup indicates that parts of text should be a heading, or bold, or a url, and so on.

(You do not need to memorize HTML mark up for this class, just know what it does broadly.) You should not be intimidated about producing an HTML page to show some information. Creating a basic looking HTML page is not difficult, although of course a complex page like the nytimes.com front page is a lot of work. You can write HTML by hand, just typing in the text including the HTML tags, or use a program that looks more like a word-processor to you, but which then generates HTML for you.

HTML Edit Demo

Nick edits the file network-html-sample.html. You can View Source on this page to see its underlying html.

View Source

When you are visiting any web page, you can use the View Source command in your browser to see the underlying HTML code for the web page you see -- you will see <p> tags for paragraphs, <a href=...< tags for links, and <img src=...> tags for images. Check out the source code NYTimes.com .. it's quite a mess, but it's just HTML, rendered by your browser.

HTML 5

The latest version of HTML, HTML5 is becoming very popular, adding needed features to make better web pages and better web "applications" (below). Older versions of HTML lacked some features, so web pages did not look or work so well, but that's been largely fixed.

Web Design Philosophy

The most important thing to know about web design is in this comic: XKCD on web design

There are two points of view: the users of your site have interests, common questions. Very often a user visits a site with a question they want answered. The organization creating the web site has a different set of interests. The creators might care about the org-chart of what division is providing what, and who runs it, or just generally promoting how shiny and awesome their organization and their management are. The old joke is that lame web sites end up looking like the org-charts of their producing organizations.

Advice: do not treat visitors to your site as captives, to be shown little graphics or videos that are essentially advertisements or advocacy, telling the visitor how great this product is, but not actually answering their questions. Instead, the most important question for the web design is: what are the most common questions/interests visitors will have and how can we make those answers conveniently available. If you put to much pointless stuff in their way, they will just get annoyed and leave. The most common error for a web site is to treat the user more like a semi-captive TV viewer, to be subjected to glossy propaganda. Notice that really successful sites like craigslist and gmail put the user's interests first. GMail includes advertisements, but they are out of the way, giving the main display always to the user's data. There's a theory that Yahoo is having problems because they gave their advertisers big, flashy ads, but in the longer term annoyed and drove away their users. Note how google ads are small and non-flashy ... again, always aligning the user's interests, as they can always switch to another site.