How The Web Works
The World Wide Web
- World Wide Web
- Q: why did Microsoft, IBM, Apple .. not invent the internet?
- 1993 Sir Tim Berners-Lee
- Set of free and open, vendor-neutral standards
- --TCP/IP -- underlying networking
- --HTML -- web page
- --HTTP -- web connection
- --JPEG, PNG -- image formats
- Web Page = HTML - Hypertext Markup Language
The familiar "web" of connected web pages runs on top of the basic TCP/IP phone system. Chat programs, email, .. these are other services, distinct from the web, which also run on top of the basic connectivity provided by TCP/IP. The web was created by Tim Berners-Lee working at the physics research facility CERN in Switzerland (now Sir Tim Berners-Lee). Browsers were available in 1993, and the web, urls etc. were becoming broadly popular by 1995.
Study question: why did something as important as the web not come out of a computer company like IBM or Microsoft or Apple or whatever? The web is a free and open standard (like TCP/IP), and for the most part is not locked-in to any particular vendor, and this freedom is a vital part of the web's success. Vendors would love to get their proprietary technology into the web (e.g. Microsoft, Adobe). Adobe Flash is currently the one proprietary, vendor-lock-in component that is widely used on the Internet, although it seems to be slowly dying off in favor of open standards.
1. A URL
- URL Uniform Resource Locator
- e.g. http://www.stanford.edu/class/cs101
- http: -- scheme
- www.stanford.edu -- domain name of server computer
- /class/cs101 -- "path" particular page on that server
A visit to a web page begins with a URL (Uniform Resource Locator) that points to that web page. Of course you've seen a million URLs over the years, but we'll look at the parts:
The http at the start is the networking scheme to use, and "http" and its secure variant "https" are by far the most common. In the future, if there were some new networking scheme, the URL syntax could still support it by starting with a different word before the colon.
After the // we have the www.stanford.edu which is the domain name of the computer on the internet that hosts this web page -- the web server. For the browser to request this web page, it will make a TCP/IP connection to that machine.
After the domain name we have the /class/cs101 path which indicates essentially which directory and file we want specifically from that web server.
2. Web Browser "Client"
- Web Browser, e.g. Firefox
- User types urls, clicks
- Browser requests data over internet
- Displays resulting HTML etc.
- History, back-button
The Web Browser is the familiar computer program, such as Firefox, that you run on your local computer to access the web. In short, you type URLs into your browser or click a link, and the browser requests and displays those pages for you. The browser also keeps track of your history of web pages so it can implement the back-button for you.
In networking terminology, the browser is the "client" which makes requests and displays what it gets back. The "server" is the other side of the request/response, servicing requests it gets.
3. Web Server
- Web Server, e.g. apache (open source)
- Web page is implemented by a web server
- Server machine, on internet
- Running all the time
- Needs a fixed, known IP address
- Stores a bunch of files (HTML, JPEG, ..)
- Gets requests (from browsers)
- Sends back response (HTML, JPEG, ..)
The other side of the conversation is the web server -- a machine which hosts a set of web pages, and waits for requests to come in for those pages. The phrase "web server" can refer to the physical machine, or it can mean the program that responds to requests. Below I'll use the phrases "web server machine" or "web server program" to distinguish those two cases.
The web server machine needs to be switched on, ready, and connected to the internet at all times. It is essentially waiting for an incoming request which could happen at any time. In contrast, you can switch on your laptop, do some browsing, and switch it off.
The web server program runs all the time, handling any incoming requests. For simple web pages, the web server program identifies a directory (aka a folder) as the web-root of the files to serve. The "path" part of the url maps into the web-root directory. So the url http://example.com/a.html means to get the file a.html from the web-root directory. The web-root can itself contain directories, so http://example.com/class/cs101/b.html refers to a "class" directory in the web-root directory, in turn containing a "cs101" directory, which contains a b.html file.
Question: what does it mean for someone to edit a web site? One simple case is that they change the files located in the web-root, such as editing the class/cs101/b.html file in the example above. After the edit, someone out on the internet who requests http://example.com/class/cs101/b.html will get the new version. In reality, the HTML documents which make up a web site may be stored in something more complex than the simple web-root directory arrangement.
Put It Together -- HTTP Request / Response
The HTTP (Hyper Text Transfer Protocol) standard describes how a browser makes a request to the web server program. ("Hypertext" is the idea of links within documents pointing to other documents. This idea long predates the web.) If HTML describes the code for a web page, HTTP is the protocol for getting a web page from the server.
- HTTP centers on a simple request/response dialog between the browser client and the sever.
- You give your browser a URL to display -- either by typing it in, or by clicking a link in a displayed web page.
- e.g. http://www.stanford.edu/class/cs101/syllabus.html
- The browser opens a TCP/IP connection to the machine with the domain name from the url, e.g. www.stanford.edu
- On this connection, the browser sends a request for the path from the url, e.g. /class/cs101/syllabus.html
- The web server program takes in this request. The server typically has many files, organized in folders. There server finds the file corresponding to the request path, and sends back the response data, in this case HTML from the file syllabus.html.
- The browser gets back the HTML code and displays it.
- The response could be other types of data, such as a JPEG image, or the response could say that the request failed, such as with the "404 not found" error response.
The server just sends back back the HTML or whatever data to your browser. Your browser then "renders" this data into a window. This is why the View Source command works -- it just shows you the HTML of the server response, which is what the browser was using anyway.
The approximate appearance of the HTML is specified in the standard, but not the exact details; the appearance can vary with how wide your browser window is, what fonts your machine has etc. If you want to send a document and specify exactly how it looks, where the line breaks are etc. use PDF (Portable Document Format, owned by Adobe but also a free standard).
Q: On Your Laptop
You have some HTML on your laptop. In what ways does your laptop not going to work to serve these files? (i.e. what does the server do?)
- Simplest case -- server holds unchanging files
- "Web application" .. page contents are dynamic
- e.g. GMail inbox
- A program on the server runs to produce the page
The above HTTP request/response sequence is the major pattern of the web. For the simple case outlined above, we have static, unchanging content. Each web page corresponds basically to an HTML file stored on the server, and the contents of the file do not change quickly.
A more complex web site will have some pages which are "dynamic" -- the HTML for them is computed, producing HTML on the fly. With a static web site, each web page corresponds roughly to a file stored on the server. With a dynamic web site, a web page corresponds to a program on the server. A request for that web page runs the corresponding program code on the server. That code, essentially runs a series of print statements (like the print we we have used) to dynamically produce the HTML which is sent back as the response. The program could do anything -- look at various data sources, putting together any sort of HTML response page.
Dynamic Web Site - Google Trends
- Web application example
- Type in a word or two: volcano, primary (4 year cycle), oscar (1 year cycle)
- HTTP request as usual, "submit" button
- Server runs a program to compute the result page on the fly
- Program basically uses print as we have seen .. producing HTML
A dynamic web site with a very simple interface is www.google.com/trends -- which graphs the frequency that different words appear in google searches. The front page shows an HTML form which includes fields where you can type in information. Clicking the button in the form (or sometimes typing return) "submits" the form to the server -- sending a request to the server which includes the values typed into the fields. This runs a program on the server which takes in the inputs from the form, and looks up information stored in files or databases on the server. The program puts all this information together and dynamically produces the HTML and images etc. for the result -- basically using print to produce HTML. Note that this still fits within the request/response pattern, but now the response is a one-off, computed on the fly just for this request.
The site sfbay.craigslist.org/ is another nice simple example of a dynamic form/response website. Here's how it works at a very high level. On the craigslist servers are files that store all the current listings .. say 1 million listings. When you submit your search term, code runs on the craigslist server, pulling out the listings that include that word, and printing out HTML to show you the first 100 matches. This program will use a for-loop, if-statement, and print, just as we have been doing.
Web Application Advantages
- Platform independent/neutral -- you can use a browser running on Mac or Linux or Windows or whatever (also known as "cross-platform"). HTML is deliberately platform-independent. There is zero lock-in to what sort of computer you use. If you hold a monopoly in operating systems, you naturally dislike platform neutrality. That's what has happened with Microsoft -- initially holding out against the web, destroying Netscape, providing incredibly buggy web browsers, etc. We'll look at this later.
In fact, the web is a platform that apps can target and run upon. It's just a platform that is not controlled by any one vendor.
- Data in the cloud -- you can go up to any browser and access your data. The cloud servers will do a better job of backing up the data than your in-home attempts.
- Nothing to install -- once you have the browser working, that's all you need. Installing software can be a pain, so this "zero-barrier" quality is nice. For example, a high school teacher using codingbat.com is so easy .. they just go to the url and it works.
- Software Updates -- analogously to software installation, the programs now reside on the server. The server people can monitor and upgrade the software continuously. The users just use the service, and get the latest.
Web Application Disadvantages
- Limited to HTML -- HTML running in the browser is not as capable of a full program running on the computer. HTML and the browser have limitations about what can be drawn and how computation can be done, although HTML is gradually becoming more and more powerful. For example, think of a program as complex as Photoshop with its many little windows, support of plugins and external drawing tablets etc. HTML does not support those things yet. You can build simple interfaces in HTML, and for many people that's good enough. Since HTML is a committee driven standard, progress can be slow.
- Data in the Cloud -- if you have some top secret data, you may prefer to just store it on your own hardware. However, in my opinion, the security practices at a cloud company like google are much better than the workaday security of, say, your laptop locked up in your house. This perhaps more of an emotional issue than a practical one.
- Lock-in -- as above, your data is up in the cloud, there is the potential for a lock-in situation, where it's hard for you to get your own data (actually this can happen with regular files too if they are in some proprietary format). So this is not a web feature, it's just a standard effect of getting software from a vendor which might lock you in by storing your data in their lock-in format.
Web Page - HTML
Here is simple web page with a few elements...
This is the first paragraph.
This is a second paragraph, including a link to the codingbat site
An image is done with "img" tag which includes a "src" url pointing to the image data file, like this
- HTML text code describes a web page
- Plain text with "tags" to mark text as bold etc. within brackets < ... >
- "h1" or "h2" or "h3" are headlines, e.g. <h2>A Heading</h2>
- "b" tags to mark bold: <b>like this</b>
- "a" tags to mark a url:
<a href="http://codingbat.com">codingbat site</a>
- "img" tag to load in image:
- Below is the HTML code/tags to produce the page above:
<html> starts the whole thing. The <head> section with <title> sets the title used at the top of the window. Inside <body> is the regular HTML content of the page.
<html> <head> <title>PAGE TITLE HERE</title> </head> <body> <h2>A Heading</h2> <p>This is the first <b>paragraph</b>. <p>This is a second paragraph, including a link to the <a href="http://codingbat.com">codingbat site</a> <p>An image is done with "img" tag which includes a "src" url pointing to the image data file, like this <br><img src="http://www.stanford.edu/class/cs101/abby.jpg"> </body> </html>
A web page is written is written in a plain text code called HTML (Hyper Text Markup Language). Basically, HTML adds "markup" commands <...> within plain text. The markup indicates that parts of text should be a heading, or bold, or a url, and so on.
(You do not need to memorize HTML mark up for this class, just know what it does broadly.) You should not be intimidated about producing an HTML page to show some information. Creating a basic looking HTML page is not difficult, although of course a complex page like the nytimes.com front page is a lot of work. You can write HTML by hand, just typing in the text including the HTML tags, or use a program that looks more like a word-processor to you, but which then generates HTML for you.
HTML Edit Demo
Nick edits the file network-html-sample.html. You can View Source on this page to see its underlying html.
When you are visiting any web page, you can use the View Source command in your browser to see the underlying HTML code for the web page you see -- you will see <p> tags for paragraphs, <a href=...< tags for links, and <img src=...> tags for images. Check out the source code NYTimes.com .. it's quite a mess, but it's just HTML, rendered by your browser.
The latest version of HTML, HTML5 is becoming very popular, adding needed features to make better web pages and better web "applications" (below). Older versions of HTML lacked some features, so web pages did not look or work so well, but that's been largely fixed.
Web Design Philosophy
The most important thing to know about web design is in this comic: XKCD on web design
There are two points of view: the users of your site have interests, common questions. Very often a user visits a site with a question they want answered. The organization creating the web site has a different set of interests. The creators might care about the org-chart of what division is providing what, and who runs it, or just generally promoting how shiny and awesome their organization and their management are. The old joke is that lame web sites end up looking like the org-charts of their producing organizations.
Advice: do not treat visitors to your site as captives, to be shown little graphics or videos that are essentially advertisements or advocacy, telling the visitor how great this product is, but not actually answering their questions. Instead, the most important question for the web design is: what are the most common questions/interests visitors will have and how can we make those answers conveniently available. If you put to much pointless stuff in their way, they will just get annoyed and leave. The most common error for a web site is to treat the user more like a semi-captive TV viewer, to be subjected to glossy propaganda. Notice that really successful sites like craigslist and gmail put the user's interests first. GMail includes advertisements, but they are out of the way, giving the main display always to the user's data. There's a theory that Yahoo is having problems because they gave their advertisers big, flashy ads, but in the longer term annoyed and drove away their users. Note how google ads are small and non-flashy ... again, always aligning the user's interests, as they can always switch to another site.