URLs and Links

Lecture Notes for CS 142
Winter 2014
John Ousterhout

  • Additional reading for this topic: none.
  • Hypertext:
    • Documents containing fields that are links
    • Clicking on a link takes you someplace else (in the same document or a different document).
    • Hypertext has been around since the 1960s:
      • Ted Nelson coined the term (early '60s), built Xanadu system
      • Doug Englebart: "Mother of all demos" in 1968
      • HyperCard for the Macintosh: 1987

URLs

  • URLs: Uniform Resource Locators:
    • Provide names for Web content
  • Example URL:
    http://www.company.com:81/a/b/c.html?user=Alice&year=2008#p2
    
    • Scheme (http:): identifies protocol used to fetch the content.
      • http: is the most common scheme; it means use the HTTP protocol, which we will discuss soon.
      • https: is similar to http: except that it uses SSL encryption when communicating with the server for greater security.
      • file: means read a file from the local disk.
      • There are several other schemes, such as ftp:, but they aren't used much anymore.
    • Host name (//host.company.com): name of a machine running an HTTP server.
    • Server's port number (81): allows multiple servers to run on the same machine. Servers almost always run on port 80 (the default).
    • Hierarchical portion (/a/b/c.html): used by server to find content. Server can use this field however it wishes:
      • Path name for a static HTML file.
      • Path name for file containing code which, when executed, will generate a page (e.g., foo.php).
      • className/method, identifying a particular method in a particular class, which will generate HTML (Ruby).
      • When you set up a Web server you provide routing information that tells how to interpret the hierarchical portion of a URL.
    • Query info (?user=Alice&year=2008): provides additional parameters that can be used by the server to select dynamic content. For example:
      http://www.company.com/showOrder.php?order=4621047
      
    • Fragment (#p2): selects a particular location in the resulting page (instead of displaying the top of the page, scroll the window so the particular fragment appears at the top). Used on the browser only; not sent to the server.

Links

  • Links: content in a page which, when clicked on, causes the browser to display another page.
  • Links are implemented with the <a> tag:
    <a href="http://www.company.com/news/2009.html">2009 News</a>
    
  • <a> elements can be used in other ways:
    • Relative URL:
      <a href="2008/March.html">
      
    • Go to a different place in the same page:
      <a href="#sec3">
      
    • Define an anchor point (a position that can be referenced with # notation):
      <a name="sec3">
      
  • Other uses for URLs:
    • Loading a page: type the URL into your browser.
    • Nested content within a page
      • Images:
        <img src="icon.gif" />
        
      • Stylesheets:
        <link rel="stylesheet" type="text/css" href="...">
        
      • Embedded page:
        <iframe src="http://www.google.com">
        

URL Encoding

  • What if you want to include a punctuation character in a query value?
    http://www.stats.com/companyInfo?name=C&H Sugar
    
  • Any character in a URL other than A-Z, a-z, 0-9, or any of -_.~ must be represented as %xx, where xx is the hexadecimal value of the character:
    http://www.stats.com/companyInfo?name=C%26H%20Sugar
    
  • This is yet another example of escaping, which is an issue whenever information is encoded textually:
    • Some characters in the encoding have special structural significance, while other characters are just literal data.
    • What if I want to include a special character in my data?
    • Must introduce some escaping (or quoting) mechanism for handling special characters in data; typically this involves additional special characters for the escaping mechanism, such as & in HTML/XML or % in URLs
    • The escaping mechanism must also escape the escape characters.
    • If you ever receive data whose content is uncontrolled (e.g., typed by a user) and you want to incorporate it into a text encoding, you must check the data for special characters and add appropriate escaping.
    • If you forget to escape special characters, unexpected user data could change the structure of the encoded value. Malicious users can capitalize on such mistakes to violate the security of the system. Example: SQL query injection.
  • Multi-level escaping: when does unescaping occur?
    • When meaning is applied
      • HTML entities are processed whenever an HTML document is parsed (on the browser)
      • %'s in URL's are unescaped whenever the URL is used as a URL
        • In browser when deciding which server to contact.
        • In server when decoding query values.
    • Example:
      <a href="/a/b/c?name=C%26H%20Sugar&year=2006">
      

Miscellaneous Topics

  • A key (and controversial) element of the Web: referential integrity not required, broken links OK.
  • URI (Uniform Resource Identifier) vs. URL