Sign In

Communications of the ACM

Digital village

Responsible Web Caching







The overwhelming majority of IT professionals would sooner have their teeth scraped than review the literature on Web caching.


HTTP 1.1's treatment of Web caching is much more sophisticated. Considerable flexibility has been added for making finer-grained distinctions between "fresh" and "stale" documents, including the use of heuristics rather than server-supplied file time-and-date stamp comparisons. A validation model was included that added a "validator token" to the cached document. With a cache-resident validator, document currency could be confirmed by the proxy cache by simply running the validator from the originating server. The currency is reported by the response status code. In fact, there are two types of validators, and correspondingly, two degrees of validation: strong and weak, depending upon the degree of "freshness" one needs. But by far the most important difference in the HTTP 1.1 treatment of caches was the addition of the cache-control header field.

The idea between the cache-control header field was to provide a general mechanism for caches to communicate efficiently and effectively with other computers. The point that should not be overlooked is by design the cache-control header field provides directives focused on regulating requests and responses along the caching food chain. The general structure of the header is:

Cache-control : directive
[optional parameters]

where the directive consists of keywords relating to aspects of requests and responses. Since the more controversial aspect of cache-control deals with responses, we'll limit our discussion accordingly:

  • Cache-control = "public"
    This means responses from this server may be stored without restriction.
  • Cache-control = "max-age = 43,200"
    This means the document contained in this response should be considered "stale" after 43,200 seconds (approx. 30 days).
  • Cache-control = "must revalidate"
    This means the caching service (proxy or browser) must revalidate the document after it becomes "stale" from the originating server, or report an error message. It must not pass on the "stale" version.

You get the idea. Now, the worrisome directives:


  • Does the cache operator own the resource/document being accessed?
  • Is this resource/document copyrighted, and if so what is the nature of the license for subsequent redistribution..
  • Is there a "definitive" version of this document (the published and copyrighted version)? If so, where is it located?
  • Does the cached document have a digital object identifier? If so, what is it?
  • Did the owner/copyright holder restrict the use of this resource/document (for example, for classroom use, for use by nonprofit corporations, for use other than commercial purposes)?
  • Is it realistic to expect the owner/author to familiarize himself or herself with the nuances of the cache-control directives before placing anything on the Web?
  • Should the owner/author be expected to define the caching parameters that take into account all possible effects (for example, would the typical author interpret "no cache" to mean "don't cache without re-validating")?
  • Did the owner/author relinquish control when a resource/document may be withdrawn from public circulation by placing it on the Web? (Once the document is cached, there's no simple and immediate way of withdrawing all cached copies from the Web.)
  • Is the owner/author entitled to maintain accurate usage statistics and logs regarding resources of their creation? (These statistics are more difficult to gather from caches, demand additional work, may (as in the cost of embedded Web bugs) have privacy implications, carry with them a definite cost to the owner/author/ information provider, and require the universal cooperation of cache maintainers to work. Cache metering is not catching on for these reasons.
  • Is it the responsibility of the owner/author to ensure version control mechanisms are in place to prevent outdated versions from being circulated by cache services?
  • Is it the responsibility of the owner/author to continuously monitor every proxy cache operation and issue a "takedown" notice each time some of his or her intellectual property is discovered in some cache?
  • Is the owner/author of a Web document/resource entitled to reasonable royalties or profit- sharing that accrue to downstream Web caching services that host his or her work?

I predict answers to these questions will determine how sympathetic one is to the objectives of proxy cache services. Regrettably, few people seem to even ask these questions, much less try to develop viable answers.





  • Generic information about Web caching, including recent news flashes, may be found on the Internet Caching Resource Center Web site; www.caching.com.
  • ICP specifications may be found at icp.ircache.net.
  • The International Web Content Caching and Distribution Workshops have been held annually since 1996. Program information and proceedings can be found at www.iwcw.org. The focus of these workshops is technical, with no evidence there is much attention paid to underlying ethical issues.
  • Duane Wessel's Web Cache can be found at www.web-cache.com. Wessels penned what is to my knowledge the most complete reference book on Web Caching (O'Reilly, 2001). He also coauthored Version 2 of the RFC for Internet caching protocol specifications. Wessels pays lip service in both his book and Web site to the ethical and privacy issues of Web caching.
  • Many caching programs for a wide variety of Unix and Windows NT/2000/XP servers are available, including:
    • The Open Source Squid—
      Web Proxy Cache (www.squid-cache.org);
    • Netscape's Proxy Server (wp.netscape.com/proxy/v3.5/);
    • Microsoft's caching software is built into their Internet security and acceleration server product (see www.microsoft.com/isaserver); and
    • The CiscoCache Engine Series (www.cisco.com/ warp/public/cc/pd/cxsr/500/index.shtml)
  • The ACM copyright policy is online at www.acm.org/ pubs/copyright_policy.
  • Digital Object Identifiers are described at www.doi.org.


 


 

No entries found