Web Archiving: The Next Phase In The Evolution Of Archiving

(This is an excerpt of an Osterman Research Inc. White Paper sponsored by Reed Tech Web Archiving Services and LexisNexis. Please click here to download the entire White Paper.)

Executive Summary


The web has become the primary communication and commerce channel for businesses and government agencies.  Digital media (web sites and other web-based content) has all but replaced print media as the primary mode of communication with customers, constituents, prospects, investors and others.  The web is also becoming the primary channel for transacting business, managing commerce for everything from online purchases to tax payments. 

However, business and governments do not yet understand that they are liable for everything they publish online.  Organizations that do not archive web content run the risk of not preserving a record of their claims, offers and other content posted on their web sites.  Retaining this content has become both a legal and regulatory requirement, and so the question is not if web content should be retained, but only how much and for how long it should be preserved. 

Web archiving has been going on for quite some time, but enterprise-class solutions have only recently become available.  New, state-of-the-art technology is now available to manage web archiving and it has the power and flexibility to meet existing and emerging web archiving requirements.  As a result, any organization that uses the web to communicate or manage commerce should consider developing a web archiving policy and deploy the appropriate technology to support that policy. 


The fundamental message of this white paper is: 

  • Web archiving is, without question, a best practice for virtually any organization. Organizations that do not archive web content are placing their organizations at unnecessary risk from both a legal and regulatory viewpoint, and they are denying themselves the use of capabilities that can provide a distinct competitive advantage.


  • Web archiving is fundamentally identical to what many organizations have already implemented in the context of email archiving, file archiving and long-term retention of other types of important business content. In essence, web archiving is merely a superset of traditional types of archiving that are already well established in business and government.


  • Many current web archiving technologies are not designed with enterprise-class capabilities that will retain web content of evidentiary value.


  • Organizations should consider developing a web archiving policy, particularly as more content migrates to the web and web-based applications. 


This white paper discusses the importance and benefits of web archiving and various use cases for it.  It also briefly discusses the sponsor of this white paper and their relevant offerings in the space. 

Why the Web Represents the Next Phase of Archiving


Web archiving is what its name implies:  the capture and archival storage of web-based content.  This can include individual web pages, entire web sites, content from web 2.0 applications like social networking sites, and other web-based content that is important to capture and retain, normally for long periods. 

The concept of web archiving is not new.  For example, the Wayback Machine - a web archiving service maintained by the non-profit organization Internet Archive based in San Francisco, California - has been archiving web content since 1996[i].  However, the Wayback Machine has several limitations for use in a business context: 

  • Web content is captured only periodically, not on a regular basis. This can prevent the capture of a large proportion of web content, particularly for sites that update content frequently. Further, changes to a web page or web site may not be captured if the change occurs between content "snapshots", the frequency of which is determined by Internet Archive.


  • There is no guarantee that all web content will be captured.


  • Web content is not necessarily captured in a way that will satisfy evidentiary rules during legal or regulatory proceedings.

 As a result, while the Wayback Machine is a good first step toward archiving web content, more sophisticated - and enterprise-class - web archiving is becoming a necessity for a growing number of applications, as discussed below. 


Many of the drivers for web archiving are fundamentally the same as those for email and other electronic content archiving: 

  • Web content can be required for e-discovery and other litigation support requirements in much the same way that emails, word processing files, PDF files and other content are required.


  • Similarly, web content can be required to demonstrate an organization's compliance (or lack thereof) with regulatory requirements in the context of advertising, forward-looking statements, claims of suitability and other content that must - or must not - be posted to web sites.


  • Many organizations have a requirement, often driven by a need to reduce risk or maintain adequate records, to preserve web site content as part of their overall records retention and records management strategy.


  • Unlike more traditional forms of archiving, web archiving can actually be used as a competitive and/or investigative tool to understand content posted on competitors' web sites.


There are some significant differences between web backups and web archives: 

  • Although both a backup and an archive of a Web site can reproduce content at a later date for forensic, e-discovery or data mining purposes, a web archive will do so more quickly, more affordably and more easily.


  • Because of the ubiquity of database-driven web sites, a backup must retain archives of all of the files, as well as all of the databases that control the web site.


  • Searching through backups of a web site is much more difficult and more time-consuming than searching through an archive.


Web archiving can rightly be considered the next logical extension of an organization's traditional archiving of email, files and other electronic content.  While email and other types of electronic content archiving tend to focus on internal content - emails sent to and from employees and business, word processing files and presentations created for internal uses, and so forth - web archiving trends to focus much more on publicly available content.  Because the web - including static web sites, web applications, social networking content, etc. - is primarily public-facing in nature, web archiving focuses primarily on content that the public has already seen or has had the opportunity to see. 

As a result, web archiving is focused to a greater degree than traditional electronic content archiving on issues like brand protection; reputation management; policy enforcement; protection of content based on when it is created, posted and taken down; business continuity and corporate memory. 

Archiving Is Already an Established Best Practice


The amount of content on the web has ballooned exponentially in recent years.  For example, as of December 2009, there were 234 million web sites, 47 million of which were added just in 2009[ii] - an average of nearly 129,000 web sites added every day.  Further, even as far back as 2008 there were well in excess of one trillion unique URLs on the web and the number continues to grow at a rapid pace. 

Growth of the web is being driven by a number of factors, including the ubiquity of web access, the ease and low cost with which content can be published and updated, and greater cultural acceptance of the web as a medium of information-sharing and commerce.  For these reasons, both business and government are increasingly reliant on the web as their primary means of communications and process management. 

Consequently, the market for web archiving - as well as archiving of email, files, SharePoint content and other information - is growing at a healthy pace.  Web archiving, currently a small segment of the total content archiving market, is poised to become an enormous area of growth, driven by the issues discussed in this white paper. 


For just about any company, government agency or educational institution, there are four primary drivers for archiving their electronic content.  However, the importance of these drivers will vary by an organization's size, the industry(ies) in which it participates, the advice of its internal and external legal counsel or compliance officers, and the locales in which it operates: 

  • Driver #1: Litigation

Electronic content stores, including web sites, contain a growing proportion of business records that must be preserved for long periods of time.  Further, this content is frequently requested during discovery proceedings because of the Federal Rules of Civil Procedure (FRCP) and state versions of the FRCP.  As a result, it is critical that all relevant electronic content be made available for e-discovery purposes. 

Further, when a hold on data is required, it is imperative that an organization immediately be able to begin preserving all relevant data.  For example, if a dispute arises because of a claim made on a page of a company's web site, that content must be preserved for as long as a court, regulator or other authorized entity may deem necessary.  An enterprise-class web archiving system allows organizations to immediately place a hold on data when requested by a court or on the advice of legal counsel. 

If an organization is not able to adequately place a hold on data when it is obligated to do so, it can suffer a variety of serious consequences, ranging from embarrassment to major legal sanctions or heavy fines.  Litigants that fail to preserve electronic content properly are subject to a wide variety of consequences, including brand damage, additional costs for third-parties to review or search for data, court sanctions, directed verdicts or instructions to a jury that it can view a defendant's failure to produce data as evidence of culpability. 

In addition to the e-discovery and legal hold benefits, an enterprise-class web archiving system allows an organization to perform either formal or informal early case assessment activities.  For example, if a customer makes a claim against a company based on a statement made on the company's web site, senior managers can search the archive for information that will help them determine the potential liability they face.  If this assessment of the potential lawsuit results in a determination that the company was indeed wrong in making the claim, they can instruct legal counsel to pursue a quick legal settlement.  If, on the other hand, the assessment results in the discovery of information that supports the company's position, that information can be used to convince the customer to drop the case or it can help win the case if it goes to trial.  In either case, an archiving system can help the organization to understand its position early on, either avoiding unnecessary legal fees or an adverse judgment, or reducing its costs by proving the sufficiency of its case. 

  • Driver #2: Regulatory Compliance

For just about every organization, there are a large and growing number of regulatory obligations to preserve electronic content.  Some of the more important  requirements are: 

  • Sarbanes-Oxley Act of 2002
    The Sarbanes-Oxley Act of 2002 requires all public companies and their auditors to retain such relevant records as audit workpapers, memoranda, correspondence and electronic records for a period of seven years. Further, Section 403 of Sarbanes-Oxley amended Section 16 of the Securities and Exchange Act of 1934 to include a requirement for public companies to post certain types of content on their web sites.

    Under Sarbanes-Oxley, company officers are obliged to report internal controls and procedures for financial reporting and auditors are required to test the internal control structures. Businesses have to ensure that information is preserved - whether paper or electronic - that would be relevant to the company's financial reporting.


  • Health Insurance Portability and Accountability Act of 1996 (HIPAA)
    All organizations operating in the healthcare field need to comply with HIPAA to ensure the safety of Protected Health Information.  Organizations are required to protect the data from unauthorized users, as well as to retain for six years a broad range of documentation regarding their compliance.

    As part of the American Recovery and Reinvestment Act of 2009 (ARRA), the provisions of HIPAA have been significantly expanded.  A key component of ARRA is the Health Information Technology for Economic and Clinical Health Act (HITECH).  Now, business partners of entities already covered by HIPAA, such as pharmacies, healthcare providers and others, are required to comply with HIPAA provisions.  This includes attorneys, accounting firms, external billing companies and others that do business with covered entities.  While these business associates were accountable to the covered entities with which they did business under the old HIPAA, these associates are now liable for governmental penalties under the new law.

HIPAA violations have been expanded dramatically.  For example, if a covered entity or one of their business associates loses 500 or more patient records, it must notify HHS and a "prominent media outlet" to let them know what has occurred.  Section 13402 of HITECH requires that if a "covered entity has insufficient or out-of-date contact information for 10 or more individuals, the covered entity must provide substitute individual notice by either posting the notice on the home page of its web site or by providing the notice in major print or broadcast media where the affected individuals likely reside."

Fines for HIPAA violations can now reach as high as $1.5 million per calendar year.

Recent FINRA Disciplinary Actions Related to Web Content


  • An individual posted false and misleading information on a Google Finance bulletin board relating to securities recomm-endations. The posting contained predictions and projections of future prices for the securities that were recommended, but the posting was made without approval. FINRA fined the individual $10,000 and suspended him from associating with any FINRA member for six months.


  • A company made false and misleading statements on its web site related to low cost commission rates and direct access to traders. The company was censured and fined $20,000.


  • An affiliate of a company participated in and won CD auctions without disclosing it was an auction participant. Further, the advertising materials used contained misleading, unwarranted and exaggerated statements, and published misleading market clearing yields on its web site. The company was found to have violated Rule 2210 and fined $225,000. 

Securities and Exchange Commission Rules
Members of national securities exchanges, brokers and dealers are obliged to preserve all records for a minimum of six years, the first two years in an easily accessible place (SEC Rule 17a-4).  The affected records are broad and encompass originals of communications generated and received by individuals within financial institutions, including inter-office memoranda and internal audit working papers. Also included are automated messages sent to all customers, which could include email blasts. The records may be "immediately produced or reproduced on 'micrographic media' [microfilm, microfiche or similar] or by means of 'electronic storage media'.  As noted above the Securities and Exchange Act of 1934 has been amended to specifically include the requirement to post certain types of content on the web. 

  • Financial Industry Regulatory Authority (FINRA)
    FINRA is a non-governmental regulator formed in 2007 by the merger of various functions of the New York Stock Exchange and the National Association of Securities Dealers. FINRA manages a wide variety of rules that are imposed upon the more than 5,000 brokerage firms and nearly 675,000 registered representatives it oversees.

    FINRA requires that various types of communications with the public must be filed prior to their use, including content that often would be posted on web sites[iii]. This includes CMO advertisements, sales literature and investment analysis tools.


  • Model Requirements for the Management of Electronic Records (MoReq)
    MoReq is a specification, originally developed in 2001, that defines the functional requirements for the manner in which electronic records are managed in an Electronic Records Management System. MoReq has been used widely in Europe and has been updated with MoReq2.

(Please click here to download the entire White Paper.)