@jaeschke

Creating a Billion-scale Searchable Web Archive

, , , , and . Proceedings of the 22Nd International Conference on World Wide Web, page 1059--1066. New York, NY, USA, ACM, (2013)
DOI: 10.1145/2487788.2488118

Abstract

Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives.

Links and resources

Tags

community

  • @jaeschke
  • @dblp
@jaeschke's tags highlighted