Long-term information preservation and access
by Song, Sang Chul, Ph.D., UNIVERSITY OF MARYLAND, COLLEGE PARK, 2010, 155 pages; 3426394

Abstract:

An unprecedented amount of information encompassing almost every facet of human activities across the world is generated daily in the form of zeros and ones, and that is often the only form in which such information is recorded. A good fraction of this information needs to be preserved for periods of time ranging from a few years to centuries. Consequently, the problem of preserving digital information over a long-term has attracted the attention of many organizations, including libraries, government agencies, scientific communities, and individual researchers. In this dissertation, we address three issues that are critical to ensure long-term information preservation and access.

The first concerns the core requirement of how to guarantee the integrity of preserved contents. Digital information is in general very fragile because of the many ways errors can be introduced, such as errors introduced because of hardware and media degradation, hardware and software malfunction, operational errors, security breaches, and malicious alterations. To address this problem, we develop a new approach based on efficient and rigorous cryptographic techniques, which will guarantee the integrity of preserved contents with extremely high probability even in the presence of malicious attacks. Our prototype implementation of this approach has been deployed and actively used in the past years in several organizations, including the San Diego Super Computer Center, the Chronopolis Consortium, North Carolina State University, and more recently the Government Printing Office.

Second, we consider another crucial component in any preservation system – searching and locating information. The ever-growing size of a long-term archive and the temporality of each preserved item introduce a new set of challenges to providing a fast retrieval of content based on a temporal query. The widely-used cataloguing scheme has serious scalability problems. The standard full-text search approach has serious limitations since it does not deal appropriately with the temporal dimension, and, in particular, is incapable of performing relevancy scoring according to the temporal context. To address these problems, we introduce two types of indexing schemes – a location indexing scheme, and a full-text search indexing scheme. Our location indexing scheme provides optimal operations for inserting and locating a specific version of a preserved item given an item ID and a time point, and our full-text search indexing scheme efficiently handles the scalability problem, supporting relevancy scoring within the temporal context at the same time.

Finally, we address the problem of organizing inter-related data, so that future accesses and data exploration can be quickly performed. We, in particular, consider web contents, where we combine a link-analysis scheme with a graph partitioning scheme to put together more closely related contents in the same standard web archive container. We conduct experiments that simulate random browsing of preserved contents, and show that our data organization scheme greatly minimizes the number of containers needed to be accessed for a random browsing session.

Our schemes have been tested against real-world data of significant scale, and validated through extensive empirical evaluations.

 
AdviserJoseph F. JaJa
SchoolUNIVERSITY OF MARYLAND, COLLEGE PARK
SourceDAI/B 71-12, p. , Nov 2010
Source TypeDissertation
SubjectsComputer engineering; Information science; Computer science
Publication Number3426394
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3426394
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.