Data analytics for networked and possibly private sources
by Wang, Ting, Ph.D., GEORGIA INSTITUTE OF TECHNOLOGY, 2011, 231 pages; 3464126

Abstract:

The past decade has witnessed an unprecedented growth in the complexity and variety of information, as partially driven by the advances in the following three areas: first, advanced sensing and monitoring technologies; second, pervasive network connectivity and ubiquitous computing platform; and third, social media and web 2.0 technologies. We are now facing data coming from multiple sources and featuring rich context information. For example, operators today have at their disposal myriad measures (e.g., NetFlow, SNMP, “syslog”) collected from all routers of large-scale enterprise networks. In contrast, the existing analytical tools are lagging way behind this astonishing growth in the complexity and variety of data. For example, even though analyzing routing data holds the promise for exposing important network failures, this promise is largely unfulfilled due to the complex, noisy and voluminous nature of the data. The lack of general design models and formal methods to effectively weave context-rich information from multiple sources motivates this thesis.

More specifically, in this thesis we focus on two grand challenges facing system designers and operators. First, how to fuse information from multiple autonomous, yet correlated sources and to provide consistent views of underlying phenomena? Second, how to respect externally imposed constraints (privacy concerns in particular) without compromising the efficacy of analysis?

In the first scenario, the correlation (e.g., dependency) among the data sources is usually reflected in the collected data in the form of spatial and/or temporal relevance. For example, the sympotoms caused by a given network failure typically demonstrate significant patterns in terms of where and when they are observed. This motivates us to design data analytical frameworks that can effectively incorporate the relationships of underlying data sources.

In the second scenario, due to the possible sensitive nature of the data, the data sources expect the entire process of data collection, processing and dissemination to provide sufficient privacy protection of their contributed data, even though the expected level of protection may vary from one source to another. This essentially raises the question of how to ensure privacy protection (e.g., via information sanitization), meanwhile guaranteeing the utility of the information for intended purposes.

To address the first challenge, we apply a general correlation network model to capture the relationships among data sources, and propose Network-Aware Analysis (NAA), a library of novel inference models, to capture (i) how the correlation of the underlying sources is reflected as the spatial and/or temporal relevance of the collected data, and (ii) how to track causality in the data caused by the dependency of the data sources. We have also developed a set of space-time efficient algorithms to address (i) how to correlate relevant data and (ii) how to forecast future data.

To address the second challenge, we further extend the concept of correlation network to encode the semantic (possibly virtual) dependencies and constraints among entities in question (e.g., medical records). We show through a set of concrete cases that correlation networks convey significant utility for intended applications, and meanwhile are often used as the steppingstone by adversaries to perform inference attacks. Using correlation networks as the pivot for analyzing privacy-utility trade-offs, we propose Privacy-Aware Analysis (PAA), a general design paradigm of constructing analytical solutions with theoretical backing for both privacy and utility.

The general design models and formal methods shown in this thesis can help improve existing data analytical systems by making them more capable of weaving local observations (from multiple sources) into globally consistent pictures, and more privacy-preserving with respect to sensitive information.

 
AdviserLing Liu
SchoolGEORGIA INSTITUTE OF TECHNOLOGY
SourceDAI/B 72-10, p. , Aug 2011
Source TypeDissertation
SubjectsComputer science
Publication Number3464126
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3464126
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.