The past decade has witnessed an unprecedented growth in the complexity and variety of information, as partially driven by the advances in the following three areas: first, advanced sensing and monitoring technologies; second, pervasive network connectivity and ubiquitous computing platform; and third, social media and web 2.0 technologies. We are now facing data coming from multiple sources and featuring rich context information. For example, operators today have at their disposal myriad measures (e.g., NetFlow, SNMP, “syslog”) collected from all routers of large-scale enterprise networks. In contrast, the existing analytical tools are lagging way behind this astonishing growth in the complexity and variety of data. For example, even though analyzing routing data holds the promise for exposing important network failures, this promise is largely unfulfilled due to the complex, noisy and voluminous nature of the data. The lack of general design models and formal methods to effectively weave context-rich information from multiple sources motivates this thesis.
More specifically, in this thesis we focus on two grand challenges facing system designers and operators. First, how to fuse information from multiple autonomous, yet correlated sources and to provide consistent views of underlying phenomena? Second, how to respect externally imposed constraints (privacy concerns in particular) without compromising the efficacy of analysis?
In the first scenario, the correlation (e.g., dependency) among the data sources is usually reflected in the collected data in the form of spatial and/or temporal relevance. For example, the sympotoms caused by a given network failure typically demonstrate significant patterns in terms of where and when they are observed. This motivates us to design data analytical frameworks that can effectively incorporate the relationships of underlying data sources.
In the second scenario, due to the possible sensitive nature of the data, the data sources expect the entire process of data collection, processing and dissemination to provide sufficient privacy protection of their contributed data, even though the expected level of protection may vary from one source to another. This essentially raises the question of how to ensure privacy protection (e.g., via information sanitization), meanwhile guaranteeing the utility of the information for intended purposes.
To address the first challenge, we apply a general correlation network model to capture the relationships among data sources, and propose Network-Aware Analysis (NAA), a library of novel inference models, to capture (i) how the correlation of the underlying sources is reflected as the spatial and/or temporal relevance of the collected data, and (ii) how to track causality in the data caused by the dependency of the data sources. We have also developed a set of space-time efficient algorithms to address (i) how to correlate relevant data and (ii) how to forecast future data.
To address the second challenge, we further extend the concept of correlation network to encode the semantic (possibly virtual) dependencies and constraints among entities in question (e.g., medical records). We show through a set of concrete cases that correlation networks convey significant utility for intended applications, and meanwhile are often used as the steppingstone by adversaries to perform inference attacks. Using correlation networks as the pivot for analyzing privacy-utility trade-offs, we propose Privacy-Aware Analysis (PAA), a general design paradigm of constructing analytical solutions with theoretical backing for both privacy and utility.
The general design models and formal methods shown in this thesis can help improve existing data analytical systems by making them more capable of weaving local observations (from multiple sources) into globally consistent pictures, and more privacy-preserving with respect to sensitive information.