The past decade has witnessed a large number of novel applications that generate imprecise, uncertain and incomplete data. Examples include monitoring infrastructures such as RFIDs, sensor networks and web-based applications such as information extraction, data integration, social networking and so on. In my dissertation, I addressed several challenges in managing such data and developed algorithms for efficiently executing queries over large volumes of such data. Specifically, I focused on the following challenges.
First, for meaningful analysis of such data, we need the ability to remove noise and infer useful information from uncertain data. To address this challenge, I first developed a declarative system for applying dynamic probabilistic models to databases and data streams. The output of such probabilistic modeling is probabilistic data, i.e., data annotated with probabilities of correctness/existence. Often, the data also exhibits strong correlations. Although there is prior work in managing and querying such probabilistic data using probabilistic databases, those approaches largely assume independence and cannot handle probabilistic data with rich correlation structures. Hence, I built a probabilistic database system that can manage large-scale correlations and developed algorithms for efficient query evaluation.
Our system allows users to provide uncertain data as input and to specify arbitrary correlations among the entries in the database. In the back end, we represent correlations as a forest of junction trees, an alternative representation for probabilistic graphical models (PGM). We execute queries over the probabilistic database by transforming them into message passing algorithms (inference) over the junction tree. However, traditional algorithms over junction trees typically require accessing the entire tree, even for small queries. Hence, I developed an index data structure over the junction tree called INDSEP that allows us to circumvent this process and thereby scalably evaluate inference queries, aggregation queries and SQL queries over the probabilistic database.
Finally, query evaluation in probabilistic databases typically returns output tuples along with their probability values. However, the existing query evaluation model provides very little intuition to the users: for instance, a user might want to know “Why is this tuple in my result?” or “Why does this output tuple have such high probability?” or “Which are the most influential input tuples for my query?” Hence, I designed a query evaluation model, and a suite of algorithms, that provide users with explanations for query results, and enable users to perform sensitivity analysis to better understand the query results.