Efficient Processing of Set-Similarity Joins on Large Clusters

by Vernica, Rares, Ph.D., UNIVERSITY OF CALIFORNIA, IRVINE, 2011, 160 pages; 3473568


Joining two datasets where the join condition is based on set similarity instead of equality is very important in many applications, such as social networks, master data management, plagiarism detection, and query refinement based on users' similar preferences. For instance, finding Web site members with similar interests and removing duplicate customers after company acquisitions are two examples. Additionally, many applications need to deal with a large amount of data that cannot be processed by a single machine. Using clusters of machines where algorithms are executed in parallel is a more scalable approach. Recently, scalable, shared-nothing, data-intensive computation platforms have gained attention from both industry and academia. This trend started with the MapReduce framework, and more general frameworks have been proposed recently.

In this thesis we study how to efficiently perform set-similarity joins, a.k.a. fuzzy-joins, on large shared-nothing clusters. We propose a three-stage approach for computing end-to-end set-similarity joins. The input to the approach is a set of records (or two sets of records), and the output is a set of joined records based on the desired set-similarity condition. The proposed approach efficiently partitions fuzzy-join processing across many nodes, balancing the workload and minimizing the need for data replication.

This thesis makes three contributions related to the proposed three-stage approach to compute fuzzy-joins. First, we study fuzzy-joins in the context of the popular MapReduce framework. Next, we propose a set of improvements to the MapReduce runtime model to improve its handling of fuzzy-join queries as well as other types of queries. Finally, we study fuzzy-joins in the context of a new data-intensive computing platform being developed at UC Irvine, ASTERIX. We show how to integrate fuzzy-joins into the system starting at the query language level and ending in efficient execution. Along with each of the contributions, we provide performance results from experiments on real datasets, synthetically increased in size, to study the speedup and scaleup properties of our techniques.

AdvisersChen Li; Michael J. Carey
Source TypeDissertation
SubjectsComputer science
Publication Number3473568

About ProQuest Dissertations & Theses
With nearly 4 million records, the ProQuest Dissertations & Theses (PQDT) Global database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

PQDT Global combines content from a range of the world's premier universities - from the Ivy League to the Russell Group. Of the nearly 4 million graduate works included in the database, ProQuest offers more than 2.5 million in full text formats. Of those, over 1.7 million are available in PDF format. More than 90,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or contact ProQuest Support.