Topics in massive data summarization
by Zheng, Xuan, Ph.D., UNIVERSITY OF MICHIGAN, 2008, 91 pages; 3329011

Abstract:

We consider three problems in this thesis. Two of them are about building near optimal histograms in non-uniform workload model and in probabilistic data streaming model respectively. The former has an associated weight vector with each streaming data vector, while the latter has a density function instead of a deterministic value as each data entry in the streaming data vector. The third one is to find approximate heavy hitters between two parties in a private model, in which, for every party no information is leaked unless necessary. More specifically, first, we want to construct a nearly workload-optimal histogram. Given B, we want to find the near optimal B bucket histogram under associated workload w within 1 + &epsis; error tolerance. In the cash register model where data is streamed as a series of updates, we can build a histogram using polylogarithmic space, polylogarithmic time to process each item, and polylogarithmic post-processing time to build the histogram. All these results need the workload to be explicitly stored since we show that if the workload is summarized in small space lossily, algorithmic results such as above do not exist.

Then, we consider the problem of private computation of approximate Heavy Hitters. Alice and Bob each hold a vector and, in the vector sum, they want to find the B largest values along with their indices. We show how to solve the problem privately with polylogarithmic communication, polynomial work and constantly many rounds in the sense that nothing is learned by Alice and Bob beyond what is implied by their input, the ideal top-B output, and goodness of approximation (equivalently, the Euclidean norm of the vector sum). We give lower bounds showing that the Euclidean norm must leak by any efficient algorithm.

In the third problem, we want to build a near optimal histogram on probabilistic data streams. Given B, we want to find the near optimal B bucket histogram on probabilistic data streams under both L1 measurement and L2 measurement. We give deterministic algorithms without sampling. We can build histograms using poly-logarithmic space, polylogarithmic time to process each item, and polylogarithmic post-processing time to build the histogram. The result we give under L2 measurement is within (1 + &epsis;)-error tolerance, and the result under L1 measurement is heuristic. We also give a direction to give guarantees to the heuristic.

 
AdviserMartin J. Strauss
SchoolUNIVERSITY OF MICHIGAN
SourceDAI/B 69-09, p. , Dec 2008
Source TypeDissertation
SubjectsComputer science
Publication Number3329011
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3329011
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.