Cache-aware application parallelization and optimization for multicores
by Zhang, Yuanrui, Ph.D., THE PENNSYLVANIA STATE UNIVERSITY, 2011, 119 pages; 3483828

Abstract:

All trends clearly show that multicore machines will become the next generation mainstream computer architecture. One reason for this is the fact that it is not possible to continuously increase clock frequency on single processor machines, since, beyond a point, power dissipation becomes a major bottleneck. In addition, with large numbers of transistors on microprocessor chips, it is becoming too costly to design and debug ever-larger and ever-more-complex processors in every two or three years. Consequently, all major chip manufacturers have turned to multicore machines, which exhibit different architectural designs, in terms of the number of cores, on-chip cache topologies and interconnecting structures.

One of the critical issues regarding these multicores is the application software and system software support. In execution environments where the cores allocated (by the OS) to an application are managed by the application itself, without proper compiler support, it will not be easy to exploit the full potential of these emerging systems. Many parallel compilers and source-level code parallelizers have been developed to provide automatic parallelization and optimization. However, most of these techniques focus on the intrinsic parallelism and data reuse patterns exhibited by the application itself, while ignoring the underlying target multicore architectures. Being cache topology-aware is important for application parallelization and optimization, since cache topology determines the core connection and thus the data sharing between computations on different cores.

Motivated by this, in this thesis research, we propose three different cacheaware parallelization and optimization frameworks targeting multicores. While the first two frameworks focus on the computation and the data layout optimizations, respectively, for general multithreaded applications, targeting Network-on-Chip (NoC)-based multicores with Non-Uniform Cache Access (NUCA), the third one, optimizing both computation and data layouts, is designed for a domain-specific application, called the Non-Uniform FFT (NuFFT) data translation computation. In particular, our first framework takes a sequential program as input, applies tiling to each nested loop, and then employs a cache topology-aware tile mapping and scheduling strategy (for each nested loop). It considers temporal and spatial data reuses at the same time during mapping and scheduling, and generates a Pthread parallel code automatically for the user. Our experimental evaluation shows that, by applying cache topology-aware tile mapping and scheduling, application performance can be improved significantly, up to 21.6%, compared to a topology oblivious strategy. Our second framework takes an alternate view and optimizes the data layouts for arrays in an application. It includes three steps: array tiling, computation-to-core mapping and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing on-chip remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing the average number of links traversed by remote accesses (distance-to-data), and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architectures. Our experimental results show that the proposed approach improves, on average, data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA. Compared to the first framework, which optimizes all parallel loop nests in an application at the same time, the second framework optimizes data layouts with respect to only one loop nest at a time, for example, the most time-consuming loop nest in an application. Combining these two frameworks into an integrated one is on our future research agenda. Our third framework, called NUDT, generates optimized parallel code for NuFFT data translation computation on multicores, which takes algorithmic and architectural descriptions as input, performs the data layout transformation, i.e., geometric tiling and binning, and then applies one of the two parallelization strategies we developed. These two parallelization strategies ensure load balance across the cores through dynamic task allocation, and guarantee mutual exclusion in data updates via different partitioning and scheduling schemes. They can be adapted to different target architectures as well as different input sample distributions. Our experimental evaluation shows that, our NUDT tool improves the NuFFT data translation significantly on multicores, and makes it no longer the bottleneck in the NuFFT computation, even for large data set sizes.

 
AdviserMahmut Kandemir
SchoolTHE PENNSYLVANIA STATE UNIVERSITY
SourceDAI/B 73-02, p. , Nov 2011
Source TypeDissertation
SubjectsComputer engineering; Computer science
Publication Number3483828
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3483828
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.