All trends clearly show that multicore machines will become the next generation mainstream computer architecture. One reason for this is the fact that it is not possible to continuously increase clock frequency on single processor machines, since, beyond a point, power dissipation becomes a major bottleneck. In addition, with large numbers of transistors on microprocessor chips, it is becoming too costly to design and debug ever-larger and ever-more-complex processors in every two or three years. Consequently, all major chip manufacturers have turned to multicore machines, which exhibit different architectural designs, in terms of the number of cores, on-chip cache topologies and interconnecting structures.
One of the critical issues regarding these multicores is the application software and system software support. In execution environments where the cores allocated (by the OS) to an application are managed by the application itself, without proper compiler support, it will not be easy to exploit the full potential of these emerging systems. Many parallel compilers and source-level code parallelizers have been developed to provide automatic parallelization and optimization. However, most of these techniques focus on the intrinsic parallelism and data reuse patterns exhibited by the application itself, while ignoring the underlying target multicore architectures. Being cache topology-aware is important for application parallelization and optimization, since cache topology determines the core connection and thus the data sharing between computations on different cores.
Motivated by this, in this thesis research, we propose three different cacheaware parallelization and optimization frameworks targeting multicores. While the first two frameworks focus on the computation and the data layout optimizations, respectively, for general multithreaded applications, targeting Network-on-Chip (NoC)-based multicores with Non-Uniform Cache Access (NUCA), the third one, optimizing both computation and data layouts, is designed for a domain-specific application, called the Non-Uniform FFT (NuFFT) data translation computation. In particular, our first framework takes a sequential program as input, applies tiling to each nested loop, and then employs a cache topology-aware tile mapping and scheduling strategy (for each nested loop). It considers temporal and spatial data reuses at the same time during mapping and scheduling, and generates a Pthread parallel code automatically for the user. Our experimental evaluation shows that, by applying cache topology-aware tile mapping and scheduling, application performance can be improved significantly, up to 21.6%, compared to a topology oblivious strategy. Our second framework takes an alternate view and optimizes the data layouts for arrays in an application. It includes three steps: array tiling, computation-to-core mapping and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing on-chip remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing the average number of links traversed by remote accesses (distance-to-data), and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architectures. Our experimental results show that the proposed approach improves, on average, data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA. Compared to the first framework, which optimizes all parallel loop nests in an application at the same time, the second framework optimizes data layouts with respect to only one loop nest at a time, for example, the most time-consuming loop nest in an application. Combining these two frameworks into an integrated one is on our future research agenda. Our third framework, called NUDT, generates optimized parallel code for NuFFT data translation computation on multicores, which takes algorithmic and architectural descriptions as input, performs the data layout transformation, i.e., geometric tiling and binning, and then applies one of the two parallelization strategies we developed. These two parallelization strategies ensure load balance across the cores through dynamic task allocation, and guarantee mutual exclusion in data updates via different partitioning and scheduling schemes. They can be adapted to different target architectures as well as different input sample distributions. Our experimental evaluation shows that, our NUDT tool improves the NuFFT data translation significantly on multicores, and makes it no longer the bottleneck in the NuFFT computation, even for large data set sizes.