In recent years, the microprocessor industry has been revolutionized by the introduction of the chip multiprocessor (CMP). Created as an alternative to single-core designs, CMPs promise to mitigate two of the most serious challenges of modern high-performance singlecore processors: design complexity and power consumption .
Workloads that rely on throughput are likely to benefit from CMP architectures with modest effort. However, extending the performance potential of CMPs broadly to sequential applications remains a difficult problem. Conventional compiler approaches have largely failed to extract sufficient thread-level parallelism from single-threaded applications to take advantage of many cores , leaving it to the programmer to extract cost-effective parallelism.
With the purpose of creating easy-to-use tools for the development of parallel applications, industry and academia have developed parallel runtime systems and libraries that allow programmers to focus their efforts on the identification of parallelism rather than worrying about how parallelism is managed and/or mapped to the underlying architecture . Dynamic management of parallelism, or the ability to take created parallelism and dynamically assign it to available execution resources, is currently used by many runtime libraries such as OpenMP and the Intel Threading Building Blocks runtime library to provide improved performance. While parallel runtime libraries make it easier for programmers to develop parallel code, software-based dynamic management of parallelism inflicts a performance cost on parallel applications as the runtime library is called to make runtime decisions. For aggressively-annotated parallel code, usage of software-based runtime libraries implies the possibility of exposing software management overheads, which at significant levels can render the existing parallelization approach cost-ineffective. Moreover, with parallelism management cost increasing with increasing core counts, performance portability of applications across large core counts is severely affected.
This dissertation proposes a low-overhead, low-latency dynamic parallelism management solution aimed at improving parallelism performance. The proposed solution not only allows parallel applications to make effective use of large core counts, but it also allows them to gracefully adapt to dynamic changes in system characteristics such as core-speed and core-count variations. To this end, this work sets forth four overarching goals: (1) perform an in-depth characterization of two popular parallel runtime libraries with the goal of identifying some of the benefits and shortcomings in their dynamic management of parallelism; (2) provide a detailed study of how software-based approaches are able to, or fail to, mitigate performance heterogeneity caused by technology variations; (3) develop parallelism redistribution policies that utilize global information with the aim of improving load balancing and performance scalability; and (4) describe Squadron, a comprehensive framework aimed at providing superior performance through low-overhead, low-latency dynamic management of parallelism capable of achieving performance improvements ranging from 18% to 13X over existing software-based solutions.
The end result of this dissertation is a detailed study of dynamic management of parallelism in software, as well as its performance potential under hardware support. The characterization results presented in this work can help runtime system designers create more optimal designs by offering valuable insights into some of the major sources of overheads currently limiting the scalability of software solutions. Squadron serves as the first step in the development of an attractive solution for future CMP architectures looking to offer superior parallelism performance through specialized hardware support.