The continued scaling of silicon fabrication technologies has enabled the integration of dozens of processing cores on a single chip in the next computer generation. Our ability to exploit such computational power, however, is checkmated not only by limitations of parallelism extraction techniques, but furthermore by increasing levels of execution uncertainty within the system. As device feature sizes scale below 45nm, reliability has rapidly moved to the forefront of concerns for leading semiconductor companies, with the main challenge being the scaling of system performance while meeting power and reliability budgets. To make things worse, such an unreliable computational fabric is used to concurrently execute an increasing number of applications that constantly vie for execution resources, thus furthermore making the execution environment more dynamic and unpredictable.
The unreliability in the electronic fabric, in conjunction with the unpredictability in the execution process, has motivated the incorporation of execution adaptivity in future multicore systems, so that computational resources can be frequently renegotiated at run-time. The challenge, however, is to attain adaptivity in conjunction with the goals that designers already face, such as computation efficiency, power and thermal management, and predictability of worst-case performance. The traditional approaches of providing adaptivity at runtime dynamically will fail to scale as we move to systems of dozens of cores. Neither do static techniques that rely solely on compiler analysis deliver efficient adaptivity though. Instead, I have proposed a set of compiler-directed run-time optimization techniques that can combine the advantages of both, capable of reacting to unpredictable events while at the same time exploiting intensive program information to guide runtime decisions.
Technically, this thesis addresses the increasing levels of execution uncertainty in future multicore systems induced by device failures, heat buildups, or resource competitions from three aspects. It presents several tightly-coupled techniques to either (1) maximally mitigate a source of uncertainty, such as thermal stress, or (2) precisely detect resource variations, especially the ones induced by device failures, and then (3) quickly reconfigure the execution in a predictable manner with no reliance on spare units. These techniques are developed with the considerations of minimizing power and performance impact, localizing communication and migration so as to satisfy interconnect constraints, and ensuring high predictability so as to meet worst-case performance constraints of mission-critical applications. The successful incorporation of these techniques in future multicore systems, I believe, will engender adaptive, scalable architectures that can seamlessly reshape execution paths and schedules in an amortizable, high-volume, fixed-silicon fabric.