As Stanford professor John Hennessy once said, “[P]arallelism and ease of use of truly parallel computers … [is] a problem that’s as hard as any that computer science has faced.” As my Ph.D. advisor, Hennessy instilled in me a desire to conquer this problem, and I have been working at it for the last 30 years. Specifically, my research focuses on parallel computer architecture, with an emphasis on shared-memory organizations and their software layers.
Early in my research career, I was fortunate to be part of the teams that worked on the Stanford DASH and the Illinois Cedar multiprocessors, two academic computer prototypes of the late 1980s. Those were exhilarating times when parallel architectures seemed poised to take over the computer industry. Later, interest in parallelism waned as processor designers competed in the frequency race — only to crash into the power and complexity walls. After the turn of the millennium, parallelism became hot again, this time spreading in an inexorable manner, dominating all markets including handheld devices. Unfortunately, while researchers have made major advances in parallel architectures and systems, true ease of programming is still elusive. Further, the slowdown of Moore’s law is about to accentuate the stress between programmability and performance as “easy parallelism” is getting exhausted. Overall, these trends ensure that research in programmable parallel computer architecture will remain vibrant and highly relevant.
Throughout my career, I have tried to work with computer companies, pushing ideas for programmable parallel architectures. For example, in the early 2000s, I participated in the design team of the IBM PERCS multiprocessor. This was an experimental multiprocessor funded by DARPA’s High-Productivity Computing Systems program, whose goal was to develop technologies for productive and efficient parallelism. This machine introduced several novel architecture ideas, including new multithreading support and processing-in-memory engines. PERCS was slated to become the nodes of the University of Illinois at Urbana-Champaign’s Blue Waters supercomputer.
A few years later, as the energy wall became omnipresent, I worked with Intel researchers to design Intel’s Runnemede Extreme-Scale Manycore. This was a highly energy-efficient manycore, with hundreds of streamlined cores, and novel architecture and circuit techniques to save energy. It was supported by DARPA’s Ubiquitous High Performance Computing
program and DOE’s X-Stack program as the building block of an exascale multiprocessor. We designed the manycore’s extreme low power cores and cache hierarchy, energy-management mechanisms, and a programming model that minimized data movement.
To further enhance programmability, I collaborated with Intel researchers to build QuickRec, the first multicore x86 prototype of deterministic Record and Replay (RnR). RnR is a primitive that consists of recording into a log all the non-deterministic events that occur during a workload execution. Then, during a re-execution of the same workload, the logged inputs are dynamically provided at the correct times, enforcing the exact reproduction of the initial execution. QuickRec uses field-programmable gate arrays to extend the core and cache hierarchy of an Intel multiprocessor, and a modified Linux operating system to manage the hardware. We showed that RnR can debug non-deterministic software and even detect security intrusions.
In a joint effort with several other University of Illinois faculty, we created the Illinois-Intel Parallelism Center (I2PC). In this center, which I had the honor to lead, we performed interdisciplinary research on programmable parallel systems. In collaboration with Intel, my research group developed the Bulk Multicore. This is an out-of-the-box multiprocessor concept, where cores continuously execute atomic blocks of instructions called Chunks, and cache coherence is maintained in a novel way with Bloom filter-based signature operations on sets of data. Bulk delivers high performance and is easy to program as its transactional memory-like extensions are transparent to users.
Working with companies has been a great learning experience, and has allowed me to seed some ideas that have impacted commercial products. However, I also enjoy thinking about new concepts that can change the way we design and program computers in the long run. For example, I devoted many years to develop the area of thread-level speculation, where the dependences between concurrently executing threads are monitored by the hardware, hence relieving the programmer from the burden of proving thread independence. Some of these ideas have contributed to the current popularity of hardware transactional memory. Currently, I am pushing the frontiers of the interaction between programming and machine design by developing very low overhead synchronization and communication primitives for highly integrated computers.
With the sunsetting of Moore’s law, parallel computer architecture is about to take center stage in computer science and engineering. As the remaining semiconductor generations fail to deliver historic improvements, architecture innovations will become strategic to computer companies. Many research opportunities will appear. These are indeed exciting times for computer architects.
I have been blessed with the privilege of working with outstanding Ph.D. students, who have driven the work described. Many of them have taken up faculty jobs at CRA member institutions such as Cornell University, University of Washington, Georgia Tech, and the University of Southern California. They and other young minds will ensure that our field continues its dramatic advances.
I have also enjoyed serving the computing community in a variety of ways, some of them enabled by the CRA. For example, I co-organized two CCC-sponsored visioning workshops that contributed to an NSF funding program. For several years, I had the honor to serve as chair of the IEEE Technical Committee on Computer Architecture (TCCA) where, together with other colleagues, we created a yearly ACM-IEEE meeting that co-locates multiple architecture and software conferences (HPCA-PPoPP-CGO) to enhance interdisciplinary interactions. It is through this interdisciplinary work that we will be able to tame the challenge of parallelism and usable parallel computers.
About the author
Josep Torrellas is the Saburo Muroga Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). He is the director of the Center for Programmable Extreme-Scale Computing, and past director of the Illinois-Intel Parallelism Center. Torrellas received the IEEE Computer Society Technical Achievement Award for “Pioneering contributions to shared-memory multiprocessor architectures and thread-level speculation,” the UIUC Award for Excellence in Graduate Student Mentoring, and has been a Willett Faculty Scholar at Illinois. He is a member of the CRA board of directors, and of the International Roadmap for Devices and Systems (IRDS). He has served as a CCC Council Member (2011-2014) and as the chair of the IEEE TCCA (2005-2010). He is a fellow of IEEE, ACM, and AAAS. He has graduated 36 Ph.D. students, who are now leaders in academia and industry. He received a Ph.D. from Stanford University.