This article is published in the August 2024 issue.

CCC Q&A: A High Performance Computing Researcher Explains Sustainable AI


By Petruce Jean-Charles, Communications Associate, CCC

CCC spoke with one of its council members, Michela Taufer about her work in high performance computing (HPC) and her contributions to sustainable AI.

Taufer has profoundly shaped the landscape of HPC through pioneering contributions that transcend traditional boundaries. Her career spans pivotal areas including volunteer computing, large-scale data management analytics workflows, and accelerator-based supercomputing. Taufer introduced groundbreaking techniques to ensure computational accuracy in unpredictable volunteer computing environments, laying a foundation for reproducible outcomes. She also championed principles to enhance data FAIRness long before their widespread adoption, significantly influencing modern data management practices in HPC. 

Taufer’s innovative solutions, such as the homogeneous redundancy algorithm and composite precision approach, have advanced reproducibility in diverse computing systems, fostering trust in scientific computing and underlining her enduring impact on HPC’s evolution and credibility worldwide.

 

What interests you about sustainable AI?

My interest lies in sustainable AI for science. Two key challenges intrigue me a lot these days: the power consumption required to train AI models and their opacity. The two challenges are linked.

With power consumption, AI models are increasingly deployed in new research areas, such as high-throughput data analytics, to understand how our planet changes. I am referring to aspects like modeling the ocean, predicting soil moisture patterns, and tracking changes in atmospheric composition — all aspects that significantly impact our lives. With the growing deployment of satellites and sensors, the amount of data available is increasing, enabling the training of more powerful AI models. However, training these models on High Performance Computing (HPC) resources is extremely power-intensive, highlighting the urgent need to minimize the power consumption of HPC resources. I often hear alarming statistics about the power required to train even simple models. 

Estimates for the energy consumption involved in training are often compared to the yearly energy use of multiple households, and these numbers are pretty scary. Unfortunately, many of these figures are speculative and fail to provide a clear picture to the general public. I want the HPC community to come together to define straightforward methods to assess these numbers and make them understandable to the public.

With the opacity of AI models, ensuring that AI models are explainable and reproducible is another critical challenge that intrigues me. Why should we trust the outcomes of AI models? Are these outcomes reproducible? I believe that to answer these questions, we need to learn from the training processes of AI models. However, to save energy, we often store only the trained models and their final fitness values without capturing the training lifespan in a shareable and searchable format. This limits our ability to learn, reuse, and reduce resource use.

We should address this chicken-and-egg problem—should we use less energy and discard information? Or should we save information even if it’s causing higher costs and have the tools to rely on the AI model? I prefer the second solution because, while this approach might consume power in the short term, it creates public trust in AI models by providing information to reason about them. Furthermore, lessons learned can help reduce training in the long term, thus saving energy in the future. 

I want the HPC community to come together and define a sustainable tradeoff that has a long-term impact on reasoning about AI models and their outcomes while mitigating power consumption by promoting the reusability of models based on lessons learned from past training.

 

How can high-performance computing positively address climate challenges?

The HPC community can deliver methods, workflows, and data commons to reduce training costs. Efforts like modeling AI model’s fitness and cutting training based on the model are crucial. By recording and annotating the generated parameter values describing AI models, we can make this knowledge searchable and reusable. We have come a long way in dealing with large amounts of performance data generated by supercomputing and scaling problems to hundreds of thousands of cores. Now, the community needs to repurpose some of that expertise for contexts that can significantly impact society.

It is not just about computing. We must keep in mind the need for efficient data management and security, which are critical components of sustainable AI. We should come together to standardize data-sharing practices, ensuring that data is easily accessible and transferable among users. By working together, we can create systems that support seamless data management and protect sensitive information, thereby promoting an environment of shared discovery and collaboration.

 

So what can researchers do to improve the sustainability of these AI models?

Collaboration is essential! The multifaceted challenges we face, from reducing energy consumption to improving the transparency of AI models, require a collective effort from diverse experts, encompassing software developers, energy-efficient hardware designers, data managers, and more. No organization possesses the necessary skills and resources to comprehensively address sustainable AI. By collaborating, we can pool our resources and knowledge, leading to robust and practical solutions. For instance, merging the strengths of HPC systems with the scalability and flexibility of cloud computing can create more efficient platforms for AI inferences.

Collaboration can also significantly reduce the costs associated with sustainable AI. We can lower individual expenses by sharing infrastructure and resources, making HPC and Cloud resources more accessible. Community clouds and shared investments can help break down financial barriers. This shared approach not only makes advanced technologies more affordable but also lowers the entry barriers for early career and disadvantaged colleagues.

 

Can you give us an example of a successful collaborative project?

I am currently engaged in an NSF-funded project to reduce neural network training costs through parametric modeling, workflow optimization, and establishing a data commons. This initiative addresses the substantial computational demands of developing precise neural networks for varied scientific datasets and applications. By introducing methods that allow for the early termination of neural network training, we significantly accelerate the search process and decrease resource usage.

A key collaborative success of this project is the development of a modular workflow that separates the search for neural architectures from the accuracy prediction. This separation provides the flexibility needed to tailor fitness predictions to different datasets and problems, thus enhancing the efficiency of neural architecture searches. Additionally, we have established a neural network data commons that meticulously records the lifecycle of neural networks, from generation through validation training. This data commons is invaluable as it enables other researchers to utilize these neural networks in their own studies, promoting reproducibility and facilitating the analysis of how network architectures impact performance on specific tasks.

The project also includes a robust educational component. Through the Systers program at the University of Tennessee Knoxville, we actively mentor underrepresented groups, particularly women in electrical engineering and computer science. The mentorship and curricula development designed for a diverse student body broaden the project’s impact beyond its scientific and technical achievements.

Power-consumption and explainable AI are big challenges, but they are worth researching and are necessary to prevent climate change.