ACM SIGPLAN BLOG: Prompts are Programs
The following is a repost from the ACM SIGPLAN Blog: PL Perspectives. It was written by Tommy Guy, Peli de Halleux, Reshabh K Sharma, and past CRA-I Co-Chair Ben Zorn
In this post, we highlight just how important it is to understand that an AI model prompt has much in common with a traditional software program. Taking this perspective creates important opportunities and challenges for the programming language and software engineering communities and we urge these communities to undertake new research agendas to address them.
Moving Beyond Chat
ChatGPT, released in December 2022, had a huge impact on our understanding of what large language models (LLMs) can do and how we can use them. The millions of people who have used it understand what a prompt is and how powerful they can be. We marvel at the breadth and depth of the ability of the AI model to understand and respond to what we say and its ability to hold an informed conversation that allows us to refine its responses as needed.
Having said that, many chatbot users have experienced challenges in getting LLMs to do what they want. Skill is required in phrasing the input to the chatbot so that it correctly interprets the user intent. Similarly, the user may have very specific expectations of what the chatbot produces (such as data formatted in a particular way, such as JSON object), that is important to capture in the prompt.
Also, chat interactions with LLMs have significant limitations beyond challenges in phrasing a prompt. Unlike writing and debugging a piece of code, having an interactive chat session does not result in an artifact that can then be reused, shared, parameterized, etc. So, for one-off uses chat is a good experience, but for repeated application of a solution, chat falls short.
Prompts are Programs
The shortcomings of chatbots are overcome when LLM interactions are embedded into software systems that support automation, reuse, etc. We call such systems AI Software systems (AISW) to distinguish them from software that does not leverage an LLM at runtime (which we call Plain Ordinary Software, POSW). In this context, LLM prompts have to be considered part of the broader software system and have same robustness, security, etc. requirements that any software has. In a related blog, we’ve outlined how much the evolution of AISW will impact the entire system stack. In this post, we focus on how important prompts are in this new software ecosystem and what new challenges they present to our existing approaches to creating robust software.
Before proceeding, we clarify what we mean by a “prompt”. First, our most familiar experience with prompting is what we type into a chatbot. We call the direct input to the chatbot the user prompt. Another, more complex prompt is the prompt that was written to process the user prompt, which is often called the system prompt. The system prompt contains application-specific directions (such as “You are a chatbot…”) and is combined with other inputs (such as the user prompt, documents, etc.) before being sent to the LLM. The system prompt is a fixed set of instructions that define the nature of the task to be completed, what other inputs are expected, and how the output should be generated. In that way, the system prompt guides the execution of the LLM to compute a specific result, much as any software function. In the following discussion, our focus is mainly on thinking of system prompts as programs but many of the observations also directly apply to the user prompts as well.
An Example of a Prompt
We use the following prompt as an example, loosely adapted from a recent paper on prompt optimization to illustrate our discussion.
You are given two items: 1) a sentence and 2) a word contained in that sentence.
Return the part of speech tag for the given word in the sentence.
This system prompt describes the input it expects (in this case a pair of a sentence such as “The cat ate the hat.” and a word, such as “hat”), the transformation to perform, and the expected structure of the output. With this example, it is easy to see that all the approaches we take to creating robust software should now be rethought in terms of how they apply to prompts.
If Prompts are Programs, What is the Programming Language?
There are many questions related to understanding the best way to prompt language models and it is a topic of active PL and AI research. Expressing prompts purely in natural language can be effective in practice. In addition, best practice guidelines for writing prompts often recommend structuring prompts using traditional document structuring mechanisms (like using markdown) and clearly delineating sections, such as a section of examples, output specifications, etc. Uses of templating, where parts of prompts can be substituted programmatically, are also popular. Approaches to controlling the structure and content in the output of prompts both in model training and through external specifications, such as OpenAI JSON mode, or Pydantic Validators, have been effective.
Efforts have also been made to more deeply integrate programming language constructs into the prompts themselves, including the Guidance and LMQL languages, which allows additional specifications. All of these methods (1) observe the value of more explicit and precise specifications in the prompt and (2) leverage any opportunity to apply systematic checking to the resulting model output.
Prompting in natural language will evolve as the rich set of infrastructures that the LLMs can interact with become available. Tools that extend the abilities of LLMs to take actions (such as retrieval augmented generation, search, or code execution) become abstractions that are available to the LLM to use but must be expressed in the prompt such that the user intent to leverage them is clear. Much PL research is required to define such tool abstractions, help LLMs choose them effectively, and help prompt writers express their intent effectively.
Software Engineering for Prompts
If we understand that prompts are programs, then how do we transition our knowledge and tools for building POSW so that we can create robust and effective prompts? Tooling for authoring, debugging, deploying and maintaining prompts is required and existing tools for POSW do not directly transfer.
One major difference between prompts and traditional software is that the underlying engine that interprets prompts, the LLM, is not deterministic and so the same prompt can result in different results in different calls even using the same LLM. Also, because the types and varieties of LLMs are proliferating, it is even harder to ensure that the same prompt will produce the same result across different LLMs. In fact, LLMs are evolving rapidly and there are important tradeoffs that can be made between inference cost, output quality, and local models versus cloud-hosted models. The implication of this fact is that when the underlying model changes, the prompt may require changes as well, which suggests that prompts will require continuous tweaking as models evolve.
There are a number of existing research approaches to automatically optimizing and updating prompts, such as DSPy, but such technologies are still in their infancy. Also, a given AI software application may choose to use different models at different times for efficiency, so like having binary formats that support multiple ISAs, (e.g., the Apple Universal binary format), prompts may require structure that supports multiple target LLMs.
Ultimately, tools that support testing, debugging, and optimizing the prompt/model pairing will be necessary and become widely used. Because standards for prompt representation or even how prompts are integrated into existing software applications have not been adopted, research into the most effective approaches to these problems is needed.
Next Steps for Prompt Research
Because prompts are programs, the software engineering and programming languages communities have much to offer in improving our understanding and ability to create expressive, effective, efficient, and easy to write prompts. There are incredible research opportunities to explore and the impact will inform the next generation of software systems that will be based on AISW. Moreover, because writing prompts is much more accessible to non-programmers, an entirely new set of challenges relates to how our research can support individuals who are not professional developers to leverage LLMs through writing effective, expressive, robust and reusable prompts.
In this post, we’ve considered how a single prompt should be considered a program but, in practice, many applications that leverage AI contain multiple prompts that are chained together with traditional software. Multi-prompt systems introduce even greater software engineering challenges, such as how to ensure that a composition of prompts is robust and predictable. And this field is moving very fast. Agentic systems, such as AutoGen and Swarm, where AI-based agents are defined and interact with each other, are already widely available. How does our existing understanding of building robust software translate to these new scenarios? Learning what such systems are capable of and how we can construct them robustly is increasingly important for the research community to explore.
The challenges and effective strategies for creating robust prompts are not well understood and will evolve as rapidly as the underlying LLM models and systems evolve. The PL and SE communities have to be agile and eager to bring the decades of research and experience building languages and tools for robust software development to this new and important domain.
Biographies:
Tommy Guy is a Principal Architect on the Copilot AI team at Microsoft. His research interests include AI-assisted data mining, large-scale AB testing, and the productization of AI.
Peli de Halleux is a Principal Research Software Developer Engineer in Redmond, Washington working in the Research in Software Engineering (RiSE) group. His research interests include empowering individuals to build LLM-powered applications more efficiently.
Reshabh K Sharma is a PhD student at the University of Washington. His research lies at the intersection of programming languages and security, focusing on developing infrastructure for creating secure systems and improving existing systems using software-based mitigations to address various vulnerabilities, including those in LLM-based systems.
Ben Zorn is a Partner Researcher at Microsoft Research in Redmond, Washington working in (and previously having managed) the Research in Software Engineering (RiSE) group. His research interests include programming language design and implementation, end-user programing, and empowering individuals with responsible uses of artificial intelligence.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.