Effective Prompt Engineering

I've always believed that the emergence of large language models (LLMs)/AI, such as GPT, has significantly lowered the barriers in many fields. Recently, following my strong recommendation, many people around me started to experiment with LLMs for more complex tasks. However, after discussing their experiences, it became clear that many of them were facing various challenges, often leading to poor user experiences. Upon analyzing their experiences and use cases, I realized a potential key issue: even though interacting with LLMs only requires natural language dialogue, it's not yet so straightforward that one can use it effectively without any learning. In particular, crafting the right questions still requires some experience, or what is known as prompt engineering. This post aims to share some insights into prompt engineering.

The most critical principle of prompt engineering is to ensure the tasks you assign to the LLM match its capabilities. To put it more specifically, if your question is overly complex and exceeds the AI's capability, you're likely to be disappointed with the outcome. AI's proficiency varies across different domains, and if you happen to challenge it with its weaknesses, you might end up with incorrect or useless results. Thus, the essence of prompt engineering lies in leveraging the LLM's strengths and, in areas where it's less capable, guiding it to solve problems using its auxiliary abilities rather than its own. Moreover, breaking down a large problem into several smaller ones can help the AI provide more effective answers.

It's important to note that the prompt engineering we're discussing differs from what's often seen in the media. The latter usually attempts to endow AI with some form of personality, such as pretending it's an editor for The New York Times. However, from my observation, this approach isn't necessarily better than using straightforward prompts. Often, simply asking the AI to "edit the text in the style of The New York Times" works just as well without assigning it a fictional persona. In this article, we're discussing higher-level strategies rather than basic templates like crafting personalities.

To illustrate our point, let's explore how AI can currently assist in our daily lives and analyze how to apply prompt engineering across various domains.

One of the most common uses of AI is treating it like an encyclopedia or a more intelligent search engine, asking it knowledge-based or factual questions. Unfortunately, this isn't AI's strongest suit. For common general knowledge or information frequently found in its training data, like the number of days in a year, AI can perform quite well. However, when it comes to more specialized knowledge, AI sometimes provides incorrect answers. It would be less problematic if it could simply respond with "I don't know." Yet, current AI/LLMs have a fundamental technical flaw; they cannot assign a definitive confidence level to their answers. When unsure, they might deliver a response with unwarranted confidence, which can be more problematic than a straightforward "I don't know." To navigate this, staying vigilant is key, and the following tips might help:

  1. Use experience to judge which knowledge is general and trustworthy. For instance, knowledge about using Python's standard libraries for programming, such as "I want to output the current time in the format of 2024-01-23 03:22:33, how should I write the program?" is considered general knowledge since this information is mentioned in stackoverflow or Python documentation. On the other hand, specialized knowledge like "the specific coordinates of the NGC6888 nebula in the celestial sphere" might be beyond AI's accurate recall.
  2. For areas you're unfamiliar with, employ trial and error to verify the reliability of AI's answers. Initially, it's crucial to carefully check its responses for accuracy and fabrication. After several rounds of verification, you'll gradually get a feel for which domains are specialized and whether the provided information is credible. For example, I initially thought Python asynchronous programming was a fairly niche subdomain that ChatGPT might not understand well. However, after a few tests, I found its answers to be correct, which helped me adjust and gradually build trust.
  3. AI's capabilities aren't limited to its own knowledge. Models like GPT-4 or New Bing can also search the internet and summarize the results. Therefore, if we know a certain knowledge domain might be challenging for LLMs, we can prompt them to rely on search results instead of their own knowledge base. This approach typically increases the credibility of the answers, as it not only provides the information itself but also links to each source, making it easier for us to verify. For instance, when comparing different car models, such as the mini countryman and the mazda cx30, asking New Bing to "list a table comparing the sizes of the mini countryman and mazda cx30" will not only produce a table but also cite its sources, allowing for further verification by visiting the specific web pages.

When it comes to programming, the scenario is quite akin to what we've discussed earlier. For instance, when programming with Python's standard libraries, we typically don't need to give any additional prompts for AI to generate quality code. However, when we dive into using specific libraries, like the ASCOM library for Python programming, AI might generate code that seems logical but turns out to be entirely incorrect, lacking the actual functions. A potential solution is to supply the relevant background knowledge to make the task manageable for AI. For example, incorporating ASCOM's documentation or sample programs as part of the prompt could be beneficial. GPT-4-Turbo, with its 120,000 token context window, is usually adept at providing solid answers even for documents spanning ten to twenty pages. So, don't hesitate to throw an entire library's sample programs or documentation at it.

Different from querying/searching, in programming Q&As, conversations tend to be lengthy. This is mainly because the generated program code is long, and after a few exchanges, a significant portion of the context window is consumed, potentially degrading answer quality. Essentially, AI might start forgetting some of the details mentioned earlier. To circumvent this, maintaining a "memory bank" could aid in refreshing its memory, preventing it from overlooking previous details, even if it involves pasting the latest code snippets regardless of their length.

Another handy tip for managing the context window is to utilize ChatGPT's editing feature effectively. For instance, after having it write a code snippet and following up with several questions for clarity, you might want to continue the coding discussion. If the subsequent code is unrelated to the previous questions, editing earlier prompts to remove or refine them can prevent cluttering the context window and confusing AI, ensuring the dialogue remains focused on a single theme.

Another common application scenario for AI involves paperwork, including drafting articles, email drafts, and summarizing texts. AI/LLMs excel in this domain, likely due to their training on similar tasks. Usually, no special tricks are needed to enhance their accuracy or effectiveness. However, it's wise to avoid overly vague requests. Specifying your needs can significantly improve AI's assistance. For instance, a request to "translate the following text into English" could result in a straightforward translation. But asking it to "read this Chinese blog, understand its style and tone, and translate it into an English blog with a similar style" can lead to more nuanced and faithful translations.

A pitfall to watch out for is AI's tendency to cut corners with lengthy inputs. For example, it might start strong in translating a long article but then skip examples and jump to main arguments, eventually skipping large sections before returning to normal towards the end. This behavior is reminiscent of a smart person slacking off. To tackle scenarios requiring extensive input and output, managing the context window by breaking the text into smaller sections for translation can prevent this issue, ensuring AI remains diligent throughout.

In the realm of mathematics, AI doesn't exactly shine its brightest. This limitation stems from the very nature of large language models (LLMs) like GPT, which are trained primarily to predict the next sequence of text based on the given context. Hence, these models don't inherently possess the capability to perform actual mathematical computations. For example, when faced with "3688×2688," an LLM understands that the expected output is a numerical value but doesn't concern itself with the accuracy of that calculation. In areas where AI's proficiency is known to be lacking, such as this, leveraging external agents like calculators or Python can significantly boost its capabilities. A prompt like "write a Python program to calculate 3688x2688" guides the AI to draft a program, which can then be executed in OpenAI's sandbox to yield the correct result. Similarly, for queries requiring precise calculations, such as the distance between the Andromeda Galaxy and the Triangulum Galaxy in the celestial sphere, relying solely on AI's generative capabilities might lead to inaccuracies. However, prompting it to develop a Python script for the calculation can provide remarkably accurate answers.

Another domain where AI falls short is in areas requiring deep insights or thorough understanding. This could be attributed to alignment issues or limitations within the training data, leading to responses that, while technically accurate, might come across as circular and not particularly insightful or useful. Currently, finding an effective strategy to overcome this challenge is tricky, mainly because text processing, at its core, is challenging to enhance with external agents for deeper comprehension.

One of AI's most significant current limitations is its primary function of generating text, with few exceptions like creating artistic images with Dall-E. This means AI excels when the desired output can be textually expressed, such as code, making it highly suitable for these applications. However, for outputs that can't be easily expressed in text, like UX designs or 3D models, leveraging AI becomes challenging. Yet, this isn't an insurmountable problem. You can create specialized agents to extend LLMs' capabilities in these areas. For instance, I've recently employed this approach by developing an agent with interfaces that allow it to control robotic systems. While operations like directing a telescope or managing the exposure time for a camera can't inherently be controlled by text, by creating an agent, I've enabled AI to manage these robotic systems through Python code, thus expanding its functionality. This advanced application, however, requires some development experience.

The essence of Prompt Engineering lies in breaking down the problem into segments AI can adeptly handle, focusing on areas of AI's strength, or utilizing agents to broaden AI's capabilities. A helpful analogy is to consider AI as an intern: motivated to work extra hours, and with a basic education but lacking in-depth insights and specialized domain knowledge. Using this intuition to craft prompts often yields satisfactory results. Over time, as we gather more experience, we'll develop a deeper understanding of which questions and domains are suitable for AI. With this knowledge, we can more effectively organize prompts to get AI to generate answers that meet our requirements directly.

Comments