The Shortcomings Behind Exquisite Artwork
Recently, OpenAI released a new text-to-image model. There are plenty of style transfer examples online, but this article wants to discuss another application scenario that may be more useful for commercial purposes: generating vivid diagrams that can be directly used in slides.
Refined and sophisticated illustrations and visualizations are key to making slides stand out. However, creating an illustration from scratch has a high barrierâif you're not trained in design with an eye for aesthetics and familiar with specialized tools, itâs tough for ordinary people to produce striking visuals. The new GPT model, though, has demonstrated a very fine-grained control over text, accurately and reliably incorporating given text naturally into an image. So if we can combine GPTâs exquisite style with its precise handling of text in image generation, we might solve a huge pain point in business applications and greatly enhance slide presentations.
For example, suppose we want to draw a flowchart. If we create it using the basic PPT approach, it might look like the following figure. It has a few basic modules arranged in a simple linear relationship, connected by arrows.
While this simplest form of visualization is easier to understand than a pure bullet list, itâs hard to say itâs very intuitiveâfundamentally, itâs still a textual outline. Yet if we truly want to make it more lively, weâll have to incorporate many design elements. For instance, beyond text, we might add a variety of icons. When the icons become plentiful, we must consider color schemes to keep them pleasant. If we aim even higher, we might want to seamlessly blend it into an exquisite visual scene with some storytelling elementsâsuch as animated charactersâto make it more narrative, friendlier, and natural. But itâs clear that while these ideas are great, theyâre essentially out of reach for most people. So a natural thought is: can GPT help us achieve something similar and produce effects like the figure below? (Spoiler: the figure below was generated by GPT.)
However, if you actually try it, youâll find that although GPTâs text control has greatly improved compared to previous models, itâs still not quite controllable enough for truly commercial-grade applications. For example, if we use prompts like the following:
I want to generate a chart in the style of "Shaun the Sheep." It will essentially look like a path in the order of:
Text in mind -> Muscle instructions -> Hand movement -> Keyboard -> Text in computer/phone
In order! And there are arrows connecting each element.
We can indeed get the correct text, but almost every time, there are various issues with the diagram. As shown below, maybe the arrows are wrong, maybe the order is off, or sometimes there are terrifying artifacts.
The main goal of this article is to share my experienceâintroducing how to elevate GPTâs flowchart and visualization capabilities to a genuinely commercial-grade level with four steps.
Four Steps for Precision and Control at a Commercial Level
1. Use Visuals to Activate AI Thinking
Of these four steps, the first (and perhaps most crucial) principle is that one picture is worth a thousand words. Keep in mind that GPT, as a multimodal LLM, doesnât just accept text promptsâit can also accept one or even multiple images to guide its image generation. Often, the reason a modelâs output is poor is because we havenât provided it with enough information. Just as with hallucinations in text, the same issue arises in generating images. For instance, in the above prompt, we never explicitly stated that there is no connection between âText in the Mindâ and âHand movement.â So itâs not entirely unreasonable that the model connected them.
Sure, one could argue that if the model is smart enough, it wouldnât make such mistakes. Thatâs true. But on the one hand, the model will always have its boundaries; there will always be scenarios where itâs not smart enough. On the other hand, we actually have a simple way to eliminate such misunderstandings: adopting a âvisual-firstâ principle. Donât rely solely on textual descriptions of your intentions to GPTâmake the image itself the core component of your prompt. This process can be very simple. For example, you can directly ask something like âCursorâ to render an HTML representation of the five elements mentioned above (the box-and-arrow visualization) and then take a screenshot of that webpage to include as part of your prompt to GPT. It can then easily generate the correct diagram.
The key here is that an image carries much more information than text. For example, the fact that âText in the Mindâ and âHand movementâ are not connected might be ambiguous in a text description, but if we provide a screenshot, it is crystal clearâthereâs no arrow connecting those elements. Using images as prompts essentially saves the modelâs cognitive effort by clarifying many possible misunderstandings in a way thatâs direct and intuitive for the model. And this clarification process doesnât even require effort on our part; you just let the text-based AI or a more âcleverâ AI sketch a diagram for you.
From another perspective, thatâs exactly why we humans do visualization in the first placeâbecause humans process images more intuitively and with less mental load than text. The same observation appears to hold true for AI. So if we communicate our creative intentions with AI through images (even simple ones), the output often improves significantly.
After providing both images and text as our mixed prompts, we might get a result like this:
2. Use Masking for Iterative Generation
In the previous example, we can see there are still some minor flaws in the generated imageâfor instance, the word âKeyboardâ in the bottom right corner is missing a letter. If we try re-generating with the same prompt, itâs equivalent to starting over, wasting previous work and offering no high success rate for fixing that detail. In ComfyUI and MidJourney, partial fixes can be done via local inpainting. ChatGPT also provides this feature, but itâs somewhat hidden. At least for now, neither the desktop client nor the web version includes it. You have to open the generated image in the mobile app, where thereâs a âSelectâ option in the bottom menu. Tap on it to add a mask over a specific area, then continue in the prompt with something like âfix the spelling of âKeyboard,ââ and it will only modify that part while keeping everything else intact.
This is a huge leap in controllability. However, thereâs a minor caveat: Unlike Stable Diffusion or MidJourney, which enforce the mask through a strict computational process, GPTâs masking still seems to be achieved via the LLM. Therefore, you might notice subtle changes outside the masked area, but usually theyâre minor and donât affect the main image.
This greatly impacts our workflow. Now we donât have to expect the model to get everything right in one go. We can proceed step by step, refining iteratively: first confirm the framework, then modify specific parts, so we can methodically hone the details.
3. Let AI Be Your Creative Partner
The third tip is to leverage GPTâs text-to-text capabilities for brainstorming. This is somewhat similar to the first tipâboth aim to minimize ambiguity in prompts. The first tip uses images, while this one involves more detailed text. Often, when we manually type prompts, because weâre lazy or canât predict AIâs quirks, our instructions can have ambiguities. But if you consult GPT, it can often help fill these gaps or offer better suggestions.
For instance, suppose I want to visualize the âDunning-Kruger Curve.â Most demonstrations online, as well as my early attempts, simply draw the curve on a blackboard with some animated characters nearby to make it less dull.
Sure, thatâs better than just placing an XY axis in PPT, but it doesnât fundamentally alter the curve itself; it merely presents it in a friendlier format. A more advanced approach would be to blend visual elements into the curve structure itself, so that itâs no longer just a cold coordinate plane, but tightly integrated with visual characters to convey a story, making it easier to grasp the deeper significance of the curve. However, on the one hand, creating such a concept requires a high level of creativity, and on the other, it entails a lot of typing. So I asked GPT how it might design such a visualization. GPT provided a very detailed response. Then I fed it my own original character âDucko,â along with some background information, and it generated a highly engaging illustrated image. I believe this image would be another step up from the blackboard example if placed in a slide presentation.
4. Use Document Management to Build an AI Asset Library
This leads naturally to the final tip: document management. Weâve mentioned this concept several times before when discussing programming. If you want to do AI Native Developmentâbe it text-based software development or visual content creationâyou need to keep systematic, reusable documentation. Only then can you further deepen and expand on it.
In our case, we can leverage document management to ensure a high degree of consistency in visual elements and style. For instance, for reasons of copyright or corporate branding, we may not be able to use something like âShaun the Sheep,â and instead need an original character and art style. In such cases, we can first brainstorm with GPT (as in the previous step) about the effect we want to achieve, and then have it generate the character design and visual references we need. Hereâs a sample image:
With that in place, we can apply the first and second tipsâproviding both the text prompt and previously generated character designs and sample images to GPT, so it can create new images with no ambiguity, based on these two elements. Note that both the character design and visual references can be reused and continually improved. Whenever GPT makes a mistake, we can adjust the character design documentation, so that GPT wonât make the same mistake next time. This entire process is what we call document management. With this principle, we can easily generate a unified visual style and incorporate it into our presentations. On the one hand, this makes our message more attractive and easy to understand; on the other, it helps build personal or corporate branding by forming an AI asset library.
Why Do Your Prompt Tips Always Become Obsolete?
After the release of GPTâs text-to-image model, thereâs been a sentiment online: âThe experience I gained from Stable Diffusion and ComfyUI no longer applies, and many related startups are doomed.â But looking at our discussion above, thereâs no need to be so pessimistic.
If, in using SD or other open-source models, all you did was memorize mechanical fixesâlike adding a negative tag when a character has three legs, adding a reference image, or applying local inpainting for incorrect textâthen yes, these solutions were merely temporary patches for specific shortcomings of older models. When a new model emerges and fixes some old problems, those patches indeed lose value.
However, looking deeper, the underlying principles of these fixes can still transfer to new models. For instance, we utilized both partial inpainting and reference images to solve GPT-related issues. In fact, precisely because I had experience using SD, I was able to quickly grasp higher-level GPT techniques. The critical point is, after you learn the solutions for SD, do you ask why? Why do some models generate six fingers? Why does local masking increase controllability? Why does providing a reference image reduce ambiguity? How does the model parse textual information and map it onto the canvas? The answers to these questions often point to more universal mechanisms: ambiguous user descriptions, spatial layout details, the modelâs handling of text placement, conceptual semantics, etc. These types of insights donât suddenly become irrelevant when a new model arrives; they can be carried over to the next, more advanced model. In other words, the so-called âtipsâ that only target an older modelâs specific quirks are bound to become outdated, simply because theyâre like special screwdrivers for a particular screw. But if you understand why the screw is turned a certain way, how the threads are formed, then switching to a new type of screwdriver is still easy to handle.
Hence the real key is: rather than positioning yourself as a mere User, passively waiting for model upgrades and learning new prompts, you should act as a Builder, continually asking why and accumulating reusable experience and documentation each time a problem is solved. Once you develop this habit of proactive thinking and iteration, you wonât be dragged along by the modelâs rapid evolution, perpetually playing catch-up. Instead, you can apply your existing deeper insights to explore new features, taking the initiative to master or even predict future upgrades. Thatâs the true meaning of being Future-Proof: not betting everything on any specific tool, but possessing enough understanding and methodology to handle potential technological updates and iterations.
Comments