Our GPT-based voice input tool has undergone two iterations since its initial development. The original concept was a simple and fast input tool that allowed us to generate large amounts of text through voice input in a short time, such as typing about 300 Chinese characters per minute. The previous iteration mainly shifted the application scenario towards knowledge management, as we discovered its usefulness not only as an input tool, but also in processing and outputting knowledge. In order to support this change in application direction, we moved the user interface from a web app to a Telegram Bot. This brought many benefits, such as automatically having the concepts of sessions and users, individual chat histories, and the ability to easily record data. At the same time, we unexpectedly found that these records could serve as excellent training data for fine-tuning existing LLMs, achieving functionality similar to GPT-4.
However, after communicating with more users, we found that the application was still quite niche. Very few users truly liked it, and most people were not interested in such an application. This means that our thinking about this tool still has some limitations, as we have not truly considered the needs of the entire IT-related community. In order to maximize the impact of the GPT approach, my class representative and I had more discussions and interviewed some early users, conducting a more thorough analysis of the entire issue.
We believe that, from a more abstract level, the potential of this tool goes far beyond knowledge management. In knowledge management, it mainly does two things: first, like a point, it turns one idea into another, perhaps through voice recognition, GPT-4 reshaping, or subtle tone changes; second, it provides a concept of a line, connecting different points into a sequence to complete more complex operations. For example, fast input is a line with two points: first performing voice recognition, then invoking GPT for logical paraphrasing. This pattern can be applied to many other areas, where some users may only need to add punctuation and correct typos to the results of voice recognition, while others may need GPT-4 paraphrasing and structured summarization, outputting text in Markdown format. Some users may even require a networked structure, using GPT to complete highly complex tasks.
Therefore, our value may not only be in providing voice recognition and GPT paraphrasing as two specific points, but in providing a platform that helps users build systems composed of points and lines. This concept is a product of deep discussion and high abstraction, and may be difficult to understand. From a more practical application perspective, many times in real life, GPT's capabilities are not smart enough to complete tasks end-to-end, requiring us to guide it step by step. For example, when writing an article, we may need to first give GPT an outline composed of keywords, let it expand into a short paragraph, then edit it and dictate it back to GPT for further polishing, and finally change the text style to make it more conducive to dissemination. Therefore, using GPT to complete relatively complex tasks requires us to guide it gradually. More specifically, the core difficulties lie in two aspects. On one hand, we need to understand the model's capabilities and limitations, and then properly decompose the complex task. On the other hand, we need to carry out some engineering practices according to the task decomposition, such as writing prompts for each step and using the results of the first step as input for the second step.
Both of these processes can be supported by this tool, with the latter being more focused on the engineering side. We can expand the existing tool into a point-and-line-based execution engine, allowing users to simply tell us what they want to do, and we can help them with data transformation, transportation, storage, and serialization. On the other hand, regarding the model's boundaries and how to decompose tasks, if more people use this tool, we can provide collective intelligence to help newcomers to GPT. This would allow them to learn from the templates of experienced users, quickly build a system, and focus their efforts on areas that require domain-specific knowledge, rather than figuring out how to build a system that moves GPT results around, optimizing prompts, or decomposing tasks. Of course, at this point, task decomposition remains a challenging task.
After determining the value of the tool, we need an appropriate UI. The main purpose of this UI is to allow users to freely combine points and lines. We considered several possible designs, such as implementing a conversational UI based on a Telegram bot or creating a graphical interface similar to children's programming, realized through drag-and-drop actions. After weighing the engineering costs and the intuitiveness of user experience, we decided to use Apple's Shortcut tool. It is a no-code programming tool built into Macs, iPhones, and iPads, featuring all programming functions such as loops, branches, and variables, while also having a very intuitive interface. Users don't need to write any code; they can simply drag and drop basic modules and establish connections to create a program that can be executed on a mobile device. This perfectly meets our needs, allowing us to reuse Apple's in-depth exploration in this field instead of reinventing the wheel. As a result, we encapsulated the core functionality into a Web API and called it through the Shortcut, allowing users to connect our provided points into lines, achieving highly flexible applications. In addition to flexibility and applicability, this tool can also deeply integrate with mobile devices, interact with various apps on the phone, output results to Notes, or send messages via WeChat, for example. It also has very flexible invocation methods, such as integration with Siri or allowing users to double-tap the back of their phone to call the Shortcut. This makes our product highly flexible, easily integrated into users' workflows, and very lightweight.
We have now created a sample Shortcut that you can view or add to your own Shortcuts via this link: https://gpty.ai/example. We welcome everyone to try it out and provide feedback.