Technical Insights on Using AI for Speed Reading Videos

In the previous article, we explored why informational videos often have low information density, yet many content creators continue to produce them. Inspired by our discussions, I developed a small tool that allows you to perform speech recognition on videos without leaving the YouTube or Bilibili app, enabling quick reading of video content.

After using this tool, I found that even longer videos no longer seemed daunting. I even dared to revisit the videos that had been sitting idle in my favorites folder. Essentially, for videos lasting 30 to 60 minutes, you can spend a minute or two identifying which parts are worth a deep dive and which parts can be skipped. This not only boosts efficiency but also lowers the barrier to watching informational videos. This product is now available for free on my website.

I haven’t turned it into an AI summary tool yet. The reason is that, in my experience, AI-generated summaries are still quite poor, often missing key insights, so it’s still necessary for humans to read the original script. However, you can manually copy and paste the recognized script content and then use GPT for more Q&A. Additionally, I created a shortcut specifically for iOS and Mac, making it even more convenient to integrate with GPT and other third-party apps. This article mainly shares these technical insights.

  1. How to Download YouTube Videos: Here, I use a Python library called yt_dlp, which can download videos, audio, subtitles, and other metadata from Bilibili and YouTube directly. However, a recent challenge is that YouTube has started restricting such libraries and even its official client, requiring users to log in to download videos. A suggested workaround online is to use logged-in cookies with the tool to simulate a login. However, I found that while this method works for one or two videos, the cookie expires after a few hours or multiple downloads, and Google will still prompt for a login. The ultimate solution is to use OAuth2 for login. After logging in once, you won’t encounter similar warnings when downloading videos thereafter.

  2. Gray Market: Due to the recent restrictions by Google mentioned above, many YouTube download sites are no longer functional. An alternative is to use browser extensions to download videos, as these extensions can access the site's cookies and simulate a login to directly parse the video URL. However, this also poses a security risk. Rumor has it that the gray market is paying $10 per user to buy the ownership of these extensions.

  3. Functionality and Performance Requirements: This tool first calls OpenAI’s Whisper API for speech recognition on the audio, then calls the GPT API to add punctuation and segment the recognized text to enhance readability. Therefore, the program’s main task is to provide a fast API interface and call various OpenAI APIs in the background. Consequently, the server itself doesn’t need high performance.

  4. Enhancing User Experience Through Asynchronous Models: We often need such a tool when watching longer videos because we can quickly speed through shorter ones. This means that both speech recognition and GPT reformatting take considerable time, making traditional blocking web APIs particularly poor in user experience. Users would have to wait idly in front of a webpage or client. For a 20-minute video, it might take about 5 minutes to complete the process. During this time, there’s no progress feedback, so users don’t know if the program has stalled or if the network connection has been lost, leading to a poor experience.

    One possible solution is to design the backend API as task-based, providing several different APIs. The client can submit a new task and get a task ID in return, which can then be used to query the task's progress. When the progress reaches 100%, another API can be used to get the speech recognition and GPT reformatted results. This allows for frequent status updates and enables asynchronous processing on the client side, significantly improving the user experience.

  5. Concurrent Web Services with Python: Since my backend is implemented in Python, there are some unique challenges associated with it. As of now, Python's interpreter still has a global interpreter lock, meaning it cannot perform multithreading and can only perform multiprocessing. This is not ideal for web services, as each process cannot handle multiple requests concurrently.

    There are two solutions to this problem. One method is to use asynchronous programming, allowing us to return control to the CPU while waiting for a response from OpenAI's server so it can handle another request. The other method is to use a process manager like Gunicorn to launch multiple processes. Both methods achieve the goal but have their own drawbacks.

    The first method is particularly useful for I/O-intensive tasks, such as those that frequently wait for disk or network operations. In our case, we have a step where we need to transcode the downloaded audio files. During this process, the CPU is fully occupied, preventing concurrent processing. The second method's issue is that processes cannot share memory. Therefore, we cannot manage tasks through shared variables. For tasks with very short lifespans, possibly only a few minutes, using memory or shared variables for management is ideal. However, to enable different Python processes to share such task data, we must use more complex solutions like a database or a key-value store backend, such as MongoDB or Redis, to share data.

    As a result, our final API architecture is as follows: we use asynchronous programming along with Gunicorn, and data consistency is ensured through Redis, with added caching. Additionally, a side note on why we need to transcode the audio is because OpenAI's API has a 25MB file size limit. For longer audio files, such as those lasting tens of minutes, we must convert them to lower bitrate audio to keep the file size within 25MB.

  6. Experiments with GPTS: After developing the web version, I also tried extending this service to GPTS. The reason is that GPTS itself would bring traffic, and this way, when calling the GPT API, I wouldn't have to pay for it, but each user would use their own subscription.

    However, after some time, I decided to abandon the GPTS direction. The main reason is that while GPTS is excellent for distributing prompts, especially for frequently used prompts by yourself or your team, it falls short when trying to incorporate more specialized capabilities.

    One significant issue is its interaction with third-party services, with two particular problems being critical for me. First, because its interaction is conversational, it cannot poll the task status API every few seconds like our web version. If the task isn't completed, it continues to wait, and if it is, it displays the results. It functions more like a turn-based game, where you poke it to ask if it's done, and it checks and responds. Then, control returns to you, and you can only poke it again to ask if it's done, and it will then give you the results.

    Therefore, it is fundamentally incompatible with our task-based API architecture. To adapt to GPTS's interaction style, I had to turn the API into a single-point API, which blocks during recognition and transcription until the task is fully completed before returning. For shorter videos, this integrates well with GPTS, providing good results through this single-point API. However, when testing with longer videos, the API constantly times out. This timeout isn't due to our web server but because GPTS strictly limits the interaction time with third-party APIs. Once it exceeds approximately 30 seconds to a minute, GPTS considers the API to be down and reports an error.

    These two limitations directly crippled our envisioned product. Therefore, after struggling for a while, I had no choice but to abandon the GPTS direction, as long videos are where our product holds more value.

  7. Lazy GPT: After completing the speech recognition, we need to go through a GPT post-processing phase. Developing this phase was quite interesting. The most serious issue was that the model tends to be lazy. Especially when the recognized text is long, GPT often works diligently for the first and second paragraphs, but by the third paragraph, it significantly shortens the input. For the fourth and fifth paragraphs, it directly skips and only briefly processes the last two paragraphs. This laziness problem is not unique to GPT; it is also common in many open-source models.

Claude V3 performs the best in this regard. Neither the cheapest haiku nor the most expensive opus versions exhibit the laziness issue. However, it has another significant problem:

  1. Context Window Limitation: Although modern LLMs seem to have long context windows, such as Claude v3 with 200K and GPT-4 with 128K, they both secretly impose an output length limitation. Both models have a maximum output of 4096 tokens. This means that even if your input is very long, such as a video transcription with 15,000 tokens, you can't process it all at once. The API can only output a maximum of 4096 characters before truncating.

To keep the output within the 4096-token limit, we must divide the input into smaller chunks of three to four thousand tokens each, then feed these chunks to the API for processing, and finally piece together the results somehow. This is quite ironic. In the eras of 4K and 8K context windows, we frequently resorted to such tedious methods. We thought we could finally abandon this ugly and inefficient approach with 128K and 200K windows. However, due to the output window limitation, we are still forced to perform this split-and-merge hack, which is very painful and often leads to consistency issues, such as the first part being in Simplified Chinese and the second part in Traditional Chinese. The merged result looks quite odd.

  1. Shortcut: Beyond the web version and GPTs, we also explored other product forms, one of which is Shortcut. Shortcut is a tool on iOS that allows you to distribute your programs directly through iCloud on your phone and call various app functions. So, I basically re-implemented the web version's functionality on Shortcut. Interested users can give it a try. It is a very user-friendly programming environment, but if possible, try it on an iPad for a better experience than on an iPhone. Its capabilities are quite strong, allowing you to achieve almost all functionalities of our client, including task management and polling.

Additionally, it has two particularly useful features. First, it can integrate with other iPhone components. For example, it can act as a sharing option. When watching a video on Bilibili and wanting to perform speech recognition on it, you can directly click share, then our Shortcut, to start the speech recognition. You don't need to copy the URL to our web page, paste it, and press Enter. The entire process is very smooth, and you don't even need to leave the Bilibili app. While it's recognizing, you can still use your phone normally. There will be a progress bar displayed on the Dynamic Island, and the whole process is carried out in the background, making it very seamless.

The second advantage is that the results can be used in various ways. For instance, my Shortcut not only displays the results but also copies them to the clipboard. This makes it easy to paste into other apps like Notes or Notion. Additionally, you can interact with other apps directly. You can edit the Shortcut to add a custom prompt and send it to ChatGPT for Q&A, or insert the result into Notion by creating a new page, and so on.

In summary, these are the technical insights I gained while working on this project. I've rambled on quite a bit, but I wanted to share my development process and how I juggled coding, playing around, and learning throughout. I'm also curious about your thoughts on this tool. My web version is available at this URL, and you can download the Shortcut here. Both are free tools, and I'm looking forward to your feedback.