Smarter AI, Stagnant UX: The Misalignment in AI Products

I've always been a staunch supporter of AI. When friends ask what AI can do, I give them very specific answers: it can help plan schedules, research information, and even handle many trivial but time-consuming tasks in life, like negotiating email discounts. I provide prompt templates based on their scenarios to lower their barrier to entry. However, after recommending AI to hundreds of people over the past year or two, I've observed that most of them have a rather lukewarm response after using it. The most common feedback is: "It's okay, I don't find it that useful." Sometimes they even add, "I'd rather do it myself."

This contrast hasn't dampened my enthusiasm for promoting AI, but it has indeed puzzled me: why is it that as models become increasingly powerful, ordinary users still find it difficult to perceive this improvement? Besides the notion that "AI is a tool that requires learning," are there any deeper reasons? Recently, I've systematically compared the user experience of several mainstream AI clients (Claude, Gemini, and ChatGPT), and I've slowly realized that this disconnect in the perception of AI's usefulness stems from a problem in the product design that connects users and models.

The capabilities of AI models are indeed evolving at an astonishing pace. As we'll discuss below, they started with multi-turn conversations, then incorporated multi-modal abilities. Now, the most advanced AI can interact with various tools and achieve multi-hour autonomous work. However, most current apps are still stuck in the multi-turn design philosophy, creating a huge gap with the capabilities of LLMs. So, when AI's intelligence is presented through an unsuitable interactive medium, users get frustrated. For example, the Claude app is designed for short conversations; if it's moved to the background, task execution is interrupted. In that case, even if Claude 4 is incredibly powerful and can perform tasks for hours in the background, it's useless. It's like putting an F1 engine into a Santana – is it powerful? Yes. Is it user-friendly? It drives pretty much like a regular Santana.

Unfortunately, the details of how these apps are used constitute the entirety of the user's perception. The vast majority of users don't know it's an app problem; they just think AI isn't good. This creates a systemic disconnect in AI products, which is the topic this article aims to discuss in detail.

From Multi-Turn, to Multi-Modal, to Multi-Hour Agency

In the last two years, AI model capabilities have undergone three leaps. First, they learned to remember context and engage in multi-turn conversations. Then, they could see images, hear sounds, and analyze videos. Now, the latest models can even operate autonomously for several hours, completing complex tasks, automatically invoking tools, and providing periodic summaries and feedback.

These three leaps—from "being able to speak" to "being able to see and hear," and then to "being able to do"—have gradually transformed AI from a Q&A tool into an intelligent assistant. OpenAI is heading in this direction, as are Google and Anthropic. But the problem is, most of the AI apps we use today are still stuck in the interactive logic of two years ago. It's like a Santana whose engine has been gradually upgraded to an F1 engine, but the brakes and suspension remain unchanged. This is the fundamental reason why many people don't feel how powerful AI has become: the models are evolving, but the apps haven't kept up.

Multi-Turn: The Dawn of Chatbots

Multi-turn conversation is the most basic capability of all mainstream models today.

A significant reason for ChatGPT's success is that it's not a search box like Google Smart Search, which answers one question at a time, but a system capable of sustained dialogue around a task. The key technology behind this is Supervised Fine-Tuning (SFT), which uses human-annotated multi-turn conversation data to teach the model how to extract memory and answer questions.

Claude also performs well in this regard. It's adept at summarizing and referencing context, such as helping you read papers, summarize long documents, or iteratively refine an article.

At this stage, creating an app was a simple matter: basically, just maintain chat history and wrap an API with a shell. The experience was similar across different providers. The only thing to pay attention to was support for large context windows. For instance, the Gemini 2.5 series models can support a 1M token context window, which is crucial for many applications. However, its web and client versions freeze when users input a few thousand tokens (using less than 1% of the model's capacity), making it almost unusable. This is a classic example of an app failing to keep pace.

Designing products as chatbots was fine in 2023 when people were just starting to use AI for chatting. But now, models are no longer just chatbots; they are copilot systems capable of handling structured tasks. If apps remain stuck in the old mindset, they will significantly waste the model's potential.

Multi-Modal: From Speaking to Seeing and Hearing

The second leap was multi-modality.

Today, mainstream models all claim to support multi-modality, but there are significant differences. Gemini 2.5 is currently the most thorough in this aspect. It can natively view images, listen to audio, and understand video. And it's not just simple viewing; it can truly reason, combine, analyze, and summarize. The underlying technology involves using different tokenizers combined with projection layers to map information from different modalities (images, sound, text) into a shared representation space, allowing the model to process actions in videos and tones in speech as if they were text.

OpenAI's approach is similar, but it lacks a unified model that can both perform reasoning (like o3) and process video, audio, and images (like gpt-4o-realtime). Its highlight is allowing images to be objects of tool invocation. For example, o3 can write Python code to crop, enlarge, and identify details in images, then pass the processed results back to the model for further tokenization and analysis. This method has significantly enhanced its multi-modal capabilities, even enabling scenarios like "guessing the location from a picture," which only o3 can achieve.

Claude's multi-modal support is currently quite basic, limited to image recognition and unable to process audio or video.

However, from an experience perspective, the most advanced Gemini offers the worst experience. Its web and client versions don't support uploading videos or audio, only images. This is a typical example of a model living in 2026 while the product is still in 2023. If the product doesn't adapt to the model's competitive advantages, it's naturally difficult to differentiate the user experience.

Multi-Hour Agency: AI Truly Becomes an Assistant

The third change is that AI models have begun to acquire the ability to run continuously and complete tasks autonomously. We can call this stage Multi-Hour Agency, meaning AI can maintain context, dispatch tools, and continuously complete a task that takes tens of minutes or even hours, without requiring you to intervene each time.

This is actually a prerequisite for AI to become truly usable. Many important tasks, such as researching news in a specific field, planning a complete trip, analyzing a database, or generating well-structured code, are beyond the scope of a Q&A robot. They essentially require a system that can think, call tools to supplement information, execute step-by-step autonomously, and even dynamically adjust plans.

Claude 4 claims it can run continuously for seven hours to complete a particularly complex task. o3 can also call many tools and execute very complex tasks in stages. The realization of these capabilities is due to continuous optimization of mechanisms like HFRL (Human Feedback Reinforcement Learning), function calling, external tool access, and long context. The model itself is ready to take over complex processes, but the apps are not. For example, no matter how powerful the Claude model is, its iOS and Mac apps disconnect as soon as the screen turns off, and chat history is lost.

From multi-turn conversations to multi-modal understanding, and then to long-duration task execution, the model's capabilities have been layered on top of each other. Meanwhile, app capabilities have remained almost stagnant. The model is no longer a simple question-answering robot but a digital assistant that can complete tasks collaboratively with you. However, at the product design level, clients still treat it merely as a search engine with lower latency and a more natural tone. So the issue isn't whether AI is smart enough, but whether we have built a product structure capable of accommodating that intelligence. Most of the time, users aren't evaluating the model itself, but the shell in which the model is packaged. And that shell is something many companies (including large ones) haven't bothered to create.

Comparing OpenAI, Claude, and Gemini Product Platforms

Ultimately, AI model capabilities have become highly convergent: large models + tool systems + long context + multi-modal encoding. But what truly differentiates them isn't model capability, but how products integrate these capabilities with user application scenarios. Over the past few months, I've continuously used Claude, ChatGPT, and Gemini, not just their APIs, but also their GUIs/apps, and not just their web versions, but also their iOS and desktop apps. My overall impression is that all three companies emphasize how powerful they are, but their consumer-grade products (except for OpenAI's) feel like they're switching between a semi-finished product and an experiment. In this chapter, we'll look at the pros and cons of the three companies' clients from a user's perspective.

Claude: Model Solid, App a Semi-Finished Product

Claude 3.7/4 series models are inherently strong, especially in areas like long-text reading, code writing, and not slacking off. They often outperform o3 in stability, earning numerous accolades on Cursor and becoming the go-to model for many. However, the consumer-grade product experience of Claude.ai is truly hard to describe.

Claude's client has a critical flaw: if you switch apps, the inference process breaks. It's not that the task pauses or reconnects; the entire conversation simply vanishes from the history. It doesn't notify you of the interruption; the task status just goes blank, and the chat in the history becomes "Untitled." This happens whether you lock the screen on iOS or close your laptop lid on a Mac. Fundamentally, this issue arises because Claude's consumer product hasn't moved beyond the chatbot paradigm, viewing the app as merely an API wrapper. Consequently, its architecture heavily relies on the client-side, placing the burden of stream maintenance and session state preservation entirely on the user. This is fine for short Q&A sessions, but it completely fails when running complex tasks. Its iOS app implementation is also rudimentary; long model outputs cause the phone to overheat. So, no matter how powerful the model is, users will simply say: "It's not good."

The only differentiating factor here might be that Claude's desktop app is currently the only mainstream client integrated with Model Customization Program (MCP). It can directly leverage MCP to connect local resources to the consumer-grade AI platform, using a subscription model instead of token-based billing, which is quite practical.

Gemini: Strong Model, App Experience Like a Demo

Google's Gemini presents an even more extreme case: the model's capabilities are absurdly strong, while the app is absurdly bad.

AI Studio is Google's debug suite for developers. Within this tool, Gemini stands out as the model I've seen with the largest token window support and the most robust mixed analysis of video, audio, images, and text. Uploading a million-word document is effortless, and running a 10-minute paper summary doesn't cause disconnections. If you give it 100 repetitive tasks requiring tedious processing, Gemini completes them diligently and without cutting corners. Its multi-modal capabilities, tool use, and especially its instruction-following abilities are top-tier in the industry. I personally believe it significantly surpasses second-tier models, including Claude and GPT.

The problem is, all of this can only be experienced within the web version of AI Studio. This is, after all, a tool for developers. You have to constantly keep the webpage in the foreground; locking your phone screen causes a disconnect. The system prompt automatically clears with each round, offering no personalization. Chat history saving and sharing rely entirely on Google Drive and are also very basic.

For consumer users, Google primarily promotes the Gemini App. But this app... is a truly bizarre product. It feels like the product department intentionally created it to frustrate the AI department. Your Gemini 2.5 model has a 1M token context window? Fine, I'll make the UI freeze when users input around 10k tokens in their prompt, bringing you down to the same level as other AIs. Your Gemini 2.5 is exceptionally good at processing video and audio, a feature other AIs lack? Fine, I won't allow users to upload video and audio files in the UI, making its functionality identical to other AI products. It wasn't until mid-2025 that users were allowed to set system prompts for Gemini 2.5 (by the way, the web version still has bugs, and the mobile version hasn't been released yet). Even when I finally find a scenario where I can use the Gemini App, I discover that the intelligence it demonstrates is significantly inferior to that in AI Studio. It shows a greater aversion to using search to broaden answers and a stronger tendency to make things up, and who knows what negative optimizations were made in the system prompt.

So, many people, myself included, had the initial reaction after using the Gemini App: "Is this it?" But they've likely only experienced a fraction of the model's true capabilities. You have to research prompts yourself and explore AI Studio's usage to even begin to unearth its underlying potential. This is impossible for 99% of users.

ChatGPT: The Most Mature Product Team

In contrast, OpenAI's product experience trounces the other two. This is actually quite counterintuitive because when we think of GPT, our first thought is of the most established LLM, with industry-leading model capabilities. We subconsciously assume OpenAI primarily leads through its model, perhaps without as much time to refine the product. However, OpenAI's top position in models is becoming precarious. While o3's tool use is still top-notch, its instruction-following ability lags behind the other two. There are also considerable differences in context window length, multi-modal capabilities (audio and video understanding), and price. Conversely, ChatGPT's product experience dominates the field, leading the other two by several lengths. It might currently be the only product that allows users to leverage 70-80% of its underlying AI model's capabilities.

Let's look at a few specific scenarios:

Asynchronous Task Execution: An important AI use case is when you're on the go, using your phone, and suddenly remember you need AI to do some research. So, you type something like "research XXX" into the app. Then you minimize the app and lock the screen (or simulate this by killing the app). ChatGPT will continue the research in the background. When you unlock the screen and reopen the app, you'll find the research completed, with the latest results displayed. In this scenario, Claude will fail 100% of the time. You might find the chat, but the title will be "Untitled," and the content will be empty. The Gemini app will most likely fail, with the entire chat disappearing. However, there's a small chance the chat conversation might reappear an hour later with the correct content. This highlights a difference in product design philosophy. Only OpenAI positions ChatGPT as a tool that can assist users with long-running tasks in the background. Although Claude emphasizes this in its API, its consumer product doesn't seem to believe in it. Gemini's approach is similar.
iPhone Photo Analysis: If a user enables RAW photo capture on their iPhone, the resulting image is a DNG file, not JPEG or HEIC. Whether intentional or accidental, this is a very common scenario, and it's hard to tell the difference in the iPhone's photo album. If we directly upload this image, Gemini will report a "connection to server disconnected" error (what on earth?), and Claude will report that the file type is unsupported. While not perfect, at least Claude's error message is correct. But OpenAI knows to first convert it to JPG and then upload. This processing is actually very simple and has a low engineering cost. Whether it's implemented or not depends entirely on product strength—whether the team actually uses the app, encounters common pitfalls, and refines the details.
Massive Text Input: Select a large amount of text (e.g., 150,000 words) and paste it into the AI app. Gemini will freeze immediately after you hit send. If you're patient enough to wait a minute or two, it might recover. If you're not patient and put the phone app in the background, the entire chat will disappear, as in the previous test. Both Claude and ChatGPT will report that the text is too long and refuse to process it, even though it's within the context window limits of the underlying LLM. This might be to reserve space for "thinking tokens."

There are many other details, such as whether you can set system prompts on the mobile client, whether deep research provides live activity progress updates, the depth of personalization, and so on, which we won't analyze one by one.

However, OpenAI is not without its issues. For example, there are still discrepancies between web and app functionalities. Features like deep research based on GitHub and SharePoint are only supported on the web. Additionally, as of now, there is no MCP support. But overall, OpenAI is currently the only company that gives equal importance to product design and model capability. The experience has no major flaws.

Is It Just That the Product Is Still Iterating?

Of course, I can understand why some products are developed more cautiously. Some might argue that the Gemini App lacks video analysis and the Claude App doesn't provide notifications for task interruptions because they are still in the MVP (Minimum Viable Product) stage. The strategy might be to launch the model first and let users try it out while the product is still being completed. This explanation sounds plausible at first, but the problem is, if the MVP phase lasts for over a year, core features are consistently delayed, and even basic functionalities like system prompts, uninterrupted tasks, and correct error reporting for file uploads are not properly implemented, then it's no longer an MVP; it's a sign that the product isn't being taken seriously. Users can distinguish between strategic restraint and resource-driven neglect.

Another argument is that most people don't use complex features anyway, so adding too many would overwhelm the product's rhythm, and keeping it simple is the right approach. This actually underestimates the essence of AI products. The true value of AI lies not in replacing a search engine or a knowledge Q&A tool, but in its ability to help users handle tasks they either can't manage themselves or don't have time for—such as long documents, cross-modal materials, and complex planning. If a product cannot even accommodate these tasks, it's destined to be seen by users as unremarkable, or even a gimmick.

In short, regardless of whether a task is simple or complex, users don't want their input to be wasted, nor do they want the app to silently fail. This isn't a high-level feature issue; it's a fundamental reliability problem. And many apps today can't even get this right.

Reasons and Opportunities

Looking back, the ability of AI models themselves to support various life scenarios today is no longer the issue. The problem is that similar models, when integrated into different companies, departments, and even budget processes, end up being presented to users in vastly different ways. This is why the same Gemini model exhibits astonishing video understanding and instruction-following capabilities in AI Studio but performs poorly in the app. This isn't a technical problem; it's an organizational one.

We are likely dealing with two products created by two different organizations, each reporting to different VPs [source]. In such a structure, the Gemini App's product manager might be completely unaware of the model's greatest strengths. After some research, they might find that ChatGPT and Claude both support image uploads but not video, leading to the conclusion: "Then we don't need it either." Little do they know that video understanding is precisely Gemini's biggest advantage.

What's even more peculiar is that AI Studio actually performs better. Why? Because it's designed for developers, often built by engineers themselves, making it closer to the model. You could call it a product, but it's more like a debugging tool. This "design by no design" approach, ironically, unleashes the model's capabilities more effectively than the app version, which has a product manager but lacks resource support.

Claude's problem is a different kind of structural issue. It's fundamentally a B2B-oriented company. APIs are its core business, accounting for 85% of its revenue [source]. The B2C client is merely a "we need to have one too because others do" feature-parity showcase. Consequently, the Claude App feels very unpolished: as long as it runs, it's fine. Users aren't notified of disconnections, tasks aren't saved if they crash, and long outputs on iOS cause the phone to overheat. No one truly cares if users use it for actual work, as long as it allows them to test the model and see that it's good.

Conversely, OpenAI is the only company that must firmly establish itself in both B2C and B2B. ChatGPT is its flagship product, contributing 73% of its revenue [source]. More importantly, it's a small company with a simple reporting chain, and its product and model teams are tightly integrated. It's hard to imagine an OpenAI product manager being unaware that their model can recognize videos. It can integrate this capability effectively simply because its organizational structure allows it to.

So, returning to the main theme of this article—why is it that AI models are getting stronger, yet users don't feel they're getting better? One of the most sobering answers might be: it's not that the products themselves are difficult to make, but that they are constrained by company structures. However, this also means that opportunities still exist.

Currently, major model vendors are competing on who has the largest, most multi-modal, and lowest-cost model, but there's hardly any competition in terms of product experience. This is due to structural obstacles and blind spots in their roadmaps. They assume that if the model is strong, the product experience will naturally improve; as long as the capability is high, users will stay. This assumption has already been somewhat disproven by the experience gap between ChatGPT and the Gemini App.

Not all teams can integrate a capability well, nor will all capabilities automatically lead to a good experience. This is a structural misconception in the industry that hasn't been sufficiently discussed, and it paradoxically provides a very realistic entry point for third-party teams.

If we know Claude 4's model is very stable, but its app is prone to crashing, could we build a more stable asynchronous task app using its API?
If we know Gemini 2.5 dominates in video analysis, but its app doesn't even support video uploads, could we use AI Studio's sample code to wrap a lightweight client and target a vertical market?
If we know all current apps are still stuck in the chatbox paradigm, could we directly move beyond the dialogue format and design a new front-end structure based on multi-hour task orchestration?

These are innovation paths that can be pursued without building models from scratch. Moreover, they are not just potentially promising products; they represent existing user needs that no one has seriously addressed yet.

So, let's return to the beginning of the article: AI isn't bad to use; it's just that the AI most people encounter is packaged in the wrong shape. The model is smart, but the app hasn't kept up. This experiential gap is not a technological disparity but the result of a long-standing disconnect between product design and organizational decision-making.

We have now entered an era where models are not scarce, but experiences are. The next watershed moment for AI products may lie in whether you can identify the opportunities hidden within these disconnects.

Computing Life