Computing Life

纽约路窄车位小为什么大家反而开大SUV——十年后重读纽约

2024-10-06T22:30:00-07:00

我以前在纽约住了五年。但因为是学生，基本上是11x7搞学术，所以一直没有精力和阅历对这个城市进行太多的思考和感悟。最近毕业快十年，因为家人来访，终于以游客的身份重游纽约，有机会重新解读了这座城市，包括纽约的人，对此有很多新的体会。

其中最令我诧异的一点是，一进纽约，很容易发现街上大SUV的比例比其他城市高很多，比如GMC Yukon XL，Lincoln Navigator，Cadillac Escalade等等。这些车子甚至有一个专门的名字叫Towncar。这实在是太诡异了，按理来说纽约的车道窄，停车难，不是应该像欧洲一样，大家都开小车吗？为什么反而街上跑的都是大车呢？

在做了更多的观察之后，我发现这些车有一个共同特点，他们的车牌号都是T打头，C结尾。在调研之后，我发现这个是TLC（Taxi and Limo Commission）监管的车辆。换言之，这些都是商业运营车辆，而不是本地人自驾的车辆。想想也是，本地New Yorker出行都坐地铁，省时省心。

坐这些车的人一般是有一些额外的需求，比如我们这次旅游就包了一辆车，从JFK直达宾馆。这是因为虽然上学的时候一个人坐地铁很方便，但一家几口人拖着箱子上下电梯实在是太折腾了，不如包个车。时间上虽然没有节省，但是人要省心多了。纽约市内开车停车虽然难，但反正是司机烦心，我不用管。

这个东西仔细想想还是很有意思的。包车其实也不便宜。地铁只要2块9的东西，包车一下把它上升到175刀。我拎着箱子走几步路，省了160刀，它不香吗？毕竟包车这种服务提供的价值非常特殊，它在把人从A送到B方面相比于地铁没有任何优势，但如果想要在省心、私密、装逼方面有任何要求，这基本上是唯一的选择。纽约街头的大SUV乌泱乌泱（目测有一半），说明对这三点比“从A到B”更高端的需求非常大，拥有一个巨大的头部市场，这是我这次发现的纽约的一个非常非典型的东西。

再深入观察下去，这个头部市场不仅限于交通，它同时也在宾馆和餐厅方面有所体现。不说阿曼这种顶级品牌，就是Marriott集团内部几千上万美元一晚的美景套房甚至客房并不罕见。餐厅上，不说米其林餐厅或者是需要正装的老牌餐厅，哪怕是Tao这样的美式中餐馆，一顿饭每人消费几百上千美元也很正常。

那它的支撑到底来自哪里呢？我觉得主要来自三个地方：人多、富人比例高和旅游心态。

一进纽约，最明显的体会就是人多，people mountain people sea，跟国内没什么区别。人车互不相让，喇叭声震天。从另一个角度来看，这一下也就让我懂了为什么旗舰店都开在纽约。第五大道的人流，每天平均有50-100万。从互联网的类比来看，对于每一家店来说，这相当于每天1M次的impression。哪怕是1%的CTR，也有每天几千上万次的click through。这个数字即使在互联网领域也是非常惊人的，更何况转化率更高的实体店。从这个角度看，一下就理解了纽约确实是商业之都，有无穷的机会。

纽约的富人比例也是惊人地高。根据统计，纽约市的居民里，平均每24个人就有一个百万（美元）富翁。再加上人多这个因素的支撑，纽约市一共有34.5万的百万富翁。而且近年来还在继续增长。纽约居民的财富总和甚至比G20除了中美以外的所有国家加起来还要多。

这些人的消费习惯和普通人是很不一样的。比如，普通人可能从来没想过包车甚至包机，但这些富人虽然平时可能也坐地铁，但当有需求的时候也会把包车作为一个选项，这是出现大规模头部市场的重要因素。

而旅游心态则主要指的是一种体验的心态。俗话说穷家富路，在旅游的时候人的心态会很不一样，对于平时不会触碰的选项，接受度往往会更高。比如，平时消费的舒适区域在三五百块钱，对于这种级别的消费会觉得不用太纠结，但旅游的时候会更愿意咬咬牙接受一千块钱一晚的套间。

正是这三个因素造成了纽约的消费层级这么高，也支撑了第五大道上那么多的旗舰店，支撑了路上的大SUV。纽约的市场确实和其他地方大不相同，存在着独特的细分却庞大的客户群。这么看，其他只有在纽约才流行的产品也可以理解，比如Blade这种在JFK和Downtown之间直升机通勤的方式，或者类似迪士尼的VIP插队票。因为这些服务提供了时间和装逼属性，而这些属性是只有有钱人才会注重和承受的。

感觉就像成年以后重读《三国》和《西游》，看到的东西和小时候很不相同一样。十年后重游纽约也看到了很多不同的侧面。这一方面是因为毕业了，从学生的视角转变为消费者的视角，比如对车的型号和类别有了一定的认识。另一方面也是因为教AI课，所以注册了一家公司。在合伙人课代表的帮助和教导下，接受到了很多商业和市场方面的教育。从学生到business owner，这是一个非常重要的角色转变，提供了非常多的不同的视角。我很赞同另一个影响我的influencer陈然的观点：成为一个business owner是一个蜕变，只有真正想着并且试着把自己的技能拿来卖钱，才可以在认知上再上一个高度。如果有类似想法的同学，不妨就从注册一个公司开始尝试。在注册以后，自然就会被迫去做很多税务、licensing、市场方面的事情。就算暂时赚不到钱，得到的教育也是非常值得的。

Why Are There So Many Large SUVs in NYC? Revisiting NYC a Decade Later

2024-10-06T22:29:00-07:00

I previously lived in New York City for five years, but as a student deeply immersed in academics with an 11x7 schedule, I never had the time or experience to ponder and appreciate the city. Recently, almost ten years after graduating, my family came to visit, and I finally had the opportunity to revisit New York City as a tourist. This allowed me to reinterpret the city, including its residents, and I gained many new insights.

One of the most surprising observations was the high proportion of large SUVs on the streets compared to other cities. For example, GMC Yukon XLs, Lincoln Navigators, and Cadillac Escalades were everywhere. These vehicles even have a specific name called "Towncar." It's truly bizarre—given that NYC's lanes are narrow and parking is difficult, shouldn't most people be driving smaller cars, much like in Europe? So why are there predominantly large vehicles on the streets?

Upon further observation, I noticed a common characteristic among these vehicles: their license plates start with a "T" and end with a "C." After some research, I discovered that these are regulated by the Taxi and Limousine Commission (TLC). In other words, these are commercial operation vehicles, not personal cars driven by locals. It makes sense—New Yorkers primarily use the subway, saving time and hassle.

People who use these cars typically have additional needs. For instance, during our trip, we hired a vehicle to go directly from JFK to our hotel. While it was convenient for a student to navigate the subway alone, it was too much of a hassle for a family with luggage to drag suitcases up and down elevators. Hiring a car didn't save time, but it was much more mentally convenient. Driving and parking in New York City is difficult, but since the driver handles the parking hassles, I didn't have to worry about it.

This phenomenon is quite interesting upon closer thought. Hiring a car isn't cheap—while a subway ride costs about $2.90, hiring a car can jump to $175. Carrying luggage a few steps to the hotel directly saves around $160. Isn't that a good deal? After all, the value provided by car services is quite unique. They don't offer any advantages over the subway in terms of getting from point A to point B, but they are the only option if you desire convenience, privacy, or a status symbol. The abundance of large SUVs on New York streets (by my estimate, about half of them) indicates a significant demand for these higher-end needs beyond mere transportation. This constitutes a massive head market, which is a very atypical aspect of New York that I discovered during this visit.

Digging deeper, this head market isn't limited to transportation; it also manifests in the hotel and restaurant sectors. Not to mention top-tier brands like Aman, within the Marriott group itself, stunning suites and even guest rooms priced in the tens of thousands of dollars per night are not uncommon. As for restaurants, whether Michelin-starred establishments or longstanding places that require formal attire, even American-Chinese restaurants like Tao typically see individual dining expenses ranging from several hundred to over a thousand dollars per person.

What supports all this market? I believe it primarily stems from three factors: high population density, a high proportion of wealthy individuals, and a tourist mindset.

Upon entering NYC, the most immediate impression is the sheer number of people—people mountain, people sea—much like in China. People and vehicles are uncompromising with each other, and the blaring horns are deafening. From another perspective, this instantly made me understand why flagship stores are all located in New York. Fifth Avenue sees daily foot traffic averaging between 500k to 1M people. Using an internet analogy, for each store, this equates to about 1 million impressions daily. Even with a 1% click-through rate (CTR), that results in thousands to tens of thousands of "click-throughs" each day. These numbers are astonishing even in the realm of the internet, not to mention in physical stores with higher conversion rates. From this angle, it becomes clear why New York is indeed a commercial powerhouse with endless opportunities.

The proportion of wealthy individuals in New York is also remarkably high. According to statistics, on average, one in every 24 residents in New York City is a millionaire (in USD). Coupled with the high population density, New York City has a total of 345,000 millionaires, and this number continues to grow in recent years. The total wealth of New York residents even surpasses that of all G20 countries combined, excluding China and the United States.

These individuals have consumption habits vastly different from the average person. For example, while ordinary people might never consider hiring a car or even a helicopter, these wealthy individuals, although they might also use the subway for regular commutes, will contemplate hiring cars when the need arises. This is a significant factor contributing to the emergence of a massive head market.

The thirds factor, the tourist mindset primarily refers to an experiential attitude. As the saying goes, "A poor family makes rich roads," implying that people's mindsets change while traveling, making them more open to options they wouldn't usually consider. For instance, one may spend three to five hundred USD comfortably without much hesitation for regular consumption levels. However, when traveling, people are more willing to bite the bullet and pay for a suite costing a thousand USD per night.

These three factors create such a high consumption tier in New York, supporting the numerous flagship stores on Fifth Avenue and the large SUVs on the streets. New York's market is indeed distinct from other places, with a unique yet expansive customer base. This explains why products that are only popular in New York make sense, such as Blade—a helicopter commute between JFK and Downtown—or VIP fast-track tickets similar to Disney's. These services offer time-saving and status-enhancing attributes, which only affluent individuals prioritize and can afford.

It feels similar to rereading "Three Kingdoms" or "Journey to the West" as an adult compared to as a child, noticing different aspects. Revisiting New York after a decade, I saw many different facets. On one hand, because I had graduated and shifted from a student perspective to that of a consumer, gaining certain insights into vehicle models and categories. On the other hand, due to teaching AI classes, I registered a company. With the help and guidance of my business partner Kedaibiao, I received substantial education in business and market. Transitioning from a student to a business owner is a significant role change, providing many different perspectives. I strongly agree with the viewpoint of another influencer who inspired me, Chen Ran: becoming a business owner is a transformation. Only by truly thinking about and trying to monetize your skills can you elevate your cognition to a new level. If any readers have similar ideas, they might as well start by registering a company. After registration, you'll naturally be forced to handle many aspects such as taxes, licensing, and marketing. Even if you don't make money initially, the education you receive is immensely valuable.

通过Prompt Engineering提升对人类求助的效果

2024-09-11T12:45:00-07:00

我在各个社区里潜水，经常能看见有人求助，大家互相帮助的情况。但遗憾的是，我发现绝大多数的帮助都是无效的，最终能够真正解决问题的很少。比如一个人在群里问："我遇到了问题X，怎么去解决？"大家可能会给出很多建议："你试试A行不行，试试B行不行。"但最终试了之后，发现都没什么用，整个求助也就不了了之。

这里面的原因很复杂，但在这里我只想从一个很小的角度来讨论。在很多情况下，我们只要用一些简单的面向人类的Prompt Engineering的技巧，稍微改一下问问题的方式，就可以极大地提升求助的效果，可能可以很快得到真正有用的回应。

在看这个小技巧之前，我们先去观察一下大家求助和互相帮助的时候遇到的最大的问题是什么。我的观察是，这个最大的问题是大家只是单纯的拍脑袋去想有什么原因能够解释求助的人遇到的问题。换言之，A可以导致X，你应该去试试改正A这个事情。

举个最简单的例子，比如有个小白求助："我家里灯泡不亮，应该怎么办？"一个很自然的回复就是："如果灯泡坏了，会导致灯不亮，所以你应该去买个灯泡换上。"这样的思路看起来没什么问题，但如果观察得多了就会发现它是一种非常低效的解决问题的方法。在编程里面叫做random debugging，或者叫拍脑袋解决问题。这是因为导致X的原因可能有很多，一个个试过去需要很长的时间。

还是用刚才的灯泡不亮的例子来看，可能是因为停电了，可能是因为插线板坏了，可能是因为电灯的开关坏了，或者某个电线断了，当然也有可能是灯泡坏了。一个一个可能的原因排查过去的话要查到猴年马月。而且还有一种可能是人自己的问题，比如可能在换了灯泡换了灯之后还是没用，最后一个星期以后发现是空气开关没打开。不要笑，这个在求助的情况下非常常见。

那如果我们不想用Random debugging，有没有什么方法能够高效地找到问题的原因呢？这其实是一个很值钱很根本的问题，做好了这一点基本上就能在工作和生活中上一个巨大的台阶。说穿了它也并不复杂，就是我们不要从正面拍脑袋去凑X可能的原因，而是通过设计一系列的实验，逆向一步步的缩小范围。

对这个灯泡的例子而言，比如说可以先试一下同一个插线板的其他灯亮不亮。如果不亮，可能就是插线板或者空气开关或者停电的问题。如果亮，就说明是灯本身的问题。那灯本身又可能是灯泡的问题、线路的问题、开关的问题等等。我们可以直接用一个测电笔在灯泡附近测一下看看有没有电。如果有电但是灯泡不亮，就说明是灯泡的问题，我们就去换灯泡就好了。如果没电，就说明可能是线路包括开关的问题。通过这一系列非常简单的实验，我们可以在一两分钟内就可以非常精确的判断出问题的真因，然后去加以解决。这个过程比比如去换一个插线板、去买一个新灯泡要省时间多了。

但是为什么大家在网上不这么帮助别人呢？这一方面是因为拥有这种逆向debug mindset的人非常少，但另一个非常重要的原因是提问的问法不对。绝大多数求助的帖子问的问题都是"我应该怎么办"，那大家自然而然就会给你提你应该怎么办的建议，比如你应该去换个灯泡，你应该去换个插线板。这种提问的方式就暗示了你想要的是一个直接的最终的解决方法，求仁得仁，自然你得到的也是零散且无章法的答案。

如何解决这个问题呢？在AI领域有一个Prompt Engineering的小技巧叫做Chain of Thought，其实正好可以用在这里。与其问最终的解决方法，更好的问题是问大家"这可能是什么原因导致的，并且我怎样判断是这个原因"。在灯泡的例子里，比如我们可以说："现在我的灯不亮，有没有什么方法能够帮助我找到是哪个环节出了问题？"这样的问法就可以逼着大家去思考可能有很多种不同的问题，而不一定是灯泡坏了，从而可以给你更有效的答案。

当然从另一个角度来说，有可能你这么思考下去会发现我根本不需要上网去问别人，而可能自己就解决了。或者另一种情况是，等到你真的发现自己解决不了上网要问别人的时候，你会自然而然把你的思路给写出来，这样对你得到更有效的帮助也是非常有力的。

所以总的来说，这种逆向debug的mindset是工作和生活中非常关键的一个技能。它可以帮我们节约大量的时间，同时让我们能做到以前做不到的事情。而且通过一些简单的prompt engineering的技巧，我们也可以给别人带来这样的mindset，帮助他们一方面更有效的解决我们自己求助的问题，另外一方面也可以给他们启发，让他们更高效的解决他们自己的问题。这也是我写这篇文章的目的。

Seeking Help Effectively through Human Oriented Prompt Engineering

2024-09-11T12:44:00-07:00

I often lurk in various communities and frequently see people seeking help and others offering assistance. Unfortunately, I find that the majority of this help is ineffective, with very few actually resolving the issue. For example, someone might ask in a group, "I'm facing problem X, how do I solve it?" and people might offer suggestions like, "Try A, see if it works, or try B." But ultimately, after trying them, nothing seems to work, and the whole request for help just fades away.

The reasons behind this are complex, but here I want to discuss a small angle. In many cases, simply using some straightforward human-oriented prompt engineering techniques and slightly altering the way we phrase our questions can greatly enhance the effectiveness of seeking help, potentially leading to more useful responses quickly.

Before looking at this small technique, let's first observe the biggest problem people face when seeking help and assisting each other. My observation is that the biggest problem is people just guessing what could explain the problem the person is facing. In other words, A could lead to X, so you should try fixing A.

Take a simple example: someone asks, "The light bulb at my home isn't lighting up, what should I do?" A natural response might be, "If the bulb is broken, it won't light up, so you should buy a new bulb and replace it." This line of thinking seems fine, but if you observe more, you'll find it's a very inefficient way to solve problems. In programming, it's called random debugging, or just guesswork. The reason is that there might be many causes for X, and testing them one by one takes a long time.

Using the previous light bulb example: it could be due to a power outage, a broken power strip, a faulty light switch, or a broken wire, not just a broken bulb. Checking each possible cause could take forever. There's also a possibility of user error, such as replacing the bulb and the lamp, only to discover a week later that the circuit breaker was off. Don't laugh, this is quite common when seeking help.

So, if we want to avoid random debugging, is there a method to efficiently identify the cause of the problem? This is actually a very valuable and fundamental question. Getting this right can significantly elevate your work and life. In essence, it's not complicated; we shouldn't just guess the possible causes of X, but rather design a series of experiments to narrow down the range in reverse.

For the light bulb example, you might first test if other lights on the same power strip work. If they don't, it could be a power strip or circuit breaker issue, or a power outage. If they do, it indicates a problem with the light itself. The light could have issues with the bulb, wiring, switch, etc. You could use a test lamp near the bulb to see if there's power. If there's power but the bulb doesn't light, it's a bulb issue, so replace the bulb. If there's no power, it could be a wiring or switch issue. Through these simple experiments, you can accurately identify the root cause in a couple of minutes and address it. This process saves much more time than buying a new power strip or bulb.

But why don't people help others this way online? Part of it is because few people have this reverse debugging mindset, but another crucial reason is the way questions are asked. Most help-seeking posts ask, "What should I do?" leading others to suggest what you should try, like replacing the bulb or power strip. This way of asking implies you're looking for a direct final solution, and consequently, you receive scattered and unmethodical answers.

How to solve this problem? In AI, there's a prompt engineering technique called Chain of Thought, which can be applied here. Rather than asking for the final solution, a better question is to ask, "What could be causing this, and how can I verify it's the reason?" In the light bulb example, you might say, "My light isn't working, are there any methods to help me identify which part is the problem?" This way of questioning encourages people to consider multiple possible issues, not just a broken bulb, leading to more effective answers.

From another perspective, thinking this way might reveal you don't need to ask others online and could solve it yourself. Or, when you realize you truly can't solve it and need to ask others, you'll naturally write out your thought process, which greatly helps in receiving more effective assistance.

In summary, this reverse debugging mindset is a crucial skill in work and life. It can save us considerable time and enable us to achieve things previously unattainable. Using simple prompt engineering techniques, we can also instill this mindset in others, helping them solve our problems more effectively and inspiring them to solve their own issues more efficiently. This is also the purpose of writing this article.

AssemblyAI语音识别API试用感受

2024-09-02T12:45:00-07:00

在之前我们做了很多尝试，使用AI语音识别和AI来辅助进行表达和输入。经过一年多的时间，一方面我觉得这个工具特别有用，可以极大地拓展了我思维的深度，让我不用浪费时间在修改打错的字上；另外一方面，整个系统也经历了很多改动，比如后台的AI从GPT-4换成了Claude 3.5。

最近另外一个问题变得越来越严重，就是我用的语音识别引擎是OpenAI的Whisper API，它变得越来越不稳定，经常显示Timeout需要重试。为了解决这个问题，我尝试过在本地host一个开源的Whisper模型，但是速度比OpenAI的还是要慢很多，他们确实做了很多infrastructure方面的优化。

为了解决这个问题，我开始探索有没有其他公司的语音识别API可以使用。在搜索一段时间之后，发现了AssemblyAI这个公司。这个公司蛮有意思的，不同于别的公司比如TurboScribe那样就是Whisper API换壳，这个公司真的有自己的模型开发能力，发布了自己的论文和白皮书。所以我花了一段时间把AssemblyAI集成到了我的系统里，和OpenAI的Whisper API做了一些对比。一些感受：

在速度上，OpenAI还是更快一些的。比如对一个3分15秒的文件进行识别，OpenAI只用了9.8秒就返回了结果，AssemblyAI在用Nano模型的情况下花了10.6秒，在用Best模型的情况下花了19.8秒，耗时大约是OpenAI的两倍左右。
但是准确率AssemblyAI确实更高，尤其是使用Best模型的情况下。对一些关于咖啡烘焙的术语，它可以非常准确地表达出来，让我完全不用做任何修改就可以直接使用。这点非常impressive。但是Nano模型达不到这样的效果，还是不如OpenAI的。
从稳定性的角度暂时还没有数据。我只是对OpenAI经常发生outage或者需要重试感觉不满，但也不知道AssemblyAI会不会有同样的情况。

所以从性能的角度来说，感觉OpenAI的API还是更好一些的。AssemblyAI是用两倍的处理时间换取稍微更高的识别准确率，我不确定对我的这种用场景是不是划算，现在暂时使用Assembly的API做一做长期测试。

在功能上，AssemblyAI提供了挺多蛮诱人的功能：

它有类似OpenAI的Batch Mode，叫做Synchronous Mode，但OpenAI的Batch Mode只对LLM有效，对于语音识别是没有这个功能的。
AssemblyAI最大支持5GB的文件，对于大文件比如电影和YouTube视频的识别特别友好。OpenAI这方面做的很差，一方面没有Synchronous Mode，一方面文件大小的限制也特别死。而且我甚至觉得这不一定是模型的限制，而是前端工程师和后端没有沟通好。当你上传一个大文件的时候，OpenAI给的错误甚至不是API的错误，而是一个Nginx的错误，说Entity too Large，是个HTTP错误，这比较搞笑。
AssemblyAI还有Streaming recognition，让用户可以对着话筒说话识别，OpenAI也没有这个功能。
AssemblyAI还有Speaker recognition的功能，类似Zoom AI companion，OpenAI也没有这个功能。

总的来说，我觉得AssemblyAI看上去是一个相当靠谱的而且很容易集成的语音识别API，而且集成起来也特别简单，感觉不妨作为OpenAI的一个Alternative。

Comparison of AssemblyAI and OpenAI Whisper API

2024-09-02T12:44:00-07:00

Previously, we made many attempts using AI voice recognition and AI to aid in expression and input. After more than a year, on one hand, I find this tool particularly useful as it greatly expands the depth of my thinking and saves me time from correcting typos. On the other hand, the entire system has undergone many changes, such as the backend AI switching from GPT-4 to Claude 3.5.

Recently, another issue has become increasingly serious: the voice recognition engine I use, OpenAI’s Whisper API, has become increasingly unstable, often showing timeouts that require retries. To solve this problem, I tried hosting an open-source Whisper model locally, but it was much slower than OpenAI's; they have indeed optimized a lot in terms of infrastructure.

To solve this problem, I started exploring whether other companies’ voice recognition APIs could be used. After searching for a while, I discovered a company called AssemblyAI. This company is quite interesting. Unlike other companies like TurboScribe, which are just rebranded Whisper APIs, this company genuinely has its own model development capabilities and has published its own papers and white papers. So I spent some time integrating AssemblyAI into my system and did some comparisons with OpenAI's Whisper API. Some observations:

In terms of speed, OpenAI is still faster. For example, recognizing a 3-minute 15-second file, OpenAI took only 9.8 seconds to return the result, while AssemblyAI took 10.6 seconds using the Nano model and 19.8 seconds using the Best model, roughly double the time of OpenAI.
However, AssemblyAI is indeed more accurate, especially when using the Best model. For some coffee roasting terms, it can express them very accurately, allowing me to use them directly without any modifications, which is very impressive. But the Nano model does not achieve this effect and is still not as good as OpenAI’s.
From the perspective of stability, there is no data yet. I am just dissatisfied with OpenAI frequently experiencing outages or needing retries, but I don't know if AssemblyAI will have the same issues.

So from a performance perspective, OpenAI's API still seems better. AssemblyAI uses twice the processing time for slightly higher recognition accuracy, and I'm not sure if it’s worth it for my use case. For now, I’ll use Assembly's API for some long-term testing.

In terms of functionality, AssemblyAI offers quite a few enticing features:

It has a Synchronous Mode similar to OpenAI’s Batch Mode, but OpenAI's Batch Mode is only effective for LLM, not for voice recognition.
AssemblyAI supports files up to 5GB, which is particularly friendly for recognizing large files like movies and YouTube videos. OpenAI performs poorly in this regard, with no Synchronous Mode and very strict file size limits. Furthermore, I even think this is not necessarily a model limit, but rather a lack of communication between frontend engineers and the backend. When you upload a large file, the error given by OpenAI is not even an API error but an Nginx error saying "Entity too Large," which is quite amusing.
AssemblyAI also offers Streaming recognition, allowing users to speak into the microphone for recognition, which OpenAI does not have.
AssemblyAI also has Speaker recognition functionality, similar to Zoom AI companion, which OpenAI does not have.

Overall, I think AssemblyAI seems to be a fairly reliable and easily integrable voice recognition API, and integrating it is particularly simple, making it a viable alternative to OpenAI.

使用Builder’s Mindset重定义AI工具

2024-08-10T23:00:00-07:00

在ChatGPT刚刚推出的时候，我就尝试过用它来规划旅行，但整个体验非常糟糕。比如它经常出现幻觉，把地址搞错，或者推荐一个并不存在的景点。由于交互界面完全是文本的，我也看不到每个景点在地图上的位置。最后不得不回归地图手动安排行程和确认住宿。因此感觉这种应用更多是噱头大于实用。

因为计划去温哥华旅游，在我花了两年时间深入研究如何用AI改进生活和工作后，我又重新尝试了一次这类工具。出乎意料的是，这次的体验有了显著提升。我把这次的聊天记录分享在这里。

简单来说，我首先让AI推荐了温哥华的景点类型和适合游玩的区域，然后告诉它我们有老人和小孩，进一步缩小了选择范围。接着让它把这些景点标注在地图上，使我们可以直观地感受到景点的分布位置，从而确定最终的行程顺序和交通方式。最终，在AI的帮助下，我们生成了一个总体的行程。

整个聊天看似平淡无奇，但如果两年前甚至一年前让我来做，恐怕都无法这么顺利地完成。在这个过程中，我主要运用了以下几个技巧：

使用不同的AI来解决不同的问题，比如利用网络搜索的AI来减少幻觉，而在需要阅读文档或生成代码时，则使用更智能的AI比如Claude。
使用文档管理来高效地教会AI以前它不会的内容。比如通过粘贴Bing Maps的文档教AI生成一个URL，让我们能够直接在Bing Maps上查看所有景点的位置。
使用Chain of Thought来规避潜在的陷阱。比如，引导AI使用地址而非经纬度来在地图上标注地点。

但其实这些技术上的奇技淫巧都只是表面现象。现在的AI体验好功能强，并不是因为AI比以前更聪明，或者我使用的工具Poe比其他客户端更好，而是因为我具备了Builder's Mindset。

所谓Builder's Mindset，就是我们对待工具的态度从被动转变为主动。我们不再只是工具的User，而是工具的Builder。当没有现成的工具时，我们构建工具；当现有的工具不好用时，我们改进工具。在AI时代，这种构建和改进变得特别简单。比如，虽然到现在为止，AI还没有一个功能能将推荐的景点直接显示在地图上，但我们只需要粘贴Bing Maps帮助文档的URL，就可以立即教会AI实现这个功能。阅读文档、理解参数含义、构建正确的URL，这些以前需要我们自己动手的体力活，现在都可以交给AI来完成。

因此，具备这种Builder's Mindset的人，可能是AI时代最大的受益者，并且会与那些被动的User拉开明显的差距。不过，这种Mindset的转变并不是一蹴而就的。一方面，它是一个专门的领域，有许多陷阱和技巧；另一方面，它需要时间和经验的积累。比如，我之所以知道Bing Maps有这个功能，也是因为在过去构建其他工具时，学到了相关知识。

如果要我从过去两年甚至更长时间的AI使用经验中只教给你一样东西，我会毫不犹豫地选择Builder's Mindset。这也是课代表和我一起创建了From Users to Builders这个AI课程的原因之一。感兴趣的同学可以参加我们最新的免费试听课程，或者去现场课程看看学员们的评价。

Redefine AI tools with a Builder's Mindset

2024-08-10T22:59:00-07:00

When ChatGPT was first launched, I tried using it to plan a trip, but the experience was quite poor. It often hallucinated, got addresses wrong, or recommended non-existent attractions. Since the interface was entirely text-based, I couldn't see the locations of the attractions on a map. I eventually had to resort to manually arranging the itinerary and confirming accommodations using maps. It felt like these applications were more gimmicky than practical.

As I was planning a trip to Vancouver and had spent two years delving into how AI could improve my life and work, I decided to give these tools another try. Surprisingly, the experience was significantly better this time. I shared this chat history here.

In short, I first asked the AI to recommend types of attractions and suitable areas to visit in Vancouver. I then informed it that we had seniors and children, which helped narrow down the options. Next, I asked it to mark these attractions on a map, allowing us to visually grasp the distribution of the sites and determine the final itinerary sequence and transportation methods. Ultimately, with the AI's help, we generated an overall itinerary.

The entire chat might seem unremarkable, but if I had attempted it two years, or even a year ago, I likely wouldn't have been able to complete it so smoothly. During this process, I primarily used the following techniques:

Using different AI for different tasks, such as using web search AI to reduce hallucinations and employing more intelligent AI like Claude for reading documents or generating code.
Using document management to efficiently teach the AI new content it previously didn't know. For example, by pasting Bing Maps documentation to teach the AI how to generate a URL, allowing us to view all the attractions' locations directly on Bing Maps.
Using Chain of Thought to avoid potential pitfalls. For instance, guiding the AI to use addresses rather than coordinates to mark locations on the map.

But in reality, these technical tricks just scratches the surface. The improved AI experience and powerful functionality are not because AI is smarter than before, or because the tool I use, Poe, is better than other clients, but because I have adopted a Builder's Mindset.

The Builder's Mindset refers to our attitude towards tools shifting from passive to active. We are no longer just users of tools; we are builders of tools. When there are no ready-made tools, we build them; when existing tools aren't user-friendly, we improve them. In the AI era, this building and improvement have become particularly simple. For example, although AI currently lacks a feature to directly display recommended attractions on a map, we can instantly teach AI to perform this function by merely pasting the URL of Bing Maps' help documentation. Reading documents, understanding parameter meanings, constructing the correct URL—these tasks that previously required our manual effort can now be handed over to AI.

Therefore, those with a Builder's Mindset may be the biggest beneficiaries in the AI era and will likely distance themselves significantly from passive users. However, this shift in mindset is not achieved overnight. On one hand, it's a specialized field with many pitfalls and tricks; on the other hand, it requires time and experience to accumulate. For instance, the reason I know about this feature of Bing Maps is because I learned related knowledge while building other tools in the past.

If I were to teach you just one thing from my AI usage experience over the past two years or even longer, I would unhesitatingly choose the Builder's Mindset. This is also one of the reasons why Kedaibiao and I co-created the From Users to Builders AI course. Interested students can join our latest free trial lesson, or check out the feedback from participants in our live course.

从历史的角度看基础大模型基础在何处

2024-08-06T21:30:00-07:00

在ChatGPT发布以后，基础大模型（Foundation Model）这个词一下子火了起来。不论是各个自媒体，还是科研单位和企业，都在热火朝天地探讨基础大模型。腾讯研究院曾经做过一个不完全统计，现在中国有200多个基础大模型正在研发。但是，到底什么是基础大模型呢？一个普通的大规模模型要想成为基础大模型，需要满足什么样的条件呢？如果它的核心标准是“大”的话，需要多大才能称为基础大模型呢？这些问题讨论的人却不多，因此我写了这篇文章，想从回顾历史的角度来解释一下我的观点。

在2017年以前，深度学习已经成为机器学习的主流。相比于传统的机器学习算法，深度学习有一个核心的优势：它一直没有出现数据饱和的现象。传统的模型，即使不断增加模型的复杂程度，到了一定规模的数据之后，它也无法再有效地利用这些数据。具体表现就是，随着训练数据的增多，它的精度不再提高。这是整个机器学习领域面临的一个老大难问题。

但深度学习则完全没有这个问题。只要你给它更多的GPU，更多的训练数据，让它训练更复杂的模型，它就能保证一定会给你更好的精度。直到如今，我们也没有发现深度学习有任何饱和的迹象。这是为什么深度学习如此受各个公司欢迎，尤其是大公司的欢迎的原因。毕竟像这种只要砸钱就能拿到更好精度的事情，可以为他们构筑一个天生的护城河。

但深度学习的这一切都有一个前提。就是神经网络或者说Feedforward的Neural Network是一个特别适合并行的算法，所以它可以在GPU上高效地进行推理和优化。但直到2017年，在NLP领域流行的方法仍然是LSTM等Recurrent Neural Network。它的特点是要想计算下一个词的结果，必须先把前一个词的结果全部计算完，这就造成了一个问题：它是一个串行处理的算法，对GPU非常不友好，很难高效地并行优化。这是为什么虽然Auto Regressive Task很早就出现了——比如把一段话中间的一个词遮住，从上下文来猜这个词是什么——但一直没有在大规模的数据上推行开的原因。比如Word2Vec这种遮住一个词猜这个词是什么的模型只是一个两层的神经网络，并没有真正地和深度学习结合起来。

2017年，Google发布了Transformer这个架构，终于改变了这一切。Transformer通过全局的注意力机制，将LSTM串行的计算结构变成了一种可以进行并行处理的结构。因此它可以在GPU上进行高效的训练。这为NLP领域带来了新的思路。因此，仅仅在几个月之后，OpenAI和Google分别独立发布了GPT和BERT两种算法，通过引入大规模预训练，极大提高了所有NLP任务的算法表现。这些算法的使用一般分为两个步骤：

第一步是用之前提到的在一段话中间遮住一个词来预测这个词是什么，在大规模的数据上面做无监督训练，从而得到一个预训练的神经网络。
在此基础上，第二步是针对具体的任务，比如对一段话进行情感分析，进行进一步的模型微调。让这个基准模型能够适配到特定的任务上，从而实现优异的性能。

发布之后，GPT和BERT迅速成为了NLP领域的标准算法。由于每加入一个新的任务都需要用大量的GPU进行微调，这也为许多科学家和工程师提供了大量工作机会（笑）。

2020年，GPT-3出现了。虽然这个模型在公众眼中没有特别引人注目的地方，但它完成了一件特别重要的事情，就是去掉了模型微调这个步骤。我们只需要使用GPT-3这个预训练过的模型，用简单的语言描述一下要完成的任务，比如“下面是一条大众点评网的用户评测，告诉我它的感情是正面的、中性的还是负面的”，并给出几个可选的例子，它就可以在这个任务上取得相当优异的成绩。GPT-3和最初版的GPT在结构上并没有太大不同，主要区别在于模型的规模和预训练的数据量增大了500倍。这种方法与深度学习一直以来的暴力风格非常吻合，再次证明了深度学习的强大潜力。

在2022年，OpenAI发布了另一个当时并未引起广泛关注的成果——GPT Instruct。这个成果首次引入了RLHF，通过强化学习的方法实现了近乎无限规模的对齐训练过程。这个过程首先使用强化学习从有限的人类标注中学习人类对两个输出的偏好程度，然后用训练出来的模型代替人类来对模型的输出打分，参与模型的训练。这相当于获得了无穷多的训练数据，进一步增加了对齐过程的数据量。在这种新方法的支持下，OpenAI首次实现了让一个语言模型理解人类意图并根据人类指令作出回答（Instruction Following）。同年11月，基于GPT-3和GPT Instruct两个成果的结合，OpenAI推出了ChatGPT。后来的事情大家都比较熟悉了。

从这个历史来看，基础大模型到底是什么呢？我认为有两个标志性的能力：

不通过模型微调就可以进行few-shot learning或者in-context learning的能力。更具体地说，一个基础大模型如果想要适配到一个新的场景中去，它不需要科学家或者工程师的参与，不需要大量的GPU来支持训练和微调，也不需要改变这个模型的权重。而只需要普通用户与其交互，就可以完成这个适配过程。这对于实际产品的形式来说是革命性的，是一种破坏性的创新（disruptive innovation）。原因是，它允许我们只需部署一个模型，就能完成许多不同的任务，这也是基础大模型中“基础”二字的由来。

一个有趣的观察是，至少从目前来看，这些基础大模型似乎反而不适合用于模型微调。一个例子是BERT经过模型微调往往可以得到很好的结果，但CLIP这种具有Open Vocabulary跨模态理解能力的基础大模型，其模型微调的难度却大很多，在学界是臭名昭著的难以通过微调来利用。要想有效地利用CLIP，更好的办法往往是进行embedding backtracking，在不改变模型权重的情况下修改输入token。
对话式UI。在这种UI出现之前，如果想利用机器学习或者AI模型的能力，我们必须理解如何编写Python程序来调用一个模型，比如需要学习什么是Tokenization，如何调用PyTorch，以及如何管理CUDA和CPU之间的内存转移。但是有了对话式AI后，任何人只要会说话，就可以通过直接对话来调用其能力。这是另一个明显的产品形式的破坏性创新。

当这两者结合在一起时，就可以构成类似ChatGPT这样的革命性产品。一方面，我们不需要机器学习科学家的参与，就可以将AI应用到自己的任务上；另一方面，也不需要懂编程，就可以使用AI的各种能力。

而实现Few-shot Learning和Conversational UI这两个关键特质的技术手段则是大规模的预训练。一个非常有趣的观察是，从Transformer到GPT、到GPT-3、再到GPT-Instruct和ChatGPT，其中模型的基本单元，也就是Transformer和Self-Attention机制，从来没有改变过。变化的主要是预训练的数据规模。BERT只用了33亿个单词，但到了GPT-3，这个规模增大了500倍，而GPT-Instruct通过RLHF实现了几乎无限的对齐数据。更多的训练数据让我们能更好地覆盖非常复杂的问题解决空间，从而更好地支持各种不同的任务。

所以总的来说，我认为基础大模型的核心并不在于“大小”，“大”只是用来实现其两种核心特质的技术手段，而并非目标。从机器学习的角度来看，它的核心特征是Few-shot Learning，不需要科学家的参与，仅凭最终用户的力量就可实现任务的适配。从UI的角度来看，另一个可选的特质是对话式UI，这种界面进一步极大地降低了使用的门槛，让更多人能够受益。

What Makes a Large Model a Foundation Model: a Historical Perspective

2024-08-06T21:29:00-07:00

After the release of ChatGPT, the term "Foundation Model" suddenly became a hot topic. Both independent media and research institutions and enterprises are enthusiastically researching and launching their own foundation models. Tencent Research once conducted an incomplete survey, showing that there are over 200 foundation models under development in China. But what exactly is a foundation model? For an ordinary large-scale model to become a foundation model, what conditions must it meet? If the core criterion is "large," how large does it need to be to qualify as a foundation model? Few people discuss these questions, so I wrote this article to explain my viewpoint from a historical perspective.

Before 2017, deep learning had already become mainstream in machine learning. Compared to traditional machine learning algorithms, deep learning has a core advantage: it has never encountered data saturation. Traditional models, even when increasing in complexity, cannot effectively utilize data beyond a certain scale. The specific manifestation is that as training data increases, their accuracy no longer improves. This is a longstanding problem in the field of machine learning.

However, deep learning does not have this problem. As long as you provide it with more GPUs, more training data, and allow it to train more complex models, it will ensure better accuracy. To this day, we have not seen any signs of saturation in deep learning. This is why deep learning is so popular among companies, especially large ones. After all, being able to achieve better accuracy simply by investing money can create a natural moat for them.

But all of this in deep learning has a prerequisite. The neural network, or feedforward neural network, is particularly suitable for parallel processing, allowing efficient inference and optimization on GPUs. However, until 2017, the popular methods in the NLP field were still LSTM and other recurrent neural networks. Their characteristic is that to calculate the result of the next word, you must first completely calculate the result of the previous word. This creates a problem: it is a serial processing algorithm, unfriendly to GPUs and difficult to optimize efficiently in parallel. This is why, even though auto-regressive tasks appeared early—like masking a word in the middle of a sentence and guessing what it is from the context—they were never widely implemented on large-scale data. For example, models like Word2Vec that involve guessing a masked word are only two-layer neural networks and were not truly combined with deep learning.

In 2017, Google released the Transformer architecture, finally changing everything. Transformer, through a global attention mechanism, transforms the serial computation structure of LSTMs into one that can be processed in parallel. This allows efficient training on GPUs, bringing new perspectives to the NLP field. Consequently, just a few months later, OpenAI and Google independently released the GPT and BERT algorithms, which greatly improved the performance of all NLP tasks by introducing large-scale pre-training. The use of these algorithms generally follows two steps:

The first step involves using the aforementioned technique of masking a word in a sentence to predict what it is, training on large-scale data to obtain a pre-trained neural network.
Based on this, the second step involves further fine-tuning the model for specific tasks, such as sentiment analysis of a passage, allowing the foundational model to adapt to specific tasks and achieve excellent performance.

After their release, GPT and BERT quickly became standard algorithms in the NLP field. Since adding a new task required significant GPU resources for fine-tuning, this provided many job opportunities for scientists and engineers :D

In 2020, GPT-3 emerged. Although this model didn't seem particularly eye-catching to the public, it accomplished something crucial: it eliminated the model fine-tuning step. We only need to use the pre-trained GPT-3 model and provide a simple language description of the task, such as "Below is a user review from a public review site, tell me if the sentiment is positive, neutral, or negative," along with a few examples. It can achieve quite remarkable results on the task. Structurally, GPT-3 is not significantly different from the original GPT; the main differences are the scale of the model and the pre-training data being 500 times larger. This method aligns with the brute-force style of deep learning, once again proving the immense potential of deep learning.

In 2022, OpenAI released another achievement that initially did not attract widespread attention—GPT Instruct. This work first introduced RLHF, implementing an almost limitless scale of alignment training through reinforcement learning. This process first uses reinforcement learning to learn human preferences between two outputs from limited human annotations, then replaces humans with the trained model to score outputs and participate in model training. This is equivalent to obtaining an infinite amount of training data, further increasing the data volume for the alignment process. Supported by this new method, OpenAI achieved, for the first time, a language model that understands human intent and responds to human instructions (Instruction Following). In November of the same year, combining the results of GPT-3 and GPT Instruct, OpenAI launched ChatGPT. The subsequent developments are well known to everyone.

From this historical perspective, what exactly is a foundation model? I believe it has two hallmark capabilities:

The ability to perform few-shot learning or in-context learning without model fine-tuning. More specifically, for a foundation model to adapt to a new scenario, it doesn't require the involvement of scientists or engineers, nor does it need a large amount of GPU support for training and fine-tuning, and there's no need to alter the model's weights. Instead, this adaptation process can be completed simply through interaction with ordinary users. This is revolutionary for actual product forms and represents a disruptive innovation. The reason is that it allows us to deploy just one model to accomplish many different tasks, which is where the "foundation" in foundation models comes from.

An interesting observation is that, at least for now, these foundation models seem to be unsuitable for model fine-tuning. An example is BERT, which often yields excellent results after fine-tuning, but foundation models like CLIP, with open vocabulary cross-modal understanding capabilities, are infamous in the academic community for being difficult to leverage through fine-tuning. To effectively utilize CLIP, a better approach is often embedding backtracking, modifying the input tokens without changing the model's weights.
Conversational UI. Before the emergence of this UI, leveraging the capabilities of machine learning or AI models required understanding how to write Python programs to call a model, such as learning what tokenization is, how to use PyTorch, and how to manage memory transfers between CUDA and CPU. However, with conversational AI, anyone who can communicate verbally can access its capabilities through direct dialogue. This is another clear example of disruptive innovation in product form.

When these two aspects are combined, they can create revolutionary products like ChatGPT. On one hand, AI can be applied to individual tasks without the involvement of machine learning scientists; on the other hand, no programming knowledge is needed to use the diverse capabilities of AI.

The technical means to achieve the two key traits of few-shot learning and conversational UI is the large-scale training. An interesting observation is that from Transformer to GPT, to GPT-3, and then to GPT-Instruct and ChatGPT, the fundamental unit of the model, namely the Transformer and self-attention mechanism, has never changed. What has primarily changed is the scale of pre-training data. BERT used only 3.3 billion words, but by the time of GPT-3, this scale had increased 500 times, and GPT-Instruct achieved nearly limitless alignment data through RLHF. More training data allows us to better cover the very complex problem-solving space, thereby better supporting a variety of different tasks.

So in summary, I believe the core of foundation models is not "large"; "large" is merely a technical means to achieve its two core traits and not the goal. From a machine learning perspective, its core feature is few-shot learning, enabling task adaptation without the involvement of scientists and solely through the power of end-users. From a UI perspective, another optional trait is the conversational UI, which further greatly reduces the barrier to entry, allowing more people to benefit.