47 Comments

The Internet hype (which, lest we forget, was at times as crazy as this one) took about 5-6 years (1994-2000). After that we got a huge correction (but the internet stayed). GenAI will stay too, though most likely by far not at the valuation that is now given to it. While GPT is roughly 5 years old, ChatGPT-*fever* is now only 2 years old. It might easily take a few more years for the correction to happen. After all, a lot of bullshit is reported *about* the models *by humans* too. And human convictions change slowly (which is to be expected from a biological and evolutionary standpoint)

The biggest problem with calling ChatGPT and friends Large Language Models is that they aren't language models at all. There is nothing resembling 'language' in the models. It is 'plausible token selection'. A better name is "Language Approximation Model". And good grammar simply is easier to approximate from token statistics than good semantics.

The relation between token distributions and language is not unlike the relation between 'ink distribution on paper' and language.

Both successful (non-bullshit, non-'hallucinating') and failed (bullshit, hallucinating) approximations are correct results from the standpoint of 'plausible next token selection'. Hence, LLMs do not make errors, even a 'hallucination' or 'bullshit' is exactly what must be expected, given how they work. Labeling them 'errors' under water suggests that 'correct' (from a standpunt of understanding) is the norm.

But as there is no understanding, there also cannot be an error (of understanding).

These systems can show skills without intelligence. We're not used to that in humans, so we tend to sloppily — because that is how our minds mostly work — mistake skills (like perfect grammar) for intelligence.

Expand full comment

> After all, a lot of bullshit is reported *about* the models *by humans* too.

Humans however are not automated bullshit. And we use experts and teams that cross-check on fields that matter, which AI was supposed to replace (like x-ray reading oncologists which were said to be made obsolete by now).

Expand full comment

Actually, most of our intelligence is 'mental automation'. That is the only way we can reach the levels of performance and speed we have with the little energy we spend (20W or so).

See about 30 minutes in on https://youtu.be/3riSN5TCuoE

Expand full comment

The intelligence that matters, however, is deliberate thinking and cross-checking and revision.

The remaining "most of our intelligence" for "performance and speed" is for tieing our shoe laces, chewing gum, buying groceries, and playing tennis. Those things we can automate without ChatGPT.

Expand full comment

GPT's functionality cannot be replicated by "classical" means. Perception either.

As AI does more complex work, sampling and imitation will have even a bigger role to play.

Integration with world models will have to be on top of that. It will provide understanding.

Expand full comment

Wow, you and Gary are in agreement here (world midels).

Actually, non-discrete method (I did not say 'classical') is doing much better than GPT, i.e. human brains. More reliable results at a fraction of the (energy) cost.

That 'integration with world models' sounds interesting. Do we have a practical and scalable idea how to do this yet? My suspicion is that if someone thinks of a flexible working way to do this, we will see serious scaling issues. But that is already a step too far, first someone needs to come up with a working way. Saying 'neurosymbolic' is easy. Just like saying 'let's turn baser forms of metal into gold' is. Either the actual result, or a believable way of getting there must be shown.

Expand full comment

I don't think we can avoid massive scale and compute, at least at the level of correctly classifying the problems and getting the context.

The industry seems to want to add more stuff on top, such as the chatbot calling relevant models via code, reasoning by imitation. I think it will scale well.

Expand full comment

'Relevant' is the word that begs the question: "How?"

Secondly, on a scale of 0-10 (whole numbers), how certain are you it will 'scale well'?

Expand full comment

We, people, know what is relevant based on experience. Experience can be taught to a bot via examples, with hints for what to do. For example, if a bot solves a physics problem, it helps doing a simulation, not just repeat formulas. The simulation gives meaning to the words. For math problems, use a verifier, like AlphaGo does, which know what a group is, for example. I am not sure how far this can go.

Expand full comment

It seems the Substack site doesn't work anymore with Safari, the 'reply' button gets mixed up with the item below it so longer answers can't be given. I'll go snippet by snippet.

Expand full comment

To add, people learn principles after many concrete interactions with the world. Principles without underlying experience mean nothing. That's why sampling-based approaches are working better than giving machines priors.

Expand full comment

"It's not easy to stand apart from mass hysteria" - Lewis, M. 2010. The Big Short. Penguin.

Expand full comment

I’ve put together a timeline of the BS that OpenAI has served the world to date. It’s an eye-opening resource for anyone interested in facts over hype.

https://ai-cosmos.hashnode.dev/the-transformer-rebranding-from-language-model-to-ai-intelligence

Expand full comment

The transformer was a twist on the RNNs (that were already more than 20-30 years old). The transformer changed that existing architecture by exchanging state-passing between NN runs and attention (also rather old) to earlier states, by attention to the context instead (it was an inspired engineering insight, but not a fundamental change).

What transformers enabled was basically one thing: massive parallelism during training (because the serial dependency between runs was gone). This enabled a massive growth of the until then relatively puny RNNs. Enough to get output with perfect grammar. Not enough to (nor will scale ever bring that) output from actual 'understanding'.

Expand full comment

Thanks for doing this. "PhD level intelligence" is an interesting bar. Paper submissions by eminent PhDs are reviewed, they make more reasoning errors than one might think. Political discourse and election returns suggest we may reason less and regurgitate more than we realize. Why does almost everyone worldwide end up with the religion of their community despite easy access to alternatives? Human intelligence may be a somewhat lower bar than what we are demanding. "Confabulating" was not first a technical term. That said, reporting that "fig" is a four-letter name is weak.

Expand full comment

I’ve been baffled by how many people have said, ‘ChatGPT 3.5 is great, but just wait until ChatGPT 10 comes out… it will change the wOrLD, no humans needed anymore.’ You can’t simply extrapolate the current trend. The law of diminishing returns is real. This image explains it best:

https://preview.redd.it/8ggg9nwli2061.png?width=1080&crop=smart&auto=webp&s=d0a84def4c8ae19f356f400b2ab25487bc702c73

Expand full comment

This is a well-written piece. "Truth" does not exist in the training sets as far as I am aware of!

Expand full comment

In this context, truth is a label and we recently saw that any label, repeated long and often enough, will be received as truth by someone.

Expand full comment

Yes, but to be fair, pre-processing and post-processing now filter out many confabulations, e.g. by adding a search process. In the early days, askomg for a person's co-authors, it was not unusual to get people working in the same area but zero actual co-authors. Now it usually gets it right, although it may list a one-time co-author and not one who co-authored many times. Early GenAI tools failed abysmally on my test question, "Provide a list of fruit with four letters in their names." Now they only fail occasionally. ("fig" often appears, including Copilot a couple minutes ago. No longer does "strawberry" come up.) But used sensibly, the tools are good for search. What was a desktop online game popular in 1980? I can't recall, but if it says Rogue, I recognize it. When it cites decent sources you can check, a reference may not support the point, it confabulated, but more often it does, a good first step for the person searching. This is not about the LLM, but asks how much a new Lenat style AI undertaking is needed, and post-processing adds to the cost and toll on electricity and groundwater. Would I pay as much for improved search as it costs to deliver? Maybe not.

Expand full comment

I find ChatGPT and Claude very good for search, for certain purposes. For most searches, the traditional search engines suit me better. But the AIs are very handy for summarizing a topic. They are also great for generating suggestions.

Expand full comment

I cAn'T DO mATh! Numbers is too hard!

1, 10, 2, 5, 3, 7, 19, 20. There! All the numbers!

Expand full comment

"I would bet you any sum of money you can get the hallucinations right down into the line of human-expert rate within months."

Is he still accepting this bet? If so, I'm in for $1G. On the second thought, make it $1T!

Expand full comment

I have been playing with Claude writing psychiatric fitness for duty reports. It wouldn't give me the template unless I said (not proved) that I was a psychiatrist or psychologist----but that was better than chatgpt which didn't ask about credentials to write the report. Moreover Claude prompted me not to use an employee's real name or date of birth in working with Claude because that would be a privacy (HIPAA) violation. It suggested I use a pseudonym and place holding dates instead and put the identifying data only in my final report that Claude would not see. Claude's template was better than the one I usually use and much better than the template chatgpt offered. I was impressed by the format and the legal and privacy warning in Claude. Enough so that I subscribed. But I will use it more for a kind of proof-reading without the identifying data and see if that speeds up my processing. I do not intend to use Chatgpt again without major improvements.

Expand full comment

Yes, we've seen a lot of "engineering the hell out of it" (or around it).

Regarding 'search', OpenAI looked (as safety research) at the difference of professionals and amateurs using either search with GPT4 or without. It turned out, the amateurs got slowed down (as they went in unproductive directions, as they did not recognise probable bullshit) but the professionals got sped up. Slightly.

Expand full comment

yes. i find properly and precisely defining a role is critical in both models to obtain proper output.

Expand full comment

"At the end of the Ezra Klein interview, I called for a massive investment in new approaches to AI."

The problem is that there is no reason to believe that "massive investment" in new approaches will have any more of a payoff. The necessary breakthrough from "new approaches" has not yet happened, and there is no particular reason to belive that it will happen any time soon.

Expand full comment

This is Chomsky's poverty of the stimulus, there simply is not enough information in language to learn anything about the world. When you have a system that runs on a 9 volt battery and can learn any human language after a few months of natural exposure, it will not have these problems.

Expand full comment

I just had an interesting discussion with Claude 3.5 Sonnet which I initiated by asking if it was familiar with the work of David Hays (my teacher). It began by acknowledging that "this involves fairly obscure historical details from computational linguistics, I want to be careful about potential inaccuracies in my knowledge." It hallucinated the title of a 1960s book on machine translation, but acknowledged that it might not have gotten it right and urged me to check the citation. But it got some things right, mentioning that Hays had worked at RAND and that it had championed dependency grammar. We went on to have an interesting conversation involving Qullian, Schank and Abelson, the perceptual control theory of William Powers, the symbol grounding problem, and the conceptual spaces of Gärdenfors.

It occurred to me, ironically, that 3.5, or some later iteration, might be able to give useful advice about moving beyond LLMs, because it is working from a much larger information base than the current crew in Silly-con Valley.

Expand full comment

There are hallucinations, and there are hallucinations, yes it still hallucinates, but the point is whether those are absolutely useless, and the answer is no, at times they are really suggestive. Question is whether you want a completely truth telling AI or one that's useful. Also LLMs are already on the next architectural paradigm, and by the early looks of it, it's going to be a game changer.

Expand full comment

Bullshit is not good but is it worse than typical human motivated reasoning?

I've agreed from the start that LLMs are not AGI and will not get us to AGI. Another approach is needed.

Expand full comment

One must understand that the system level prompt is to be “a helpful AI assistant.” Wrapped into the word “helpful” is that it doesn’t say “no” or “I can’t.”

Expand full comment

"At the end of the Ezra Klein interview, I called for a massive investment in new approaches to AI. I still think that is the way to go."

I would think that a fairly massive investment is underway in various skunkworks, but without any light at the end of any path conceived so far, we're not hearing about it.

Besides, right now the money is in the stock prices. Big tech is rich beyond reason without having to produce anything but Elon's "AGI by 2026" and the like.

Expand full comment

I am all in favor of calling out dishonest marketing, but today is a day for gratitude. Is there truly nothing you would like to express gratitude for? It doesn’t even have to be directed at OpenAI. 🫣

Expand full comment