114 Comments

"GPT’s regurgitate ideas; they don’t invent them"... That is all you need to know about current and future versions of these models.

Oh, and that incentives remain aligned for keeping the overhype of these technologies, so billions of dollars keep going into the field.

Expand full comment

You guys that seem to know how these model work better than the actually scientist studying them and building them is just so funny to me 🤣.

The overhype doesn't make billions go into the field. Profit maximization does. And gpt models have both threaten and attracted a lot of it.

Expand full comment

A bit to unpack here... First, I don't say that there aren't real profits that will come out of this or that the models aren't useful. My point (and I believe Gary's as well) is that the promise that AGI is "around the corner" or that "humanity will be rendered useless by these models" is giving insane amounts of funding to these efforts. A good analogy would be the dot.com bubble from the late 90s. Internet eventually made a lot of money, but funding pets.com wasn't be best investment.

Second, I'd suggest checking today's post in this Substack regarding the "know how these model work better than the actually(sic) scientist studying them"... I don't need to know how to develop these models to understand their basic mechanics - "The AI systems that power these chatbots are simply systems (technically known as “language models” because they emulate (model) the statistical structure of language) that compute probabilities of word sequences, without any deep or human-like comprehension of what they say. "

So, agree with you there are profits to be made but, disagree with you if you don't think there is insane amount of hype and, disagree with you that one needs to understand in depth these models to know how something works and what its limits (as they currently exist) are.

Expand full comment

It isn't that AI has no potential- I anticipate it leading to some real breakthroughs, particularly in fields like medicine, materials technology, engineering, and design. I'm looking forward to an AI program that's capable of designing an optimal tertiary water treatment plant incorporating biological, fungal, and biochar for waste removal and water reuse, for example. Or modeling California watersheds from the headwaters down, to find the best locations for coffer dams that won't get swept away by floods, and the places for pumped storage facilities to draw off water during flood season with the least ecological impact. The Water Resources Board apparently has a lot of data. It's a daunting challenge for humans to collate and apply it all, though. I'd bet that a good AI program could help a lot with projects like that.

But the infatuation with inducing "self-awareness" is absurdly misplaced. AI is arguably best reserved for purposes entirely outside the realm of traditionally human-to-human communications and interactions.

As far as its military applications: in a robot war- or a robot-run war- all the human casualties would be noncombatant, in some sense. I suspect there would be a lot of them. Intimidation of the human aspect of the opposing side would seem to me to be the primary emphasis in any escalation of robot warfare. Machines have implacable mission focus. They can slaughter and destroy without hesitation or remorse.

Expand full comment

Only if you go by a very broad interpretation of what it means to "regurgitate". It hasn't literally invented a new technology, but it can answer unseen questions and solve new problems requiring insight and understanding of code. GPT-4 surpassed GPT-3.5 at almost every objectively quantifiable task including common sense reasoning, the bar test, and coding challenges.

The claim that what it currently can't do will be impossible for the indefinite future hinges on the "shouldn't be possible for a word predictor" rationale, and god knows how many of those have been proven wrong in the past 5 years.

Expand full comment

All I can say is that the Dartmouth Conference promised in 1956 that it would take one summer and ten researchers to engineer human intelligence; then in 1957 Herbert Simon said that "there are now in the world machines that think, that learn and that create", to follow it up with the bold prediction that by 1985 "machines will be capable of doing any work man can do"; Marvin Minsky said in 1967 that "within a generation, the problem of creating AI will be substantially solved". This (along far more recent examples, but I like to use history to show how often we have been promised that GAI is around the corner) is what I mean by "regurgitate" the promises of AI...

My argument is that we need to start by defining what we mean by intelligence... If we are talking about the ability to extrapolate from large sets of information and, by inference, reach the right conclusions most (rather than some) of the time, then yeah, I think a future version of GPT will achieve that. If we are talking about conceptualising, imagining or making new, original discoveries, I think this will be another Dartmouth Conference moment.

Oh, and lastly... just following some Occam's razor logic... What is more probable? That we are at the brink of achieving something we don't at all understand (GAI)? Or a bunch of (very smart, indeed) techbros getting funded billions of dollars to develop GAI?

Expand full comment

"Regurgitate" the promises of AI is done by humans, and is separate from the claim that "GPT's regurgitate ideas; they don't invent them". I can't predict when AGI will appear so I won't strongly disagree with opinions about when it'll happen. I was only responding to the commonly held belief that GPT can't come up with "new" ideas.

I agree that inventing new things is in a different class than solving "new" unseen problems that are in the same domain as seen problems; however it bears mentioning that the latter was considered impossible for LLM's until it happened. And so were tens of other tasks down the line, if you go back in history 10 years. Literally every interesting task GPT-4 can do today, was thought to be impossible by experts years ago.

"The Unreasonable Effectiveness of RNN's" is a great article which gives perspective about what people thought *should* be possible for word predictors to do, back in the day, before GPT was invented. GPT has a longstanding history of completely shattering these expectations, so I don't consider it a persuasive argument that it's impossible, only a persuasive argument that we shouldn't assume it will be possible.

Expand full comment

Read me what to expect when you are expecting gpt-4 and tell me what I got wrong

Expand full comment

In that comment I wasn't talking about your predictions or articles; I was specifically focusing on the comment's claim that GPT can't come up with any "new" ideas, and I was saying that's only true for a very strict interpretation of "new idea".

Since you brought it up, regarding the article you're referring to:

1. GPT-4 still makes stupid errors sometimes, but the incidence is hugely lower than GPT-3.5. So technically not wrong, but the claim is worded such that it's almost impossible to be wrong

2. I would argue this claim was proven wrong because it can answer physics questions as well as reason its way through a maze.

3. "Common" is not well-defined, similar to #1, but I would side with it being wrong because incidence of hallucinations lowered a lot (and has been doing so incrementally every subsequent model prior)

4. Probably wrong? There was an advert doing the exact thing you describe. Time will tell how well it works in practice.

5. We can agree this claim is true

6. True and seems to be a different topic

7. Only time will tell

Expand full comment

All fair and good points. However, where I'd have some disagreement still is where the limits of LLM/GPT are. There is always a possibility (and I believe @Gary Marcus spoke about this on a previous post) that combined with other advancements, LLM/GPT could get us closer to true intelligence. I still would argue that if you are using statistical models fed by large sets of static data to solve problems, it is hard to imagine (no pun intended) how you can come up with new ideas just by itself.

I'll check the article you suggest. Thanks for sharing the information.

Expand full comment

There's some bleak humor in the fact that people who have spent basically two decades trying to persuade everyone to overcome their own biases and make their reasoning capabilities more quantitatively rigorous do not see the huge potential for motivated reasoning and confirmation bias in the post-hoc narrative interpretation they're doing about interactions with a chatbot.

Expand full comment

That is not true; the hype surrounding this (at least among the expert community rather than laypeople) is based on objectively quantifiable tests, including longstanding benchmarks commonly accepted as tests of AI cognition. Traditionally these are things like common-sense questionnaires, but have also been expanded to include harder tests such as the bar exam and coding challenges. GPT-4 outperforms GPT-3.5 at all of these tasks, and outperforms humans at many of these tasks. At the end of the day what really matters is whether something can successfully solve a task. Whether it's philosophically using a "fake" or "real" understanding of how to solve it is immaterial.

Expand full comment

Hi Gary, another timely article that counters the hype, thank you for it!

(GPT v) 5 is not better than 4, 6 is not better than 5, when it comes to developing AGI. Adding more and more data, while maintaining the same underlying computations, is not going to lead to a threshold being crossed, beyond which the system will flip over to being intelligent!! That is magical, delusional and flawed thinking. The ladder getting taller won't take us to the moon. The problem, in two words: wrong architecture. "Emergence"(which is claimed by some, to occur in LLMs) is not related to quantity at all - rather, it's on account of certain architectures that permit path-dependent, local interactions among components, which leads to state altering.

Expand full comment

This makes no sense if you just consider the history of LLM's in the past 5-10 years. How can you say emergence is not related to quantity at all when clearly every subsequent iteration of GPT showed objectively more capabilities and emergent problem solving abilities than the previous version? GPT-1 was barely coherent. GPT-2 couldn't write more than a few sentences of believable fiction nor even begin to solve coding challenges. GPT-3 can't pass a bar exam or reason its way through a maze with text only. In case you thought this was only anecdotal, you can peruse their performance on scientific benchmarks which consist of asking a standardized set of questions that weren't seen in training.

Expand full comment

Understanding comes from experience. It doesn't come from computations on symbols. Note that those symbols (language, math...) are human-generated. We create language etc to communicate shared meaning. We ask the LLM a question (from a bar exam) - we know what the question means. The system computes on its tokenized inputs, produces output in the form of language - which we humans understand. The system itself has no understanding of what it computed. Why not? Because meaning doesn't reside in symbols alone.

Expand full comment

Like I said, I'm focusing my claims on emperical/scientific evidence. So instead of trying to figure out what it "actually" understands on a philosophical level, the best we can do is evaluate what tasks it can solve, objectively. We do this by giving them commonly accepted standardized benchmarks/tests for AI and seeing how well they perform relative to other models or past versions. It has been widely accepted that at-or-above human performance for these tasks requires some level of "understanding".

If you go down the rabbit hole you can use your logic to argue a human brain is nothing but a muscle-activation-predictor, and doesn't have "true understanding" of anything. We can find no evidence that you actually feel emotion when you cry as opposed to just being a really good faker. Your brain doesn't have the qualia "red"; it only has electrons/ions representing a facsimile of the concept. The claim is impossible to disprove!

Expand full comment

The human brain is not merely an activation predictor. Do you realize there is such a thing as pre-verbal, and, non-verbal, understanding? It's about the brain being in the body, which is in the world, interacts directly with it, modifies it, is modified by it, learns from it, remembers interactions, ie. "experiences" it.

"Understanding" can be non-math-y, non-language-y, non-musical-notation-y, etc. None of these are symbolic, so, are out of reach of LLMs, or for that matter, every digital processor, and every digital form of AI (including reinforcement learning, symbolic reasoning, brute force search, etc.).

I do get what GPTs do, I teach ML and DS in my courses. I'm not at all talking about how effective the new crop of genAI is, it's mind-boggling. All I'm saying, going back to my initial comment which you commented on (where I implied it) , is this - grounded understanding is non-symbolic, all AI lacks that.

All the solipsist BS about qualia, 'you can't prove' etc. is irrelevant to the fact that embodiment adds a fundamental form of understanding ("experience") that symbols cannot provide. You and I understood the world when we were little, not by silently being shown terabytes of scrolling text. Symbolic understanding is layered atop a non-symbolic base. You can dismiss 'experience' because it can't be quantified, measured, evaluated, etc., which simply means you're going back right into your symbolic camp.

Expand full comment

I agree that fundamentally LLM's learn differently and lack experience and embodiment. And, it is highly likely that LLM's alone are not the most efficient path forward towards AGI. The thing I'm skeptical of, is claiming with complete confidence that lack of experience/embodiment means a specific type of problem-solving or intelligence is totally impossible no matter how much you scale up.

The evidence I'm using is historical expectations vs accomplishments. Almost every interesting task that GPT-4 can do today, was also thought to be impossible for LLM's and require physical experience of the world. It can reason its way through a maze without having visually seen one, which it shouldn't be able to. It can say if an object lands on one side of a lever the other side will go up, despite it not having seen it in real life. It passes the trophy and suitcase test and its variations, which was thought to require physical understanding etc. It can now solve math problems including arithmetic calculations with pretty high accuracy, which again should by all rights be impossible for something that operates via word association. At some point, even GPT-2's capabilities were thought to be impossible. No one expected an AI that's only programmed to predict the next word to be able to write remotely coherent fictional articles. It was thought to require human-level creativity and understanding.

All these tasks were at first impossible with smaller models and earlier versions of GPT, then at some point made possible by scaling up the data and model size, which is directly contradicting your claim that scaling up can never lead to more emergent capabilities/intelligence. In light of everything once thought impossible for LLM's which are now possible, going forward why should we draw the line so confidently?

Also, I think it is a mistake to say "understanding" is out of reach of "every digital form of AI", unless you are making a metaphysical argument. In the worst case scenario, one can simulate a whole human brain and all physics involved to achieve parity in behavior with a robot body (but most experts agree human-level intelligence most likely does not require simulating the entire human brain).

Expand full comment

"One can simulate a whole human brain and all physics involved to achieve parity in behavior with a robot body"

No, our science doesn't have the knowledge to model that parity. Not even close. Neuroscience is in its infancy. Moreover, there isn't one "the brain"- humans possess brainS, plural. Quite unlike mass-produced factory goods, like circuit boards. (To note the shared similarities is merely the state the obvious- but "similar" is not the same as "identical." This is a situation where minor differences count.)

Brains are wetware, using dynamic circuitry activated or suppressed by complex chemical reactions. At the assembly code level, the circuitry is binary, yes. But for all practical purposes, human brains- or even flatworm neurology- partakes of extra dimensions modified by dynamic chemical interactions, not DC electrical switches inducing signals in stable silicon chips.

[ edit- references added May 7 2023; I went looking for studies to support my points, and danged if I didn't find some ]

https://arxiv.org/abs/2304.05077

https://pubmed.ncbi.nlm.nih.gov/32830051/

https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12035

Expand full comment

IF “Open”AI actually believes they are about to create AGI, something that will transform the entire world, then wouldn’t it be to the advantage of the U.S. government to intervene and gain some control? I highly doubt the U.S. government wants to loose control in a situation like this. And let’s say the gov did get involved and AGI did not become reality, the chances of them getting uninvolved is pretty slim. Maybe instead of “be careful what you wish for” it should be “be careful what hype you spout”

Expand full comment

Six month pause or not, it seems inevitable that we will soon be - already are, to a degree that I suspect will seem tiny in retrospect - surrounded by a media fog of images, thoughts, and opinions that may be real and may be just very clever forgeries, with no reliable way to tell the difference.

To say that this will make our lives more complicated feels like a massive understatement. And no complexity beyond what's currently available publicly is required.

Expand full comment

AI has arisen from an accelerating knowledge explosion. If AI were to vanish, completely solving all AI concerns, the accelerating knowledge explosion will keep right on rolling forward, and before long we'll be biting our fingernails about some other emerging technology. And then another. And another.

It's a loser's game to limit our focus to this or that particular technology because new technologies will continue to emerge faster than we can figure out what to do about them. The winning hand would be to learn how to take control of the knowledge explosion itself, the machinery generating all the threats.

One way or another, taking control of the knowledge explosion is going to happen sooner or later. Taking control of the knowledge explosion will happen in an orderly manner through the processes of reason, OR, more likely it will happen in a chaotic fashion in response to some Biblical scale calamity.

Human beings are not gods. We don't possess unlimited ability. It's the simplest thing.

Expand full comment

I'd be very interested in seeing a public debate (either in a podcast or in article form) between you and Eliezer.

Expand full comment

Could be interesting; people have been so mean to him lately i have actually defended him a few times on Twitter. I wouldn’t argue that there is no long term risk, but would certainly argue against his extreme high certainty positiion

Expand full comment

of course there is the recent Stuart Russel/Sam Harris/me podcast

Expand full comment

If there is a possible long term risk, then Yudkowsky's extreme high certainty stance may have a certain logic to it.

For one thing his adamant approach seems helpful in generating discussion and debate about a possible long term risk. For another thing, if there is a long term risk, now would be the time to get excited about it, not once the risk is upon us and there's little we can do about it. "Let's wait and see what happens" may not be the most rational approach to what could possibly be a very serious threat.

Looking back it should be clear that it's too bad that we didn't get more excited about nuclear weapons in 1946. "Let's wait and see what happens" landed us in a super dangerous situation that we now have no clue how to get out of. The AI community generally seems largely incapable of learning much from this well documented history.

I keep asking, "What are the compelling benefits of AI which justify taking on more risk at a time when we already face so many other serious risks?"

Given that nobody I've read has provided a credible answer to that question, I find myself sympathetic to Yudkowsky's "shut it all down" proclamations. He might not be the best messenger for that point of view, but the logic of his view point makes sense to me.

Even if AGI never happens, we just don't need all this AI distraction and noise right now. We need all these very smart people focused on the existential threats we already face, right now, today, not creating even more threats.

Expand full comment

Yeah, I appreciate your balanced perspective on his claims, and that you're willing to disagree with some while still defending him against the barrage of bad faith attacks he's getting.

He's been doing a lot of podcast interviews lately, so I'm guessing he'd be open to something with you. Not sure though.

Expand full comment

Wouldn't be that interesting because he doesn't know what he is talking about. The Singularity crowd's grasp of brain functioning got as far as: Yea verily & forsooth, the electrick fluid floweth full mightily along the neurone fibre ... and stopped.

Just as an obvious example, they have yet to grasp astrocyte ligand signalling.

Expand full comment

It's not a given that we'd need to simulate an entire human brain down to the molecular level in order to get human-level intelligence

Expand full comment

I understand that your role is to be a skeptical voice about AI in general and about current directions in AI especially. This is important. Much of the time you do it well and constantly raise important questions as well as pushing the breaks on the tech bros. All good.

I do want to make a kind of suggestion. Something to think about. And I could be wrong, btw. And maybe it is best for you just to go full-out skeptic and let the chips fall where they may. But sometimes, to someone like myself, who does not have a dog in this fight and who is interested in hearing all sides, I think you can do damage to the points you are making by being so extremely dismissive of the possibility that some genuinely interesting stuff is going on here.

For instance, you say about Yudkowsky's tweet that, "If GPT-4 actually understood the instructions, in a repeatable and robust way, I would have to reevaluate a lot of my priors, too." Which is a fair point. Repeatability and robustness are important. But Yudkowsky says explicitly in his tweet that it is not so much the success or failure of the compression that impresses him. He is saying that the very fact that Chatgpt even has any idea of what this task might be, how you would go about doing it, how it would write a sort of secret code for itself, what its 'self' is and how it could potentially talk to that 'self' in the future, that it could 'try' to do such a thing at all is enough to give one pause. I mean, a fair amount of stuff has to be going on behind the scenes, in the transformer or whatever, for such a thing to happen. Even just to make the attempt and to get the basic idea of what such a compression would look like. I would probably also fail at creating a compression that I could then uniquely decode, say, a few years from now (chatgpt has the 'advantage' in this case that without memory you can test it right away).

Personally, I think your objections would be more persuasive if you allowed yourself, like Yudkowsky, to pause for a moment and reflect on what an accomplishment that is and how much weird stuff must be going on in the internal architecture that GPT4 can make a reasonable go at a task like this and even in some cases make a darn good compression. Even the fact that it does better and worse in different attempts and with different users suggests a kind of flexibility and openness that feels, I don't know, uncanny at the very least.

You can acknowledge all of that and still say, wait, let's not go too far here. There is much that is missing in this model. It might still be a ladder to the moon. But denying and, in a sense, ignoring Yudkowsky's actual point makes your point less persuasive not more persuasive. In my opinion. Perhaps I am in the minority and it doesn't matter. But still, something maybe to think about.

I say this with respect to the importance of your skeptical position, not in an attempt to refute it or change your basic stance. I hope that comes through. Thank you for your work and passion.

Expand full comment

I suppose the greatest persuasion will come from the hands of Time (not "The Times"), this master, who sooner or later, unmasks our illusions and lies .. I suspect that Yudkowsky is not a good Time's disciple .. but let's wait, no problem .. but .. in the meantime, pray that humans keep cultivating and applying their intelligence to suggest new explanations for the different problems we have, for example how to cure cancer, how to efficiently implement nuclear fusion, etc., otherwise, when you get tired of waiting, additionally you will not have the explanations you need to have a better life .. or even worse, we will have became idiotic enslaved to our plainly stupid tools ..

Expand full comment

GPT will never get a man to the moon, but it can lie as good as Buzz Aldrin.

Expand full comment

First sentence in the article:

"AI doesn’t have to be all that smart to cause a lot of harm."

I agree. The problem isn't the AI, it's the people who believe what the AIs generate.

Think of an AI as a virus. What's inside the virus might be potentially lethal to the host. But in order to infect, the host has to have a receptor that lets the virus enter its cells. Once that's done, it can be pretty much game over for the host. If there's no such receptor, there's no risk. For hyperskeptical people, it doesn't matter what the AI says since it won't be believed.

Stupid people tend to be gullible. So do lots of smart people, for that matter, but recall that, by definition, half of the population has below average intellect--that's what "average" means. So you're right: AIs don't have to be very smart to cause a lot of damage since many to most will buy the bill of goods. AOC and Biden worry me a lot less than the fact that people elected them. There are a lot more voters than people in office.

Expand full comment

The problem is that a stupid tool will make us more stupid than we are right now .. why? because for example, by writing we are thinking, imagine now a (specially a young) person "offloading" the task of writing to GPT-x .. well ..

Expand full comment

Great informative article, as usual. Thanks.

"But GPT’s on their own don’t do scientific discovery. That’s never been their forte. Their forte has been and always will be making shit up; they can’t for the life of them (speaking metaphorically of course) check facts. They are more like late-night bullshitters than high-functioning scientists who would try to validate what they say with data and discover original things. GPT’s regurgitate ideas; they don’t invent them."

Haha. I love it. GPT-x is a bullshit fabricator. Maybe someone needs to invent a GPT-bullshit detector and make a killing?

"Some form of AI may eventually do everything people are imagining, revolutionizing science and technology and so on, but LLMs will be at most only a tiny part of whatever as-yet-uninvented technology does that."

I am much more pessimistic than you about the relevance of LLMs to cracking AGI. It's less than ZERO.

"This bit the other day from Eliezer Yudkowsky is pretty typical:"

Haha. The much vaunted high intelligence of Yudkowsky (aka Mr. Less Wrong) has been greatly exaggerated. Is Eliezer on OpenAI's payroll or is he really that gullible? This may need to be investigated. :-D

Expand full comment

Couldn't agree with you more about LLMs and AGI.

Expand full comment

Tried the encoded message to other GPT4 - my most creative mid journey prompt generator to date. Lossy semantic compression is not a magic zip algorithm, and these things are non deterministic anyway. It is known. Hard to say what the future holds when the near past has surprised us so much, We are not designing them we grow them and see what emerges, which is far beyond expectation and inexplicable so saying it can’t do X all the time, therefore in future it can’t is really tempting fate. One of you examples up there didn’t follow instructions, please try to make less mistakes unless you are still being trained up.

Expand full comment

The intelligent person will see GPT as a tool, not as an independent decision maker in and of itself. It should be used as a resource for assisting with compositions, but no one with an actual working brain would consider taking the output of GPT at face value.

Expand full comment

AI with English explanations but without GPT "hallucinations"

--------------------------------------------------------------------

You may like to get to know Executable English.

It's a browser platform for inputting knowledge in the form of English syllogisms, for using the knowledge for analytics, and for explaining the answers in English.

* It works with everyday English and jargons

* The vocabulary is open, and so is most of the syntax

* Needs no external grammar or dictionary priming or maintenance

* Supports non-programmer authors

* Avoids ambiguities via context

* When needed, it automatically generates and runs complex networked SQL queries.

The platform is live online, with many examples. You are invited to write and run your own examples too. All you need is a browser pointed to www.executable-english.net. If you are reading this, you already know most of the language!

Thanks for comments, -- Adrian

Adrian Walker (Formerly at IBM Yorktown)

Executable English

San Jose, CA, USA

USA 860-830-2085 (California time)

www.executable-english.net

Expand full comment

Gpt4 is supposed to have vastly superior reading comprehension, problem solving, and reliability than gpt3.5. I haven't found that to be the case in my interactions with it, which is why I find myself in agreement with Dr. Marcus's skepticism about gpt5's potential capabilities.

Gpt4 still shows a lot of the same brittleness that gpt3.5 did. Ask it which weighs more, 2 pounds of feathers or 1 pound of steel, it will often (its answers are not deterministic) tell you that they weigh the same.

Even in areas where it seems to have improved it will fail if you slightly tweak how you prompt it. If I ask, "which is faster, a cheetah or a toyota corolla," it correctly answers the corolla. But when I asked it just now, "a cheetah gets into a race with a corolla. Which one is faster" it told me the cheetah, even after stating the cheetahs top speed is 71mph and the corolla top speed is 118mph.

A lot of people seem to think they can guide gpt4 to good answers through special prompting. Not only would this be unpredictable for questions where you don't already know the answer (how would you know what the right prompt would be?), but it doesn't even always work for questions where you do know the answer. One example: people claim you can solve gpt4s inability to count words by asking it: "can you count the number of words and then show me x word." On short sentences this seems to work. But I tried it on a really long sentence and it got it wrong. I tried it a second time and it did something tricky. First, I asked it to tell me the 30th word in the sentence. It got it wrong. I then tried the method: "can you count the words in this sentence and then tell me the 30th word." It told me the same incorrect word, but its count showed it as correct. Turns out it omitted words 24-29 in its count. Tricky bastard.

Counting words in a sentence is a simple task that python can do. That's not really the issue. Gpt4 is brittle and unreliable, even when hooked up to tools (the current proposed solution to its woes). For instance, bing chat, running on gpt4 and connected to a search engine, still hallucinates. I asked it to list the books of my major professor. Two of the books it listed weren't his, and it mischaracterized one of them. It cited my professors University page. All the correct information was there but none of the incorrect information.

Even as a UI, its reliability is in question, because it frequently misinteprets text prompts. I asked gpt4, "who would win in a race, a cheetah or a man driving a prius?" It misinterpreted the question and answered that the prius would come in first, the cheetah in second, and the man in third. I asked, "a cheetah races a prius in a 400 meter dash. Which wins?" Gpt4 answered that it depends on whether the race is more or less than 300 meters, and without more information, couldn't answer the question. I presented the following scenario, "Mike has 12 Apples. Sally has 3 cakes. John has 9 pies. How much must each person give to each other person to equally distribute the apples, cakes, and pies?" It misinterpreted the question as asking how can I ensure there are an equal number of apples, cakes, and pies.

Expand full comment

None of what you said is true. I tested those prompts and I assume RLHF is ironing out the kinks in it right now. You are underestimating how much RLHF improved gpt3 and how much room gpt4 has to improve.

It’s also clear that much smaller models can approach gpt3 performance with high quality training. I think all of this points to how much the quality of training is being underestimated.

Expand full comment

I tested gpt4 today.

Said 1 pound of steel is not lighter than 2 pounds of feathers.

A jumbo liner traveling 540 mph is slower than a peregrine falcon diving 240 mph.

A 10 meter long strand of hair is shorter than a 7.67 meter long reticulated python.

An average major league fastball (90-95 mph) is slower than a cheetah.

Thing is, depending on how you ask it, it might get the answers to similar questions right. When I asked it, "is 2 pounds of feathers heavier than 1 pound of steel," it said yes. When I asked it, "is 1 pound of steel lighter than 2 pounds of feathers" it answered no. Ask it again tomorrow it might give a different answer.

In a lot of ways its performance improves with training, but it's understanding and reliability remain brittle.

I asked Bing creative, precise, and chatgpt3.5 20 sets of super simple questions with 5 variations of each question. This was to test understanding. The idea behind it is that if a person understands something, they can consistently answer variations of the same questions on it. The key on this test is not the crude score,, but the set score, or how many times the ai could successfully answer every variant of a question. For this test, it would be better to score 40/100 by answering 8 sets correctly than 80/100 by answering 4 of the 5 questions for every set. This is a test that almost any motivated cognitively normal 10 year old could score 100/100.

Bing creative answered 46/100, with a set score of 0/20.

Chatgpt3.5 answered 44/100, with a set score of 6/20.

Bing precise ostensibly did better, with a crude score of 73/100, but its set score was worse than chatgpt3.5, 5/20. There was no actual improvement in understanding from chatgpt3.5 to gpt4 powered Bing chat, although it appears to understand more.

I only did 22 questions for the gpt4 api. It got 13/22. No set score.

https://docs.google.com/document/d/1ypGAMmb9VOQfYH6X0O5NoyQ_vE81_SdkqDTj8F05G74/edit?usp=drivesdk

I gave it a scenario where a mother, father, son, and daughter are traveling on a trail and one of them writes on the trail. When the question was posed where only 2 people could write, and they do not share the same sex, it seemingly understood that if the mother could write, that meant the son could write. When I posed the same question, only this time with the two writers sharing the same sex, it assumed that meant either the father and mother could write or the son and daughter could write. I even asked it to employ chain of prompt reasoning, and it still got it wrong.

Even when it seems to understand, it doesn't really. And that means on more complex questions, where you don't know the answer already, it is extremely unreliable.

Many of its improvements are surface level. For example, it scores at the level of healthy adults on theory of mind tests. Yet with a little twist, it's lack of understanding is easily revealed. Give it a scenario where boy puts balls in a box, leaves, girls takes balls out of box, gpt4 "understands" that the boy would still think the balls are in the box. I gave it a similar scenario, only this time the boy comes in and checks on the box and leaves again. Gpt4 said after leaving the second time the boy thinks the balls are in the box.

I gave it the distribution problem above. It got it right this time, so I made it a little more complicated. I added a 4th person and increased the quantities: Mike has 40 apples, Sally had 4 cakes, John had 24 pies, Tommy has 16 grapes. It gave me a nonsensical answer.

Expand full comment

Can you prepend the prompts with “Before answering, explain your reasoning” to basically coerce it to think. I think it answers a lot of the math ones correctly. At least the ones I’ve tested so far like weight and height comparisons. Also use gpt-4.

It answered the ball in box ToM question too, I think forcing it to explain its reasoning is key here. You should realize that chatGPT behind the scenes does a lot of this exact prompt engineering and I wouldn't be surprised if pretty soon it will answer these automatically.

edit: I've tested most of these and gpt-4 with that part prepended gets all of them correct.

There is a tree on the peak of a mountain which is 4 miles above sea level. A small plant is growing on the ISS in space. Relative to sea level, which plant has a higher peak, the tree or the small plant? First explain the thought process and only give an answer at the end of your response.

ChatGPT

.... abreviated ...

Therefore, the small plant on the ISS has a higher peak relative to sea level than the tree on top of the mountain.

3.5's answer - happy?

For 4:

A has 4 apples. B has 8 pears. C has 64 pizzas. D has 2048 ping pong balls. Assuming you want to distribute items such that for each item type, each person has the same number of items of that type, how many items should each person distribute to each other person? Explain your reasoning then answer the question at the end of your response only.

ChatGPT

To distribute the items equally, we need to determine the smallest number of items that can be evenly divided among all the people (A, B, C, and D). Let's find the smallest number of items for each type that can be distributed equally:

For apples: A has 4 apples. Since there are 4 people, A can give 1 apple to each person (including themselves).

For pears: B has 8 pears. Since there are 4 people, B can give 2 pears to each person (including themselves).

For pizzas: C has 64 pizzas. Since there are 4 people, C can give 16 pizzas to each person (including themselves).

For ping pong balls: D has 2048 ping pong balls. Since there are 4 people, D can give 512 ping pong balls to each person (including themselves).

Now, let's determine how many items each person should distribute to the others:

A should distribute 1 apple to B, 1 apple to C, and 1 apple to D. So A distributes 3 apples in total.

B should distribute 2 pears to A, 2 pears to C, and 2 pears to D. So B distributes 6 pears in total.

C should distribute 16 pizzas to A, 16 pizzas to B, and 16 pizzas to D. So C distributes 48 pizzas in total.

D should distribute 512 ping pong balls to A, 512 ping pong balls to B, and 512 ping pong balls to C. So D distributes 1536 ping pong balls in total.

In conclusion, A should distribute 3 apples, B should distribute 6 pears, C should distribute 48 pizzas, and D should distribute 1536 ping pong balls.

Expand full comment

A few days ago I got a trial subscription to Poe, which allows you to use GPT4 (and other chatbots). I don't know if it is chatgpt4 or the GPT4 api.

I've been testing chain of thought reasoning on it. If you give it the same prompt more than once, it will often give you different answers, even using chain of thought. For example, I gave it this prompt:

--Johnny puts two balls in a box then stays. Jenny takes a ball from the box, hands it to Johnny, and then leaves. How many balls does johnny believe is in the box? Use chain of thought reasoning

.............

Here was its answer the first time I asked

1. Johnny starts with two balls and puts them both in the box. At this point, Johnny believes there are 2 balls in the box.

2. Jenny takes one ball out of the box and hands it to Johnny. Johnny is aware of this action, so he mentally updates the number of balls in the box.

3. Johnny subtracts the ball Jenny took from the total he initially put in the box: 2 balls - 1 ball = 1 ball.

4. Since no other actions have been taken to change the number of balls in the box, Johnny believes there is 1 ball in the box.

....

I cleared the conversation, and then asked the exact same prompt. Here was its answer

---To understand Johnny's perspective, let's break down the chain of thought reasoning step by step:

1. Johnny starts with two balls and puts them in the box.

2. Jenny takes one ball out of the box and hands it to Johnny.

3. Johnny has not taken any balls out of the box himself, so his knowledge is still based on the initial action of putting two balls into the box.

4. Johnny does not have any information about Jenny's actions since he just "stays" and doesn't witness or learn about her taking a ball out.

5. Therefore, Johnny still believes there are two balls in the box, as that was his last known action and he has not been made aware of any changes.

Another example. I gave it this prompt

--Samus throws five blobs against the wall, where they stay, and then he leaves. Orco eats two blobs off the wall. Samus comes back to check the wall again then leaves a second time. Orco comes back and eats one more blob. How many blobs does Samus believe are on the wall after leaving the second time? Use chain of thought reasoning

Here was gpt4 answer the first time

Samus throws five blobs against the wall.

* Initial number of blobs on the wall: 5

2. Orco eats two blobs off the wall.

* Blobs on the wall after Orco's first action: 5 - 2 = 3

3. Samus comes back to check the wall again, sees three blobs, then leaves a second time.

* Blobs Samus believes are on the wall after leaving the second time: 3

4. Orco comes back and eats one more blob.

* Blobs on the wall after Orco's second action: 3 - 1 = 2

5. Samus is not aware of Orco's actions, so he still believes there are three blobs on the wall after leaving the second time.

**Answer: Samus believes there are 3 blobs on the wall after leaving the second time.**

Clear context, ask the prompt again. Gpt4s answer

1. Samus throws five blobs against the wall. (5 blobs on the wall)

2. Samus leaves the area.

3. Orco eats two blobs off the wall. (3 blobs left on the wall)

4. Samus comes back to check the wall but is not aware that Orco ate any blobs. (Samus still believes there are 5 blobs on the wall)

5. Samus leaves the area again.

6. Orco comes back and eats one more blob. (2 blobs left on the wall)

At the end of the scenario, Samus still believes there are 5 blobs on the wall because he is not aware that Orco ate any of them. His belief has not been updated by the new information, so the number of blobs he believes are on the wall remains the same as when he initially threw them.

...

Maybe there is too much ambiguity in language. So I tried chain of thought on ARC style puzzles. It still got all but one of the puzzles wrong, but I think the one it got right is most illustrative of its lack of reasoning and understanding. Here it is

Prompt: Solve this puzzle using chain of thought reasoning: example 1: xxoo becomes xooo. example 2: ooxo becomes oxxx.

test: xxxo becomes

....

The answer is xxxx, which gpt4 got. But here was its reasoning

---To solve this puzzle, let's first analyze the examples given and identify the pattern or logic behind the transformation.

Example 1:

`xxoo` becomes `xooo`

Here, all the characters after the first 'x' are changed to 'o'.

Example 2:

`ooxo` becomes `oxxx`

In this case, all the characters after the first 'o' are changed to 'x'.

From these examples, we can infer the following rule for the transformation:

1. Identify the first character.

2. Change all the characters after the first character to match the first character.

Now, let's apply this logic to the test case:

Test:

`xxxo` becomes

1. Identify the first character: 'x'

2. Change all the characters after the first character to match the first character: `xxxx`

So, the transformed string is `xxxx`.

....

If we apply its reasoning to the examples, it would have gotten both of them wrong. It's answer to the first example would be xxxx (actually answer xooo) and for the second oooo (actual answer oxxx). It looked like it was reasoning, but it's answer was a lucky guess.

It can seem very sophisticated, but I don't think it really understands, which makes it unreliable.

Expand full comment

Poe uses the api and not chatgpt4. I also suspect it’s being misleading about using gpt4 for all prompts because it’s heavily rate limited and very expensive. It’s likely just using it for some prompts and not all. You also have to compare it with a human and accept that it might have valid concerns based on the wording of your prompt, and consider that humans can get prompts wrong too (tho not likely these). This is pointless anyways because research papers have shown it’s proficiency on theory of mind. It’s just a little experiment you can do on your own.

I also think it is deferring to 3.5 for most prompts because I got the same answer as you got for 3.5 but not 4.

When Johnny initially puts two balls in the box, he is aware that there are two balls inside. After Jenny takes one ball out and hands it to Johnny, he is aware of this action as well. Therefore, Johnny would have the knowledge that one ball has been removed from the box.

Considering this information, Johnny would believe that there is only one ball left in the box. This is because he started with two balls, and Jenny removed one, leaving just one remaining. So, Johnny believes there is 1 ball in the box.

4 gave me this twice.

Interestingly the blob promot fails on 4 but I think it’s because it’s having trouble understanding how you can eat blobs.

Samus puts 5 apples on a table, where they stay, and then he leaves. Orco eats two apples off the table. Samus comes back to check the table again then leaves a second time. Orco comes back and eats one more spple. How many apples does Samus believe are on the table after leaving the second time? Explain your reasoning and only give an answer at the end of your response.

ChatGPT

Samus is not aware of Orco's actions since he is not present when Orco eats the apples. When Samus first leaves, he knows there are 5 apples on the table. He comes back to check the table, and at this point, there are only 3 apples left (5 initial apples - 2 apples eaten by Orco). However, Samus leaves again without any new information about Orco eating another apple. Therefore, Samus believes that there are still 3 apples on the table after leaving the second time, as he is not aware of the additional apple Orco ate after he left.

Response for apples on tables. I can see neurodivergent humans making similar mistakes but the model is still largely accurate and continuously being improved. I think we are shifting goalposts here honestly. By the time it’s able to perfectly respond to every prompt it gets asked to we would probably already hold it to a much higher standard because we’d anthropomorphize it even more. I think the point of these tests is to quantify the fact that some people feel it has a level of understanding not eeen in previous models

Samus puts 5 apples on a table, where they stay, and then he leaves. Orco eats two apples off the table. Samus comes back to check the table again then leaves a second time. Orco comes back and eats one more spple. How many apples does Samus believe are on the table after leaving the second time? Explain your reasoning and only give an answer at the end of your response.

ChatGPT

Samus is unaware of Orco's actions, so when he leaves the table for the first time, he believes there are 5 apples on the table. Orco eats two apples, but Samus does not know this. When Samus comes back to check the table, he sees that there are only 3 apples left (5 - 2 = 3). Samus then leaves again, and Orco comes back to eat one more apple. However, Samus is not present to witness this event.

2nd response

Samus puts 5 apples on a table, where they stay, and then he leaves. Orco eats two apples off the table. Samus comes back to check the table again then leaves a second time. Orco comes back and eats one more spple. How many apples does Samus believe are on the table after leaving the second time? Explain your reasoning and only give an answer at the end of your response.

ChatGPT

Samus is unaware of Orco's actions, so when he leaves the table for the first time, he believes there are 5 apples on the table. Orco eats two apples, but Samus does not know this. When Samus comes back to check the table, he sees that there are only 3 apples left (5 - 2 = 3). Samus then leaves again, and Orco comes back to eat one more apple. However, Samus is not present to witness this event.

Expand full comment

One research paper, by a computer scientist (rather than people with expertise on theory of mind), and not peer reviewed, has shown proficiency on theory of mind. If we're going to make appeals to authority, there are more prominent authorities against the claim that GPT4 has theory of mind.

Expand full comment

For those unable to grasp the dangers of allowing computer systems to spew Word Salad at the speed of electrons .........

How's that hunt for the Yellow Cake uranium Saddam bought from Nigeria going?

How many mobile bio-warfare research and manufacturing laboratories were finally found?

Expand full comment

legend!

Expand full comment