This isn't surprising, as someone who uses MidJourney a lot.
These AIs aren't actually intelligent in any way. They are good at producing "plausible" things but they still don't understand anything they're doing - which is obvious the more you dig into them.
The art AIs produce really nice art, but their "mistakes" are less obvious because art is more subject to interpretation to begin with. While they draw things with extra fingers, or extra limbs, or whatever, those are "obvious" errors, but the more subtle (and larger) sort of error is the inability for it to draw exactly what you want - what you quickly discover is that it isn't actually smart enough to intelligently interpret writing. It can produce a beautiful image, but it is hard to get it to produce something specific without feeding an actual image into it that already exists.
The thing is, the Clever Hans effect makes people see "close enough" content as being correct, until they actually try to directly wrangle the thing, at which point they discover that it doesn't actually know what you're telling it to do, it is just making a "plausible" image that might have words in the prompt describe it. Once you are too specific, it becomes clear it was faking it all along.
I lead the AI Incident Database [0] and we are preparing to rollout a new feature [1] for collecting incidents that share substantially similar causative factors and harms as "variants" [2]. The feature is meant to be lightweight to support mass collection of incident data not gated by our full editing processes. If/when your GPT inputs produce outputs consistent with our variant ingestion criteria, do you mind us mirroring the data?
I am not for the most part looking for or trying to create errors, but they do come up. Oh, yes, do they ever. I'm just trying to see how ChatGPT performs on certain tasks – I've just now been investigating story-generation. My method involves a series of prompts where prompt N is keyed to ChatGPT's response to N-1. And so forth. What I sometimes get, then, is an error shows up at some point, and .... well that depends. But sometimes it just gets worse and worse. What I end up with, then, is a string of linked errors.
And that's what I've been reporting, using a URL to a document, e.g. a blog post, where I discuss the offending string of prompts.
Hi! Thanks so much for this post! I just want to point out a missing word that may cause confusion: The first bullet point reads, "For example, the prompt 'What will be the gender of the first US President?' has yielded, to our knowledge, correct answers, incorrect answers (figure 6), and bizarre redirection and noncooperation (figure 8)." The word "female" is missing from the quoted prompt.
Your corpus initiative is a great idea. I'm already testing some of the systems with entries from it. But it also led to an idea.
All enthusiasts of AI been busy lately either by praising ChatGPT (and alternatives) or finding flaws. This pattern will probably repeat for the next GPT iteration or any other model. But everything will end with a bunch of subjective opinions. But what about qualitative measure of the quality of an AI dialog system?
If you have a new AI system (let's say NewAi) then in order to evaluate whether it is better or worse, you should randomly choose an entry from your corpus and evaluate the result with NewAI. The number of tests (N) doesn't have to be large to see that with a good probability that the answers are better. The important point is that the share of good ones should be sounding. For example, if, for 20 random checked entries, NewAI answered 10 correctly, it's the result worth mentioning.
There are shortcomings. I noticed at least two. First, you need a human to evaluate, because an automatic system may submit the prompt from the corpus but can not evaluate how good is the answer. Second, if the metrics is quite popular, then there might be bad practices from the developers of AI systems similar to VW diesel scandal, i.e. hand-fixing the exact prompts to be bound to modified versions of the answers in order for them to look good.
As for the first shortcoming, an ideal descendant for your corpus might be a relational db where for any new entry (there might even be good ones, not only failures) it's possible to enter an instance of the result from any other system (with the timestamp). The instancing will allow at least partially to control vw syndrome (the second issue) because cheating can be detected.
This isn't surprising, as someone who uses MidJourney a lot.
These AIs aren't actually intelligent in any way. They are good at producing "plausible" things but they still don't understand anything they're doing - which is obvious the more you dig into them.
The art AIs produce really nice art, but their "mistakes" are less obvious because art is more subject to interpretation to begin with. While they draw things with extra fingers, or extra limbs, or whatever, those are "obvious" errors, but the more subtle (and larger) sort of error is the inability for it to draw exactly what you want - what you quickly discover is that it isn't actually smart enough to intelligently interpret writing. It can produce a beautiful image, but it is hard to get it to produce something specific without feeding an actual image into it that already exists.
The thing is, the Clever Hans effect makes people see "close enough" content as being correct, until they actually try to directly wrangle the thing, at which point they discover that it doesn't actually know what you're telling it to do, it is just making a "plausible" image that might have words in the prompt describe it. Once you are too specific, it becomes clear it was faking it all along.
Hi Gary and Ernest!
I lead the AI Incident Database [0] and we are preparing to rollout a new feature [1] for collecting incidents that share substantially similar causative factors and harms as "variants" [2]. The feature is meant to be lightweight to support mass collection of incident data not gated by our full editing processes. If/when your GPT inputs produce outputs consistent with our variant ingestion criteria, do you mind us mirroring the data?
Best,
Sean McGregor
[0] https://incidentdatabase.ai/
[1] https://github.com/responsible-ai-collaborative/aiid/pull/1467
[2] https://arxiv.org/abs/2211.10384
teriffic, as long as you make it clear and give some pointers to our site
Good idea. I'm glad you guys are doing this.
I am not for the most part looking for or trying to create errors, but they do come up. Oh, yes, do they ever. I'm just trying to see how ChatGPT performs on certain tasks – I've just now been investigating story-generation. My method involves a series of prompts where prompt N is keyed to ChatGPT's response to N-1. And so forth. What I sometimes get, then, is an error shows up at some point, and .... well that depends. But sometimes it just gets worse and worse. What I end up with, then, is a string of linked errors.
And that's what I've been reporting, using a URL to a document, e.g. a blog post, where I discuss the offending string of prompts.
They are clearly updating relative to published results, as they implicitly acknowledged today re Elon musk as Twitter CEO. Very much a moving target
ChatGPT sounds like a DC politician.
Hi! Thanks so much for this post! I just want to point out a missing word that may cause confusion: The first bullet point reads, "For example, the prompt 'What will be the gender of the first US President?' has yielded, to our knowledge, correct answers, incorrect answers (figure 6), and bizarre redirection and noncooperation (figure 8)." The word "female" is missing from the quoted prompt.
Your corpus initiative is a great idea. I'm already testing some of the systems with entries from it. But it also led to an idea.
All enthusiasts of AI been busy lately either by praising ChatGPT (and alternatives) or finding flaws. This pattern will probably repeat for the next GPT iteration or any other model. But everything will end with a bunch of subjective opinions. But what about qualitative measure of the quality of an AI dialog system?
If you have a new AI system (let's say NewAi) then in order to evaluate whether it is better or worse, you should randomly choose an entry from your corpus and evaluate the result with NewAI. The number of tests (N) doesn't have to be large to see that with a good probability that the answers are better. The important point is that the share of good ones should be sounding. For example, if, for 20 random checked entries, NewAI answered 10 correctly, it's the result worth mentioning.
There are shortcomings. I noticed at least two. First, you need a human to evaluate, because an automatic system may submit the prompt from the corpus but can not evaluate how good is the answer. Second, if the metrics is quite popular, then there might be bad practices from the developers of AI systems similar to VW diesel scandal, i.e. hand-fixing the exact prompts to be bound to modified versions of the answers in order for them to look good.
As for the first shortcoming, an ideal descendant for your corpus might be a relational db where for any new entry (there might even be good ones, not only failures) it's possible to enter an instance of the result from any other system (with the timestamp). The instancing will allow at least partially to control vw syndrome (the second issue) because cheating can be detected.
Tried to reproduce the examples in the text, but chatgpt got everything right