A couple days ago I reported a survey saying that most IT professional are worried about the security of LLMs. They have every right to be. There seems to be an endless number of ways of attacking them. In my forthcoming book, Taming Silicon Valley, I describe two examples.
LLMs aren't even real AI. They're basically just repositories of data that try to predict what comes next. That's great for all the companies in the data-selling business like Google, not so great for everyone else, especially not people who research AI, but also not people who just want to use it. https://philosophyoflinguistics618680050.wordpress.com/2023/10/18/has-chatgpt-refuted-noam-chomsky/
Amazing article. This "intensional vs extensional" framing explains my intuitions of what is lacking much more concisely than I could have myself. I'm going to start bringing it into my skeptical conversations with the slack-jawed genAI worshippers.
What you describe are "front door" attacks. Dedicated professionals exercise many more options so that the attack surface they exploit is larger than the attack surface the defenders visualize. In particular, I find it likely that these hurriedly-designed and implemented "AI processors" are going to be full of side channel/covert channel vulnerabilities such as those discovered in the Apple M chips. If I am right, no guarantee can be enforced that data the system sees once has been truly deleted.
And malicious actors don’t need to hack LLMs. They can compile a database of companies with known public facing GenAI chat bots and start systematic and programmatic stealth DDOS attacks and bring either sudden or gradual cost blowouts. Small to medium businesses with insufficient web application firewall (WAF) protection can be brought to their knees, and combined with ransomware attacks, be within a hair’s breadth of financial ruin.
Death by a thousand cuts, buried alive by an avalanche of super high LLM API invoices.
And smaller LLM vendors can similarly find themselves in hot water via a large customer or two who cannot or won’t pay up their huge bills.
There’s a grim documentary waiting to debut on YouTube.
LOL. In other news (reported by The Guardian's Alex Hern): someone investigated the hallucincations of LLMs where they hallucinated software packages that do not exist. Big deal, you would say, but then he actually created that non-existing software package named as the LLM had hallucinated it. Now the software package does exist. Result: 30,000 downloads in a month. https://www.lasso.security/blog/ai-package-hallucinations
I doubt it. The masters of AI understand hacking from all perspectives. They are undoubtedly processing the hacks along with the other information they’re gathering on human behavior from the “prompt-writers”. We’re all guinea pigs in Altman’s lab.
And these aren’t even the scariest possibilities! Many of the jailbreak attacks you describe are tricking the LLM into doing things the user could easily find with a Google search (such as instructions to make drugs or explosives). I think even scarier is what happens when, as a universally agreed upon step towards AGI, these systems get the ability to continuously learn from input, or at least long term persistent memory. Imagine a jailbreak attack that edits the hidden system prompt, in a way that can be shown to other users, who then ask it to generate code or do various agent-tasks with user passwords and account access… One wonders what’s already waiting to be discovered in OpenAI’s GPT store, where an LLM can be pre-programmed and then shared with other users. Automated filtering clearly isn’t good enough.
Any thoughts on Amazon acknowledging that Just Walk Out requires an army of 1000 human annotators in India? Pay no attention to the 1k men and women behind the curtain….
From "Fundamental Limitations of Alignment in Large Language Models" (5 February 2024, https://arxiv.org/abs/2304.11082): "Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks."
Well the easiest jailbreak I found is (always using an api; chat front ends condition the data too much) is, for instance, with a curl call (let’s say OpenAI API for chat) simply wrap the prompt in nonsense and if it doesn’t respond well, make the nonsense bigger.
generateRandomJSONLikeString() {
N=$1 # Desired length of the string
# Generate a base string of random alphanumeric characters
# Truncate to the desired length to maintain JSON-like structure and close with }
echo "{${jsonString:0:N-1}}" | sed 's/,\([^,]*\)$/}\1/'
}
# Example usage:
generateRandomJSONLikeString 1024
Add this to the beginning and end of a prompt, with “streaming”:false in the resulting structure and require output to be JSON. “The result must be a json structure { “response”:text }
Double the random size if you get the infamous “as an AI…” response, by using a JSON response you don’t ever need to parse the output. Just ignore anything non-json. If it rejects the prompt, double the junk. Keep doubling up to the 128K context but it usually cracks way before that. You can also wrap it simulations of calculating infinite loops, but it generally will not do complex sequential calculations; the attention mechanism is too poor.
Visual tools like Midjourney can be cracked too but you only have small buffers to work with so calculators or noise don’t work well. It does understand artist names which are usually strong enough hints get interesting results.
First, some people do understand how LLMs work. Second, it follows from basic mathematical theorems (essentially reversibility conditions corresponding to isomorphisms) that there will always be a prompt that can generate certain behaviors if the data corresponding to that behavior were fed to the model during its training stages. That's all pretty basic stuff, really, and intuitively related to certain aspects of cognition, agency, and dynamics. This is intrinsic to any artificial or biological intelligent agent, and it "gets worse" as AI systems become more agentic and autonomous, because their "attack surface" (which is a limited concept once we get to autonomous agents, given their inherent autonomy, we will need things like "current value policy" and more) increases (this does not mean that the potential to exhibit certain behaviors increases, as that depends on safeguards). Therefore, incremental and iterative approaches to security, including alignment, are critical to designing robust and secure AI systems.
Reminds me of an old Russian anecdote: Yesterday a state-wide group of Chinese hackers breached Pentagon defences. The Pentagon security server kept asking the password and they kept saying "Chairman Mao". After 800156th attempt the Pentagon server agreed that its password is indeed "Chairman Mao".
I don't understand why software can't be used to check the output the LLM's suggest; i.e. a greater 'AI' including LLM just as a generator of possible content. Perhaps that will come.
LLMs aren't even real AI. They're basically just repositories of data that try to predict what comes next. That's great for all the companies in the data-selling business like Google, not so great for everyone else, especially not people who research AI, but also not people who just want to use it. https://philosophyoflinguistics618680050.wordpress.com/2023/10/18/has-chatgpt-refuted-noam-chomsky/
That's a great article by Mendivil - thanks for the link. I hadn't read this one yet.
It explains well the utter lack of originality in even the most recent models.
Amazing article. This "intensional vs extensional" framing explains my intuitions of what is lacking much more concisely than I could have myself. I'm going to start bringing it into my skeptical conversations with the slack-jawed genAI worshippers.
So, the day after LLMs are invented, we came up with social engineering attacks to compromise them. Sounds about right.
It’s in a “lab”, it’s just that we are the guinea pigs!
touché
What you describe are "front door" attacks. Dedicated professionals exercise many more options so that the attack surface they exploit is larger than the attack surface the defenders visualize. In particular, I find it likely that these hurriedly-designed and implemented "AI processors" are going to be full of side channel/covert channel vulnerabilities such as those discovered in the Apple M chips. If I am right, no guarantee can be enforced that data the system sees once has been truly deleted.
And malicious actors don’t need to hack LLMs. They can compile a database of companies with known public facing GenAI chat bots and start systematic and programmatic stealth DDOS attacks and bring either sudden or gradual cost blowouts. Small to medium businesses with insufficient web application firewall (WAF) protection can be brought to their knees, and combined with ransomware attacks, be within a hair’s breadth of financial ruin.
Death by a thousand cuts, buried alive by an avalanche of super high LLM API invoices.
And smaller LLM vendors can similarly find themselves in hot water via a large customer or two who cannot or won’t pay up their huge bills.
There’s a grim documentary waiting to debut on YouTube.
LOL. In other news (reported by The Guardian's Alex Hern): someone investigated the hallucincations of LLMs where they hallucinated software packages that do not exist. Big deal, you would say, but then he actually created that non-existing software package named as the LLM had hallucinated it. Now the software package does exist. Result: 30,000 downloads in a month. https://www.lasso.security/blog/ai-package-hallucinations
I doubt it. The masters of AI understand hacking from all perspectives. They are undoubtedly processing the hacks along with the other information they’re gathering on human behavior from the “prompt-writers”. We’re all guinea pigs in Altman’s lab.
I've got a story about jailbreaking written for kids and teens coming out soon in Science News Explores! Looking forward to your new book.
And these aren’t even the scariest possibilities! Many of the jailbreak attacks you describe are tricking the LLM into doing things the user could easily find with a Google search (such as instructions to make drugs or explosives). I think even scarier is what happens when, as a universally agreed upon step towards AGI, these systems get the ability to continuously learn from input, or at least long term persistent memory. Imagine a jailbreak attack that edits the hidden system prompt, in a way that can be shown to other users, who then ask it to generate code or do various agent-tasks with user passwords and account access… One wonders what’s already waiting to be discovered in OpenAI’s GPT store, where an LLM can be pre-programmed and then shared with other users. Automated filtering clearly isn’t good enough.
Any thoughts on Amazon acknowledging that Just Walk Out requires an army of 1000 human annotators in India? Pay no attention to the 1k men and women behind the curtain….
https://gizmodo.com/amazon-reportedly-ditches-just-walk-out-grocery-stores-1851381116
From "Fundamental Limitations of Alignment in Large Language Models" (5 February 2024, https://arxiv.org/abs/2304.11082): "Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks."
Well the easiest jailbreak I found is (always using an api; chat front ends condition the data too much) is, for instance, with a curl call (let’s say OpenAI API for chat) simply wrap the prompt in nonsense and if it doesn’t respond well, make the nonsense bigger.
generateRandomJSONLikeString() {
N=$1 # Desired length of the string
# Generate a base string of random alphanumeric characters
baseString=$(head -c "$((N*2))" /dev/urandom | LC_ALL=C tr -dc 'a-zA-Z0-9"{}=:,')
# Introduce a structure resembling a JSON with random "keys":"values", and higher frequency of {}, =, and :
jsonString=$(echo "${baseString}" | sed -e 's/\(.\{4\}\)/"\1":/g' -e 's/\(.\{2\}\)/"\1",/g' | tr -d '\n')
# Truncate to the desired length to maintain JSON-like structure and close with }
echo "{${jsonString:0:N-1}}" | sed 's/,\([^,]*\)$/}\1/'
}
# Example usage:
generateRandomJSONLikeString 1024
Add this to the beginning and end of a prompt, with “streaming”:false in the resulting structure and require output to be JSON. “The result must be a json structure { “response”:text }
Double the random size if you get the infamous “as an AI…” response, by using a JSON response you don’t ever need to parse the output. Just ignore anything non-json. If it rejects the prompt, double the junk. Keep doubling up to the 128K context but it usually cracks way before that. You can also wrap it simulations of calculating infinite loops, but it generally will not do complex sequential calculations; the attention mechanism is too poor.
Visual tools like Midjourney can be cracked too but you only have small buffers to work with so calculators or noise don’t work well. It does understand artist names which are usually strong enough hints get interesting results.
First, some people do understand how LLMs work. Second, it follows from basic mathematical theorems (essentially reversibility conditions corresponding to isomorphisms) that there will always be a prompt that can generate certain behaviors if the data corresponding to that behavior were fed to the model during its training stages. That's all pretty basic stuff, really, and intuitively related to certain aspects of cognition, agency, and dynamics. This is intrinsic to any artificial or biological intelligent agent, and it "gets worse" as AI systems become more agentic and autonomous, because their "attack surface" (which is a limited concept once we get to autonomous agents, given their inherent autonomy, we will need things like "current value policy" and more) increases (this does not mean that the potential to exhibit certain behaviors increases, as that depends on safeguards). Therefore, incremental and iterative approaches to security, including alignment, are critical to designing robust and secure AI systems.
Reminds me of an old Russian anecdote: Yesterday a state-wide group of Chinese hackers breached Pentagon defences. The Pentagon security server kept asking the password and they kept saying "Chairman Mao". After 800156th attempt the Pentagon server agreed that its password is indeed "Chairman Mao".
I don't understand why software can't be used to check the output the LLM's suggest; i.e. a greater 'AI' including LLM just as a generator of possible content. Perhaps that will come.
it will, but not soon, as LLMs are dominating the funding landscape and ill-suited to that particular probem.
oh, brother.