When looked at carefully, OpenAI’s new study…

Feb 4, 2024

Why you should be worried

43 Comments

Gary, this is huge. For background, I used to work for the CIA's Counter-proliferation Division, which existed to stop the creation and spread of Weapons of Mass Destruction like nukes, chemical weapons, and biological weapons. The development of chemical and nuclear weapons require chemicals, elements, and machinery that is distinctive, and thus easier to discover and shut down via treaty, sanctions, or covert avenues.

Biological weapons always were and always will be the toughest nut to crack, in terms of stopping their development, because so much of biological weapons development is identical with legitimate biological research. This means LLMs will make that already-hard task even harder. It also points out something every AI company will be loathe to admit: if you are improving a lethal technology like bioweapons, what you are developing is inherently dual-use, e.g. it can be used for civilian AND military ends. The most serious dual use technology always faces export restrictions for exactly that reason. I suspect one reason OpenAI's evaluation was, "Oh, this isn't statistically significant" is because if it WERE statistically significant, they've put LLMs in an entirely different regulatory category, and despite what they claim, IMO they do NOT want any meaningful regulation. Their valuation would PLUMMET if Uncle Sam said "Oh, hey, this is export restricted."

(of course, trying to enforce that would be a nightmare)

The fact that this study used GPT-4 with no safety guardrails in place (a model version the public can't access) is not a reason disregard the threat here. Meta's open-source LLAMA is only 6 months-1 year behind OpenAI, but because they've made their weights public, they've made the safety guardrails trivially east to shut down. We cannot pretend safety guardrails on ChatGPT will save us when LLAMA WILL catch up and LLAMA's guardrails can be disabled in an hour. That's one reason open-source models are potentially very dangerous. Meta will never admit that, anymore than OpenAI will admit LLMs can be dual use. Their whole business model depends on them never being classified that way. I posted something related to this a couple of weeks back. https://technoskeptic.substack.com/p/ai-safety-meme-of-the-week-d9e

Expand full comment

And now there's OLMo: https://allenai.org/olmo

Expand full comment

Scott, I'd never heard of it. Thanks for the heads up.

Expand full comment

1. The result is still derivative, it’s just that the LLM groups had more access to information of the *contents* of the sources on the Internet.

2. Nonetheless, this the information equivalent of the easy access to firearms. We really want a more efficient predictive analytics model that allows *more* antisocial behavior? I hope not…

3. Never screw with an experimental psychologist. The study of behavior has so much more noise and so little signal relative to the older sciences that we’ve created what are among the most formidable experimental designs and statistical analyses among all areas of scientific inquiry. Gary is right. The chances of this study being published in a peer-reviewed journal - esp. a tier 4 or 5 one, is zero.

Expand full comment

What I found noticeable is that experts profited more from GPT4 than amateurs. This is in line of my expectations, in for instance coding. Beginners do not profit as much from these systems as expert coders. You cannot create expert coders out of amateurs by using LLMs, but you can improve the productivity of experts. It fits with the fact that to profit from these systems, *you* have to bring. the understanding to the table (as the LLMs doe not have any).

Expand full comment

I agree that the degree of benefit provided to the experts is quite striking. But to digress, your assertion that the same pattern is found in programming is not consistent with my own experience (LLMs are useless to me) or with reports I've heard from others. Certainly, some experience is necessary to get much benefit, but past a certain level, one is unlikely to be working on anything that is sufficiently well-trod ground to be well covered by the training data. I would say it's intermediate programmers who benefit the most.

Expand full comment

I used to program a lot and I still do intermittently (once every 1-2 years). But I do for instance more complex sh or python scripts only rarely. So, when I am at "how did one do this again?" I find ChatGPT useful. It often generates nonsense (endless loops, expansions that do not work), but I can spot nonsense most of the time. It fits with your 'intermediate' in that I am a very experienced programmer who can profit in areas/languages where my experience is meagre. I suspect the better one is at programming overall, the more the 'start in a new area' is a good use case.

Another thing that is useful is researching LLMs itself. It has been extensively trained on that aspect, so it is above average reliable. It still confabulates, but I can easily spot things that are wrong and when I did on, I get the constant "my apologies, I was unclear" phrase and then a (slightly) better answer.

I have come to the provisional conclusion that 'the' use case for LLMs is 'living documentation'. Instead of having to wade through a sh man page or search on stackexchange, I can interact in natural language with GPT4. It is 'satisficing' my needs. Not enough to pay $20/month, that I only do when I am researching GPT itself...

Expand full comment

Comment removed

Comment removed

Expand full comment

Agreed, this is an effective use case (and Microsoft is going all-in on that). However, watch out for 'use it or lose it'. The risk is small, though, probably.

It is also somewhat problematic in use cases where there are all kinds of requirements inside an organisation (e.g. from security, risk, and control that companies like Microsoft cannot provide for) that are not 'boilerplate'. E.g. according to Microsoft, the only security aspect in Azure might be what can be done via policies. But that isn't enough by far for the regulators in more strongly regulated industries like finance, where central banks may set more hard/wide requirements. So, organisations run the risk that they will get a lot of LLM-generated code in their landscapes that in the end might start to look like the mess of all those Excel spreadsheets of 2 decades ago.

Expand full comment

Comment removed

Comment removed

Expand full comment

Fully agree. But humans are humans. So the 'should' often doesn't become 'is/does'.

Expand full comment

Jan Matusiewicz

It is understandable that people are worries about possibility of future pandemic after the last one killed more than 7m people worldwide. You would think that humanity would be wiser and try really hard to prevent the next one. Then why are wet markets still operating in parts of the world and nobody seems to care? https://www.telegraph.co.uk/global-health/science-and-disease/why-wet-markets-will-never-close-despite-global-threat-human/

Expand full comment

Agreed. Also, why is GoF research still going on around the world? We need an international treaty banning the intentional creation of novel dangerous viruses which would not otherwise exist.

Expand full comment

What irks me, by the way, is that more and more we're taking these non-peer-reviewed documents seriously. And that they are getting air time for these. Somehow I get the feeling this undermines the scientific method.

Expand full comment

I'm getting real sick of AI companies weaponizing Schrodinger's Apocalypse against the rest of the world. When the GPTs first became powerful Altman was wringing his hands over the end of the world but still preaching the promise of utopia. Now ClosedAI is publishing papers that downplay the risks of bioweapon development. Instead of society being forced to sit on the sidelines and watch these power-tripping mouth-breathers play with fire, we should have a mechanism to drag them before a court with actual power to ask "what the ever-loving fuck is wrong with you?"

Expand full comment

I'm not seeing the big deal here. Any tool that makes knowledge more available can be used for ill purposes. I'm sure building a bioweapon would be harder without other information dissemination tools: scihub, Wikipedia, Google search, bookstores, etc.

It made experts somewhat more effective in a time-limited setting, but a bioterrorist would have years of obsessive time to put into their demented project. I think it's fairly unclear that GPT would be a big lift IRL, and even if it is, there are a ton of legitimate use cases for researching disease-causing microbes.

Expand full comment

It is like you didn’t engage with the actual data at all

Expand full comment

I read your article and OpenAI's blog post. The absolute values of improvements seem small. The results seem pretty consistent with my experience using GPT4 for the past 8 months: it is good at the "early project" stage on something new, especially ideation, but breaks down as you get more into details. It can make you more productive, but generally not open up totally new capabilities without a fair amount of "traditional" learning.

Your article seems to be assuming that ability to construct bioweapons has been the limiting factor in bioweapon deployment, which I just don't think is true. The participants spent a few hours looking into these questions with presumably better things to do with their time, and were somewhat better at finding answers. Are there really a lot of people with biology PhDs who want to become a bioterrorist, but just don't have much time to devote to it? I'm open to evidence that is the case, but absent that evidence, the results don't seem as alarming as you're putting forward. Compare this to eg explosive manufacturing, which is comparatively *much* easier technically, with easily purchasable precursors, but still there are very few attacks in the US. I personally know several people with knowledge about firearms and drones who could probably design a fairly deadly improvised weapon. They don't because they're not terrorists.

Expand full comment

Isn’t a more relevant question: Is access to ChatGPT more dangerous than access to the internet in general. My guess is no, it is not.

Expand full comment

Given that the study showed the opposite and you made no reference to those results, 🤷‍♂️

Expand full comment

Well … no, it didn’t, though, did it? I thought that was the point 🤷

Expand full comment

I wouldn't worry about it too much.

based on my experience (which I may not disclose in detail, sorry), much of what is on the interweb regarding how to make a "bioweapon" (toxin such as Botulinum, or bacterium such as Anthrax, for examples) is - according to Bayesian modeling - more likely to kill the operator than to result in a useable weapon.

large Bayesian networks like GPT are more likely to come up with innovative chemical weapons that can be made by armatures (e.g., substituting a different chlorinating agent or alcohol)

Expand full comment

Late to the post, because I was away. I always find these discussions of AI doom highly interesting but also frustrating, because there is rarely a clear mechanism of action. At the extreme, the likes of Yudkowski argue that AI will turn us all into paperclips or design a virus that kills us all with 100% mortality within the week, and if asked why we wouldn't just turn off the AI's power button or how that virus would work biologically, one only gets the response that the AI will so smart that it can do things we now consider impossible. Like zap us from the sky when we reach for the off-button, because that worked in that scifi horror story they read once. The problem is two-sided: for these people, intelligence is magic, so a superior intelligence is god-like; and they are perfectly ignorant of physics and biology and are therefor unencumbered by any understanding of what a virus can or cannot do, for example.

That does not apply to you, a cognitive scientist, but even with this piece I am puzzled how the danger is meant to manifest. As far as I understand, the 'tasks' here are intellectual, in fact comparable to googling and reading the scientific literature. There is an efficiency gain, yes, but what I don't understand is why the people who are expert enough to understand and make practical use of what ChatGPT summarises for them couldn't just as well do a literature search and arrive at exactly the same outcomes with a two week delay. Conversely, those users who cannot do the literature search right now are likely not competent enough to understand and make use of what they get from ChatGPT.

And then comes the actual bottleneck, having an extremely expensive, well equipped laboratory with all the right supplies from suppliers who have to follow strict regulations regarding who they sell certain items to. In that sense at least, there is an equivalent to the belief that a sufficiently smart AI can simply will us to Alpha Centauri by ignoring physical distances, radiation, micrometeorites, and most importantly funding and resource limitations to the building of spaceships, because that lab and those supplies and competent lab technicians do not manifest from thin air. Perhaps the problem with the study is that it looked at a small intellectual exercise in isolation instead of asking, "and now what?"

There are real dangers to the widespread use of AI, like drowning in spam, driving human creators out of business and thus impoverishing our culture, or making poor automated decisions, but it really doesn't click to me how this particular use case introduces a risk that hasn't existed since scientific journals were invented and made available in university libraries.

Expand full comment

The problem is the test is not real world. Lone wolf scientist are quite rare and are dangerous without LLM assistance. For example the anthrax attack shortly after 9/11

The attacks involved the mailing of letters containing powdered anthrax to various media outlets and two U.S. senators. Five people died and 17 others were infected as a result of the attacks.

The correct test would use non-scientists unversed in bioweapon technology. The danger is an LLM that assisted a terrorist.

That opens a pandora's box for any intelligent human to build weapons of mass destruction (WMD).

My experience was in the counter intelligence response to all the 9/11 event as a member of the technology group directorate that brought together all 15 agencies. We used a WMD knowledge graph taxonomy of concepts to index threats.

Expand full comment

I’d love help building a mental model here.

In some cases, you ridicule models for faults, hallucinations, inconsistencies, and inabilities to be trusted and how relying on LLMs is dumb because they’re not using abstract symbolic reasoning under the hood. Or that diffusion models constantly make obvious mistakes and will continue to do so with each generation.

In others, you describe the worry about how they are effective in helping people build bioweapons (and likely to continue to get better with larger models). Or how audio/visual/text models are getting very persuasive and believable and convincing others in deepfakes and pose grave threats.

While I have my own (hopefully!) consistent views here, I’m not sure I understand enough to describe your views on AI despite reading your substack since it started. I’d love to see an article that addresses these together holistically! Otherwise it feels like the common theme is “AI is bad!” but in contradictory dimensions each time in isolation (“it’s too capable! it’s too incapable!”). I need help putting those pieces together into a consistent framework of what you believe, and would love an essay filling fleshing that out at some point...thanks!

Expand full comment

Feb 6, 2024Edited

I think the overall argument he’s making is that deep learning generative AI models are quite powerful, but also unreliable and ethically challenged.

Ironically, a useful analogy might actually be humans, as it isn’t hard to imagine some combination of brilliance, delusion, psychopathy, and unreliability all within a single individual (people who fit this description surely exist).

Imagine we’re releasing into the world a vast population of such people; *extremely* bright (in many contexts), preternaturally knowledgeable, and highly unstable. They aren’t consistently rational, responsible, and scrupulous enough to be trusted in any important capacity, but it’s also the case that some of them - some of the time - may be capable of doing tremendous harm. (Sometimes they’ll be thwarted by their own intermittent hallucinations or their propensity for boneheaded errors… but not *every* time).

And anyone can grab one of these individuals off the street, take them home and enlist their help on whatever project they might be working on.

It’s far from a perfect analogy, but I think it points in the direction of how Gary sees things, and it accords well with my own observations of AI behavior.

Expand full comment

Thanks! Yup, your argument makes sense, and I relate and agree with it as well! Perhaps it’s the argument Gary holds, but I’m not sure I’ve seen him use words like “bright” or “knowledgeable” to describe LLMs or give them any credit on such matters (usually it’s pushing back against such characterizations and trying to tear down anyone who uses them). This why I think an article addressing the nuance holistically would be valuable, vs needing to put words in his mouth…might also be a nice break from all the “refute the latest news item” we’ve had recently too!

Expand full comment

Feb 6, 2024Edited

https://x.com/garymarcus/status/1754981909216776307?s=61m addresses a bit

Expand full comment

Thanks for sharing, though I’d still appreciate a more in depth post, if you’re taking requests. :)

My read of this tweet is that it takes the “AI is not capable but being trusted too much” framing, which I can agree is certainly dangerous for stability of civil systems. But that seems incompatible with AI actually being useful for creating increases biorisks that you talk about in this substack essay here. I assume it’s not untrustworthy incorrect misinformation that is letting them create increased biorisks, after all! If an LLM is not smart, not logical, not trustworthy, then how is it being used to help participants create biorisks? It seems like there are some “useful” qualities that are being leveraged, that are growing more valuable with more capable models, and sounds like you’re pointing at that in this substack essay here.

To be clear, I totally don’t look for or expect nuance on tweets, and understand it’s hard to present a holistic phrasing when responding to “the latest news piece”. That’s what prompted my suggestion of a separate post above…

Expand full comment

Hostile Replicator

Whatever about the appropriateness of the statistical evaluation - what do we know about the ecological validity of the measurements? What does an “increase in accuracy of 0.8” actually mean about real world outcomes of people trying to produce bioweapons? My best guess is “very little”.

Expand full comment

The small sample size and thus insufficient statistical power is a valid concern. However, you can not infer statistical significance by extrapolating from a tiny sample size (n = 25 per cell). However, the paragraph about the Bonferroni correction is completely off the mark. They report dozens of dependent variables. Obviously, they need to correct for multiple comparisons?

Tiny sample sizes and no correction for multiple comparison led to the replication crises in psychology. Quite disappointing to see you advocating for such shady practices.

These methodological problems should be solved by power analysis and a proper preregistration.

Expand full comment

If they had preregistered the only hypothesis of interest it would have been significant.

Call me shady again and you will be blocked.

Expand full comment

But they did not preregister and ran dozens of tests, because they had many dependent variables. So they definitely need Bonferroni corrections for these multiple tests. It is super sad and destructive to openly call for questionable research practices.

The tiny sample size makes the evidential value of the study meaningless in any case. But because it is meaningless, it does not support either the null or alternative hypothesis.

Expand full comment

Apart from the unpersuasive study itself, it would take more than a handful of experiments to demonstrate that GenAI —or any AI technique— was not a force multiplier for weapons development. Clearly, it is a potentially useful timesaver. The question is more about the effect size.

Expand full comment

I believe the 10^31 number in footnote 2 is an estimate of the number of virions, not the number of virus species, which is surely far smaller; apparently only in the millions.

Expand full comment

You have a good source for the millions?

Expand full comment

Wikipedia says "more than 11,000 of the millions of virus species have been described in detail". The abstract for reference 8 says "There are an estimated 5000 viral genotypes in 200 liters of seawater and possibly a million different viral genotypes in one kilogram of marine sediment. By contrast, some culturing and molecular studies have found that viruses move between different biomes. Together, these findings suggest that viral diversity could be high on a local scale but relatively limited globally."

I don't get the feeling that anybody has a really precise estimate, but 10^8 seems like a safe upper bound.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts