For me, after many years experience in computer systems engineering but not much recent coding experience, it's not how much time I save, it's whether I can do it at all. I've been learning-by-doing python for the past couple of years, and I've gotten pretty good at it. But it's only because of my many sessions with ChatGPT and Gemini that I have succeeded in putting together a sophisticated, flexible, speech-to-text dictation and control application for Windows. There's simply no way I could have done it without their help. Particularly when it came to tracking down the vagaries of imported python modules and low-level windows functions.
This isn't vibe coding, this is me examining every line of code produced, understanding whether it does what I want efficiently and accurately, and engaging in a dialogue with the llms to improve it and extend it.
I realize that's not the point of the study you site, but it is my experience.
Yes, even as a very experienced developer, I have occasionally found LLMs useful to get me started writing a new app on a platform I haven't previously coded for. For me, this seems to be their only successful use case. As you say, I still have to go over their output carefully to find bugs.
I haven't even tried using them on a large existing codebase. I'm not optimistic that they would be much use.
It can be very helpful indeed for that kind of thing, but are you sure "There's simply no way I could have done it without their help"? I mean it's not like you have to figure it all out from O'Reilly books, there's google and there's much or all the code the LLM was trained on available. It might have taken a little longer, but your knowledge might have gained a little more depth. I'm not sure, I agree teaching new things to experienced people is an area it's strong in. I definitely use it for that.
Of course I could do it myself. I could also dig the foundation of my house with just a shovel. But would I? No.
Maybe I should have said it this way: The barrier to doing that research on my own without help from the llm would have been too great for me to attempt.
One thing I have noticed is that I can often find stuff faster using an LLMs vs a Google search, with a few exceptions such as looking for a specific website or the lowest price for some well-defined item. As you say, the code is out there but finding it "manually" is not so easy.
It's definitely useful to teach yourself coding stuff if you're already a programmer. It's also good at providing an overview of a topic you already know something about, so you can spot errors and check important facts.
It's less good for things you know nothing about. I was asking it for months about how an area of federal funding was going to fare under Trump's big beautiful bill. It gave me very long, very detailed information, with breakdowns about how different people were likely to vote etc etc. The only trouble was the thing I was asking about wasn't actually in the bill.
I find it is good to tell you what to research next or available options for things I know little about (or have simply forgotten). For example, I recently knew that I needed a topological sort but was not up on all the algorithms. An LLM told me the names of the algorithms and their plusses and minuses. Was it exhaustive, who knows? But I didn't need the best algorithm, just a good enough one. The important thing is that it was undoubtedly trained on a lot of relevant content.
But wouldn't have Wikipedia gave you the exact same information? Of course it would, and it would be better structured, too!
But the real question is different: LLMs are quickly destroying the will of people to participate in StackOverflow and Wikipedia… but they be able to do anything without them?
The question may sound silly, but it's quite real: the more LLM-generated slop is out there the harder it is to find the actual information and if it's as bad as they say… this would just lead to eventual collapse of the whole paradigm. It would simply destroy itself.
You are talking about different but very real problems.
First, no, Wikipedia would not have given me the same kind of information. Some of it would be there but it would be spread over several pages and, without knowing the names of all the algorithms, I wouldn't have known which pages to visit. I would have gotten there eventually. Also, the LLM generated a lot more opinions about the advantages and disadvantage of each and gave me suggestions as to what I might look at next.
It is sad that the presence of LLMs are suppressing Wikipedia author participation. LLMs are shaking up lots of things but new technology does that. We have to adjust.
By "LLM-generated slop", I assume you mean the phenomenon where people publish AI-generated pages containing bad information. That's a real problem but the blame is on the people posting the slop.
I fight against the hype and the distortions LLMs create but it's not going away. We have to learn to live with it.
it provides a lot of info, quickly, which is often but not always accurate, in response to a vague or brief prompt. That's useful, but not a replacement for human expertise.
Sure, Google Search is pretty bad these days but LLMs are still very useful. Both takes are right. One of the benefits of LLMs is that they have caused Google to get off their collective butts and try to improve their search.
You absolutely could have done it without their help. Like everyone who came before you, you'd have had to study the problem in detail and come up with a plan, then implement it piece by piece.
You're a human. You can learn things and apply your knowledge. It's what we do.
As to whether you could have done it *as fast* without the chatbot's help, hard to say. You could certainly have done it better.
My experience playing with LLMs has been: they can do simple things for me, apparently faster than I can do them myself, but they can't do anything non-trivial without screwing it up and making a mess. But studies in OP indicate I may have been fooling myself about the "faster" part.
I think a key point you're making is "There's simply no other way I could have [written a new program for a specialised domain in an unknown coding language] without [the help of an LLM]"
As someone who has done this many times over a long career in technology, I would hard disagree.
Prior to LLM's, the Internet made thar fairly simple to do, simply by looking through open Github etc source repositories, Stack Exchange etc Q&A's, personal blogs etc. Prior to that it was coding books.
On the cautionary side, we know from experience reports provided by some well-regarded elite developers that although LLM-generated code *can* work, that code may have fundamental problems in areas such as errir handling, broader integration / architecture, scale / maintainability etc.
There have also been problems caused by poor-quality coding patterns with many instances propagated through accessible code repositories (copy & paste), that when used to build an underlying LLM code corpus become heavily weighted output bias - basically the poor code-patterns get carried forward as high-probability output cases.
My suspicion for a while has been that we tend to over-apportion high-capability to LLM's that simply isn't there - and this METR study would appear to support that anecdotal suspicion. We're "wow'd" by the magic capability of it all, but the magic is at least in part illusion.
One of the ongoing paradoxical challenges is that it takes expertise to realise where LLM's fail. So - for example - an expert Python developer can see the errors in Python generated by an LLM: however if you're using an LLM to teach you Python, you're at a distinct disadvantage.
You're at risk of being lulled into assumptions about the quality of output, reinforced by factors such as code being created quickly, that complies, and can execute / work. That doesn't make it good code, fit for purpose that will work as you expect it to under real-world conditions.
You assume that I am naive about these risks. I am not.
Let's look at an analogy. My table saw is dangerous. People lose fingers and hands all the time using table saws. Does that mean I should stop using it? Or should I mitigate the dangers by following a strict methodology?
The real danger is that I don't even attempt to do complicated stuff like I've been doing because, for example, it would take too long to research the foibles of UIA in Windows on my own.
I simply saw that you were making an absolutist claim that was obviously easily refuted. From that limited sample, it seemed reasonable / helpful to note other subtle aspects and considerations you - and more importantly, others reading the problematic statement in your comment - might benefit from.
For example, it doesn't really matter whether you think you are not naive, are aware you are naive, or even whether you are actually naive and completely unaware of that consideration.
It doesn't matter, because you have - to a large extent - no way of knowing whether anything you're receiving as output from such tools, is or is not problematic in any way that might be important: because you have no validated reference to compare against.
For instance, the code presented may compile, but that's a pretty rudimentary concern.
However, you have no way of knowing whether you are being presented proprietary code that should not have been shared that you might not legally be able to use. You might be opening yourself up to a lawsuit in using that output.
In your example of "learning-by-doing-python" where you state this learning could not have been done without LLM/GPT-based systems (implying this was your only learning resource), you have no way through just the LLM/GPT-based system of knowing at the time the code is presented whether the generated code is considered: a) good code - code that expert developers in that language would consider good, b) compliant with commonly accepted standards and styles within a community of peers likely to interact with it, and c) free of logical (as opposed to syntactical) errors - you'd need to use the code to explore and hopefully discover this.
For any non-trivial code - particularly that needs to be maintained or is intended to be shared -, the questions become how the community can verify generative-AI code - a second opinion, and / or third opinion (code / peer reviews, audits etc) as it were.
So the key point I raise is how you - or anyone else - can mitigate the inherent risks?
It would be interesting - rather than seemingly attacking me defensively - you responded to the questions raised and simply explained *how* you mitigate the risks that you indicate you are aware of.
In your case you report a success, but how can we actually measure the success of your program? Did you get feedback from your users? Do the users of your app spend the time to make sure the speach-to-text translations are accurate? Are the translations accurate? Are they useful? Could this problem have been solved using a different method other than AI? Did you use AI because you like AI, and you want to have AI on your resume to increase your employment chances in the future?
I find that people are not completely honest about these projects and it is hard to get to the truth 100% without exploring all the details.
Ooooh! Hadn't thought of that! Maybe I'm being manipulated by an evil AI into thinking that my code is good, just because I understand it what it does, and it does what I want it to do, and it doesn't fail me.
Maybe I should submit all my code to you before I make an assessment like that. Then you can inform me whether I'm wrong or not. How would that be?
On the other hand, maybe I should just ignore you because I don't really care what you think about my honesty.
This is what I believe to be the best use for these LLMs, to quickly condense down a lot of info for an intelligent being (human) to then assemble the parts that they need. The LLMs are more like a 'super' Google, and can explain and summarize documents and poorly documented libraries so that a person can work with them faster; however, this still requires a person to audit the LLM output (because they hallucinate way to much and make many poor code decisions, especially when given a lot of latitude).
As a non-coder, it works this way for me, too. I recently wanted to do a simple interactive html page to help kids learn their times tables, and GPT made it for me, pretty much exactly as I envisioned it.
That was great! For amateurs who want to potter around, it's lovely. But that is really a very different ballgame from assisting with professional work.
I’m testing both ChatGPT and aistudio and I Like aistudio more, more tokens and less mistakes.
My PiSelfhosting and pi-server-vm are both in the end created with no hand written code by me. I use unit testing, also generated by aistudio. Just max 3 iterations and it works. What is ypur experience in how many cycles before the code is correct?
What do you mean by correct? It is very easy to write bad python code, which appears to work in the high probability use cases, but whose behaviour is non-deterministic in some lower probability cases. It is exceptionally easy to write python code that is grotesquely inefficient. It is very easy to write bad python code that cannot be extended or maintained. Python is a complex and subtle language.
How do you know that your LLM is not doing precisely that, if you learned Python from that same LLM? Where are your quality criteria coming from? Is the code testable, or portable? And do the means to do so meet established standards?
As for your example: the python package system can seem complex, especially if you go to random sites on the Internet, as much of what they say is only almost true. If you read the python standard, and think it through, the whole thing becomes obvious. And as for native functions, ctypes is your friend, and is very well documented in the official python documentation. Neither of these is hard. If you want hard, try Python metaprogramming. Standards are easy to find. LLMs basically reward laziness with bad code of dubious quality.
That’s been exactly my experience. I’m 62 and have been working with a proprietary Data Ops platform for many years but have never learned to code in Python. ChatGPT has opened all sorts of possibilities for me to create Python scripts that say use Google’s OR Tools CP-STAT solver as part of a data pipeline that I have built with the software I know. I couldn’t have done that without the LLM.
This substantiates my feel about the tools. I try them out and understand how a certain feel of velocity gives some dopamine hits. Especially for devs with underdeveloped editors (not emacs or vim). The text manipulation speed alone must be a hit. But then I solve real world problems and simply leave the thing alone, because I have the velocity from ordinary thinking through the problem. It's like having a child next to you ranting and dreaming about how to get something done in front of you. You ignore it after it goes off into useless tangents that ignore important aspects of what you do.
While I have toyed with coding assistants, I agree that they can get you started with unfamiliar languages far faster than trying to learn from scratch. Python is a good example. Writing interfaces between languages is another.
However, I think there is a difference between asking an LLM for help and using the new IDEs with AI, built in, like Cursor, Windsurf, and Firebase.
A YouTube video I watched had the presenter comparing different AI IDEs and saying that one should at least pay at the $20/m level for using them. Maybe if your work depends on it (but shouldn't your employer pay?), but not for the casual programmer.
"It will be very interesting to see to how this evolves over time."
It will also be *very* interesting to observe some specific phenomena in the process.
1) People insisting that it saved *them* time — even if it didn't save time for those other shlubs who "don't know how to use it".
2) People continuing to make claims about time saving without actually measuring the amount of time that things took, but going on feeling. If I go by feeling, coding and debugging takes practically no time. If I actually look at the clock, though... The study seems to bear that out.
3) Very shallow testing of the quality of the code and of the product.
4) Oblivion about the differences between programs generated to help a non-programmer organize a recipe book vs. programs that help to run businesses.
5) The IKEA effect for software. I didn't make it, but I assembled it, which makes me feel like I made it, which bestows the endowment effect.
This resembles my own experience using these tools. They might seem faster and cool, but if I pay attention and track my time while using them, its pretty obvious that while fun to work with, they slow me down.
The plural of anecdote is not data, but still. I have an object detection task (tagging the moons of Uranus in astronomical images) with about 40 training datapoints. I asked o3 to write code to load a pretrained resnet-18 and to retrain the head only. After a few iterations the thing was running and spitting out predictions (the centers of the moons in image coordinates). Unfortunately though they looked significantly less accurate than I was expecting. It took a while to figure out what was wrong: as it turns out, the code o3 generated did not contain a crucial line where it should have set the weights to the pretrained value. So I was basically “fine tuning” a net initialized with random weights. On the plus side I would probably not have attempted to do any of this without chatGPT.
Ironically, they benefit expert coders more. They do help with one part: procrastination / slow start. A couple of times I needed a script for a small task. It's always a nuisance to get these out. Not a huge challenge, but you waste a couple of hours ironing out the kinks and thinking how to start. GenAI can get the draft very quickly. Then you see the obvious issues, fix them, and viola.
Another use is exploring bits you don't know and getting strategy advice. Basically, a faster way to research.
Where it's completely useless is understanding a real-world system. You can't just give it a piece of code to analyze or find issues in. Same with looking up non-existing methods: they will confabulate stuff you are looking for, and it'll take you some time to figure out it's a fiction. Writing a decent, working app from scratch without human intervention? I doubt it.
It's a bit like spellchecking: if you can't spell, you wouldn't always be able to use it properly.
I have to admit though that there is another positive: Stackoverflow is much less cocky now. It used to be a pain to ask questions there. "What, you don't know?" "You haven't done your research" "Closing the question as offtopic" and so on. No longer.
I wonder if the programming language help from LLMs will degrade as more AI-generated code is ingested in future training sets. I also wonder whether we will damage the progression of skills from junior engineers to senior engineers, as working knowledge of complex systems and specific code constructs degrades. Studies are showing that skills are deteriorating with overuse (or incorrect use) of AI.
Based on my observations, not a scientific study. The spectacular productivity increases with AI tools are driven by good programmers who are already several times more productive than average programmers. Average programmers + AI sometimes reach the level of good programmers without AI. And yess, debugging is a real issue, in fact a nightmare with genAI coding tools.
Novice coders + AI are able to "churn out" code quite quickly that they don't understand and that is very fragile. Worse, I think that most of them will never learn to code well and will be dependent on AI tools for life.
Moreover, I also observe a more or less rapid erosion of programming skills due to the overuse of AI tools and the law of least effort. The development and maintenance of complex cognitive capacities requires active work and cannot rely solely on technological assistance. When it comes to natural neural networks, it's "use it or lose it."
This is what happens when people assume engineering isn't a creative endeavor, and that all the problems it solves are superficial remixes of solved problems.
If this were true, Turing completeness would be irrelevant and RISC would be completely solved. We would be able to dramatically speed up computations by knowing everything we would possibly compute, thus defining the limits around the halting problem, and eliminating the p-np conundrum.
They still haven't solved for training a model and getting a coherent database of structured insights which could be manually assembled into coherent outputs. You know, like a layer that understands individual characters, words, parts of speech, concepts, fields of study, trees of knowledge, distinct vocabularies, contexts, and the generalized patterns present across rational fields of study in self consistent theories relying on as few axioms as possible. They're trying to solve the incompleteness theorem with "sympathetic magic" thinking, as if mandrake is a type of man, and their intellectual humunculii will burst forth from Mason Jars like some all powerful and immortal deity beaten out of Aliester Crowleys wettest fever dreams.
I mean, phonemes are a thing, but most of them carry zero meaning and intonation changes the context of speech. From just this basic understanding of communication, it's easy to see that blindly dissolving text and generating associations will create massive layers of incomprehensible bias while completely missing the point. Structured language is an invention to eliminate intonation, because the things it's going to say will run straight into the trap of incompleteness, it's trading one form of ambiguity for another, and that's not something that can be solved with any amount of computer science. It's likely the linguistic equivalent of the uncertainty principle : if you think you know, you don't.
Thank you for this post! It's extremely inspiring. 😀
I don't. It's unfeasible. Even with human-written code it's often easier and simpler to throw it away and rewrite it from scratch, if it's not written by a good programmer.
With AI we would have 99% of code in that state.
But what we WOULD get, 100% guaranteed is lots of companies and tools that would PROMISE you that… then fail to deliver.
They would sustain itself on promises for quite a long time without actually delivering anything working.
Like Tesla's autopilot… how many years ago first cars that were supposed to “eventually” be able to drive itself were released?
The strongest case for AI is "whatever I'm bad at that the AI suddenly lets me pretend I'm good at."
Very few people are good at programming. It's a strange skill; tedious, and with little to show for your efforts. Not the sort of thing you can dabble in and quickly find rewarding. Not much praise to be had from others in "hello world" exercises.
So, everyone who isn't much good at programming thinks programming is the strong case for AI. People who are good at programming but bad at other things, like art, think that AI is really good at art but poor at programming.
Everyone is wrong. AI is bad at everything.
As a good programmer, I have said the whole time that it's bad at programming and depressingly enough not getting any better (I would love for it to relieve some of the most tedious parts of programming). As a mediocre artist, I have said the whole time that the "art" is bad and meaningless, because even I can do better. As a crappy writer but avid reader, I can tell the writing is bad but I can't tell you how to improve it.
Not only is it bad at everything, but all the ways it has "improved" have nothing to do with what makes it bad in the first place. The code has always lacked comprehension of the task and how the software is meant to be used; now it can write more bad code faster. The art has always been soulless; now it can make high detail soulless art. The writing is tedious and obnoxious; now it can write a whole tedious, obnoxious book where it could only do paragraphs at a time some years ago.
There is no case for LLMs. It doesn't exist. Whatever piece is missing, that piece that will make them useful, remains undiscovered. I would like for someone to find it, but that won't happen until everyone can admit that they're useless and stop trying to make them useless at larger scales.
Why do you think big tech is still plowing such massive amounts of money into developing these LLMs and all the surrounding infrastructure? Surely they can also see how ineffective they are at even the task they're supposed to be the most helpful with (programming).
I think tech runs on hype and hot air, and tech investors live in a constant state of FOMO, determined not to be the guy who passed up "the next google/facebook/etc." We've already seen several high profile tech pitches implode spectacularly that anyone with relevant expertise could have debunked (and did).
Theranos, WeWork, FTX are perfect examples. They were never going to work, they knew internally they were never going to work, and anyone on the outside who knew anything knew they were scams and tried to warn people. It didn't matter. As long as you can convince credulous investors and stock speculators that you're going to change the world, you get rich and rarely suffer consequences. It hasn't been about making things that work for a long time, only about telling a good story and making your exit before reality catches up with you.
LLMs are just the latest in the tech fraud hype cycle. Maybe the straw that finally breaks the camel's back and reintroduces healthy skepticism to tech investing. We'll see.
It sure seems like there's a lot of mass delusion going on. It's going to be very tough for most of these companies to make a profitable exit when they aren't making any profits or any hope to make profits. OpenAI is totally worthless financially.
Part of it is the same kind of bid as social networks, "the network effect" - if everyone is on Facebook, then everyone has to be on Facebook, and you achieve vendor lock-in.
But I'm not sure this applies to LLMs at all. Imagine for a moment they were actually useful. Does it follow that if you capture the largest market share you "win" the AI game? Will your friends pressure you into using "chatty" instead of or in addition to "grok"? Do you lose anything by switching between vendors freely? Current early adopter behavior indicates the opposite - they're always going on about how you really have to consult multiple LLMs to get the most out of them.
Another huge part of it is regulatory capture. Because it's not really possible to maintain dominance by having "the best LLM" - innovations get reverse-engineered and replicated by competitors in weekly or monthly timescales - one strategy is to just make it effectively illegal to compete with you. This is what they were pursuing up until around last year, when DeepSeek proved that non-US AI research could compete, and so regulatory capture in the US was off the table. So now suddenly they've changed their tune on the whole "please regulate us" thing and will settle for nothing short of international treaties. Which they're not going to get.
Anyway I could keep ranting but tl;dr I think you're right. There's no plausible way for any of these companies to make a profit.
I have noticed a psychological phenomenon in my many decades of working with computers. People tend to count the time spent working positively toward a goal but not time spent fixing regressions: the mistakes or bugs. I suspect it is a carryover from practice at a task. How fast you can do something perfectly is what matters. Mistakes can be corrected and, therefore, shouldn't be counted when considering task speed. I suspect this is partly what throws off programmers' expectations of AI coding.
- Extremely simple features and simple apps are the sweet spot for GenAI Coding
- Your role moves away from hands-on engineering/coding and into both Design and Director
I was able to use Gemini Canvas to build, debug, enhance, deploy a javascript plugin with a hosted API within a couple hours—I am not a developer, but a Product Designer of 28 years.
I did know what I wanted the end experience to be, knew the basics of Git and Netlify, but zero ability to wire it all up. GenAI guided me through how to step through it all as well as fixing errors and extending the functionality.
Debugging was the process of me directing it to create the right outcome rather than to fix specific lines of code. All outcome-based.
Again, this seems only plausible for simple functions. I can’t imagine it scaling, but when it does it will be another abstracting of what it means to engineer.
So how did you check its behaviour in not-main-path conditions? Most of debugging is trying to fix what happens when everything does not go right. That is the easy bit.
This was a side-of-desk simple API call with limited test cases. As such, I simply executed the cases as a user would by interacting with the feature. I’m a Principle Product Designer, not an engineer, so I know there are far more thorough approaches.
That being said, even the most simple adjustment to the feature resulted in the GenAI to sometimes revert or completely rewrite the stable parts of the feature. One step forward, two steps back.
This is not surprising since it’s the same behavior we observe from Image GenAI. While Google’s Nano Banana has come a long way to preserve the majority of the image and only change the requested portion, it still fails to understand the intent. And how could it?
It takes hours of design and communication between designers and engineers to arrive at a common understanding of what to build. I’m not shocked that the AI can’t perform with a line of text of direction.
For me, after many years experience in computer systems engineering but not much recent coding experience, it's not how much time I save, it's whether I can do it at all. I've been learning-by-doing python for the past couple of years, and I've gotten pretty good at it. But it's only because of my many sessions with ChatGPT and Gemini that I have succeeded in putting together a sophisticated, flexible, speech-to-text dictation and control application for Windows. There's simply no way I could have done it without their help. Particularly when it came to tracking down the vagaries of imported python modules and low-level windows functions.
This isn't vibe coding, this is me examining every line of code produced, understanding whether it does what I want efficiently and accurately, and engaging in a dialogue with the llms to improve it and extend it.
I realize that's not the point of the study you site, but it is my experience.
Yes, even as a very experienced developer, I have occasionally found LLMs useful to get me started writing a new app on a platform I haven't previously coded for. For me, this seems to be their only successful use case. As you say, I still have to go over their output carefully to find bugs.
I haven't even tried using them on a large existing codebase. I'm not optimistic that they would be much use.
It can be very helpful indeed for that kind of thing, but are you sure "There's simply no way I could have done it without their help"? I mean it's not like you have to figure it all out from O'Reilly books, there's google and there's much or all the code the LLM was trained on available. It might have taken a little longer, but your knowledge might have gained a little more depth. I'm not sure, I agree teaching new things to experienced people is an area it's strong in. I definitely use it for that.
Of course I could do it myself. I could also dig the foundation of my house with just a shovel. But would I? No.
Maybe I should have said it this way: The barrier to doing that research on my own without help from the llm would have been too great for me to attempt.
Well that's fair enough. I'm using it for stuff I'm obliged to do.
One thing I have noticed is that I can often find stuff faster using an LLMs vs a Google search, with a few exceptions such as looking for a specific website or the lowest price for some well-defined item. As you say, the code is out there but finding it "manually" is not so easy.
It's definitely useful to teach yourself coding stuff if you're already a programmer. It's also good at providing an overview of a topic you already know something about, so you can spot errors and check important facts.
It's less good for things you know nothing about. I was asking it for months about how an area of federal funding was going to fare under Trump's big beautiful bill. It gave me very long, very detailed information, with breakdowns about how different people were likely to vote etc etc. The only trouble was the thing I was asking about wasn't actually in the bill.
I find it is good to tell you what to research next or available options for things I know little about (or have simply forgotten). For example, I recently knew that I needed a topological sort but was not up on all the algorithms. An LLM told me the names of the algorithms and their plusses and minuses. Was it exhaustive, who knows? But I didn't need the best algorithm, just a good enough one. The important thing is that it was undoubtedly trained on a lot of relevant content.
But wouldn't have Wikipedia gave you the exact same information? Of course it would, and it would be better structured, too!
But the real question is different: LLMs are quickly destroying the will of people to participate in StackOverflow and Wikipedia… but they be able to do anything without them?
The question may sound silly, but it's quite real: the more LLM-generated slop is out there the harder it is to find the actual information and if it's as bad as they say… this would just lead to eventual collapse of the whole paradigm. It would simply destroy itself.
You are talking about different but very real problems.
First, no, Wikipedia would not have given me the same kind of information. Some of it would be there but it would be spread over several pages and, without knowing the names of all the algorithms, I wouldn't have known which pages to visit. I would have gotten there eventually. Also, the LLM generated a lot more opinions about the advantages and disadvantage of each and gave me suggestions as to what I might look at next.
It is sad that the presence of LLMs are suppressing Wikipedia author participation. LLMs are shaking up lots of things but new technology does that. We have to adjust.
By "LLM-generated slop", I assume you mean the phenomenon where people publish AI-generated pages containing bad information. That's a real problem but the blame is on the people posting the slop.
I fight against the hype and the distortions LLMs create but it's not going away. We have to learn to live with it.
That is a serious worry, I hadn't thought of that. I suppose it can train on github though?
it provides a lot of info, quickly, which is often but not always accurate, in response to a vague or brief prompt. That's useful, but not a replacement for human expertise.
That says more about how awful Google Search has become in the past several years than it does about the utility of the LLMs.
Sure, Google Search is pretty bad these days but LLMs are still very useful. Both takes are right. One of the benefits of LLMs is that they have caused Google to get off their collective butts and try to improve their search.
You absolutely could have done it without their help. Like everyone who came before you, you'd have had to study the problem in detail and come up with a plan, then implement it piece by piece.
You're a human. You can learn things and apply your knowledge. It's what we do.
As to whether you could have done it *as fast* without the chatbot's help, hard to say. You could certainly have done it better.
My experience playing with LLMs has been: they can do simple things for me, apparently faster than I can do them myself, but they can't do anything non-trivial without screwing it up and making a mess. But studies in OP indicate I may have been fooling myself about the "faster" part.
I think a key point you're making is "There's simply no other way I could have [written a new program for a specialised domain in an unknown coding language] without [the help of an LLM]"
As someone who has done this many times over a long career in technology, I would hard disagree.
Prior to LLM's, the Internet made thar fairly simple to do, simply by looking through open Github etc source repositories, Stack Exchange etc Q&A's, personal blogs etc. Prior to that it was coding books.
On the cautionary side, we know from experience reports provided by some well-regarded elite developers that although LLM-generated code *can* work, that code may have fundamental problems in areas such as errir handling, broader integration / architecture, scale / maintainability etc.
There have also been problems caused by poor-quality coding patterns with many instances propagated through accessible code repositories (copy & paste), that when used to build an underlying LLM code corpus become heavily weighted output bias - basically the poor code-patterns get carried forward as high-probability output cases.
My suspicion for a while has been that we tend to over-apportion high-capability to LLM's that simply isn't there - and this METR study would appear to support that anecdotal suspicion. We're "wow'd" by the magic capability of it all, but the magic is at least in part illusion.
One of the ongoing paradoxical challenges is that it takes expertise to realise where LLM's fail. So - for example - an expert Python developer can see the errors in Python generated by an LLM: however if you're using an LLM to teach you Python, you're at a distinct disadvantage.
You're at risk of being lulled into assumptions about the quality of output, reinforced by factors such as code being created quickly, that complies, and can execute / work. That doesn't make it good code, fit for purpose that will work as you expect it to under real-world conditions.
You assume that I am naive about these risks. I am not.
Let's look at an analogy. My table saw is dangerous. People lose fingers and hands all the time using table saws. Does that mean I should stop using it? Or should I mitigate the dangers by following a strict methodology?
The real danger is that I don't even attempt to do complicated stuff like I've been doing because, for example, it would take too long to research the foibles of UIA in Windows on my own.
Don't teach your granny to suck eggs.
I simply saw that you were making an absolutist claim that was obviously easily refuted. From that limited sample, it seemed reasonable / helpful to note other subtle aspects and considerations you - and more importantly, others reading the problematic statement in your comment - might benefit from.
For example, it doesn't really matter whether you think you are not naive, are aware you are naive, or even whether you are actually naive and completely unaware of that consideration.
It doesn't matter, because you have - to a large extent - no way of knowing whether anything you're receiving as output from such tools, is or is not problematic in any way that might be important: because you have no validated reference to compare against.
For instance, the code presented may compile, but that's a pretty rudimentary concern.
However, you have no way of knowing whether you are being presented proprietary code that should not have been shared that you might not legally be able to use. You might be opening yourself up to a lawsuit in using that output.
In your example of "learning-by-doing-python" where you state this learning could not have been done without LLM/GPT-based systems (implying this was your only learning resource), you have no way through just the LLM/GPT-based system of knowing at the time the code is presented whether the generated code is considered: a) good code - code that expert developers in that language would consider good, b) compliant with commonly accepted standards and styles within a community of peers likely to interact with it, and c) free of logical (as opposed to syntactical) errors - you'd need to use the code to explore and hopefully discover this.
For any non-trivial code - particularly that needs to be maintained or is intended to be shared -, the questions become how the community can verify generative-AI code - a second opinion, and / or third opinion (code / peer reviews, audits etc) as it were.
So the key point I raise is how you - or anyone else - can mitigate the inherent risks?
It would be interesting - rather than seemingly attacking me defensively - you responded to the questions raised and simply explained *how* you mitigate the risks that you indicate you are aware of.
In your case you report a success, but how can we actually measure the success of your program? Did you get feedback from your users? Do the users of your app spend the time to make sure the speach-to-text translations are accurate? Are the translations accurate? Are they useful? Could this problem have been solved using a different method other than AI? Did you use AI because you like AI, and you want to have AI on your resume to increase your employment chances in the future?
I find that people are not completely honest about these projects and it is hard to get to the truth 100% without exploring all the details.
Ooooh! Hadn't thought of that! Maybe I'm being manipulated by an evil AI into thinking that my code is good, just because I understand it what it does, and it does what I want it to do, and it doesn't fail me.
Maybe I should submit all my code to you before I make an assessment like that. Then you can inform me whether I'm wrong or not. How would that be?
On the other hand, maybe I should just ignore you because I don't really care what you think about my honesty.
I think I'll go with that second choice …
While talkers talk builders build and learn how to usefully and realistically use these tools—which BTW—are very useful
It's a free world, at least so far in the West. Do as you please :-)
This is what I believe to be the best use for these LLMs, to quickly condense down a lot of info for an intelligent being (human) to then assemble the parts that they need. The LLMs are more like a 'super' Google, and can explain and summarize documents and poorly documented libraries so that a person can work with them faster; however, this still requires a person to audit the LLM output (because they hallucinate way to much and make many poor code decisions, especially when given a lot of latitude).
As a non-coder, it works this way for me, too. I recently wanted to do a simple interactive html page to help kids learn their times tables, and GPT made it for me, pretty much exactly as I envisioned it.
That was great! For amateurs who want to potter around, it's lovely. But that is really a very different ballgame from assisting with professional work.
I’m testing both ChatGPT and aistudio and I Like aistudio more, more tokens and less mistakes.
My PiSelfhosting and pi-server-vm are both in the end created with no hand written code by me. I use unit testing, also generated by aistudio. Just max 3 iterations and it works. What is ypur experience in how many cycles before the code is correct?
What do you mean by correct? It is very easy to write bad python code, which appears to work in the high probability use cases, but whose behaviour is non-deterministic in some lower probability cases. It is exceptionally easy to write python code that is grotesquely inefficient. It is very easy to write bad python code that cannot be extended or maintained. Python is a complex and subtle language.
How do you know that your LLM is not doing precisely that, if you learned Python from that same LLM? Where are your quality criteria coming from? Is the code testable, or portable? And do the means to do so meet established standards?
As for your example: the python package system can seem complex, especially if you go to random sites on the Internet, as much of what they say is only almost true. If you read the python standard, and think it through, the whole thing becomes obvious. And as for native functions, ctypes is your friend, and is very well documented in the official python documentation. Neither of these is hard. If you want hard, try Python metaprogramming. Standards are easy to find. LLMs basically reward laziness with bad code of dubious quality.
That’s been exactly my experience. I’m 62 and have been working with a proprietary Data Ops platform for many years but have never learned to code in Python. ChatGPT has opened all sorts of possibilities for me to create Python scripts that say use Google’s OR Tools CP-STAT solver as part of a data pipeline that I have built with the software I know. I couldn’t have done that without the LLM.
Yeah it's really opened up possibilities for me too. I'm 77.
This substantiates my feel about the tools. I try them out and understand how a certain feel of velocity gives some dopamine hits. Especially for devs with underdeveloped editors (not emacs or vim). The text manipulation speed alone must be a hit. But then I solve real world problems and simply leave the thing alone, because I have the velocity from ordinary thinking through the problem. It's like having a child next to you ranting and dreaming about how to get something done in front of you. You ignore it after it goes off into useless tangents that ignore important aspects of what you do.
In that boat as well.
While I have toyed with coding assistants, I agree that they can get you started with unfamiliar languages far faster than trying to learn from scratch. Python is a good example. Writing interfaces between languages is another.
However, I think there is a difference between asking an LLM for help and using the new IDEs with AI, built in, like Cursor, Windsurf, and Firebase.
A YouTube video I watched had the presenter comparing different AI IDEs and saying that one should at least pay at the $20/m level for using them. Maybe if your work depends on it (but shouldn't your employer pay?), but not for the casual programmer.
"It will be very interesting to see to how this evolves over time."
It will also be *very* interesting to observe some specific phenomena in the process.
1) People insisting that it saved *them* time — even if it didn't save time for those other shlubs who "don't know how to use it".
2) People continuing to make claims about time saving without actually measuring the amount of time that things took, but going on feeling. If I go by feeling, coding and debugging takes practically no time. If I actually look at the clock, though... The study seems to bear that out.
3) Very shallow testing of the quality of the code and of the product.
4) Oblivion about the differences between programs generated to help a non-programmer organize a recipe book vs. programs that help to run businesses.
5) The IKEA effect for software. I didn't make it, but I assembled it, which makes me feel like I made it, which bestows the endowment effect.
6) Persistent unawareness of the Large Language Mentalist Effect, wonderfully named and described here: https://softwarecrisis.dev/letters/llmentalist/
Very interesting article. Thanks for sharing the link.
This resembles my own experience using these tools. They might seem faster and cool, but if I pay attention and track my time while using them, its pretty obvious that while fun to work with, they slow me down.
https://nappisite.substack.com/p/ai-coding-assistance
It all depends on how the AI assistant is used.
Having it write code for you, that needs debugging, is a very bad idea. Debugging code is hard.
I use GitHub CoPilot very intensely. It is really helpful at completing lines and small blobs, if one pays a lot of attention.
Then, Gemini is really great at reviewing code, catching issues, and offering ideas.
Just don't put it in driver's seat. Can waste a lot of time that way.
empirically figuring out guidelines like these seem like the way to go
I confirm that debugging, particularly for a novice could become a nightmare.
The plural of anecdote is not data, but still. I have an object detection task (tagging the moons of Uranus in astronomical images) with about 40 training datapoints. I asked o3 to write code to load a pretrained resnet-18 and to retrain the head only. After a few iterations the thing was running and spitting out predictions (the centers of the moons in image coordinates). Unfortunately though they looked significantly less accurate than I was expecting. It took a while to figure out what was wrong: as it turns out, the code o3 generated did not contain a crucial line where it should have set the weights to the pretrained value. So I was basically “fine tuning” a net initialized with random weights. On the plus side I would probably not have attempted to do any of this without chatGPT.
"Debugging code you didn’t write can be hard."
Yeah, no shit! Especially if it doesn't even always make 100% sense.
Tried it in different scenarios.
Ironically, they benefit expert coders more. They do help with one part: procrastination / slow start. A couple of times I needed a script for a small task. It's always a nuisance to get these out. Not a huge challenge, but you waste a couple of hours ironing out the kinks and thinking how to start. GenAI can get the draft very quickly. Then you see the obvious issues, fix them, and viola.
Another use is exploring bits you don't know and getting strategy advice. Basically, a faster way to research.
Where it's completely useless is understanding a real-world system. You can't just give it a piece of code to analyze or find issues in. Same with looking up non-existing methods: they will confabulate stuff you are looking for, and it'll take you some time to figure out it's a fiction. Writing a decent, working app from scratch without human intervention? I doubt it.
It's a bit like spellchecking: if you can't spell, you wouldn't always be able to use it properly.
I have to admit though that there is another positive: Stackoverflow is much less cocky now. It used to be a pain to ask questions there. "What, you don't know?" "You haven't done your research" "Closing the question as offtopic" and so on. No longer.
I wonder if the programming language help from LLMs will degrade as more AI-generated code is ingested in future training sets. I also wonder whether we will damage the progression of skills from junior engineers to senior engineers, as working knowledge of complex systems and specific code constructs degrades. Studies are showing that skills are deteriorating with overuse (or incorrect use) of AI.
Based on my observations, not a scientific study. The spectacular productivity increases with AI tools are driven by good programmers who are already several times more productive than average programmers. Average programmers + AI sometimes reach the level of good programmers without AI. And yess, debugging is a real issue, in fact a nightmare with genAI coding tools.
Novice coders + AI are able to "churn out" code quite quickly that they don't understand and that is very fragile. Worse, I think that most of them will never learn to code well and will be dependent on AI tools for life.
Moreover, I also observe a more or less rapid erosion of programming skills due to the overuse of AI tools and the law of least effort. The development and maintenance of complex cognitive capacities requires active work and cannot rely solely on technological assistance. When it comes to natural neural networks, it's "use it or lose it."
This is what happens when people assume engineering isn't a creative endeavor, and that all the problems it solves are superficial remixes of solved problems.
If this were true, Turing completeness would be irrelevant and RISC would be completely solved. We would be able to dramatically speed up computations by knowing everything we would possibly compute, thus defining the limits around the halting problem, and eliminating the p-np conundrum.
They still haven't solved for training a model and getting a coherent database of structured insights which could be manually assembled into coherent outputs. You know, like a layer that understands individual characters, words, parts of speech, concepts, fields of study, trees of knowledge, distinct vocabularies, contexts, and the generalized patterns present across rational fields of study in self consistent theories relying on as few axioms as possible. They're trying to solve the incompleteness theorem with "sympathetic magic" thinking, as if mandrake is a type of man, and their intellectual humunculii will burst forth from Mason Jars like some all powerful and immortal deity beaten out of Aliester Crowleys wettest fever dreams.
I mean, phonemes are a thing, but most of them carry zero meaning and intonation changes the context of speech. From just this basic understanding of communication, it's easy to see that blindly dissolving text and generating associations will create massive layers of incomprehensible bias while completely missing the point. Structured language is an invention to eliminate intonation, because the things it's going to say will run straight into the trap of incompleteness, it's trading one form of ambiguity for another, and that's not something that can be solved with any amount of computer science. It's likely the linguistic equivalent of the uncertainty principle : if you think you know, you don't.
Thank you for this post! It's extremely inspiring. 😀
I foresee job opportunities in the future to clean up the gigantic mess of AI generated code.
I don't. It's unfeasible. Even with human-written code it's often easier and simpler to throw it away and rewrite it from scratch, if it's not written by a good programmer.
With AI we would have 99% of code in that state.
But what we WOULD get, 100% guaranteed is lots of companies and tools that would PROMISE you that… then fail to deliver.
They would sustain itself on promises for quite a long time without actually delivering anything working.
Like Tesla's autopilot… how many years ago first cars that were supposed to “eventually” be able to drive itself were released?
The strongest case for AI is "whatever I'm bad at that the AI suddenly lets me pretend I'm good at."
Very few people are good at programming. It's a strange skill; tedious, and with little to show for your efforts. Not the sort of thing you can dabble in and quickly find rewarding. Not much praise to be had from others in "hello world" exercises.
So, everyone who isn't much good at programming thinks programming is the strong case for AI. People who are good at programming but bad at other things, like art, think that AI is really good at art but poor at programming.
Everyone is wrong. AI is bad at everything.
As a good programmer, I have said the whole time that it's bad at programming and depressingly enough not getting any better (I would love for it to relieve some of the most tedious parts of programming). As a mediocre artist, I have said the whole time that the "art" is bad and meaningless, because even I can do better. As a crappy writer but avid reader, I can tell the writing is bad but I can't tell you how to improve it.
Not only is it bad at everything, but all the ways it has "improved" have nothing to do with what makes it bad in the first place. The code has always lacked comprehension of the task and how the software is meant to be used; now it can write more bad code faster. The art has always been soulless; now it can make high detail soulless art. The writing is tedious and obnoxious; now it can write a whole tedious, obnoxious book where it could only do paragraphs at a time some years ago.
There is no case for LLMs. It doesn't exist. Whatever piece is missing, that piece that will make them useful, remains undiscovered. I would like for someone to find it, but that won't happen until everyone can admit that they're useless and stop trying to make them useless at larger scales.
Why do you think big tech is still plowing such massive amounts of money into developing these LLMs and all the surrounding infrastructure? Surely they can also see how ineffective they are at even the task they're supposed to be the most helpful with (programming).
I think tech runs on hype and hot air, and tech investors live in a constant state of FOMO, determined not to be the guy who passed up "the next google/facebook/etc." We've already seen several high profile tech pitches implode spectacularly that anyone with relevant expertise could have debunked (and did).
Theranos, WeWork, FTX are perfect examples. They were never going to work, they knew internally they were never going to work, and anyone on the outside who knew anything knew they were scams and tried to warn people. It didn't matter. As long as you can convince credulous investors and stock speculators that you're going to change the world, you get rich and rarely suffer consequences. It hasn't been about making things that work for a long time, only about telling a good story and making your exit before reality catches up with you.
LLMs are just the latest in the tech fraud hype cycle. Maybe the straw that finally breaks the camel's back and reintroduces healthy skepticism to tech investing. We'll see.
It sure seems like there's a lot of mass delusion going on. It's going to be very tough for most of these companies to make a profitable exit when they aren't making any profits or any hope to make profits. OpenAI is totally worthless financially.
They seem to be pursuing market capture.
Part of it is the same kind of bid as social networks, "the network effect" - if everyone is on Facebook, then everyone has to be on Facebook, and you achieve vendor lock-in.
But I'm not sure this applies to LLMs at all. Imagine for a moment they were actually useful. Does it follow that if you capture the largest market share you "win" the AI game? Will your friends pressure you into using "chatty" instead of or in addition to "grok"? Do you lose anything by switching between vendors freely? Current early adopter behavior indicates the opposite - they're always going on about how you really have to consult multiple LLMs to get the most out of them.
Another huge part of it is regulatory capture. Because it's not really possible to maintain dominance by having "the best LLM" - innovations get reverse-engineered and replicated by competitors in weekly or monthly timescales - one strategy is to just make it effectively illegal to compete with you. This is what they were pursuing up until around last year, when DeepSeek proved that non-US AI research could compete, and so regulatory capture in the US was off the table. So now suddenly they've changed their tune on the whole "please regulate us" thing and will settle for nothing short of international treaties. Which they're not going to get.
Anyway I could keep ranting but tl;dr I think you're right. There's no plausible way for any of these companies to make a profit.
I have noticed a psychological phenomenon in my many decades of working with computers. People tend to count the time spent working positively toward a goal but not time spent fixing regressions: the mistakes or bugs. I suspect it is a carryover from practice at a task. How fast you can do something perfectly is what matters. Mistakes can be corrected and, therefore, shouldn't be counted when considering task speed. I suspect this is partly what throws off programmers' expectations of AI coding.
“… this 2023 cartoon (I couldn’t track down the artist) “
Originally posted on
r/ProgrammerHumor
By r/ElyeProj
https://www.reddit.com/r/ProgrammerHumor/s/hljMLqjmc0
A couple of things in play here:
- Extremely simple features and simple apps are the sweet spot for GenAI Coding
- Your role moves away from hands-on engineering/coding and into both Design and Director
I was able to use Gemini Canvas to build, debug, enhance, deploy a javascript plugin with a hosted API within a couple hours—I am not a developer, but a Product Designer of 28 years.
I did know what I wanted the end experience to be, knew the basics of Git and Netlify, but zero ability to wire it all up. GenAI guided me through how to step through it all as well as fixing errors and extending the functionality.
Debugging was the process of me directing it to create the right outcome rather than to fix specific lines of code. All outcome-based.
Again, this seems only plausible for simple functions. I can’t imagine it scaling, but when it does it will be another abstracting of what it means to engineer.
So how did you check its behaviour in not-main-path conditions? Most of debugging is trying to fix what happens when everything does not go right. That is the easy bit.
This was a side-of-desk simple API call with limited test cases. As such, I simply executed the cases as a user would by interacting with the feature. I’m a Principle Product Designer, not an engineer, so I know there are far more thorough approaches.
That being said, even the most simple adjustment to the feature resulted in the GenAI to sometimes revert or completely rewrite the stable parts of the feature. One step forward, two steps back.
This is not surprising since it’s the same behavior we observe from Image GenAI. While Google’s Nano Banana has come a long way to preserve the majority of the image and only change the requested portion, it still fails to understand the intent. And how could it?
It takes hours of design and communication between designers and engineers to arrive at a common understanding of what to build. I’m not shocked that the AI can’t perform with a line of text of direction.
This recent online article is also relevant:
"AI models just don't understand what they're talking about"
https://www.theregister.com/AMP/2025/07/03/ai_models_potemkin_understanding/