102 Comments
Oct 1Liked by Gary Marcus

This aligns with my experience. I’m a Sr. Engineer and we use Copilot, Cursor, ChatGPT, ect. at my company.

Personally, I haven’t seen a meaningful uptick in feature velocity since we adopted GenAI coding assistants, but I am seeing more code volume from Jr. devs with bizarre bugs. My time digging through PRs has ticked up for sure.

In my dev work I’ll find myself turning off Copilot half the time, because it’s hallucinating suggestions get pretty distracting.

Expand full comment

Does it help you with debugging at all? I find it helps me with that, but I'm a complete noob so I have my doubts if it would be of any help to someone who knows what they're doing.

Expand full comment

It's quite good at explaining code, writing tests, refactoring, writing comments, documentation. Basically anything that is based on already written code. The context is lost most of the time, and for debugging, I have a very low success rate. Usually it's just repeating nonsense. But you can paste an error message to have a good explanation, which certainly helps.

Expand full comment

Programming is painstaking work. If you paste in some code and hope for the best you will encounter a lot of grief, much of it much later.

GenAI is an aid, to be used in small incremental doses. But then it is good for you.

Expand full comment

AI-machinary makers are now gathering the input of the users in the form of their small doses ypu describe to generate better machines that can guess what is required by the first original request/task. The thing is as Marcus already mentioned: when an open source code is already there, if written in a good reconfigurable way, then it is already reusable, and the copilote is not needed; now code copilot is just a non trasparent search engine because it should have pointed out that source of the code by where it gets it from (including its license) instead of the illusion that it is '" generating" code or "co" programming. And on the other hand, if there is a true new requirement that really need a new code/algorithm/data-structure, then this will need a person understanding the new requirement/problem and crafting a new solution.

The gap that exist in the search for good proper working code, while current AI code generators are designed (my opinion from how I see the context of openAI and Microsoft) with the intention to immitate/replicate developers in order to replace them with the intermediate step of the co-pilot as a way to close their gap (the current limitation of code generators), not the search for good code gap (the problem that keeps human agency in place and reduce the technical dept).

Expand full comment

There are many ways of looking at it. CoPilot is helpful in my work. There's always a need for custom code, even if the pattern is maybe already obvious in other code.

I anticipate there will be future tools for refactoring, debugging, etc.

Expand full comment

"GenAI is an aid, to be used in small incremental doses. But then it is good for you." For some uses (that aren't critical if they fail or where errors are easily spotted — think getting code for a plot where you as a human can immediately see not the code but the plot is wrong — it may even be valuable for larger doses. That speeds you up because GenAI's 'estimation' is good enough to get a higher productivity.

But if what you say is true. The question about valuation of GenAI companies becomes important.

The field of coding is very wide. There will be areas with a very high productivity growth. There will be areas with a low risk of negative effects from GenAI's lack of understanding what it is doing. How much from either we do not really know, but I suspect a lot of IT won't benefit, only on the surface and in isolation because nothing much then depends on very high correctness requirements.

Expand full comment

This really depends on if chatbots can improve.

While the work people do is very complex and requires immense amount of detail, the number of strategies people use is small, and very context-dependent.

Human labor is very, very expensive. Painstakingly training a chatbot to do reliably simple (but still tricky) jobs is very plausible, and the payoff will be immense. We will see more of that. It won't be quick or cheap.

Expand full comment
Oct 1Liked by Gary Marcus

A software developer learns how to code. An LLM doesn't even know what code is. Throwing together a probabilistic sequence of vectors that appear many times in github repos will only get you so far.

Expand full comment
Oct 1Liked by Gary Marcus

Imagine if all this money was given to open source libraries / frameworks / higher level language creators..

They actually raise the level of abstraction and let programmers do more with less code.. And it's been happening since the beginning without any hype

Expand full comment

More generally, we have not figured out how to fund public goods (FOSS being just one example).

Expand full comment
Oct 1Liked by Gary Marcus

There was a great paper recently (swe-bench) that found that off the shelf the best llm models solve about 2% of a curated set of github issues. Even if this can be 10x'd by fine tuning, that still is not a replacement for a software engineer, especially since someone still needs to verify the solutions.

Expand full comment
Oct 1Liked by Gary Marcus

The only way we'd see a 10x programming productivity gain is if AI could write entire apps reliably from some kind of easy to write description. Of course, that is exactly what some hype merchants have claimed. Assuming there was such a problem domain, management would quickly realize that this means it is so regular that they could write a single program that, with a few input parameters, could generate the target apps with greater reliability and maintainability, and fewer compute resources, than the AI solution.

Expand full comment

"The only way we'd see a 10x programming productivity gain is if AI could write entire apps reliably from some kind of easy to write description." Another way could be if AI got 10x more people into programming.

Expand full comment

I know you are joking but, in case you weren't, I believe that the productivity gain being sought is per-programmer while holding the quality and cost of the programmer constant. Of course, this is an unachievable ideal. For example, many programmers wouldn't want a job prompting a coding AI.

Expand full comment

If ever there was a misnomer, “prompt engineering” is it.

Prompt engineering” has nothing to do with real engineering.

Everyone wants to be called an engineer these days (even computer programmers)

Expand full comment

I think it qualifies as an engineering task. I would stop short of calling someone an engineer if that's all they knew how to do.

Expand full comment

Maybe sth you are missing is that there are a lot of different reasons why people are programming ... and a lot ways of being productive writing code ... maybe the kind of SE you are doing is only part of the whole picture?

Expand full comment

Or perhaps a prompt engineer should no longer be called a programmer, or what they do as programming.

Expand full comment

as an aside, there is also sth called LLM programming ... programming languages to program prompts ... https://github.com/yakazimir/esslli_2024_llm_programming

Expand full comment
Oct 1Liked by Gary Marcus

Sergei Brin recently commented at the All-In Summit that none of his devs are using AI. He thought they should be and has been trying to encourage them to use it. He says he wowed them a few times when he used AI to quickly generate some demo apps. But it begs the question. Why aren't devs, the people most amenable to AI who readily use and adapt to new technology, adopting it, and instead have to be pushed into using it? Another data point in support of Gary's premise. Experts find much less benefit from LLMs than non experts who can be happy with an almost solution.

Expand full comment

This, exactly. If I've run into a problem that I can't figure out, even after scouring places like stackoverflow, there's zero chance an LLM is gonna get me the answer. It's basically doing a stupider, less reliable version of searching the internet!

Expand full comment

I found the opposite. I've had to search pages of stackoverflow looking for ideas, trying to filter out all the answers that miss the point. I found that although copilot can sometimes give non-working answers, it's nearly always applicable to problem and can kickstart me on to the correct solution.

Expand full comment

my experience as well, except when the LLM has been trained on a deprecated documentation. Then there's no end to how annoying it is.

Expand full comment

I've found it can sometimes get into a infinite loop of giving me a non-working answer (although very close), I tell it that it is wrong, it acknowledges its error, and then gives me the "corrected" code... which is exactly the same as the previous answer.

Expand full comment

Good to hear, maybe I'll try genAI again next time I hit a wall. The few times I've tried, it would give me the kinds of suggestions I'd already seen and knew didn't work or we're addressing a different problem.

Admittedly there's some confirmation bias on my part. I would expect LLMs, given how they work, to have a hard time distinguishing solutions to similar-sounding problems from solutions to problem I'm facing, and all the more niche the problem, the worse they'll perform. When I scan through pages on stack overflow, I'm using my domain knowledge to identify promising leads and discard others. LLMs don't have domain knowledge, but they're awesome at syntax.

I'm open to the possibility that I've sold them short. Programming is only a small part of my job, so I speak from limited experience.

Expand full comment

I have found that it occasionally is unable to give me anything useful for some more niche C++ template meta-programming problems where there are few examples for it to train on. Sometimes I've had to make several attempts at rewording the problem until it stops repeating the same irrelevant answer.

Expand full comment

Coming back to my distinction of shallow vs deep SE, one can push this a bit further. Even in deep SE there are shallow problems with which LLMs can help. Lacking documentation is an important one.

Expand full comment

This doesn't correlate with my experience. I know plenty of fang engineers that are using it voluntarily. I suppose it's more useful for some languages than others. I've heard its terrible at Rust for example.

Expand full comment

I like to distinguish shallow and deep SE. Many devs are working on deep SE. LLMs are not very useful there. But being able to create your own apps instead of using corporate apps can be really empowering and LLMs are great for that. Different application area, different people ... this is where I see a potential for 10x.

Expand full comment

So much funding wasted on Sisyphus. No modularity in AI equates to no clever design. Slow and buggy code crops up rather than correct and optimal. Sigh.

Expand full comment
Oct 1Liked by Gary Marcus

People should be aware that using CoPilot is a risky intellectual property business. If you're writing code for your company, or for hire, you're potentially giving up your copyright to the code. Be careful.

Expand full comment

Microsoft claims they will pay to fight any copyright infringement suits, but when it comes right down to it, I’m sure their lawyers will find some excuse not to do so (based on some claimed violation of the end user agreement.) How many people have the money to hire a lawyer to fight Microsoft AND a copyright holder claiming infringement? Good luck with that.

Despite assurances from Microsoft, programmers and others are really foolish to be mindlessly using the output of GenAIs before the copyright issues have been resolved in the courts because the potential fines for infringement can be very steep.

Expand full comment

Well, if Microsoft is using their own in house code to train their AI, that would actually be the best argument against using the code generated by the Microsoft AI (even better than the potential copyright infringement argument)

To say that Microsoft is not known for reliable, secure, bug free software would not only be to state the obvious but also be an extreme understatement.

Expand full comment

Afaik, the big players train their LLMs on in house code.

Expand full comment
Oct 1·edited Oct 1Liked by Gary Marcus

Even 2x would be hype. And don't forget the LLM terms-of-service forbid working on AI/ML code.

Expand full comment
Oct 1·edited Oct 1

As someone who experiences the 10x in real life (despite the cringe attached to it, I think it's an apt term), I think critics are missing the obvious in their criticism.

1. building software is mostly not about code

2. llms don't do all that well at code but can generate things that have the right code shape

3. there are many artifacts that are not code in production that are extremely useful to building good software

If you put this all together and focus on "what do humans need to build good software collaboratively", good uses of LLMs become apparent:

- good documentation / rfcs / knowledge bases / onboarding docs / mentoring / etc...

- logging, monitoring, error messages, visualizers, analysis tools, etc...

- prototypes prototypes prototypes. You don't even need to run them, but they are a sort of solo-adventure-whiteboard-brainstorming

I gave a workshop about the topic that hopefully gives a bit more insight into how I approach things: https://www.youtube.com/watch?v=zwItokY087U

Handout is here: https://github.com/go-go-golems/go-go-workshop

What this looks like in practice (although my opensource stuff) is that I can build software like this: https://github.com/go-go-golems/go-go-labs/blob/main/web/voyage/app.md in an hour or two in the evening, after work, without feeling like I am really writing software.

For longer-term software: https://github.com/go-go-golems/go-go-labs/blob/main/pkg/zinelayout/parser/units_doc.md

I don't really care if I have to fill in the 10 lines that do the actual complicated thing, that's fun.

But I 100% stand behind 10x improvement in (productivity is maybe not the best word) quality. Faster "coding" means faster iteration/prototyping, and iteration is one of the key ingredients to building something that actually is useful.

Expand full comment

"I don't really care if I have to fill in the 10 lines that do the actual complicated thing, that's fun." That is exactly my experience as well.

Expand full comment

I just watched your "10x Development" youtube video. Very interesting and I appreciate the demos. I subsequently tried out Phind. My question is if the scenario is you are developing and maintaining a large existing legacy inhouse codebase, how to use these LLM based code assistance tools in your day to day work when the LLM has no knowledge of the inhouse codebase and pasting proprietary code into LLM for question/debugging is prohibited? Another question is, in your experience, does LLM hallucination come in as an issue in any generated code? As far as Phind goes, I just now tried out its Ask tab (I guess the equivalent to a regular LLM Chat session). I asked one arithmetic question of division of a 4 digit number over a 5 digit number. The answer from Phind was wrong after 3 decimal points, causing further incorrect rounding in its elaborated part of answer. Then I asked my usual *variation* of wolf/goat/cabbage/farmer crossing river problem, again Phind got it wrong with nonsensible answer, even after I made couple of rounds of clarifications. I chalk it up to hallucination but in effect it is just generating next tokens to best fit its training examples, which I imagine does not include my particular *variation*. My question is, in your LLM conversation daily, do you experience any form of hallucination? If yes, how do you deal with it?

Expand full comment

I just tried to use Phind to generate some imaginary code to extract birthDate from a JSON text and after 1 prompt clarification I got the below as part of the generated code. Are line 5 and 6 from hallucination? In your experience have you encountered such generated code? Someone opined that any mention of hallucination can be attributed to not-good-enough human prompt. Do you think LLM should generate the below code as result of not-good-enough prompt?

1 for key, value in data.items():

2 if isinstance(value, dict):

3 if 'birthDate' in value:

4 birth_dates.append(value['birthDate'])

5 elif 'name' in value and 'birthDate' in value:

6 birth_dates.append(value['birthDate'])

Expand full comment

sorry the indent got stripped out after hitting Post. line 3 and 5 are on the same indent level, and line 4 and 6 are on the same further indented level.

Expand full comment

A lot of people are in denial.

It's not 10x; my calculations are 1000x.

Am I the only one who uses these tools?

I configured an SAP integration to create ANSI X.12 850 messages from IDOC02 documents to a gateway in 5 seconds. Python code to generate full coverage test vectors 30 seconds. Installation script with SAP API 5 seconds. Should a test fail, the fail is automatically combined with the code for a revision -

5 seconds. It was stable within a minute.

Tried more complex usage tests - generated a device driver for printing which intercepted an image, used a separate AI image API to upscale it to maximum printer resolution, then enqueued the result in 5 minutes. It took me longer to find an upscaling service.

When I generate a business book (200 page documents) as my system streams through the prompt matrix, when I want an illustration, the generator requests 10 python scripts to generate the diagram. They are tested automatically; The first one which works stops testing and is stored in a library within the book source.

It took me 8 hours to reverse engineer a data structure compatible with SAP, ServiceNow, SalesForce, Teamcenter and Kinaxis. 770 tables, and I set it to auto-populate for a set of enterprise simulations I needed to do.

Created a specialized document analysis tool in 1 hour that would have taken an ordinary devopment process a year - I know because many teams had not finished.

This is the limit of what I can share, but let me say, it's wonderful.

Expand full comment

Your description of the process and fabulous performance of "these tools" almost seem magical. You use time durations like "5 seconds", '30 seconds" etc. 5 seconds is what it takes a normal person to type a few words. A few words typed in and your new system is integrated, onboard, up and running? I would think corporate IT departments would be falling over each other to acquire such technology. Unfortunately I don't see that at all.

"I configured an SAP integration to create ANSI X.12 850 messages from IDOC02 documents to a gateway in 5 seconds." - Is the SAP integration an LLM based tool? How did you specify your intended configuration to the tool? Mouse clicking selections or typing English sentences? Either way seems require more than 5 seconds if your configuration is of any scale larger or more complex than pre-canned selections, right? Or is this configuration already pre-canned selection all you had to do was to check on one checkbox? In that case 5 seconds is possible but that is just pre-canned programming, not LLM capability, right? What am I missing here?

"Python code to generate full coverage test vectors 30 seconds." - How do you verify the correctness of the generated vectors? Does that take your time or you don't really care?

"Installation script with SAP API 5 seconds" - 5 seconds to generate some script by LLM based tool I can certainly imagine. But again, what was the amount of time for you to enter into the tool your requirements? How many times did you have to do that before satisfactory result arrived? How do you know the generated scripts were correct? Do you have to spend time to verify and correct any mistake as LLMs do have this feature called hallucination? Or you don't really care?

Same kind of questions to other examples you have made. Basically who verifies the generated tests? Who determines to pass or fail an automatic test? Is that human configured or LLM generated? Who verifies that pass/failure judgement if it is generated? How much time does it take for human to convey to LLM tool how the tests and judges need to work? If no language inputs are required then it is just pre-canned programming. If language inputs are required then what is the verification process and efforts to ensure the final configured behavior is accurate and correct?

Your examples are either hard to imagine or the result of pre-canned features probably having nothing to do with LLM. Without more details it is very hard to picture! If everything presented is factual and all magical performances can be attributed to LLM, I seriously suggest you change career and become an OpenAI enterprise marketing representative. I imagine they value 1000x LLM practitioner such as yourself to showcase their models to Corporate USofA for speedier inroads to industries. The financial rewards there? Sky is the limit.

Expand full comment

They have some kind of serious placebo effect going on I think.

Expand full comment

We make similar use. For us it's unbelievably useful. Some of these articles seem... A bit agenda driven.

Expand full comment

Yep, denial and inexperience.

I finished up my OpenAI written contract system today, as a kind of hobby. I’m glad others use these tools, I ran an IT department and worked for the CIO of a fortune 50 company, and oversaw a budget of $1.2B.- it would have collapsed my ERP staff within 6 months - we would be finished with 100% of projects. Aside from endless SAP, The last major custom tool we built has lasted 23 years - it was so integrated in so many systems it was hard to keep current.

For fun, I’ve specified how to self-mutate to reduce the interface rework time from a month to an hour (or less). It’s quite insane. It requires almost no people at all. That’s 10% of my old supply chain team gone.

I’m not talking about “copilot” I’m speaking of 10 years or more experienced coders made somewhat irrelevant.

The entire Ariba Network model becomes irrelevant for B2B integrations. Tools to “scrape” log files to reverse engineer enterprise systems flow - irrelevant. “No code” hideous graphical torture to build workflow - irrelevant.

I haven’t been able to find a single area where these tools - even at their simplest - don’t write code that’s better than highly experienced blue-chip programmers.

Expand full comment

Then you should layoff all the programmers and run the project just by yourself, after all, all work now are just measured in 5 seconds here, 30 seconds there, sort of like "hobby" items in your word. You will be the first single person 1 Billion dollar company/team Sam imagined a while ago. You seriously need to consider advertising your company's name and your team's name so we can all go learn from you, just imagine the increase of productivity for the benefit of humanity.

Expand full comment

If you have a 1000x speedup, then you are more productive than entire companies like huggingface, mistril, and anthropic.

They all have sub 1000 employees. I look forward to using the amazing software you make.

Expand full comment

The role of “programmer” is going to melt. They become somewhat irrelevant, since the problem moves from writing code to precisely specifying behavior.

I don’t hire or fire anyone. But it will be quite rare to need large teams to do detailed work.

Expand full comment

I know the time because when I have teams build these things it takes weeks if not months. I had a quote once to do an EDI850 mapping setup in “only 2 months.” For a single supplier/customs relationship.

ANSI X.12 standards have around 900 objects, around 20-30 are commonly used. They are quite old. Likewise SAP interfaces even R/4 are quite well known.

It literally took me 5 seconds to get the same SAP setup code.

Alarmed is not the word.

Hallucination is a word for the result of poor specification.

The software development process will have a nuclear bomb placed in the center with this.

You can deny it or you can leverage it.

Expand full comment

I appreciate your reply even though it is not directly attached below my response post. I think to illustrate your point maybe you can demonstrate your use case (or some conjured up example to show the concept if corp rules or trade secrets are in the way) in a Youtube video, like the links provided by GO GO GOLEMS in this same comment section. That way you can really help change minds if it works such magic. I am trying to figure out what is your input to the LLM tool and output from it and how do you verify the output correctness? I just looked up some X12 example messages one of which looked like a medical claim code invoice. This reminds me of the FIX trading protocol and associated parsers. Aren't these type of industry standard protocols already equipped with standard parsers? The 2 month quoted work from your programming team and the subsequent 5 seconds LLM outputted identical SAP setup code, are they program code, or configuration files, or something else? Are they large in size or small? Is your input (your perfect prompt) a written spec of many many volumes of texts or just a few words? All these factor into one's considerations of applying LLM into his/her workflow and some working examples could really help people make the move. Also I just posted some example of hallucinated generated code to GO GO GOLEMS's original post. (Just scroll down a few posts you will see it) Those are the kind of things (a benign form in this case) that I asked about how to verify and correct. I don't think poor prompting is an excuse to such code generation, don't you think? (The code was generated by Phind which according to some posts internally uses GPT4, please correct me if I am wrong)

Expand full comment

I never use code directly, it is always verified. I just automated the process heavily. I don’t seek to change all minds, I merely point out that for those of us willing to try new things it’s a radical change that has staggering benefits.

Nobody writes assembly code.

Nobody writes Fortran-IV or COBOL code.

Nobody writes Pascal or C.

Advanced techniques tried to get people to “code” graphically. That’s all gone.

Nobody need write C# or Python or Java ever again.

OpenAI has every line of code written, just waiting to be conjured up by inference. It’s the biggest code library on earth.

OpenAI has every enterprise architecture ever created, nobody has to do solutions again.

OpenAI has every enterprise data model ever considered, nobody has to derive one again.

You just have to carefully ask for the correct one.

It’s not magic, it’s a library.

Expand full comment

That explains a lot! Just wow!

Expand full comment

You get it! 😉

Expand full comment

As a C# developer, I believe ReSharper and their Rider IDE have done more to make my job easier than anything else.

Expand full comment

GenAI can do things that take some people days if not weeks, and does so with more precision than even the best human programmer. It also makes the most insane and subtle bugs I've ever seen.

As someone who's been programming for 15 years, it feels like magic—and with all the same caveats. It can provide incredible value, but ultimately is only as good as the person using it.

I appreciate that you're looking at the broader picture, beyond my, and other people's anecdotal evidence. The overall net effect is going to ultimately reflect the "energy" put into it. It will be a reflection of what motivates those people using it.

What outcome are you hoping for, either for GenAI, or your study and writings about it?

Expand full comment

Those types of claims utterly ignore technical debt up the wazoo that's gonna bite every "LLM-code" infested project out there: https://www.geekwire.com/2024/new-study-on-coding-behavior-raises-questions-about-impact-of-ai-on-software-development/

=====

But while AI may boost production, it could also be detrimental to overall code quality, according to a new research project from GitClear, a developer analytics tool built in Seattle.

The study analyzed 153 million changed lines of code, comparing changes done in 2023 versus prior years, when AI was not as relevant for code generation. Some of the findings include:

“Code churn,” or the percentage of lines thrown out less than two weeks after being authored, is on the rise and expected to double in 2024. The study notes that more churn means higher risk of mistakes being deployed into production.

The percentage of “copy/pasted code” is increasing faster than “updated,” “deleted,” or “moved” code. “In this regard, the composition of AI-generated code is similar to a short-term developer that doesn’t thoughtfully integrate their work into the broader project,” said GitClear founder Bill Harding.

The bottom line, per Harding: AI code assistants are very good at adding code, but they can cause “AI-induced tech debt.”

=====

Expand full comment

and thanks for the link, very useful

Expand full comment

"Those types of claims utterly ignore technical debt" On the other hand, LLMs can make some legacy code more maintainable.

Expand full comment

I did a test today, from zero code I created a python based tool which when given a PDF, DOCX or TXT file which is an arbitrarily complex contract in any of a few dozen types and subtypes, uses OpenAI to deconstruct it into a highly structured JSON document containig XML structures that verifiably comply with any arbitrary XML schema (XSD document) to comply with conversion guidelines from major ERP vendors, creates unified contract classification keys at contract, section, clause, entity class, entity, datatype and value levels. I just basically asked OpenAI got-4o to write code to do what I do when I analyze procurement contracts.

It supports clause linking, clause library, template library, type library, variable library generation. It runs interactively or batched, and self-instruments performance and prompt accuracy profiling, tests regenerabity of the original contract to ensure no data loss, and helped me discover errors in vendor documentation.

Tomorrow it will capture lightweight tabular text formatting in contracts into XML CDATA (I had to learn XML structure today) embed image blobs (signatures), and I'll allow it to consume all available processing resources either on a workststation, or within cloud compute resources to parallelize the efforts (or until OpenAI shoots me).

The only reason I wasn't done in 4 hours was that I was given an option between xmlschema and lxml and xmlschema gave me innacurate results making me think the XML generation was faulty.

The intention is to digest a quarter-million documents for a huge conversion process. I’ve reduced the process from 5 years farming the work to India to 41,000 process hours (sequential) which should be able to parallelize to 41 hours. I’ve never done distributed processing with modern tools but I suspect it’s pretty easy.

Assuming I hadn’t been sidelined by the xmlschema library, I would be done and ready for scaled testing. I will be pulling contracts (again, not me but OpenAi generated software) from the SEC EDGAR website which holds some contacts which are part of 10k filings for materiality - should be able to find a few hundred.

Again, would this have been 1000 hours of work, 800? 100? Would it have been possible without AI? I do know I had a tool in 6 hours that was not possible 5 years ago, solves a general problem, and could be used tomorrow. It’s not a “copilot” - I just asked the right questions.

Expand full comment

How do you verify the correctness of the output. by correctness I do not mean complying to schema like what you mentioned, but 100% output fidelity to input content. That is, how do you know one clause is not missed here or there, or a superfluous one is not added here or there? LLM produces grammatically correct text (whether the grammar is English or XML or source code) because that is what they are trained to do on a token level consistency after consuming the entirety of the Internet. It's a solved problem, has been for many years even before GPT. That is the not the contention. The key is how do you know the generated output content is correct to your intention? Or you don't care to a certain degree? If you don't care then yours is not a good example. If you do care, then how do you verify? Human eyeballing? Then what's the gain? Write a script to verify? If you manually write a script to verify, that is equivalent to writing the converter without using LLM. If you use LLM to generate the verifying script, how do you know the verifying script is correct? Any programming that can be humanly verified easily is not hard to manually write in the first place. If it is hard for human to verify the correctness of the verifying script, what gain did you just achieve? You alluded to "tests regenerabity of the original contract to ensure no data loss". That point is not clear to me.

Did you generate the original contract based on the output generated from the original contract? It sounds seemingly clever but I would still not trust the correctness by default. Do you require the rengenerated contract to be verbatim identical to the original contract? I find that hard to believe and even if that is the case, I would still have to verify the intermediate transformation steps (programming, generated or not) to be fully confident. If you don't require verbatim identical, how do you verify they are content-wise identical without yet another verifying script? And who generates/writes/verifies that verifying script? The fact that LLM can generate any XML/JSON content or conversion script for you is because such data text or script more or less had appeared in its training set (entire Internet). (Have you ever typed in a question to Google search which Google failed to auto complete? That shows you whatever you just thought of, someone else already have asked before) That does not mean the LLM by itself found the solution for you. LLM does not have a solution finding mechanism, it is a text completion mechanism. It is helpful to keep that in mind. If the LLM helps you in your particular case, then all the best for you. However all in all, your particular example does not seem to fit into the general sense of "software engineering" (if we are so generous to view programming as an engineering discipline) I was a programmer for many years for Wall street banks and had done the type of work you described as side errands, aside from main line of development work. I admit that I wish I had LLM back then to help me parse a data feed to find something quick on an ad-hoc basis, but without it, writing a script to parse one way or another was not hard to begin with, certainly not comparable to actual mission critical program development work which requires imagination, discipline, design, integration, verification, which LLM today lacks, precisely because it is not a solution finding mechanism, but a word completion mechanism.

Expand full comment

Every output is normalized to lowercase, space collapsed and compared to the original for clause detection, that’s level 1. The output can be 100% concatenated to regenerate the input text. The element detection in each clause extracts named entities. Level 2. The entities are extracted, typed, and variables substituted in each clause and stored for template use. Level 3. The xml for each clause can be reversed back and forth as can xml for variables. The typing is summed over many contracts to set one small set of common types of variables (start date and stop date, quantity/unit of measure types for example). The variable strings are they typed (date string, integer) level 4 and then values are stored level 5. Clauses are given named types over many contracts and standardized level-6 metadata. Contract types and subtypes are extracted and standardized over collections of contracts - level 7 metadata. I have separate passes to generate document metadata and conversions statistics. Clauses stripped of enumeration and variabilized are put into a clause library for later reconciliation, same with variable types/names and contract types / names. The output is also locked to a hash of the file with some other security.

Oh and I extract images, drop in a conversion placeholder pointing to the file, and run image classifier/recognition to for conversion annotation.

I asked OpenAI to write the verifiers as it wrote the extractions

I have a tool I generated which works the other way, I can start with a title, and I use OpenAI to expand the title into an outline, paragraphs, paragraph elements, sentences, it generates a fictional character schema, it can link multiple volumes, and can generate multi volume series, or a novel, short story, the sections can hold multiple content types - sentences, verse, generated images, tables, other strongly typed content. It can write any kind of book besides fiction, movie scripts, white papers, research papers, software, training manuals whatever. I write books for friends.

As for work experience I worked in banking and commodities trading out of my dorm room at Caltech when I was 19 in the early 80’s. I used internet as a playground when it was still ARPANEt and you have to have a login to a nicnode. I have run IT departments, and worked for CIO’s, and have been in the industry for almost 45 years. These tools data architect, solution architect, and code better than any person or team I’ve worked with globally in those 45 years. Personally, I’ve designed chips, built kernels and drivers and application packages, delivered or directed code delivery up to ERP scale, in supply chain, finance, product design, sales, service, and marketing.

These tools do the work in 1/1000 the time.

You can deny it, or you can figure out how to leverage it

Expand full comment