The paper by Ernest Davies is indeed worthwhile. I found especially the final paragraph of section 4.8 illustrative.
Using the LLM as a sort if 'loaded dice' for the mutation element of a genetic algorithm is a nice ('tinkering engineers') trick, but also one that raises a question about the effect of the LLM constraints — they are stochastically constrained confabulators after all — on the mutations on how effective you can find genetic optima.
It seems we're in the 'engineering the hell out of a fundamentally limited approach' stage for transformer LLMs. And the overselling by Google PR is becoming a pattern (see Gemini).
LLM can be used, at most, as an initial guess, yes. So there's going to be a lot of infrastructure built around it to make it reliable. Which I think is quite doable.
Just started to read the paper, and found this (line 54):
"First, we sample best performing programs and feed them back into prompts for the LLM to improve on; we refer to this as best-shot prompting. Second, we start with a program in the form of a skeleton (containing boilerplate code and potentially prior structure about the problem), and only evolve the part governing the critical program logic. For example, by setting a greedy program skeleton, we evolve a priority function used to make decisions at every step. Third, we maintain a large pool of diverse programs by using an island-based evolutionary method that encourages exploration and avoids local optima."
This is already coming across to me as a likely case of stone soup — I'm referring to an old fable in which soup was allegedly made from a stone, but it becomes clear in the telling that there were lots of other ingredients that actually made it soup. Given the structure they've described above, they could very well have gotten the same result, I would expect, using a random program tree generator — this is just John Koza's genetic programming technique from the '90s. Does anyone seriously believe that there was information anywhere in the LLM's training corpus that bore on this problem?
If you look at appendix A of the Supplementary Material, or at my review, you will see that they tried the experiment with standard genetic programming techniques, using a carefully tuned mutation engine. It worked better than they expected but not nearly as well as with the LLM. It is certainly possible that one could design a better genetic programming algorithm, but they did try the experiment.
So what do you think is going on here? Is it just that the LLM is unusually good at coming up with code snippets that could be useful? Or do you think that the collective code-related knowledge of the Internet somehow predicted the answers to this particular problem?
I discuss this as best as I can in my review, but it's really hard to know why this works, particularly since we have only the five examples. The problems are susceptible to greedy search using priority functions that in some sense resemble code snippets in the training set and moreover which can be reached by searching upward through a population of code snippets.
All the talk about symmetries that were found in the code makes me wonder whether the LLM is just better at detecting, combining and extending 'interesting small scale' symmetries relevant to scoring algos in the current most successful code outputs. Not in a 'I have good reason to believe this symmetry might work for the problem at hand' way, which the LLM doesn't even know about(and, if that was the case, would make the PR claims much more justifiable), but in the kind of dumb 'fill in the pattern' way that would also enable a human programmer, with zero knowledge of what problem is being solved aside from something having to do with linear algebra, to predict that, say, what follows after "result = X.yz + Y.xz + ...' is 'Z.xy'
Do those proposals reach the 'next best thing' faster than the hand-crafted mutation engine because the mutation engine is still worse at combining elements of scoring functions by looking at patterns? Hard to tell, but...maybe?
LLMs reveal us to be strangely deferential to plausible-sounding language, even when the language-generator (human or statistical model) is completely disconnected from any understanding whatsoever.
The flipside is that the "shallow" usage here of the LLM can be seen as a feature, not a bug. If you believe, as many here do, that an LLM alone will never be trustworthy on important problems, then it's wise to give it less responsibility. The argument in the response paper, that a perfect LLM would have no opportunity to use its abilities, is irrelevant if you believe that a perfect LLM can't exist. Here the LLM is given the same role in the evolutionary algorithm that random genetic mutation has in biological evolution. In life this method has produced, well, us, for example. The point is hand-wavy exploration plus objective fitness testing. It's a solid contribution that this system doesn't rely on the LLM for correctness.
I read these headlines as though they were using the LLM roughly the same way fiction authors are using it, as a creative aid. Prime the LLM with info and questions and see if it suggests anything that I might be able to use. The weighted dice analogy seems apt. The LLM will spit out a litany of random ideas and the human must be the one to have the aha moment and say, "That might be useful, let's try it."
This AI hype has reached criminal proportions. I think instead of reacting to it piecemeal it would be good to create something a bit more organised. Something along the lines of Skeptical Science perhaps. A website dedicated to providing explanations of hype myths by experts. This would make an excellent educational aid.
This advance is incremental, but it shows an important truth. A system that generates things has value. Not just in math, but even for an indoor robot, that can figure out, given what you say, what you actually want. Such a robot can take into account what you said before, and the context it which it operates.
We will see more advances in chatbots. Simply adding more data does not work after a while. So companies will be forced to focus more on validation and architecture. The reward for a well-done chatbot that other companies are willing to pay for could be high.
"OVERSOLD" is an understatement! It's like every other day another over-hyped, fake news AI claim comes up. I would hope these smart scientists understand the difference between pure and applied mathematics. How does solving an "unsolvable" pure mathematics problem help humanity exactly? "Scientists" have now become philosophers; that little objective evidence thing seems in the distant past. How can you go from pure mathematics...to making such irrationally exuberant claims...sounds like they're about to solve poverty, cure cancer, and all the other complex social problems in the world, and be home for dinner. Pump breaks, please!
It's a long (13 page) review of a single article. Journals don't usually publish those.
If you find any errors, by all means let me know. If they are minor, I will correct them, with an acknowledgements to you. If they invalidate the review, I will withdraw it.
I may say that I have shown it to some of the authors of the Nature papers. They don't agree with all my evaluations, but they have confirmed that, as far as they have checked, the content is accurate. -- Ernie Davis
The paper by Ernest Davies is indeed worthwhile. I found especially the final paragraph of section 4.8 illustrative.
Using the LLM as a sort if 'loaded dice' for the mutation element of a genetic algorithm is a nice ('tinkering engineers') trick, but also one that raises a question about the effect of the LLM constraints — they are stochastically constrained confabulators after all — on the mutations on how effective you can find genetic optima.
It seems we're in the 'engineering the hell out of a fundamentally limited approach' stage for transformer LLMs. And the overselling by Google PR is becoming a pattern (see Gemini).
LLM can be used, at most, as an initial guess, yes. So there's going to be a lot of infrastructure built around it to make it reliable. Which I think is quite doable.
Just started to read the paper, and found this (line 54):
"First, we sample best performing programs and feed them back into prompts for the LLM to improve on; we refer to this as best-shot prompting. Second, we start with a program in the form of a skeleton (containing boilerplate code and potentially prior structure about the problem), and only evolve the part governing the critical program logic. For example, by setting a greedy program skeleton, we evolve a priority function used to make decisions at every step. Third, we maintain a large pool of diverse programs by using an island-based evolutionary method that encourages exploration and avoids local optima."
This is already coming across to me as a likely case of stone soup — I'm referring to an old fable in which soup was allegedly made from a stone, but it becomes clear in the telling that there were lots of other ingredients that actually made it soup. Given the structure they've described above, they could very well have gotten the same result, I would expect, using a random program tree generator — this is just John Koza's genetic programming technique from the '90s. Does anyone seriously believe that there was information anywhere in the LLM's training corpus that bore on this problem?
If you look at appendix A of the Supplementary Material, or at my review, you will see that they tried the experiment with standard genetic programming techniques, using a carefully tuned mutation engine. It worked better than they expected but not nearly as well as with the LLM. It is certainly possible that one could design a better genetic programming algorithm, but they did try the experiment.
Oh, okay. Thanks for pointing that out.
So what do you think is going on here? Is it just that the LLM is unusually good at coming up with code snippets that could be useful? Or do you think that the collective code-related knowledge of the Internet somehow predicted the answers to this particular problem?
I discuss this as best as I can in my review, but it's really hard to know why this works, particularly since we have only the five examples. The problems are susceptible to greedy search using priority functions that in some sense resemble code snippets in the training set and moreover which can be reached by searching upward through a population of code snippets.
https://cs.nyu.edu/~davise/papers/FunSearch.pdf
All the talk about symmetries that were found in the code makes me wonder whether the LLM is just better at detecting, combining and extending 'interesting small scale' symmetries relevant to scoring algos in the current most successful code outputs. Not in a 'I have good reason to believe this symmetry might work for the problem at hand' way, which the LLM doesn't even know about(and, if that was the case, would make the PR claims much more justifiable), but in the kind of dumb 'fill in the pattern' way that would also enable a human programmer, with zero knowledge of what problem is being solved aside from something having to do with linear algebra, to predict that, say, what follows after "result = X.yz + Y.xz + ...' is 'Z.xy'
For instance:
https://chat.openai.com/share/c3c2aa5f-4908-48f5-b117-434356448451
Do those proposals reach the 'next best thing' faster than the hand-crafted mutation engine because the mutation engine is still worse at combining elements of scoring functions by looking at patterns? Hard to tell, but...maybe?
Okay, I'm reading your review now :-)
If an AI truly "went beyond human comprehension", how would we even know?
This just in... FunSearch almost sort of kinda solved the P vs. NP problem, but didn’t really.
😂
The BOT acronym refers to Bullshit On Tap ?
LLMs reveal us to be strangely deferential to plausible-sounding language, even when the language-generator (human or statistical model) is completely disconnected from any understanding whatsoever.
No wonder fraud thrives ...
The flipside is that the "shallow" usage here of the LLM can be seen as a feature, not a bug. If you believe, as many here do, that an LLM alone will never be trustworthy on important problems, then it's wise to give it less responsibility. The argument in the response paper, that a perfect LLM would have no opportunity to use its abilities, is irrelevant if you believe that a perfect LLM can't exist. Here the LLM is given the same role in the evolutionary algorithm that random genetic mutation has in biological evolution. In life this method has produced, well, us, for example. The point is hand-wavy exploration plus objective fitness testing. It's a solid contribution that this system doesn't rely on the LLM for correctness.
"The real problem is not whether machines think, but whether men do"
—Technology Liberation Front
I read these headlines as though they were using the LLM roughly the same way fiction authors are using it, as a creative aid. Prime the LLM with info and questions and see if it suggests anything that I might be able to use. The weighted dice analogy seems apt. The LLM will spit out a litany of random ideas and the human must be the one to have the aha moment and say, "That might be useful, let's try it."
This AI hype has reached criminal proportions. I think instead of reacting to it piecemeal it would be good to create something a bit more organised. Something along the lines of Skeptical Science perhaps. A website dedicated to providing explanations of hype myths by experts. This would make an excellent educational aid.
Hype is how all dishes are served nowadays.
This advance is incremental, but it shows an important truth. A system that generates things has value. Not just in math, but even for an indoor robot, that can figure out, given what you say, what you actually want. Such a robot can take into account what you said before, and the context it which it operates.
We will see more advances in chatbots. Simply adding more data does not work after a while. So companies will be forced to focus more on validation and architecture. The reward for a well-done chatbot that other companies are willing to pay for could be high.
Steelmanning (Memes-R-Us Is Over) era test
Let’s assess whether Gary Marcus passes the Steelmanning Test:
Steelmanning in the Memes Are Over (ie Gen AI) era
https://themindcollection.com/steelmanning-how-to-discover-the-truth-by-helping-your-opponent/
As it stands, Gary Marcus hasn’t.
Let’s see which LLM is best in speeding up our steelmanning of Deepmind on Funsearch to the John Stewart Mills “On Liberty” extent.
#SpiceTradeAsia_Prompts
"OVERSOLD" is an understatement! It's like every other day another over-hyped, fake news AI claim comes up. I would hope these smart scientists understand the difference between pure and applied mathematics. How does solving an "unsolvable" pure mathematics problem help humanity exactly? "Scientists" have now become philosophers; that little objective evidence thing seems in the distant past. How can you go from pure mathematics...to making such irrationally exuberant claims...sounds like they're about to solve poverty, cure cancer, and all the other complex social problems in the world, and be home for dinner. Pump breaks, please!
are you suggesting that the "terrific new paper" you cite would pass peer review?
It's a long (13 page) review of a single article. Journals don't usually publish those.
If you find any errors, by all means let me know. If they are minor, I will correct them, with an acknowledgements to you. If they invalidate the review, I will withdraw it.
I may say that I have shown it to some of the authors of the Nature papers. They don't agree with all my evaluations, but they have confirmed that, as far as they have checked, the content is accurate. -- Ernie Davis