An alternative view of what is happening is that we have been passing through three different phases of LLM-based development.
In Phase 1, "scaling is all you need" was the dominant view. As data, network size, and compute scaled, new capabilities (especially in-context learning) emerged. But each increment in performance required exponen…
An alternative view of what is happening is that we have been passing through three different phases of LLM-based development.
In Phase 1, "scaling is all you need" was the dominant view. As data, network size, and compute scaled, new capabilities (especially in-context learning) emerged. But each increment in performance required exponentially more data and compute.
In Phase 2, "scaling + external resources is all you need" became dominant. It started with RAG and toolformer, but has rapidly moved to include invoking python interpreters and external problem solvers (plan verifiers, wikipedia fact checking, etc.).
In Phase 3, "scaling + external resources + inference compute is all you need". I would characterize this as the realization that the LLM only provides part of what is needed for a complete cognitive system. OpenAI doesn't call it this, but we could view o1 as adopting the impasse mechanism of SOAR-style architectures. If the LLM has high uncertainty after a single forward pass through the model, it decides to conduct some form of forward search combined with answer checking/verification to find the right answer. In SOAR, this generates a new chunk in memory, and perhaps in OpenAI, they will salt this away as a new training example for periodic retraining. The cognitive architecture community has a mature understanding of the components of the human cognitive architecture and how they work together to achieve human general intelligence. In my view, they give us the best operational definition of AGI. If they are correct, then building a cognitive architecture by combining LLMs with the other mechanisms of existing cognitive architectures is likely to produce "AGI" systems with capabilities close to human cognitive capabilities.
Well, someday someone may figure out how to do it all in a connectionist architecture. But either way, we are seeing more and more structure in these systems. I also think the pragmatic engineers in startups will be thinking: "I could try to do reasoning inside the net, but damn this SAT solver runs fast on my GPU." I'm on the lookout for interesting combinations of heavily optimized symbolic AI reasoning engines and strong contextual knowledge retrieved from the LLM. That would give us the soundness of the inference engine plus the rich context and world knowledge of the LLM. It's not how people work, but it is a great way to build an AI system.
Mostly agreed, with the caveats that (a) we don't yet understand how to combine LLMs with those other mechanisms in a way that will work, and (b) even when we make some progress on that question, I think it's still going to be incremental; I would not yet use words like "likely ... close to human cognitive capabilities".
Maybe. The more structure that is exposed by the system, the more interpretable it can be. For example, RAG makes it possible to cite source documents. However, as the size of a search space scales up (e.g., in AlphaGo or in a SAT solver), the size of the "explanation" grows very large, and new techniques are needed to summarize it. That raises the long-standing challenge of discovering human-interpretable abstractions.
In Retrieval Augmented Generation, a collection of documents (e.g., Wikipedia) is pre-processed and indexed into a vector data base. During generation, your question is matched to the vector database, and relevant passages from the documents are copied into the LLM's context buffer. Bing (and presumably Google) also do a web search and include some results in the input buffer as well. My simple model is that it is these retrieved documents that are cited. But I imagine the commercial models have multiple strategies for determining which documents to cite. Studies have shown that the generated answers can have a mix of retrieved material and information learned during the pre-training phase. You must check everything an LLM produces!
This is particularly evident using Perplexity. It's also particularly frustrating as if the source is garbage the RAG will basically output garbage, or an interpretation that is based on garbage. Humans can quickly tell when a source is bad but it seems difficult for their pipeline to do that. I am also still wondering if they remain slave of the SEO and PageRank algorithms to extract those documents?
An alternative view of what is happening is that we have been passing through three different phases of LLM-based development.
In Phase 1, "scaling is all you need" was the dominant view. As data, network size, and compute scaled, new capabilities (especially in-context learning) emerged. But each increment in performance required exponentially more data and compute.
In Phase 2, "scaling + external resources is all you need" became dominant. It started with RAG and toolformer, but has rapidly moved to include invoking python interpreters and external problem solvers (plan verifiers, wikipedia fact checking, etc.).
In Phase 3, "scaling + external resources + inference compute is all you need". I would characterize this as the realization that the LLM only provides part of what is needed for a complete cognitive system. OpenAI doesn't call it this, but we could view o1 as adopting the impasse mechanism of SOAR-style architectures. If the LLM has high uncertainty after a single forward pass through the model, it decides to conduct some form of forward search combined with answer checking/verification to find the right answer. In SOAR, this generates a new chunk in memory, and perhaps in OpenAI, they will salt this away as a new training example for periodic retraining. The cognitive architecture community has a mature understanding of the components of the human cognitive architecture and how they work together to achieve human general intelligence. In my view, they give us the best operational definition of AGI. If they are correct, then building a cognitive architecture by combining LLMs with the other mechanisms of existing cognitive architectures is likely to produce "AGI" systems with capabilities close to human cognitive capabilities.
sounds like neurosymbolic AI in the end, no?
Well, someday someone may figure out how to do it all in a connectionist architecture. But either way, we are seeing more and more structure in these systems. I also think the pragmatic engineers in startups will be thinking: "I could try to do reasoning inside the net, but damn this SAT solver runs fast on my GPU." I'm on the lookout for interesting combinations of heavily optimized symbolic AI reasoning engines and strong contextual knowledge retrieved from the LLM. That would give us the soundness of the inference engine plus the rich context and world knowledge of the LLM. It's not how people work, but it is a great way to build an AI system.
Mostly agreed, with the caveats that (a) we don't yet understand how to combine LLMs with those other mechanisms in a way that will work, and (b) even when we make some progress on that question, I think it's still going to be incremental; I would not yet use words like "likely ... close to human cognitive capabilities".
Yes, I'm speculating here (and will probably regret it quite soon). If past experience is a guide, we will discover yet more pieces that are needed.
Tom, question from a neophyte, would all these additional systems ancillary to the LLM be helpful in terms of interpretability?
Maybe. The more structure that is exposed by the system, the more interpretable it can be. For example, RAG makes it possible to cite source documents. However, as the size of a search space scales up (e.g., in AlphaGo or in a SAT solver), the size of the "explanation" grows very large, and new techniques are needed to summarize it. That raises the long-standing challenge of discovering human-interpretable abstractions.
When LLMs cite “source documents” are they actually citing the specific documents from which particular data came?
Or are they citing after the fact “best guesses” about where the data likely might have come from? —Eg, based on a web search of key words
If they are citing the actual source documents , how does that work?
I've seen a GPT based system cite correctly in some sense, but still hallucinate the details when forced (for example) to do arithmetic.
In Retrieval Augmented Generation, a collection of documents (e.g., Wikipedia) is pre-processed and indexed into a vector data base. During generation, your question is matched to the vector database, and relevant passages from the documents are copied into the LLM's context buffer. Bing (and presumably Google) also do a web search and include some results in the input buffer as well. My simple model is that it is these retrieved documents that are cited. But I imagine the commercial models have multiple strategies for determining which documents to cite. Studies have shown that the generated answers can have a mix of retrieved material and information learned during the pre-training phase. You must check everything an LLM produces!
This is particularly evident using Perplexity. It's also particularly frustrating as if the source is garbage the RAG will basically output garbage, or an interpretation that is based on garbage. Humans can quickly tell when a source is bad but it seems difficult for their pipeline to do that. I am also still wondering if they remain slave of the SEO and PageRank algorithms to extract those documents?
Thanks.
And I plan to check😊