Larry Jewett on Marcus on AI

5 Comments

When LLMs cite “source documents” are they actually citing the specific documents from which particular data came?

Or are they citing after the fact “best guesses” about where the data likely might have come from? —Eg, based on a web search of key words

If they are citing the actual source documents , how does that work?

Expand full comment

I've seen a GPT based system cite correctly in some sense, but still hallucinate the details when forced (for example) to do arithmetic.

Expand full comment

Tom Dietterich

Nov 11Edited

In Retrieval Augmented Generation, a collection of documents (e.g., Wikipedia) is pre-processed and indexed into a vector data base. During generation, your question is matched to the vector database, and relevant passages from the documents are copied into the LLM's context buffer. Bing (and presumably Google) also do a web search and include some results in the input buffer as well. My simple model is that it is these retrieved documents that are cited. But I imagine the commercial models have multiple strategies for determining which documents to cite. Studies have shown that the generated answers can have a mix of retrieved material and information learned during the pre-training phase. You must check everything an LLM produces!

Expand full comment

Reply (2)

Eric Jeker

Nov 12

This is particularly evident using Perplexity. It's also particularly frustrating as if the source is garbage the RAG will basically output garbage, or an interpretation that is based on garbage. Humans can quickly tell when a source is bad but it seems difficult for their pipeline to do that. I am also still wondering if they remain slave of the SEO and PageRank algorithms to extract those documents?

Expand full comment

Larry Jewett

Nov 11

Thanks.

And I plan to check😊

Expand full comment