And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to presen…
And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/
And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/
These systems with huge context (prompt) sizes provide options to 'engineer around' the LLMs fundamental limitations (but also opens problems, like that huge prompt being used to slowly jailbreak safety fine-tuning — Crescendo, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/).
It has been clear from the last actual paper by OpenAI on GPT that scaling effects of the LLMs themselves is log, or even log-log, or even log-log-log: https://ea.rna.nl/2024/02/13/will-sam-altmans-7-trillion-ai-plan-rescue-ai/