1 Comment
⭠ Return to thread

You might define the measure of capabilities of an LLM to be the number of tasks that it solves correctly. It is finite due to the length restriction of the LLM input ("context"). Many different input texts can describe the same task; hence each task is a set of essentially equivalent texts.

Of course there is no linear effect of capacity increase on benchmark scores. However given the purpose of LLM benchmarks, the relationship should be close to monotonic. Hence the approach taken by Gary Marcus in this article to show lack of recent LLM capability improvements seems valid.

Expand full comment