I am left with a question, though: why not call them "examples" instead of "shots"? Is it just so us fancy prompt engineer types can sneer at the rest of the population with our obviously superior knowledge, lording our power over the luddites of the world the way that the arcane academia of the ancient world controlled information?
I laughed pretty hard at this... had to read it 3-4 times to fully appreciate it (plus my luddite brain kept being entertained already by the 1/3 point, could not intake multiple shots of the articulated fluid comical sarcasm... )
Thank you! I'm here every day, commenting on whatever Daniel writes. We are basically those two grumpy Muppets up in the balcony, but with much less malice, and maybe not quite as good-looking.
One thing though, while I get it the article explain "how" n-shot works, a purpose built system will (for the forseeable future) be vastly superior to an LLM and the n-shot strategy.
The lack of contextual, semantic, cultural, data boundary on an LLM is dangerous and it's only "not-dangerous" for the people holding the keys to the doom-machine - ie, I can see Sam from OpenAI asking everyone "why would you not give us your data"
I know why.
Because then AI becomes stuck in 2021.
And no amount of n-shot will render a useful answer to a human in 10 years from now.
Trust me, I hesitated commenting on your article because it's very valid and I don't want to look as if I'm disagreeing. The article is spot on - and honestly all you write is of high caliber.
It's an interesting balance - how do you add to the conversation and offer a new point of view. I don't think we should avoid having the conversation, mostly because OpenAI and other such companies are for-profit companies. So ultimately it all converges in that direction IMO.
If they don't make money, they will take their toys away and ChatGPT goes away. In my work life I see the massive $$$$ being spent on competing models being trained so the likes of OpenAI will feel the heat soon. Especially the swarm of domain aware models will do really well in various vertical industries.
Maybe the market will diverge sucessfully into "consumer LLMs" and "business domain aware LLMs"...
So Daniel - I don't mean to take away from your awesome work which I not only enjoy, I think it is useful to me and many others to better explain how all this AI works to the regular person.
I see things the same way: There's already a pretty wide gap between consumer-facing AI that does the trick for the average person and purpose-built, fine-tuned, etc. models that deliver business value to organizations.
The gap is only likely to grow further.
Also, while the consumer landscape is dominated by a few relatively large players, the business side is probably going to get increasingly fragmented with new, niche models being used.
I appreciate any form of constructive comments, and this one in particular complements my post nicely. I come from the "average Joe" perspective, while you bring in the "Yes, but" business take.
I'm curious how tests like the ones shown in your initial chart allow the LLMs to have different amount of shots for the same test. How can it be fair or accurate analysis to give Claude 2 a 0-shot rating on the GSM8K, while Grok gets 8-shots?
What I always assumed is that these comparisons aren't done at the same time as part of the same broad test. So when e.g. Google tests Gemini against a benchmark, they just look at the best available comparison to another model at the time. So if Claude was only tested with a 0-shot approach by the Anthropic team back when it was benchmarked, that's what Google uses in their reference table. (So Google doesn't e.g. re-run all the other models through the same tests when compiling its table, only its own Gemini models.)
Now why there isn't a set of firm criteria that every tested model must apply isn't something I know much about. But I agree that for an apples-to-apples comparison, we'd want every model to be prompted using the same exact method.
PS: But to be fair, Grok seems to need all the help it can get.
Absolutely one of the most helpful articles I have read on this subject. Good job!
Thanks for the kind words, happy you enjoyed it!
Thanks for clarifying this!
I am left with a question, though: why not call them "examples" instead of "shots"? Is it just so us fancy prompt engineer types can sneer at the rest of the population with our obviously superior knowledge, lording our power over the luddites of the world the way that the arcane academia of the ancient world controlled information?
I'm not saying it's aliens, but "Aliens."
I laughed pretty hard at this... had to read it 3-4 times to fully appreciate it (plus my luddite brain kept being entertained already by the 1/3 point, could not intake multiple shots of the articulated fluid comical sarcasm... )
Good one sir. :D
Thank you! I'm here every day, commenting on whatever Daniel writes. We are basically those two grumpy Muppets up in the balcony, but with much less malice, and maybe not quite as good-looking.
Hey!
I may be grumpy, not good-looking, and a Muppet, but...what else did you say?
I've forgotten, and I'm hungry.
I dunno my man, that picture of you with the puppies is pretty swanky!!
I appear at my best when furbabies surround me. What can I say?
One thing though, while I get it the article explain "how" n-shot works, a purpose built system will (for the forseeable future) be vastly superior to an LLM and the n-shot strategy.
The lack of contextual, semantic, cultural, data boundary on an LLM is dangerous and it's only "not-dangerous" for the people holding the keys to the doom-machine - ie, I can see Sam from OpenAI asking everyone "why would you not give us your data"
I know why.
Because then AI becomes stuck in 2021.
And no amount of n-shot will render a useful answer to a human in 10 years from now.
I have no doubt that much of what you say holds true for business applications, enterprise-scale deployments, etc.
But the vast majority of regular people will likely continue interacting with LLMs via the chatbot interface and in a much more casual context.
My focus is on that average person and givng them some basic tools and concepts to carry into their interactions with AI.
Yes, that's true - agreed.
Trust me, I hesitated commenting on your article because it's very valid and I don't want to look as if I'm disagreeing. The article is spot on - and honestly all you write is of high caliber.
It's an interesting balance - how do you add to the conversation and offer a new point of view. I don't think we should avoid having the conversation, mostly because OpenAI and other such companies are for-profit companies. So ultimately it all converges in that direction IMO.
If they don't make money, they will take their toys away and ChatGPT goes away. In my work life I see the massive $$$$ being spent on competing models being trained so the likes of OpenAI will feel the heat soon. Especially the swarm of domain aware models will do really well in various vertical industries.
Maybe the market will diverge sucessfully into "consumer LLMs" and "business domain aware LLMs"...
So Daniel - I don't mean to take away from your awesome work which I not only enjoy, I think it is useful to me and many others to better explain how all this AI works to the regular person.
Cheers
I see things the same way: There's already a pretty wide gap between consumer-facing AI that does the trick for the average person and purpose-built, fine-tuned, etc. models that deliver business value to organizations.
The gap is only likely to grow further.
Also, while the consumer landscape is dominated by a few relatively large players, the business side is probably going to get increasingly fragmented with new, niche models being used.
I appreciate any form of constructive comments, and this one in particular complements my post nicely. I come from the "average Joe" perspective, while you bring in the "Yes, but" business take.
So keep them coming.
Simple and easy to understand.
Thanks, that's what I was aiming for. Happy it hit the mark!
I'm curious how tests like the ones shown in your initial chart allow the LLMs to have different amount of shots for the same test. How can it be fair or accurate analysis to give Claude 2 a 0-shot rating on the GSM8K, while Grok gets 8-shots?
Definitely.
What I always assumed is that these comparisons aren't done at the same time as part of the same broad test. So when e.g. Google tests Gemini against a benchmark, they just look at the best available comparison to another model at the time. So if Claude was only tested with a 0-shot approach by the Anthropic team back when it was benchmarked, that's what Google uses in their reference table. (So Google doesn't e.g. re-run all the other models through the same tests when compiling its table, only its own Gemini models.)
Now why there isn't a set of firm criteria that every tested model must apply isn't something I know much about. But I agree that for an apples-to-apples comparison, we'd want every model to be prompted using the same exact method.
PS: But to be fair, Grok seems to need all the help it can get.