18 Comments
Feb 8Liked by Daniel Nest

Absolutely one of the most helpful articles I have read on this subject. Good job!

Expand full comment
author

Thanks for the kind words, happy you enjoyed it!

Expand full comment

Thanks for clarifying this!

I am left with a question, though: why not call them "examples" instead of "shots"? Is it just so us fancy prompt engineer types can sneer at the rest of the population with our obviously superior knowledge, lording our power over the luddites of the world the way that the arcane academia of the ancient world controlled information?

Expand full comment
author

I'm not saying it's aliens, but "Aliens."

Expand full comment

I laughed pretty hard at this... had to read it 3-4 times to fully appreciate it (plus my luddite brain kept being entertained already by the 1/3 point, could not intake multiple shots of the articulated fluid comical sarcasm... )

Good one sir. :D

Expand full comment

Thank you! I'm here every day, commenting on whatever Daniel writes. We are basically those two grumpy Muppets up in the balcony, but with much less malice, and maybe not quite as good-looking.

Expand full comment
author

Hey!

I may be grumpy, not good-looking, and a Muppet, but...what else did you say?

Expand full comment

I've forgotten, and I'm hungry.

Expand full comment

I dunno my man, that picture of you with the puppies is pretty swanky!!

Expand full comment

I appear at my best when furbabies surround me. What can I say?

Expand full comment

One thing though, while I get it the article explain "how" n-shot works, a purpose built system will (for the forseeable future) be vastly superior to an LLM and the n-shot strategy.

The lack of contextual, semantic, cultural, data boundary on an LLM is dangerous and it's only "not-dangerous" for the people holding the keys to the doom-machine - ie, I can see Sam from OpenAI asking everyone "why would you not give us your data"

I know why.

Because then AI becomes stuck in 2021.

And no amount of n-shot will render a useful answer to a human in 10 years from now.

Expand full comment
author
Feb 10·edited Feb 10Author

I have no doubt that much of what you say holds true for business applications, enterprise-scale deployments, etc.

But the vast majority of regular people will likely continue interacting with LLMs via the chatbot interface and in a much more casual context.

My focus is on that average person and givng them some basic tools and concepts to carry into their interactions with AI.

Expand full comment

Yes, that's true - agreed.

Trust me, I hesitated commenting on your article because it's very valid and I don't want to look as if I'm disagreeing. The article is spot on - and honestly all you write is of high caliber.

It's an interesting balance - how do you add to the conversation and offer a new point of view. I don't think we should avoid having the conversation, mostly because OpenAI and other such companies are for-profit companies. So ultimately it all converges in that direction IMO.

If they don't make money, they will take their toys away and ChatGPT goes away. In my work life I see the massive $$$$ being spent on competing models being trained so the likes of OpenAI will feel the heat soon. Especially the swarm of domain aware models will do really well in various vertical industries.

Maybe the market will diverge sucessfully into "consumer LLMs" and "business domain aware LLMs"...

So Daniel - I don't mean to take away from your awesome work which I not only enjoy, I think it is useful to me and many others to better explain how all this AI works to the regular person.

Cheers

Expand full comment
author
Feb 10·edited Feb 10Author

I see things the same way: There's already a pretty wide gap between consumer-facing AI that does the trick for the average person and purpose-built, fine-tuned, etc. models that deliver business value to organizations.

The gap is only likely to grow further.

Also, while the consumer landscape is dominated by a few relatively large players, the business side is probably going to get increasingly fragmented with new, niche models being used.

I appreciate any form of constructive comments, and this one in particular complements my post nicely. I come from the "average Joe" perspective, while you bring in the "Yes, but" business take.

So keep them coming.

Expand full comment

Simple and easy to understand.

Expand full comment
author

Thanks, that's what I was aiming for. Happy it hit the mark!

Expand full comment
Feb 8Liked by Daniel Nest

I'm curious how tests like the ones shown in your initial chart allow the LLMs to have different amount of shots for the same test. How can it be fair or accurate analysis to give Claude 2 a 0-shot rating on the GSM8K, while Grok gets 8-shots?

Expand full comment
author
Feb 8·edited Feb 8Author

Definitely.

What I always assumed is that these comparisons aren't done at the same time as part of the same broad test. So when e.g. Google tests Gemini against a benchmark, they just look at the best available comparison to another model at the time. So if Claude was only tested with a 0-shot approach by the Anthropic team back when it was benchmarked, that's what Google uses in their reference table. (So Google doesn't e.g. re-run all the other models through the same tests when compiling its table, only its own Gemini models.)

Now why there isn't a set of firm criteria that every tested model must apply isn't something I know much about. But I agree that for an apples-to-apples comparison, we'd want every model to be prompted using the same exact method.

PS: But to be fair, Grok seems to need all the help it can get.

Expand full comment