Will probably stick with just Claude and Gemini. Not sure I can support a product with limited details around safety and alignment work but this seems to definitely makes OpenAI's profitability outlook even worse then it was in 2024!
Yeah, I was trying to look up coverage about the safety implications of developing and launching models this quickly. Couldn't find much yet. Do you have any insights on this?
Great hot take Daniel, and yes, that 2025 prophecy is starting to look a little shaky already isn't it?! AI predictions in general are a tough proposition. Perhaps just stick to vague and oracular statements that are impossible to quantify, like Sam Altman? ;)
There goes that convergence we were talking about again!
It seems notable that, instead of pure frontier models where more intelligence and being "smarter" is the goal, we are starting to see models that are "also very good" or "just as good" or "almost as good", but those 2nd tier models offer much more efficient energy use, or more personalization or more privacy, or do something that isn't "more compute."
I'm starting to see these companies as just as important, and some of the innovations that aren't at the frontier as being crucial. EG, like a billion people are being exposed to LLMs via "Apple Intelligence", where it's just stupid easy to start using.
Yeah, it's actually one of the models I mentioned in our recent email correspondence.
There definitely seems to be a tendency towards rather quick convergence in LLMs and now also reasoning models.
Although I wouldn't call DeepSeek-R1 a 2nd-tier model. It's very much top-tier with o1 at this point, and I imagine it's ahead of Gemini 2.0 Flash Thinking Experimental.
But you're right, we also see lots of smaller models from AI labs that are "good enough" but much faster and cheaper to serve (e.g. GPT4o-mini, Gemini Flash, Microsoft phi, etc.). For most "average Joe" cases, you might not even be able to tell the difference and will get your answers faster.
Sorry - I meant 2nd tier only in terms of when they come out. I think the pioneers who come out with the new model with all these exciting abilities, smashing benchmarks, etc as in a separate class from the clones, even if the clones are every bit as good (or even better). We need both types, and it looks like we're getting plenty more in that 2nd category now that the industry is slightly more mature, I reckon.
I've done a handful of tests with R1 and it's better than anything else I've used across the tests. That includes other reasoning models, like o1. My tests are specific to me and not exhaustive (nor related to coding or mathematics), but it's certainly impressed me.
May I ask what kinds of tasks you found it most useful for? And when you say that it's better than anything else, is it the quality of reasoning, the tone, how complete the answers are, etc.?
- Discussing chewy subjects that don't have a clear, or agreed, answer. To the point where I've told it to pick the best choice from options I've given, and it's helpfully refused an outright choice, giving alternatives that work within the remit of what I've allowed. Beyond that, pushing the answer with a curveball that I didn't elaborate on, yet the response it returned grasped why I went in that direction, then offered credible advice (and even tactics) on how to work down that path. I haven't seen o1 get close to what I got in my incredibly limited testing off very simple prompting.
- Drafting ideas and analysing styles. The quality and understanding of writing when honed carefully was a big surprise to me. I'd not been expecting much—if any—improvement on what's out there any time soon. I'm genuinely surprised that this model has, in my opinion, made an improvement. Again, the relatively basic prompting makes it all the more impressive.
-
Good question on why it's better for me. Yep, quality of reasoning took me by surprise. Yep, tone is varied, so it'll hint at light-hearted just enough to get away with it even in a serious response, or hit a tone of care where needed when the rest of the reply is more perfunctory. That said, the reliance on metaphor and simile is still massive, like other LLMs. So far, I've just asked for less emphasis on them and the second try is much better. And, yep, the answers are complete enough for my liking, and without it giving fluff.
To be fair, my own reply here is like a bloomin' essay, so I asked R1 to condense it. Two tries with types of prompting I know work elsewhere... R1 wasn't the best I've experienced. Just goes to show the value of using multiple models for different tasks!
I appreciate the detailed response, so no need to R1 it ;)
The above sound like pretty solid non-coding, non-STEM use case examples to me. I've been reading about o1 and the best way to approach it over the past few days, and it's clear that it's best for "meaty" subjects that require abstract thinking, long-term planning, etc. So your experience checks out - I guess there's lots of overlap between R1 and o1 in terms of applicability!
And speaking of "using multiple models for different tasks," that's exactly what my upcoming Thursday post is about: Using GPT-4o and o1 in tandem to supplement each other.
Will probably stick with just Claude and Gemini. Not sure I can support a product with limited details around safety and alignment work but this seems to definitely makes OpenAI's profitability outlook even worse then it was in 2024!
Yeah, I was trying to look up coverage about the safety implications of developing and launching models this quickly. Couldn't find much yet. Do you have any insights on this?
Great hot take Daniel, and yes, that 2025 prophecy is starting to look a little shaky already isn't it?! AI predictions in general are a tough proposition. Perhaps just stick to vague and oracular statements that are impossible to quantify, like Sam Altman? ;)
Damn, that's where I fucked up!
Is it too late to revise my prediction to "AGI will be here soon, give or take some time, unless it won't"?!
There goes that convergence we were talking about again!
It seems notable that, instead of pure frontier models where more intelligence and being "smarter" is the goal, we are starting to see models that are "also very good" or "just as good" or "almost as good", but those 2nd tier models offer much more efficient energy use, or more personalization or more privacy, or do something that isn't "more compute."
I'm starting to see these companies as just as important, and some of the innovations that aren't at the frontier as being crucial. EG, like a billion people are being exposed to LLMs via "Apple Intelligence", where it's just stupid easy to start using.
Yeah, it's actually one of the models I mentioned in our recent email correspondence.
There definitely seems to be a tendency towards rather quick convergence in LLMs and now also reasoning models.
Although I wouldn't call DeepSeek-R1 a 2nd-tier model. It's very much top-tier with o1 at this point, and I imagine it's ahead of Gemini 2.0 Flash Thinking Experimental.
But you're right, we also see lots of smaller models from AI labs that are "good enough" but much faster and cheaper to serve (e.g. GPT4o-mini, Gemini Flash, Microsoft phi, etc.). For most "average Joe" cases, you might not even be able to tell the difference and will get your answers faster.
Sorry - I meant 2nd tier only in terms of when they come out. I think the pioneers who come out with the new model with all these exciting abilities, smashing benchmarks, etc as in a separate class from the clones, even if the clones are every bit as good (or even better). We need both types, and it looks like we're getting plenty more in that 2nd category now that the industry is slightly more mature, I reckon.
"Good enough" is amazing by 2023 standards.
I've done a handful of tests with R1 and it's better than anything else I've used across the tests. That includes other reasoning models, like o1. My tests are specific to me and not exhaustive (nor related to coding or mathematics), but it's certainly impressed me.
Whoa, that's quite an endorsement!
May I ask what kinds of tasks you found it most useful for? And when you say that it's better than anything else, is it the quality of reasoning, the tone, how complete the answers are, etc.?
A couple of big winners so far:
- Discussing chewy subjects that don't have a clear, or agreed, answer. To the point where I've told it to pick the best choice from options I've given, and it's helpfully refused an outright choice, giving alternatives that work within the remit of what I've allowed. Beyond that, pushing the answer with a curveball that I didn't elaborate on, yet the response it returned grasped why I went in that direction, then offered credible advice (and even tactics) on how to work down that path. I haven't seen o1 get close to what I got in my incredibly limited testing off very simple prompting.
- Drafting ideas and analysing styles. The quality and understanding of writing when honed carefully was a big surprise to me. I'd not been expecting much—if any—improvement on what's out there any time soon. I'm genuinely surprised that this model has, in my opinion, made an improvement. Again, the relatively basic prompting makes it all the more impressive.
-
Good question on why it's better for me. Yep, quality of reasoning took me by surprise. Yep, tone is varied, so it'll hint at light-hearted just enough to get away with it even in a serious response, or hit a tone of care where needed when the rest of the reply is more perfunctory. That said, the reliance on metaphor and simile is still massive, like other LLMs. So far, I've just asked for less emphasis on them and the second try is much better. And, yep, the answers are complete enough for my liking, and without it giving fluff.
To be fair, my own reply here is like a bloomin' essay, so I asked R1 to condense it. Two tries with types of prompting I know work elsewhere... R1 wasn't the best I've experienced. Just goes to show the value of using multiple models for different tasks!
I appreciate the detailed response, so no need to R1 it ;)
The above sound like pretty solid non-coding, non-STEM use case examples to me. I've been reading about o1 and the best way to approach it over the past few days, and it's clear that it's best for "meaty" subjects that require abstract thinking, long-term planning, etc. So your experience checks out - I guess there's lots of overlap between R1 and o1 in terms of applicability!
And speaking of "using multiple models for different tasks," that's exactly what my upcoming Thursday post is about: Using GPT-4o and o1 in tandem to supplement each other.
Thanks for your feedback!