I’m surprised to see this in email inbox before I saw it here.
Yeah I would have posted it but I have been running around like a madman today, My impression so far is it is hard to tell the difference because 3.7 was so good already. What are others finding?
What tasks is this on?
There certainly are a bunch of things where sonnet 3.7 was already quite saturated.
I’ve personally been seeing a step change and new things at new scales and reliability have been possible
Just the old “ask me questions about X until you have enough information to write Y” it’s a prompt I use a lot
I’m finding that 4.0 can “go a lot further”. I’m not sure it’s any “smarter” though, but still too early to make proper judgements.
But as a specific metric; I would seldom see previous models attempt more than a half dozen text replacements - where now it will happily do many dozen at once.
Further update, Sonnet 4.0 is smarter as well as more comprehensive. It is much more capable of understanding the nature of problems. Combine that with being more comprehensive in what it looks at and considers (thinks about); it’s doing a fantastic job.
I’ve got a substantial refactor, and I was really postponing it due to the amount of effort I was going to have to put into it; I expected a few days of work. But after one afternoon (& evening) – the refactor is complete.
And that’s with Sonnet, I’ll try Opus when I’ve got something that warrants the extra-expensive tokens for it to tackle.
I’ll take some credit for the refactor being finished so fast; Beyond Better is doing quite well too.
I’m finding it comparable to 3.7 except that it appears to be better at what I would call “intuiting intent” when given subtle or incomplete prompts. It’s also better at meta analysis of a thread of conversation as in “at what point did you come up with X idea and what was it I said prompted you to do so, or was it an idea that I gave to you?”
It might seem strange to ask this but I use Claude “socratically” when developing and testing my own ideas and I find 4 does a better job. 3.7 was already superior for this kind of use compared to the others I’ve tried and I’m happy that 4 works even better for this.
The more recent training cut-off (January this year) is also a plus for what I do.
Further update on Claude 4…
For quite a while I’ve been putting up with a nuisance bug in BB (involving app state, UI interactions, and race/timing conflicts). I didn’t have the patience to solve it myself and it was too many moving bits for previous Claude to solve.
So I handed the problem to Sonnet 4. I ended up with hacks on top of bandaids and still didn’t have a solution.
I tried again with Sonnet 4 with an addition to the prompt to “refactor rather than hack at the solution”. That worked well and I got a nice looking refactor, but it still didn’t fix the problem.
I still hadn’t tried Opus 4 since Sonnet 4 had been performing so well with everything else I needed (& Opus is so expensive). I decided to give Opus 4 a go on this problem.
Opus 4 also did a nice looking refactor. While doing that, Opus noticed some performance issues with UI interactions (even though there was already ‘debounce’ handling to improve performance). Opus slowly worked for a while (10 minutes) across a few files, and produced a solution that fixed the long-standing nuisance bug along with some UI performance enhancements (theoretically, the performance wasn’t a real-world problem so I’m not noticing a change, but metrics show an improvement).
When the extra cost is warranted, Opus 4 does impressive work.
I’m not sure about Opus doing extra work I didn’t ask for, especially at increased cost, but the results were impressive.
Configuring BB to use Opus 4 for orchestrator tasks, Sonnet 4 for agent tasks, and Haiku 3.4 for admin tasks is a powerful combination.
PS. I want to prefer using Gemini in BB for its larger context window, but Claude continues to outperform when using tools.
This is a good example of a use case for Opus, I personally still haven’t found something like this.
I use Gemini a lot these days and in my experience the difference between the two providers is reliability, Gemini hallucinations a lot more than Claude so when I want something done right I use Claude and when it doesn’t matter that much I use Gemini. For BB I’d agree, Claude is the right choice until this changes (which I fully expect it will)
Interesting Alex! What is the main reason you’re using Gemini then if you feel the reliability is better and hallucinations less with Claude?
And also forgive my ignorance what’s BB?
Usage limits. Claude is great but you don’t get much bandwidth whereas Gemini has much higher usage limits and I can use it for throw-away questions.
BB is Beyond Better.