12 of my go-to choice of model + harness combos for each job as of Dec 2025 (IMO):

1) Gemini 2.5 Pro in AI Studio for Deep Writing
2) Opus 4.5 in Cursor for the best greenfield frontend work money can buy - if you're rich

3) Gemini 3 in AI Studio with Mandark for proper code review and creating solid plans for large changes
4) Gemini CLI with Gemini 3.0 (only Gemini 3) for long refactors and edits
5) Sonnet 4.5 1m in Cline for long complex tasks where you already have a good plan

6) Claude Code if you want to have fun for the rest of the day and vibe code all your side projects
7) Codex for placebo testing
8) Claude UI with Sonnet (or GPT-5 on chat[dot]com) as a Google/Perplexity replacement
9) V0 for frontend if you already have assets in Figma

10) Claude Deep Research for research - most steerable and deep research tool so far
11) Nanobanana pro for image generation and diagrams- Flux 2 as a far second for photo work
12) Firecrawl MCP connected to almost all of these for bulk web searching

Harnesses today (even Chat UIs) matter as much as underlying models. The prompt, tools, the nature of the loop - all make a massive difference for deep agentic work. While a lot of harness+model combos *can* do something, they're a long way away from doing it reliably, every single time

Runner up:
• Gemini 3.0 in Antigravity is really good at one shotting complex (in terms of functionality) frontends!

Goes without saying, I haven't tested everything, and anything not mentioned is either unknown or had weird results *for me*! Your mileage may vary :)

AI studio (Gemini, Nanobanana): ai.dev
Antigravity: https://antigravity.google/
Claude UI: claude.ai
Cline: https://cline.bot/
Claude Code: https://www.claude.com/product/claude-code
Codex: https://developers.openai.com/codex/cli/ 
Sambot: chat.com
V0: v0.dev
Mandark: https://github.com/hrishioa/mandark

Top Model + Harness Combos

How does an LLM writing out this program (WITHOUT a code interpreter running the output) make things more accurate?

Verified on Qwen 3 - a30b (below)

Lots of interesting takeaways from the Random Rewards paper. NOT that RL is dead, but honestly far more interesting than that!

Testing this locally surprised me too. Something is definitely happening here - and it's also apparent when testing Opus vs Sonnet 4. Models reason very, VERY differently when using code vs natural language - displaying very different aptitudes working through the same problem.

Trying different problems on multiple models, there's a distinct difference in answer and reasoning quality in code vs NL.

This is heavily relevant to the work we're doing, which involves transitioning between NL reasoning and code boundaries repeatedly. 

Honestly it's relevant to almost all work - most agentic flows have 10-20 transitions (sometimes more) per loop.

Most flows today treat NL as reasoning, code as execution, and structured data as an extraction method. There might be problems with this approach.

Actionable takeaways for us:
1. Test code-as-reasoning pathways (no code interpreter, interleaved in thinking instead of as a toolcall or output).
2. Measure model aptitude and performance in thinking with code vs without.
3. Find new ways of measuring rewards signals from NL reasoning, instead of switching to code or structured output for measurement (which can corrupt results).

Finally, about the random rewards improving benchmark performance: It's the clipping term.

GRPO clips impact of a reward based on token probability. Lower probability tokens can move less than higher probability tokens. This means that even with random rewards (especially so with random rewards), models push more into what was in-distribution for them. For Qwen-MATH, this is code. It thinks better in code. Therefore it gets better overall, even with random rewards.

Full paper and results - also can we normalise releasing preprints in Notion? So much easier to read, annotate and understand!

https://x.com/StellaLisy/status/1927392717593526780

Thinking in Code

Frontend entirely made with @v0 - this has become an inseparable tool for writing feedback. Thinking of calling it scansion

I'll open source or share the link once I can clean it up - still using my keys, drop email/twitter in comments

Sonnet looking through the thing 👇

Sonnet trying to think it through while streaming YAML

https://olickel.com/everything-about-evals

Writing under test

Writing tool: Scansion

Learned a lot here:
1. YAML is better than JSON but parsing streaming YAML is hard - not impossible though.
2. Thinking tokens are useful for redirecting output.
3. Metaprompting works, but we consistently underestimate the amount of work that metaprompt takes. Anything under a cardinality of 20, you'll do less work if you just write and edit the actual prompts yourself.

Writing under test:

https://olickel.com/everything-about-evals

YAML vs JSON & Metaprompting

This is the guide I wish I had - didn't hold back.

Everything I know.

Enjoy.

https://olickel.com/everything-about-evals

Complete Evals Guide

It's crazy that if you just know what to use and how to coordinate it, you can make massive, dense, well-researched reports.

Made a guide but it's tiny - things really have become less about skill and more about knowledge. I guess knowledge becomes skill?

https://www.notion.so/southbridge-research/Auto-generating-Insanely-Good-Reports-1ea5fec70db180fcafb3f83685035f72

Generating Great Reports



https://x.com/hrishioa/status/1919050596663054827

Thoughts on Deep Research from OpenAI ($200/month) vs Gemini vs Perplexity

GPT Deep Research (DR) with o3-mini-high has the best, most verbose writing. If you're learning a new topic from scratch, this is the best option. If you already know what you're looking for and you need more detail, the same verbosity cuts the other way.
If you're looking for simple prompting, this is also often the best option - the questions you're asked before research starts is a simple but massively useful feature that help guide things a lot better.
It's also the second slowest, and pretty comparable in number of sources to the other ones. Also the most expensive by 10x. It's definitely not 10x better.


Perplexity DR is surprisingly good. Pretty close in sources - and picks up some things that the other ones miss. I wish the output was longer (694 words compared to openai's 2462 words). It's also the fastest of the bunch - I had time to ask three more questions in the time that the other tools took to finish. This is what I'd use if I'm inside my areas of expertise.

Perplexity (once you have the initial research) is also pretty good if you switch the model to o3 and ask for more elaboration. Improves the results.

Gemini is the surprising underperformer. You'd think Google would dominate this space simply by virtue of the fact that they have direct access to the world's largest search index, while everyone else has to live with SERP summaries and Google queries - but somehow Gemini DR (perhaps because of the 1.5 pro model still being used) falls quite short. Unless you're getting this one for free, not much point.

Comparing deep research offerings

Makes sense why sonnet is still preferred. Made the same app with o3-mini, r1 and sonnet on Cursor.

Pic 1 and 2 are sonnet, 3 is r1, 4 is o3mini.

Same prompts, sonnet was most fun with the best features + fastest, r1 - most creative. Deployed sonnet 👇

https://x.com/cursor_ai/status/1885415392677675337

https://gaeguli-tawala.vercel.app/

I didn't write a single line of code myself on any of these. Sonnet was still the most creative when asked to come up with new features. Did all of them in parallel in about 50 minutes.

Sonnet vs. o3-mini vs. r1: App Comparison

This post from @davidcrawshaw hits pretty close to home. My internal repos have exploded almost 800% since GPT-3.5, for the same reason: It's much easier to test hypotheses, make new applications and run ideas now. It's not because LLMs are better than humans at code - they're just built different in a very, VERY useful way.

If you try to exploit the differences, you'll have a much better time.

LLMs have no long-term memory - humans do, and it's very difficult to get fresh eyes from humans on a problem.

LLMs have broader knowledge than any one human, and not all humans with the specific in-depth knowledge are accessible at any time.

LLMs have no problem doing repeated work. They are a practically renewable resource (like solar), unlike humans at the same level.

Modern software dev is designed for humans (incremental updates on increasingly large codebases). Rearchitecting this for llms means more tests on smaller packages - minirepos - that can be worked on independently. It means a lot more throwaway versions before you get to the final product. For example, I'll take an idea, build multiple small Typescript scripts to test viability, add tests, make a quick CLI to test, launch it and give it to some friends, turn it into a GUI on @vercel for more testing, rewrite tests, before I start extracting out core logic to *completely rewrite* the whole thing for the actual intended purpose.

I'll also use multiple llms and crossposts the outputs to get fewer holes in an analysis (e.g. https://www.youtube.com/watch?v=p948WOthRyg)  

https://x.com/davidcrawshaw/status/1876407248500793710