Thoughts on Deep Research from OpenAI ($200/month) vs Gemini vs Perplexity
GPT Deep Research (DR) with o3-mini-high has the best, most verbose writing. If you're learning a new topic from scratch, this is the best option. If you already know what you're looking for and you need more detail, the same verbosity cuts the other way.
If you're looking for simple prompting, this is also often the best option - the questions you're asked before research starts is a simple but massively useful feature that help guide things a lot better.
It's also the second slowest, and pretty comparable in number of sources to the other ones. Also the most expensive by 10x. It's definitely not 10x better.
Perplexity DR is surprisingly good. Pretty close in sources - and picks up some things that the other ones miss. I wish the output was longer (694 words compared to openai's 2462 words). It's also the fastest of the bunch - I had time to ask three more questions in the time that the other tools took to finish. This is what I'd use if I'm inside my areas of expertise.
Perplexity (once you have the initial research) is also pretty good if you switch the model to o3 and ask for more elaboration. Improves the results.
Gemini is the surprising underperformer. You'd think Google would dominate this space simply by virtue of the fact that they have direct access to the world's largest search index, while everyone else has to live with SERP summaries and Google queries - but somehow Gemini DR (perhaps because of the 1.5 pro model still being used) falls quite short. Unless you're getting this one for free, not much point.
Makes sense why sonnet is still preferred. Made the same app with o3-mini, r1 and sonnet on Cursor.
Pic 1 and 2 are sonnet, 3 is r1, 4 is o3mini.
Same prompts, sonnet was most fun with the best features + fastest, r1 - most creative. Deployed sonnet 👇
x.com/cursor_ai/status/1885415392677675337
This post from @davidcrawshaw hits pretty close to home. My internal repos have exploded almost 800% since GPT-3.5, for the same reason: It's much easier to test hypotheses, make new applications and run ideas now. It's not because LLMs are better than humans at code - they're just built different in a very, VERY useful way.
If you try to exploit the differences, you'll have a much better time.
LLMs have no long-term memory - humans do, and it's very difficult to get fresh eyes from humans on a problem.
LLMs have broader knowledge than any one human, and not all humans with the specific in-depth knowledge are accessible at any time.
LLMs have no problem doing repeated work. They are a practically renewable resource (like solar), unlike humans at the same level.
Modern software dev is designed for humans (incremental updates on increasingly large codebases). Rearchitecting this for llms means more tests on smaller packages - minirepos - that can be worked on independently. It means a lot more throwaway versions before you get to the final product. For example, I'll take an idea, build multiple small Typescript scripts to test viability, add tests, make a quick CLI to test, launch it and give it to some friends, turn it into a GUI on @vercel for more testing, rewrite tests, before I start extracting out core logic to *completely rewrite* the whole thing for the actual intended purpose.
I'll also use multiple llms and crossposts the outputs to get fewer holes in an analysis (e.g. youtube.com/watch?v=p948WOthRyg)
x.com/davidcrawshaw/status/1876407248500793710
Covering some of the papers this week with just the interesting bits (or just things I didn't know)
Starting with the BLT paper, we've now learned that @aiatmeta has some kind of food obsession
The most interesting one was the synchronous LLMs paper (at the end)
Among all the cool things at NeurIPS I wanted to call out this gem: CoCoNUT (not sure who's in charge of naming at @AIatMeta
Direct latent space reasoning by connecting the last and first layers, without collapsing the distributions into a single token.
Interesting:
Friendship ended with transcription models
Releasing
github.com/southbridgeai/offmute
VLMs can:
- Transcribe
- Figure out who's speaking
- Look at the video itself
- Make a final report
for cheaper 🫡
~ 6 months there's a chance OCR and transcription models disappear entirely
Wrong twice this week!
I've been suggesting self-consistency as a way to scale compute against accuracy. Turns out it doesn't work and there are better ways.
Way too many useful things in this frankly underrated paper I wish I read sooner👇
Released diagen yesterday, but how does it work?
1. Generate @terrastruct d2 diagrams with the model of your choice. Sonnet seems best, o1 seems needlessly expensive, gemini-flash is insane if you do a few rounds of visual reflection.
What's visual reflection? 👇
x.com/hrishioa/status/1843685800875266470