Explaining SeaGPT

•

🚣‍♂️Today we're launching the world's first multi-agent tool for shipping Goal: Remove 100s of emails a day per person You can read the article, or we can think step-by-step and I can explain every part techcrunch.com/2023/04/24/greywing-seagpt/

Full demo at the end - this has been two weeks of non-stop building, so excuse the potato microphone TLDV is we have a system to automate back-and-forths with real parties in maritime, to reduce email overwhelm. How did we make it reliable? twitter.com/hrishioa/status/1647626783775399937

The demo includes everything we've built from scratch at Greywing A department of agents work together to make sure email communication is 100% automated, no human-in-the-loop So - a crew manager walks into a bar with 12 LLMs twitter.com/hrishioa/status/1647626783775399937

Concierge says hello. From a single query, two subsequent questions are answered: 1. Who are they trying to reach? (FLIGHTS? DRAFT_EMAIL? DEVIATIONS? FUEL?) 2. What clean data can be handed over to the next Agent? twitter.com/hrishioa/status/1648368274072297473

Next in line is Coerce. Coerce has one job: make sure that the information is in the exact, perfect right format for any automated system to take over. Dates? ISOString only. Locations? Name and International Codes. Every green ❔ above is Coerce doing its job.

The unsung hero here is AutoSuggest, our autocomplete on Steroids. I'll make a separate thread on AutoSuggest, but it provides useful suggestions to users that: ☝️Fit what our systems can do. ☝️Merge real information into varied sentences. ☝️Can be single-shot accepted.

Next-in-line is Viva, which fills in any missing information by asking the user. Coerce steps in to clean up for each pass, and Viva raises a hand when it's done. Want flights but didn't specify the target Airport? Viva. twitter.com/hrishioa/status/1645409935101149187

Next is any number of agents that handle each task, but for our case let's look at agency emails. Every day, a crew manager sends dozens of emails to agencies asking for prices, restrictions, visas. Then for the missing information. Then again, and again...

With our Drafting bot, it's a single conversation. One agent drafts the body, another one personalises. Finally a third one adds all the high-token specifics like disclaimers and corollaries. Object-oriented LLMs, just like I promised twitter.com/hrishioa/status/1647626783775399937

One final bot - and this is the simplest - takes the email drafted, and waits patiently for any changes from the user. This is where we validate if requests are professional, check for jailbreaks to make sure that the outgoing email is good.

So far this has all been GPT-3.5. When the responses come back, we bring out GPT-4, which is at least 0.5 better than the other one But not yet. Gauntlet evaluates every email (for 1/30th the cost), to remove useless tokens and see if we even need the big guns.

Gauntlet uses multi-path validation (another thread for another time) We take multiple reasoning threads to the same solution, and check if they agree GPT-4 with our best prompts was 85% accuracy, GPT-3.5 with multi-path validation was 100% twitter.com/hrishioa/status/1646449274635579393?s=20

Time for the biggest brain GPT-4 outlines a reasoning path to extract clean, per-person costs from each email and attachment, with categorization and additional corollaries extracted This step also - in a single shot - extracts clarifications and questions as bullet points

Our Costs agent can now come in and extract proper costs (into all the formats needed for delivery) and our Restrictions agent can extract restrictions at each port, segmented by nationality The drafting bot (same as the one before) can turn the bullet points into a nice email

Finally - and this is a win we can only earn if we've stuck to our principles of clean structured interfaces The Merge prompt merges all the past information (costs, restrictions, etc) with the new data to provide the most up-to-date information.

I haven't covered multiple-languages, push to talk, and how we manage in-stream processing of tokens to keep things as responsive as possible. Another time.

This took us three weeks of non-stop work. I'll keep you posted as we go for another three. We've been open-sourcing what we have, DMs are open if you have questions, want to help, or want us to help. We go far if we go together. Demo? Demo! 🚀 youtube.com/watch?v=fN_C2LPFbZk