1/40/ Some technical notes on $OP airdrop #1 from yesterday. 🧵
2/ Obviously, it was a bumpy ride. We ran into a few different things that I think people will be interested to hear about though.
3/ Important disclaimers:
This is only my recollection / perspective on events. Not anything official or authoritative. I don't speak for OP Labs nor Optimism.
The official retrospective we will put out will have much more details, but I hope this gives you a taste.
3.5/ I'm going to use "we" a whole lot to save space, but while I followed along, I wasn't part of the rescue effort (having not been involved with the current code/infra in the first place).
I just got out of the way and focused on keeping Discord civil instead.
4/ The first thing that went wrong is that some people were able to claim in advance of an official announcement.
5/ This is because the contracts had been deployed earlier. To use them, you needed to generate a Merkle proof of inclusion in the Merkle tree containing all eligible addresses.
This Merkle proof could be obtained by querying an API endpoint on the claim backend server.
6/ Some people figured out how to do this early via API exploration.
In theory, it was possible to reconstruct this proof yourself by reverse engineering the (unverified) claims contract, but this is very technical, so we're not sure if anyone actually pulled it off.
7/ The takeaway here is that the claims should be pausable and should have started in the paused state.
8/ The second issue is that as people discovered the endpoint, the claims backend (for proof generation) fell over really fast.
Turns out it had a couple issues that made perform poorly under load: the disk filled too fast because of a cache, and it made too many RPC requests.
9/ This was quickly fixed, but we then ran into another issue: JSON-RPC endpoints were completely overloaded.
10/ What's JSON-RPC? It's the protocol used (A) send transactions to the chain (e.g. via ṂetaṂask) and (B) make read-only calls to the chain.
11/ Transactions sent to the chain all need to hit the sequencer, which was chugging along just fine. We've been seeing ~10-12 transactions per second (TPS) for a few hours. This is similar to Ethereum mainnet, so no big issues there.
12/ This was, however, a lot more TPS than Optimism has seen in the past.
(I know: TPS is a bad metric, gas/second is a better one, but it is the metric I have.)
13/ The issues was (B) — the read-only calls. Each time you use a dapp, the dapp makes (a few to a shit ton) of read-only calls to the chain in order to display useful data (e.g. the current price on Uniswap).
14/ This volume far exceeds the amount of transactions sent (by something like 100x for a well written dapp, to many orders of magnitude more for hackjobs).
15/ All the computations required to perform these calls is not done by the sequencer (or on L1, by mining nodes), but by dedicated nodes. Most people use node providers like Infura or Alchemy for these (aka "infra provider" or "JSON-RPC providers").
16/ If you're using ṂetaṂask, you can select the JSON-RPC provider for each chain. By default for Ethereum, it's Infura. On Optimism, you're probably using mainnet.optimism.io
17/ Note that any node can serve as JSON-RPC endpoint, it's just that nobody makes this publicly available given the costs involved. But if you run your own L1 node or L2 replica, you can use it as your endpoint — this would have avoided all the issues we're about to talk about!
18/ Back to our issue. We have record-breaking (for Optimism) level of TPS. What happens? Well mainnet.optimism.io has been recently upgraded to load balance between multiple endpoint providers (including in-house).
20/ ... but it wasn't enough. We quickly ran through our quotas on the various providers we use (Infura, Alchemy, Quicknode) and had to provision more. But that wasn't enough either, as conditions were crazy enough to exceed total provisioned capacity.
21/ Node providers did us a great one here and were able to increase capacity. We worked closely with them and were even able to uncover some issues in how some transactions were handled, which greatly slowed down the system. Truly great work from the teams involved!
22/ All of this caused some partial disruption of dapps (who couldn't query the chain anymore) and even sending transactions (which goes through the same endpoints) was affected.
23/ A sidebar: how could dapps avoid disruption? Well, if the issue was only the free quota, it would have sufficed for them to pay for a node provider service, which would have given them their own personal quota.
24/ On L1, dapps that want good reliability and avoid suffering from the throttling of free endpoints are expected to do this.
On L2, most dapps went by using the free endpoint. This was our fault for not throttling it from the get-go!
25/ We actually started doing this recently (and communicating with dapps about it), but we've only been ramping up the rate limiting gently, to avoid breaking the ecosystem all at once.
26/ But... in this case, your own node provider quota would have been insufficient because the total JSON-RPC capacity was insufficient for a little while.
The only way for dapps to totally avoid disruption would have been for them to run their own L2 replica!
28/ A crucial step we took to reduce load back to something manageable was to temporary filter out "archive" request on the endpoints. In particular, we dropped any call who targetted the state of a block > 64 blocks behind the tip of the chain.
29/ I'm a bit hazy on how that worked in practice, but my impression is that this ended up scheding a ton of load. Retrospective will have a more accurate recounting.
30/ This was not without consequences however: e.g. it broke Uniswap until we re-enabled archive transactions a bit later.
31/ Finally, a non-issue was that the batch sequencer (which submits transaction batches to L1) starting lagging behind L2 mainnet.
32/ Quick math: as our volume approached that of mainnet, it means that after compression, the Optimism calldata approached 50% of mainnet calldata excluding rollups.
33/ Because calldata is cheap in gas compared to compute, this is not a problem per se. But Ethereum transactions have a defacto size limit! This means as L2 transactions kept piling in, the amount of transactions we had to submit to L1 kept increasing.
34/ Our batch submitter is currently stateless — it has no database, and will read all the info it needs from L1 and L2 nodes. This makes it simpler and more reliable.
35/ However, a consequence of this design is that we only ever have a single transaction in flight at any single time. And we wait for a certain amount of L1 confirmations before we send the next transaction!
36/ This was not especially news to the team, and we were able to quickly deploy a fix that temporarily decreased the L1 confirmation depth.
37/ At peak, the batch submitter was 1 hour behind L2 mainnet, and then started catching up. This didn't cause any disruption.
38/ All in all, I think we handled this pretty well. Disruptions only lasted a couple of hours, network was never 100% down (even if you were using node providers), and the sequencer itself was running just fine.
39/ Avoiding the issues would have been better, and frankly some of those we should have foreseen, but at least we were able to react appropriately.
40/ Again, I can't claim any credit for all these heroics, all glory goes to the team. Too many people to tag everyone! (Full list here: twitter.com/optimismPBC/following)