|  | 
|  | 1 | +# Purpose | 
|  | 2 | + | 
|  | 3 | +<!-- This section is also sometimes called “Motivations” or “Goals”. --> | 
|  | 4 | + | 
|  | 5 | +<!-- It is fine to remove this section from the final document, | 
|  | 6 | +but understanding the purpose of the doc when writing is very helpful. --> | 
|  | 7 | + | 
|  | 8 | +This document exists to drive consensus and act on a reference of the preferred topology | 
|  | 9 | +of a cloud deployment of an interop enabled OP Stack cluster. | 
|  | 10 | + | 
|  | 11 | +# Summary | 
|  | 12 | + | 
|  | 13 | +The creation of Interop transactions opens Optimistim Networks to new forms of undesierable activity. | 
|  | 14 | +Specifically, including an interop transaction carries two distinct risks: | 
|  | 15 | +- If an interop transaction is included which is *invalid*, the block which contains it is invalid too, | 
|  | 16 | +and must be replaced, causing a reorg. | 
|  | 17 | +- If the block building and publishing system spends too much time validating an interop transaction, | 
|  | 18 | +callers may exploit this effort to create DOS conditions on the network, where the chain is stalled or slowed. | 
|  | 19 | + | 
|  | 20 | +The new component `op-supervisor` serves to efficiently compute and index cross-safety information across all chains | 
|  | 21 | +in a dependency set. However, we still need to decide on the particular arrangement of components, | 
|  | 22 | +and the desired flow for a Tx to satisfy a high degree of correctness without risking networks stalls. | 
|  | 23 | + | 
|  | 24 | +In this document we will propose the desired locations and schedule for validating transactions for high correctness | 
|  | 25 | +and low impact. We will also propose a desired arrangement of hosts to maximize redundancy in the event that | 
|  | 26 | +some component *does* fail. | 
|  | 27 | + | 
|  | 28 | +# Problem Statement + Context | 
|  | 29 | + | 
|  | 30 | +Breaking the problem into two smaller parts: | 
|  | 31 | + | 
|  | 32 | +## TX Flow - Design for Correctness, Latency | 
|  | 33 | +It is a goal to remove as many sync, blocking operations from the hot path of | 
|  | 34 | +the block builder as possible. Validating an interop cross chain transaction | 
|  | 35 | +requires a remote RPC request to the supervisor. Having this as part of the hot | 
|  | 36 | +path introduces a denial of service risk. Specifically, we do not want to have so | 
|  | 37 | +many interop transactions in a block that it takes longer than the blocktime | 
|  | 38 | +to build a block. | 
|  | 39 | + | 
|  | 40 | +To prevent this sort of issue, we want to move the validation on interop | 
|  | 41 | +transactions to as early as possible in the process, so the hot path of the block builder | 
|  | 42 | +only needs to focus on including transactions. | 
|  | 43 | + | 
|  | 44 | +For context, a cross chain transaction is defined in the [specs](https://github.com/ethereum-optimism/specs/blob/85966e9b809e195d9c22002478222be9c1d3f562/specs/interop/overview.md#interop). Any reference | 
|  | 45 | +to the supervisor means [op-supervisor](https://github.com/ethereum-optimism/design-docs/blob/d732352c2b3e86e0c2110d345ce11a20a49d5966/protocol/supervisor-dataflow.md). | 
|  | 46 | + | 
|  | 47 | +## Redundancy - Design for Maximum Redundancy | 
|  | 48 | +It is a goal to ensure that there are no single points of failure in the network infrastructure that runs an interop network. | 
|  | 49 | +To that end, we need to organize hosts such that sequencers and supervisors may go down without an interruption. | 
|  | 50 | + | 
|  | 51 | +This should include both Sequencers, arranged with Conductors, as well as redundancy on the Supervisors themselves. | 
|  | 52 | + | 
|  | 53 | +# Proposed Solutions | 
|  | 54 | + | 
|  | 55 | +## TX Ingress Flow | 
|  | 56 | +There are multiple checks we can establish for inflowing Tx to prevent excess work (a DOS vector) from reaching the Supervisor. | 
|  | 57 | + | 
|  | 58 | +### `proxyd` | 
|  | 59 | + | 
|  | 60 | +We can update `proxyd` to validate interop messages on cloud ingress. It should check against both the indexed | 
|  | 61 | +backend of `op-supervisor` as well as the alternative backend. | 
|  | 62 | +Because interop transactions are defined by their Access List, | 
|  | 63 | +`proxyd` does not have to execute any transactions to make this request. | 
|  | 64 | +This filter will eliminate all interop transactions made in bad faith, as they will be obviously invalid. | 
|  | 65 | + | 
|  | 66 | +It may be prudent for `proxyd` to wait and re-test a transaction after a short timeout (`1s` for example) | 
|  | 67 | +to allow through transactions that are valid against the bleeding edge of chain content. `proxyd` can have its own | 
|  | 68 | +`op-supervisor` and `op-node` cluster (and implicitly, an `op-geth` per `op-node`), specifically to provide cross safety queries without putting any load on other | 
|  | 69 | +parts of the network. | 
|  | 70 | + | 
|  | 71 | +### Sentry Node Mempool Ingress | 
|  | 72 | + | 
|  | 73 | +We can update the EL clients to validate interop transactions on ingress to the mempool. This should be a different | 
|  | 74 | +instance of `op-supervisor` than the one that is used by `proxyd` to reduce the likelihood of a nondeterministic | 
|  | 75 | +bug within `op-supervisor`. See "Host Topology" below for a description of how to arrange this. | 
|  | 76 | + | 
|  | 77 | +### All Nodes Mempool on Interval | 
|  | 78 | + | 
|  | 79 | +We can update the EL clients to validate interop transactions on an interval in the mempool. Generally the mempool | 
|  | 80 | +will revalidate all transactions on each new block, but for an L2 that has 1-2s blocktime, that could be frequent if the | 
|  | 81 | +RPC round-trip of an `op-supervisor` query is too costly. | 
|  | 82 | + | 
|  | 83 | +Instead, the Sequencer (and all other nodes) should validate only on a low frequency interval after ingress. | 
|  | 84 | +The *reasoning* for this is:  | 
|  | 85 | + | 
|  | 86 | +Lets say that it takes 100ms for the transaction to be checked at `proxyd`, checked at the mempool of the sentry node, | 
|  | 87 | +forwarded to the sequencer and pulled into the block builder. The chances of the status of an initiating message | 
|  | 88 | +going from existing to not existing during that timeframe is extremely small. Even if we did check at the block builder, | 
|  | 89 | +it doesn't capture the case of a future unsafe chain reorg happening that causes the message to become invalid. | 
|  | 90 | +Because it is most likely that the remote unsafe reorg comes after the local block is sealed, there is no real | 
|  | 91 | +reason to block the hot path of the chain with the remote lookups. If anything, we would want to coordinate these checks | 
|  | 92 | +with the *remote block builders*, but of course we have no way to actually do this. | 
|  | 93 | + | 
|  | 94 | +### Batching Supervisor Calls | 
|  | 95 | + | 
|  | 96 | +During ingress, transactions are independent and must be checked independently. However, once they've reached the Sequencer | 
|  | 97 | +mempool, transactions can be grouped and batched by presumed block. Depending on the rate of the check, the Sequencer | 
|  | 98 | +can collect all the transactions in the mempool it believes will be in a block soon, and can perform a batch RPC call | 
|  | 99 | +to more effectively filter out transactions. This would allow the call to happen more often without increasing RPC overhead. | 
|  | 100 | + | 
|  | 101 | +### Note on Resource Usage | 
|  | 102 | + | 
|  | 103 | +<!-- What is the resource usage of the proposed solution? | 
|  | 104 | +Does it consume a large amount of computational resources or time? --> | 
|  | 105 | + | 
|  | 106 | +Doing a remote RPC request is always going to be an order of magnitude slower than doing a local lookup. | 
|  | 107 | +Therefore we want to ensure that we can parallelize our remote lookups as much as possible. Block building | 
|  | 108 | +is inherently a single threaded process given that the ordering of the transactions is very important. | 
|  | 109 | + | 
|  | 110 | +## Host Topology / Arrangement | 
|  | 111 | + | 
|  | 112 | +In order to fully validate a Superchain, a Supervisor must be hooked up to one Node per chain (with one Executing Engine behind each). | 
|  | 113 | +We can call this group a "full validation stack" because it contains all the executing parts to validate a Superchain. | 
|  | 114 | + | 
|  | 115 | +In order to have redundancy, we will need multiple Nodes, and also *multiple Supervisors*. | 
|  | 116 | +We should use Conductor to ensure the Sequencers have redundancy as well. | 
|  | 117 | +Therefore, we should arrange the nodes like so: | 
|  | 118 | + | 
|  | 119 | +|             | Chain A | Chain B | Chain C | | 
|  | 120 | +|------------|---------|---------|---------| | 
|  | 121 | +| Supervisor 1 | A1      | B1      | C1      | | 
|  | 122 | +| Supervisor 2 | A2      | B2      | C2      | | 
|  | 123 | +| Supervisor 3 | A3      | B3      | C3      | | 
|  | 124 | + | 
|  | 125 | +In this model, each chain has one Conductor, which joins all the Sequencers for a given network. And each heterogeneous group of Sequencers is joined by a Supervisor. | 
|  | 126 | +This model gives us redundancy for both Sequencers *and* Supervisors. If an entire Supervisor were to go down, | 
|  | 127 | +there are still two full validation stacks processing the chain correctly. | 
|  | 128 | + | 
|  | 129 | +There may need to be additional considerations the Conductor makes in order to determine failover, | 
|  | 130 | +but these are not well defined yet. For example, if the Supervisor of the active Sequencer went down, | 
|  | 131 | +it may be prudent to switch the active Sequencer to one with a functional Supervisor. | 
|  | 132 | + | 
|  | 133 | +## Solution Side-Ideas | 
|  | 134 | + | 
|  | 135 | +Although they aren't strictly related to TX Flow or Redundancy, here are additional ideas to increase the stability | 
|  | 136 | +of a network. These ideas won't be brought forward into the Solution Summary or Action Items. | 
|  | 137 | + | 
|  | 138 | +### `op-supervisor` alternative backend | 
|  | 139 | + | 
|  | 140 | +We add a backend mode to `op-supervisor` that operates specifically by using dynamic calls to `eth_getLogs` | 
|  | 141 | +to validate cross chain messages rather than its local index. This could be accomplished by adding | 
|  | 142 | +new RPC endpoints that do this or could be done with runtime config. When `op-supervisor` runs in this | 
|  | 143 | +mode, it is a "light mode" that only supports `supervisor_validateMessagesV2` and `supervisor_validateAccessList` | 
|  | 144 | +(potentially a subset of their behavior). This would give us a form of "client diversity" with respect | 
|  | 145 | +to validating cross chain messages. This is a low lift way to reduce the likelihood of a forged initiating | 
|  | 146 | +message. A forged initiating message would be tricking the caller into believing that an initiating | 
|  | 147 | +message exists when it actually doesn't, meaning that it could be possible for an invalid executing | 
|  | 148 | +message to finalize. | 
|  | 149 | + | 
|  | 150 | +TODO: feedback to understand what capabilities are possible | 
|  | 151 | + | 
|  | 152 | + | 
|  | 153 | +# Solution Summary | 
|  | 154 | + | 
|  | 155 | +We should establish `op-supervisor` checks of transactions at the following points: | 
|  | 156 | +- On cloud ingress to `proxyd` | 
|  | 157 | +- On ingress to all mempools | 
|  | 158 | +- On regular interval on all mempools | 
|  | 159 | + | 
|  | 160 | +Additionally, regular interval checks should use batch calls which validate at least a block's worth of the mempool | 
|  | 161 | +at a time. | 
|  | 162 | + | 
|  | 163 | +`op-supervisor` checks of transactions should *not* happen at the following points: | 
|  | 164 | +- sequencer node mempool ingress | 
|  | 165 | +- block building (as a synchronous activity) | 
|  | 166 | + | 
|  | 167 | +When we deploy hosts, we should currently use a "Full Validation Set" of one Supervisor plus N Managed nodes, | 
|  | 168 | +to maximize redundancy and independent operation of validators. When Sequencers are deployed, Conductors should manage | 
|  | 169 | +individual Managed Nodes *across* Supervisors. | 
|  | 170 | + | 
|  | 171 | +# Alternatives Considered | 
|  | 172 | + | 
|  | 173 | +## Checking at Block Building Time (Tx Flow Solution) | 
|  | 174 | + | 
|  | 175 | +The main alternative to not validating transactions at the block builder is validating transactions | 
|  | 176 | +at the block builder. We would like to have this feature implemented because it can work for simple networks, | 
|  | 177 | +as well as act as an ultimate fallback to keep interop messaging live, but we do not want to run it as | 
|  | 178 | +part of the happy path. | 
|  | 179 | + | 
|  | 180 | +## Multi-Node (Host Redundancy Solution) | 
|  | 181 | + | 
|  | 182 | +One request that has been made previously is to have "Multi-Node" support. In this model, | 
|  | 183 | +multiple Nodes for a single chain are connected to the same Supervisor. To be clear, the Supervisor software | 
|  | 184 | +*generally* supports this behavior, with a few known edge cases where secondary Nodes won't sync fully. | 
|  | 185 | + | 
|  | 186 | +The reason this solution is not the one being proposed is two-fold: | 
|  | 187 | +- Managing multiple Nodes sync status from a single Supervisor is tricky -- you have to be able to replay | 
|  | 188 | +all the correct data on whatever node is behind, must be able to resolve conflicts between reported blocks, | 
|  | 189 | +and errors on one Node may or may not end up affecting the other Nodes. While this feature has some testing, | 
|  | 190 | +the wide range of possible interplay means we don't have high confidence in Multi-Node as a redundancy solution. | 
|  | 191 | +- Multi-Node is *only* a Node redundancy solution, and the Supervisor managing multiple Nodes is still a single | 
|  | 192 | +point of failure. If the Supervisor fails, *every* Node under it is unable to sync also, so there must *still* | 
|  | 193 | +be a diversification of Node:Supervisor. At the point where we split them up, it makes no sense to have higher quanitities | 
|  | 194 | +than 1:1:1 Node:Chain:Supervisor. | 
|  | 195 | + | 
|  | 196 | +# Risks & Uncertainties | 
|  | 197 | + | 
|  | 198 | +We really need to measure everything to validate our hypothesis on the ideal architecture. | 
|  | 199 | +To validate the ideal architecture, we need to measure it and then try to break it. | 
|  | 200 | + | 
|  | 201 | +Incorrect assumptions, or unexpected emergent behaviors in the network, could result in validation not happening at the right times, | 
|  | 202 | +causing excessive replacement blocks. Conversely, we could also fail to reduce load on the block builder, still leading to slow | 
|  | 203 | +block building or stalls. | 
|  | 204 | + | 
|  | 205 | +Ultimately, this design represents a hypothesis which needs real testing before it can be challenged and updated. | 
|  | 206 | + | 
|  | 207 | +# Action Items from this Document | 
|  | 208 | +- Put the `proxyd` check in place | 
|  | 209 | +- Put an interval check in place in the mempool | 
|  | 210 | +- Remove build-time checks | 
|  | 211 | +- Test RPC performance (data collectable via Grafana) | 
|  | 212 | +- Consider and add Supervisor-health as a trigger for Conductor Leadership Transfer | 
0 commit comments