Table of Contents
- RFC-0166: Snowbridge Emergency Pause Pallet
RFC-0166: Snowbridge Emergency Pause Pallet
| Start Date | 2026-05-28 |
| Description | A permissionless, deposit-gated emergency pause for Snowbridge that halts both sides of the bridge via best-effort calls with on-chain retry, resolved by Fellowship. |
| Authors | Snowbridge team |
Summary
At the moment, there is no way for Snowbridge to be halted immediately. The best course of action to halt the bridge should an exploit be detected, is to halt the bridge through a whitelisted caller proposal, through OpenGov. This has obvious drawbacks - even if a Snowbridge exploit is detected, there is no way to halt the bridge on-chain (off-chain relayers can be switched off but it is obviously not a fool-proof stopgap). This RFC proposes a permissionless, instant Snowbridge halt if the caller deposits a large sum of DOT, to be slashed if paused maliciously. This proposal is a reactive security measure (i.e. a exploit or vulnerability first need to be visible for this functionality to be useful). Another proposal, Snowbridge Circuit Breakers, is proposed alongside this RFC for a more proactive approach.
Motivation
Snowbridge has no fast, broadly-accessible halt path today. Existing governance halt routes require a referendum and Fellowship action (hours-to-days latency). Both are too slow for an active drainage exploit and to stop activity during investigation.
Two pieces of prior work in the Fellowship are relevant but do not solve Snowbridge's halt needs:
- polkadot-fellows/runtimes #1089, proposing deployment of
pallet-tx-pauseandpallet-safe-modeacross system chains. - polkadot-fellows/runtimes #1164 (draft), wiring
pallet-safe-modeinto AssetHub with a 100k DOT permissionless trigger.
Neither fits Snowbridge:
- Blast radius is different.
pallet-safe-modeinstalls a chain-wideBaseCallFilter. On BridgeHub that would freeze Kusama-side bridging, identity, governance proxies, etc., not just Snowbridge. - No outbound side-effects. Safe-mode is a passive filter. It cannot emit the XCM to AssetHub that halts
snowbridgeSystemFrontend, and it cannot queue the outbound governance command that flips the Ethereum Gateway's operating mode. - Different resume semantics. Safe-mode's exit model is auto-resume on expiry. For Snowbridge, deliberate Fellowship resolution is the expected exit, and the timeout is a quorum-failure backstop that should rarely fire, because auto-resuming under an active attack is more dangerous than staying paused.
pallet-tx-pauseis a privileged-origin gate with no currency support. Itspausecall is guarded by a configuredPauseOrigin(a privileged origin such as Fellowship or root), not a permissionless signed caller, and itsConfighas noCurrency/ReservableCurrencytype, so there isn't an easy way to attach the slashable deposit that makes a permissionless trigger safe.
Snowbridge requires a more specific implementation: a halt with a permissionless economic trigger that executes the different halting mechanisms, a longer halt window, and Fellowship-driven genuine/malicious classification.
Stakeholders
- Polkadot Fellowship, the
ResolveOriginand the body that decides between genuine vs malicious triggers. - Snowbridge maintainers, who implement and operate the halt path.
- Snowbridge users and integrators, who experience a halt as the bridge being closed at submit time on both Ethereum and AssetHub.
- Polkadot Treasury, the destination of slashed deposits on malicious triggers.
- Security researchers and watchdog operators, the most likely callers of a permissionless trigger during an incident.
Explanation
Goal
A permissionless DOT deposit triggers a complete Snowbridge halt, in response to possible exploit (stop new activity while investigating) and active exploits (attacker is actively draining value).
Implementation
The pallet adds one new extrinsic to BridgeHub, trigger(). It is permissionless, gated only by a 100,000 DOT reservable deposit. On a successful call it reserves the deposit, transitions state to Triggered, and dispatches the seven halt calls below in priority order. Six of the seven are the same set_operating_mode extrinsics that root-level governance halts use today; the new pallet becomes an additional (deposit-gated, permissionless) caller of them. The seventh, EthereumOutboundQueueV2::set_operating_mode(Halted), is a new extrinsic this RFC requires on the V2 outbound queue. V2's pause architecture is single-chokepoint at AssetHub: snowbridge-pallet-system-frontend owns the operating mode, and the AH XCM router's PausableExporter checks it at validate() time, so a halted frontend means V2 messages never reach BridgeHub at all, making a BH-local mode redundant in the steady state. The emergency-pause use case introduces a new reason for it: during the ~1-2 min window between trigger() and the cross-chain AH frontend halt (call 6) actually landing, V2 messages users submit on AH still pass PausableExporter (the frontend is still Normal), wrap into ExportMessage, ride XCMP to BH, and queue in V2 outbound with no check. The new BH-local mode closes that window symmetrically with how call 3 does for V1.
None of the seven calls block the trigger: the pallet attempts each, logs successes and failures, and re-attempts pending legs in later blocks via on_initialize.
BridgeHub-local calls are fired first because they are synchronous local writes with no cross-chain dependency, so they typically land in the same block as the trigger. This serves as an immediate hard-stop of any asset flow through the bridge. It could cause asymmetry in bridge assets (e.g. funds locked on Ethereum but not minted on AssetHub), but this is preferable to more possible illegitimate transactions being processed through the bridge. The two cross-chain calls depend on HRMP delivery and BEEFY relay respectively, so they take ~1-2 min when everything is healthy and can take longer or fail entirely under congestion.
EthereumInboundQueue::set_operating_mode(Halted)(BH local), blocks E→P V1 dispatch on Polkadot. Covers inbound-queue exploits and any in-flight V1 message arriving before the Gateway halt lands.EthereumInboundQueueV2::set_operating_mode(Halted)(BH local), same for V2.EthereumOutboundQueue::set_operating_mode(Halted)(BH local), rejects V1 P→E exports when AH's XCMExportMessagelands on BH. Closes the gap between trigger time and the AH frontend halt landing.EthereumOutboundQueueV2::set_operating_mode(Halted)(BH local, new). Same as call 3 but for V2 P→E exports. Closes the symmetric in-flight gap before the Gateway halt lands on Ethereum.EthereumBeaconClient::set_operating_mode(Halted)(BH local), stops new beacon-header ingestion. Doesn't gate dispatch directly (calls 1 and 2 already do) but it's a cheap defense-in-depth lever for the case where the exploit IS the beacon client.- AH
snowbridgeSystemFrontend::set_operating_mode(Halted)(XCM Transact from BH). Blocks AH users and parachains from even attempting P→E sends. - Outbound governance command via
EthereumSystem(V1 path only) flipping the Ethereum Gateway toOperatingMode::RejectingOutboundMessages. With Fiat-Shamir-assisted BEEFY verification this lands on Ethereum in ~1-2 min instead of ~20 min. The V1 path is used because itsPRIMARY_GOVERNANCE_CHANNELalready bypasses the V1 outbound queue halt (pallets/outbound-queue/src/send_message_impl.rs:79), so call 3 does not block call 7. Routing this command throughEthereumSystemV2is not used because the V2 outbound queue has no governance bypass today, and adding one is more invasive than reusing the V1 governance path. The Gateway has a single sharedmodestorage (Gateway.sol:247) used by both V1 and V2 dispatch, so writing it via V1 flips it for both. See §"Compatibility" and §"Future Directions" for the V1-deprecation follow-up. No relayer change is required: the existing primary and secondary governance relayer instances (PRIMARY_GOVERNANCE_RELAY_*/SECONDARY_GOVERNANCE_RELAY_*inrelayer/.env.mainnet.example) are pinned toPRIMARY_GOVERNANCE_CHANNEL_IDand run as independent services from per-parachain user-message relayers, so they pick up the SetOperatingMode commitment on their next scan without competing for compute or Ethereum gas budget; the dominant latency variable is BEEFY proof availability, not relayer prioritization.
Why best-effort everywhere
All seven calls are best-effort: any one of them failing does not abort trigger() or revert the others. The motivating failure mode is the XCM Transact to AH (call 6), which depends on the BH outbound HRMP queue having capacity. Under congestion, the same congestion an active incident might be causing, that send can fail. If the trigger were all-or-nothing, the whole halt would revert and the bridge would stay open at exactly the moment we need it closed. With best-effort, whichever calls land at trigger time take effect; the rest are retried by on_initialize. Worst case: "the bridge is closed via whatever subset landed, and the remaining levers retry on subsequent blocks".
The BH-local calls are synchronous storage writes on the same chain as the pallet, so they should always succeed; treating them as best-effort too doesn't change behavior in the common case, it just removes a class of "trigger reverted because of an edge case nobody anticipated" outcomes.
Resuming the bridge
Resume is the symmetric inverse of the halt: a set_operating_mode(Normal) on each of the BH-local pallets that were halted, an XCM Transact to AH flipping snowbridge-pallet-system-frontend back to Normal, and an outbound governance command flipping the Ethereum Gateway back to OperatingMode::Normal. The pallet exposes this as a single resume() extrinsic, callable by ResolveOrigin (Fellowship XCM voice). It uses the same LocalOperatingMode / AssetHubXcmSender / GatewayOperatingModeSender plumbing as trigger(), just with Normal instead of Halted, including the same best-effort, on_initialize retry loop.
Encoding resume on-chain in this pallet (rather than relying solely on an external SDK preimage tool to assemble the seven calls) means the resume logic is versioned and tested alongside the halt logic. If the set of pallets that need flipping ever changes (e.g., a new outbound queue version is added), trigger() and resume() move together as one pallet upgrade; there is no second artifact to keep in sync.
The pallet has two resume paths, with different policies for who resumes the bridge. Fellowship resolution is a resolve(policy) extrinsic taking a ResolutionPolicy enum parameter:
resolve(Genuine): the pallet refunds the deposit and transitions toNormal, but does not auto-resume. The trigger was a legitimate incident response, so resuming requires an explicitresume()call once the underlying issue is fixed. Fellowship issuesresume()when ready.resolve(Malicious): slashes the deposit and transitions directly toResuming. The bridge auto-resumes viaon_initializewithout waiting on a separateresume()call. The trigger was malicious and the bridge should be resumed immediately, so no separate decision is needed.- Backstop timeout: refunds the deposit, transitions to
Resuming, and auto-resumes (same shape asresolve(Malicious)for the resume side). The reasoning: if Fellowship has been unreachable for the full backstop window, the default policy is "the trigger was probably bad", since a real incident would have been resolved well before then. The per-asset Gateway-side velocity caps from the companion preventive layer remain in force regardless and continue to bound value-at-risk after the bridge resumes.
The existing SDK governance resume preimage still works and remains the fallback if the pallet itself is somehow wedged (governance can always submit set_operating_mode(Normal) directly to the underlying Snowbridge pallets via root). The resume() extrinsic is the fast, on-chain path for the common case; the SDK preimage is the slow, out-of-band escape hatch.
What we reuse from prior work
From pallet-safe-mode and PR #1164's AssetHub wiring: the deposit / reserve / release pattern, the EnsureXcm<IsVoiceOfBody> origin pattern for ResolveOrigin, the 100k DOT deposit calibration, the duration / extension machinery (EnterDuration + ExtendDuration + on_initialize deadline check).
From pallet-tx-pause: the (pallet, call) addressing convention as a future direction if a softer per-extrinsic pause is ever wanted, and the runtime-level Fellowship-only call gating model for resolve.
Code reuse: the deposit/reserve/slash skeleton and the duration/extension/timeout machinery come from pallet-safe-mode. What we drop is the BaseCallFilter integration. What we add is the seven side-effect hooks (one of which requires a small accompanying change to the V2 outbound queue pallet, see §Compatibility) and the best-effort retry loop in on_initialize.
Threat model coverage
- E→P entry (Ethereum-side UX), Gateway halt (call 7).
- E→P in-flight + inbound-queue exploit, inbound V1 + V2 halts (calls 1, 2). Critically also covers exploits that bypass the Gateway entirely (malformed proofs, payload-decode bugs, MMR weaknesses).
- P→E new submissions during AH-halt-in-flight window, V1, outbound V1 halt (call 3).
- P→E new submissions during AH-halt-in-flight window, V2, outbound V2 halt (call 4).
- P→E entry (AH-side UX), AH frontend halt (call 6).
- Beacon-client exploit, beacon client halt (call 5).
What the halt does not cover
The seven calls block new submissions on both sides. They cannot stop a P→E message that has already passed deliver() and entered the BH MessageQueue before the halt lands. Once a message is in the MessageQueue, the outbound queue's do_process_message (pallets/outbound-queue/src/lib.rs:301, and the V2 equivalent) processes it, builds a merkle commitment, and the message becomes relayable. The Gateway's submitV1 / submitV2 inbound dispatch path does not check operating mode either, the Gateway's RejectingOutboundMessages mode only gates Ethereum-side senders, not inbound dispatch, so once relayed, the message executes (releaseToken or mintForeignToken) regardless of the halt.
Concretely: an attacker who has already managed to enqueue a malicious P→E message before the halt lands will still see that message dispatch on Ethereum. The halt only stops future enqueues.
This residual gap is the explicit motivation for the companion preventive layer. The Snowbridge Circuit Breakers RFC specifies per-asset Gateway-side velocity caps that block releaseToken / mintForeignToken at the Gateway itself, regardless of whether the message came through normal flow or the in-flight window of a halt. The halt pallet is the reactive "stop new flow" layer; the circuit breakers are the preventive "stop value extraction" layer. Both are needed for full coverage.
Drawbacks
- Spam vector mitigated only by deposit. A determined adversary willing to forfeit 100k DOT can halt the bridge once. The Fellowship-resolved slash makes repeat abuse expensive but a single griefing event is unavoidable. The companion preventive layer reduces the value an attacker gains from the resulting investigation window.
- Per-leg outcome dispersion. Operators must inspect emitted events to know which legs actually landed. Worst case: a triggered pallet with the Gateway halt still pending for many blocks if the BH outbound queue is congested.
- Bridge recovery before the backstop fires requires a reachable Fellowship. The only paths back to
Normalahead of the 7-day backstop areresolve(Genuine)andresolve(Malicious), both Fellowship-only. If Fellowship is unreachable at the same time as a bad trigger (a troll halt, a false-alarm halt), the bridge stays halted until the backstop auto-unhalts, up to a week later. Funds aren't lost, but no new transfers move in or out during that window. Intentional bias: auto-resuming too early under an active attack is worse than a week of downtime for legitimate users, but worth noting as a real failure mode. - Cross-chain calls remain a dependency. If HRMP between BH and AH is down, the AH frontend halt will never land via
on_initialize. The pallet has no in-band escape hatch for this case; Fellowship would need to land a separate runtime call (e.g., re-routing the XCM through an alternate path) outside the pallet's API.
Testing, Security, and Privacy
- Per-leg pallet tests asserting each
LocalOperatingModeleg flips the correct storage (in bothHaltedandNormaldirections), and thaton_initializeretries the right subset on partial failure for bothtrigger()andresume(). - XCM integration tests with a wedged AH HRMP queue, asserting
trigger()still transitions toTriggered, records the AH leg as pending, retries on subsequent blocks, and clears once HRMP drains. - End-to-end simulation (chopsticks fork): trigger with all seven legs healthy, with one failed leg, with the BH outbound queue at capacity, with AH unreachable. Each scenario should yield "bridge closed via whatever subset landed, remaining legs pending".
- Deposit reservation paths under sufficient/insufficient balance.
- Resolution paths (
resolve(Genuine),resolve(Malicious), timeout,force_extend). - No new privacy surface. All events are public; the deposit caller is already implicit in the transaction signer.
- Security posture: the pallet creates a new attack surface (anyone with 100k DOT can halt). This is the intended design, calibrated against the asymmetric harm of being unable to halt during an active drainage.
Performance, Ergonomics, and Compatibility
Performance
Trigger weight is dominated by the seven side-effect calls. BH-local legs are O(1) storage writes; the AH XCM and Gateway outbound legs queue messages but do not synchronously execute remote logic. on_initialize weight while Triggered is bounded by the per-leg retry cost gated by RetryBackoff, so the steady-state idle cost is reading State and a block-number comparison.
Ergonomics
Trigger UX is an extrinsic with 100k DOT in the signer's account. Operator UX during a triggered state is event-driven, the relayer indexer should add EmergencyPauseTriggered and per-leg HaltSucceeded / HaltFailed to its watch set. Fellowship-side, resolve or force_extend follow the same XCM-voice pattern as existing Fellowship governance calls.
Compatibility
One small change is required to an existing Snowbridge pallet: EthereumOutboundQueueV2 gains an OperatingMode storage item and a set_operating_mode extrinsic, mirroring the V1 outbound queue but without a governance bypass on the send path. V2 deliberately omitted operating-mode plumbing because its pause architecture is single-chokepoint at the AH frontend's PausableExporter; in the steady state, a halted frontend means nothing reaches BH, so a BH-local check is redundant. This RFC adds the check as defense-in-depth for the brief window during which the cross-chain AH frontend halt is in flight (see §"The seven halt calls" for the full reasoning).
The V2 send path intentionally does not get a PRIMARY_GOVERNANCE_CHANNEL-style governance bypass. The bypass would only matter if call 7 routed through V2's outbound queue, but call 7 routes through V1's EthereumSystem instead, which already has the V1 bypass. The Gateway's shared mode storage means writing it once via V1 flips both V1 and V2 dispatch paths, so the V2 path is unused for governance commands today. This minimizes the V2 outbound queue diff and avoids inventing a new bypass mechanism (V2's Message type carries origin: H256 instead of V1's channel_id, so any V2 bypass would be net-new infrastructure). The V1-deprecation follow-up will need to add the V2 bypass at that point, since call 7 will then have to use V2; this is flagged in §"Future Directions".
The new storage defaults to Normal so the change is backwards-compatible. No other existing Snowbridge pallet interfaces change. The LocalOperatingMode trait is new and implemented in the runtime. The XCM Transact and Gateway outbound paths use existing pallets and channels. No migration required for the new pallet's State (initial value is Normal).
Prior Art and References
- polkadot-fellows/runtimes #1089, the chain-wide safe-mode and tx-pause deployment proposal.
- polkadot-fellows/runtimes #1164, the AssetHub safe-mode wiring.
pallet-safe-modeandpallet-tx-pausein the Polkadot SDK.- TBA Snowbridge Circuit Breakers RFC, the companion preventive layer.
Unresolved Questions
RetryBackoffcalibration. Too short hammers a congested HRMP queue; too long extends cross-chain halt latency. ~30-60 seconds (a few blocks) is a starting point.- Cross-chain call ordering. Whether to fire the AH XCM and the Gateway outbound send in a fixed order or in parallel within the same block. They are independent in practice.
- Deposit calibration. 100k DOT matches the runtimes #1089 number, but Snowbridge halts more than a generic safe-mode would. Worth a separate Fellowship discussion on whether the deposit should be higher.
BackstopDurationvalue. Default is 7 days. Open question whether a longer value (e.g. 2-3 weeks) is warranted as additional caution against premature auto-resume under a prolonged real incident, weighed against the longer worst-case downtime in the Fellowship-unreachable case. Set in runtime config; can be tuned without changing the pallet.
Future Directions and Related Material
- Per-extrinsic granular pause as a v2 of the pallet, using
pallet-tx-pause'sFullNameOf<T>addressing. - Watchdog automation: off-chain monitors with funded accounts that auto-trigger on observed anomalies, with the deposit acting as their skin in the game.
- Companion RFC: the TBA Snowbridge Circuit Breakers RFC specifies the preventive layer (per-asset Gateway-side velocity caps, AH and BH secondary caps) that bounds value-at-risk during the detection-latency window this pallet does not cover.