Collator Protocol

NOTE: This module has suffered changes for the elastic scaling implementation. As a result, parts of this document may be out of date and will be updated at a later time. Issue tracking the update: https://github.com/paritytech/polkadot-sdk/issues/3699

The Collator Protocol implements the network protocol by which collators and validators communicate. It is used by collators to distribute collations to validators and used by validators to accept collations by collators.

Collator-to-Validator networking is more difficult than Validator-to-Validator networking because the set of possible collators for any given para is unbounded, unlike the validator set. Validator-to-Validator networking protocols can easily be implemented as gossip because the data can be bounded, and validators can authenticate each other by their PeerIds for the purposes of instantiating and accepting connections.

Since, at least at the level of the para abstraction, the collator-set for any given para is unbounded, validators need to make sure that they are receiving connections from capable and honest collators and that their bandwidth and time are not being wasted by attackers. Communicating across this trust-boundary is the most difficult part of this subsystem.

Validation of candidates is a heavy task, and furthermore, the PoV itself is a large piece of data. Empirically, PoVs are on the order of 10MB.

TODO: note the incremental validation function Ximin proposes at https://github.com/paritytech/polkadot/issues/1348

As this network protocol serves as a bridge between collators and validators, it communicates primarily with one subsystem on behalf of each. As a collator, this will receive messages from the CollationGeneration subsystem. As a validator, this will communicate only with the CandidateBacking.

Protocol

Input: CollatorProtocolMessage

Output:

Functionality

This network protocol uses the Collation peer-set of the NetworkBridge.

It uses the CollatorProtocolV1Message as its WireMessage

Since this protocol functions both for validators and collators, it is easiest to go through the protocol actions for each of them separately.

Validators and collators.

%3c1Collator 1v1Validator 1c1->v1v2Validator 2c1->v2c2Collator 2c2->v2

Collators

It is assumed that collators are only collating on a single parachain. Collations are generated by the Collation Generation subsystem. We will keep up to one local collation per relay-parent, based on DistributeCollation messages. If the para is not scheduled on any core, at the relay-parent, or the relay-parent isn't in the active-leaves set, we ignore the message as it must be invalid in that case - although this indicates a logic error elsewhere in the node.

We keep track of the Para ID we are collating on as a collator. This starts as None, and is updated with each CollateOn message received. If the ParaId of a collation requested to be distributed does not match the one we expect, we ignore the message.

As with most other subsystems, we track the active leaves set by following ActiveLeavesUpdate signals.

For the purposes of actually distributing a collation, we need to be connected to the validators who are interested in collations on that ParaId at this point in time. We assume that there is a discovery API for connecting to a set of validators.

As seen in the Scheduler Module of the runtime, validator groups are fixed for an entire session and their rotations across cores are predictable. Collators will want to do these things when attempting to distribute collations at a given relay-parent:

  • Determine which core the para collated-on is assigned to.
  • Determine the group on that core.
  • Issue a discovery request for the validators of the current group withNetworkBridgeMessage::ConnectToValidators.

Once connected to the relevant peers for the current group assigned to the core (transitively, the para), advertise the collation to any of them which advertise the relay-parent in their view (as provided by the Network Bridge). If any respond with a request for the full collation, provide it. However, we only send one collation at a time per relay parent, other requests need to wait. This is done to reduce the bandwidth requirements of a collator and also increases the chance to fully send the collation to at least one validator. From the point where one validator has received the collation and seconded it, it will also start to share this collation with other validators in its backing group. Upon receiving a view update from any of these peers which includes a relay-parent for which we have a collation that they will find relevant, advertise the collation to them if we haven't already.

Validators

On the validator side of the protocol, validators need to accept incoming connections from collators. They should keep some peer slots open for accepting new speculative connections from collators and should disconnect from collators who are not relevant.

GDeclaring, advertising, and providing collationscluster_collatorCollatorcluster_validatorValidatorc1v1c1->v1Declare and advertisec2v2c2->v2Providev1->c2Requestv2->v2Note Good/Bad

When peers connect to us, they can Declare that they represent a collator with given public key and intend to collate on a specific para ID. Once they've declared that, and we checked their signature, they can begin to send advertisements of collations. The peers should not send us any advertisements for collations that are on a relay-parent outside of our view or for a para outside of the one they've declared.

The protocol tracks advertisements received and the source of the advertisement. The advertisement source is the PeerId of the peer who sent the message. We accept one advertisement per collator per source per relay-parent.

As a validator, we will handle requests from other subsystems to fetch a collation on a specific ParaId and relay-parent. These requests are made with the request response protocol CollationFetchingRequest request. To do so, we need to first check if we have already gathered a collation on that ParaId and relay-parent. If not, we need to select one of the advertisements and issue a request for it. If we've already issued a request, we shouldn't issue another one until the first has returned.

When acting on an advertisement, we issue a Requests::CollationFetchingV1. However, we only request one collation at a time per relay parent. This reduces the bandwidth requirements and as we can second only one candidate per relay parent, the others are probably not required anyway. If the request times out, we need to note the collator as being unreliable and reduce its priority relative to other collators.

As a validator, once the collation has been fetched some other subsystem will inspect and do deeper validation of the collation. The subsystem will report to this subsystem with a CollatorProtocolMessage::ReportCollator. In that case, if we are connected directly to the collator, we apply a cost to the PeerId associated with the collator and potentially disconnect or blacklist it. If the collation is seconded, we notify the collator and apply a benefit to the PeerId associated with the collator.

Interaction with Candidate Backing

As collators advertise the availability, a validator will simply second the first valid parablock candidate per relay head by sending a CandidateBackingMessage::Second. Note that this message contains the relay parent of the advertised collation, the candidate receipt and the PoV.

Subsequently, once a valid parablock candidate has been seconded, the CandidateBacking subsystem will send a CollatorProtocolMessage::Seconded, which will trigger this subsystem to notify the collator at the PeerId that first advertised the parablock on the seconded relay head of their successful seconding.

Future Work

Several approaches have been discussed, but all have some issues:

  • The current approach is very straightforward. However, that protocol is vulnerable to a single collator which, as an attack or simply through chance, gets its block candidate to the node more often than its fair share of the time.
  • If collators produce blocks via Aura, BABE or in future Sassafras, it may be possible to choose an "Official" collator for the round, but it may be tricky to ensure that the PVF logic is enforced at collator leader election.
  • We could use relay-chain BABE randomness to generate some delay D on the order of 1 second, +* 1 second. The collator would then second the first valid parablock which arrives after D, or in case none has arrived by 2*D, the last valid parablock which has arrived. This makes it very hard for a collator to game the system to always get its block nominated, but it reduces the maximum throughput of the system by introducing delay into an already tight schedule.
  • A variation of that scheme would be to have a fixed acceptance window D for parablock candidates and keep track of count C: the number of parablock candidates received. At the end of the period D, we choose a random number I in the range [0, C) and second the block at Index I. Its drawback is the same: it must wait the full D period before seconding any of its received candidates, reducing throughput.
  • In order to protect against DoS attacks, it may be prudent to run throw out collations from collators that have behaved poorly (whether recently or historically) and subsequently only verify the PoV for the most suitable of collations.