| Tag matching logic |
| |
| The MPI standard defines a set of rules, known as tag-matching, for matching |
| source send operations to destination receives. The following parameters must |
| match the following source and destination parameters: |
| * Communicator |
| * User tag - wild card may be specified by the receiver |
| * Source rank – wild car may be specified by the receiver |
| * Destination rank – wild |
| The ordering rules require that when more than one pair of send and receive |
| message envelopes may match, the pair that includes the earliest posted-send |
| and the earliest posted-receive is the pair that must be used to satisfy the |
| matching operation. However, this doesn’t imply that tags are consumed in |
| the order they are created, e.g., a later generated tag may be consumed, if |
| earlier tags can’t be used to satisfy the matching rules. |
| |
| When a message is sent from the sender to the receiver, the communication |
| library may attempt to process the operation either after or before the |
| corresponding matching receive is posted. If a matching receive is posted, |
| this is an expected message, otherwise it is called an unexpected message. |
| Implementations frequently use different matching schemes for these two |
| different matching instances. |
| |
| To keep MPI library memory footprint down, MPI implementations typically use |
| two different protocols for this purpose: |
| |
| 1. The Eager protocol- the complete message is sent when the send is |
| processed by the sender. A completion send is received in the send_cq |
| notifying that the buffer can be reused. |
| |
| 2. The Rendezvous Protocol - the sender sends the tag-matching header, |
| and perhaps a portion of data when first notifying the receiver. When the |
| corresponding buffer is posted, the responder will use the information from |
| the header to initiate an RDMA READ operation directly to the matching buffer. |
| A fin message needs to be received in order for the buffer to be reused. |
| |
| Tag matching implementation |
| |
| There are two types of matching objects used, the posted receive list and the |
| unexpected message list. The application posts receive buffers through calls |
| to the MPI receive routines in the posted receive list and posts send messages |
| using the MPI send routines. The head of the posted receive list may be |
| maintained by the hardware, with the software expected to shadow this list. |
| |
| When send is initiated and arrives at the receive side, if there is no |
| pre-posted receive for this arriving message, it is passed to the software and |
| placed in the unexpected message list. Otherwise the match is processed, |
| including rendezvous processing, if appropriate, delivering the data to the |
| specified receive buffer. This allows overlapping receive-side MPI tag |
| matching with computation. |
| |
| When a receive-message is posted, the communication library will first check |
| the software unexpected message list for a matching receive. If a match is |
| found, data is delivered to the user buffer, using a software controlled |
| protocol. The UCX implementation uses either an eager or rendezvous protocol, |
| depending on data size. If no match is found, the entire pre-posted receive |
| list is maintained by the hardware, and there is space to add one more |
| pre-posted receive to this list, this receive is passed to the hardware. |
| Software is expected to shadow this list, to help with processing MPI cancel |
| operations. In addition, because hardware and software are not expected to be |
| tightly synchronized with respect to the tag-matching operation, this shadow |
| list is used to detect the case that a pre-posted receive is passed to the |
| hardware, as the matching unexpected message is being passed from the hardware |
| to the software. |