Over the last few years, we have written a number of papers concerning the efficient implementation of collective communication operations on parallel architectures []. As part of that research, we have noticed that efficient implementations for scatter, gather, collect, and distributed reduction, one can build efficient implementations for the broadcast, reduce-to-one, and reduce-to-all by making the following observations:

Indeed, given optimal implementations of scatter, gather collect, and distributed reduction, implementing the other operations as described can be shown to be asymptotically (for long vectors of data) within a factor two of optimal, or even optimal.

Broadcast:equivalent to a scatter followed by a collect.Reduce-to-one:equivalent to a distributed reduce followed by a gather.Reduce-to-all:equivalent to a distributed reduce followed by a collect.