Dense Linear Algebra Software project
- A short lead paragraph
We have used Janus to develop a New Algorithm for the Collective Communication All-to-One Operation Reduce.
We study the pipelined reduction on a uniform unidirectional network of distributed processes under the linear (Hockney) model. We present a new algorithm based on a greedy tree which we theoretically compare with three standard algorithms: binary, binomial, and pipeline. The segmentation of the messages is initially constrained to be uniform and each algorithm is studied with its own optimal block size. We recall well-known results: (1) for short size messages, the binomial algorithm is better than the binary and the pipeline ones; (2) for medium size messages, the binary algorithm is better than the binomial and the pipeline ones; (3) for long size messages, the pipeline algorithm is better than the binomial and the binary ones. Our new algorithm (greedy) outperforms all these algorithms for any message sizes (given appropriate tuning of the message segmentation). Indeed we can prove it is optimal for a given segmentation. For example, for 64 processes, neglecting computation, and with a latency ten times larger than the inverse of the bandwidth, (bandwidth in bytes per unit of time,) the greedy algorithm theoretically performs as binary for small messages, 50% better than binary for medium messages (which here is defined by 1 KB to 4 KB), and slightly better than pipeline for large messages. Secondly, we present experimental results where we compare MPI implementation of the four algorithms (greedy, binary, binomial and pipeline) based on MPI_Send and MPI_Recv protocols together with MPI_Reduce on 64 nodes for message of size 4 Bytes to 16 MB. We recover the trends explained previously. Our implementation of the greedy algorithm is in practice more efficient than MPI_Reduce except for very large messages (more than 4 MB). Finally we study theoretically nonuniform segmentations of the messages. We find that, for the greedy algorithm, for some configuration,nonuniform segmentations do perform better than any uniform segmentation. For example, for 12 processes, neglecting latency, and a message of 10 units, we find that the best uniform segmentation for greedy is (1,1,...,1,1) while the best nonuniform segmentation is (2,2,1,1,1,1,1,1). The nonuniform segmentation is 7% better than the best uniform segmentation.
- Links to award summaries for each source of funding
- NSF award "CAREER: Foundations for understanding and reaching the limits of standard numerical linear algebra," $400 000, CCF 1054864, 2011 -- 2016.
- NSF award "Improvement and Support of Community Based Dense Linear Algebra Software for Extreme Scale Computational Science," $400 000, OCI 1032861, 2010 -- 2013.
- Links to project's websites