Los Alamos Message Passing Interface 1.5.16 RC1 review

by rbytes.net on

Los Alamos Message Passing Interface is an implementation of the Message Passing Interface (MPI) motivated by a growing need for faul

License: LGPL (GNU Lesser General Public License)
File size: 1291K
Developer: Advanced Computing Laboratory
0 stars award from rbytes.net

Los Alamos Message Passing Interface is an implementation of the Message Passing Interface (MPI) motivated by a growing need for fault tolerance at the software level in large high-performance computing (HPC) systems.

This need is caused by the vast number of components present in modern HPC systems, particularly clusters. The individual components -- processors, memory modules, network interface cards (NICs), etc. -- are typically manufactured to tolerances adequate for small or desktop systems.

When aggregated into a large HPC system, however, system-wide error rates may be too great to successfully complete a long application run. For example, a network device may have an error rate which is perfectly acceptable for a desktop system, but not in a cluster of thousands of nodes, which must run error free for many hours or even days to complete a scientific calculation.

LA-MPI has two primary goals: network fault tolerance and high performance.
Network fault tolerance is acheived by implementing a highly efficient checksum/retransmission protocol. The integrity of delivered data is (optionally) verified at the user-level using a checksum or CRC. Data that is corrupt (or never delivered) is retransmitted.

As for high performance, LA-MPI's lightweight checksum/retransmission protocol allows us to achieve low latency messaging. Furthermore, the flexible approach taken to the use of redundant data paths in a network-device-rich system leads to high network bandwidth since different messages and/or message-fragments can be sent in parallel along different paths. Also, since LA-MPI is developed for use on the the large systems at Los Alamos National Laboratory we have verified that LA-MPI is scalable to over 3,500 processes.

An alternative solution to the network fault tolerance problem is to use the TCP/IP protocol. We believe, however, that this protocol -- developed to handle unreliable, inhomogeneous and oversubscribed networks -- performs poorly and is overly complex for HPC system messaging, and that LA-MPI's lightweight checksum/retransmission protocol is a more appropriate choice.

Here are some key features of "Los Alamos Message Passing Interface":
Standard compliant (MPI version 1.2 integrated with ROMIO for MPI-IO)
Highly portable
Open source (LGPL)
Thread safe
Optimized for SMP systems, including NUMA architectures
Network fault tolerant (data integrity checked at user level)
Message-fragment striping across multiple network devices

What's New in This Release:
Namespace conflicts have been fixed.
Error detection and handling of fragments has been improved.
Bugs in memory barriers and spinlocks for x86 and x86_64 architectures have been fixed.
Profiling and backtracing support have been added.
Asynchronous I/O has been disabled by default as a workaround for problems with some filesystems.
Minor timeout bugs have been fixed.

Los Alamos Message Passing Interface 1.5.16 RC1 keywords