This page gives a sample MPI implementation of an ADMM-based solver for the Lasso problem, as described in sections 6.4 and 9.2 of our paper. This implementation is intended to be pedagogical, so it is heavily commented, is not performance optimized, and attempts to mirror the Matlab version of the solver.
The source code is available to view as syntax-highlighted HTML or for download.
The linear algebra operations in the solver use the GNU Scientific Library (GSL). This package is easy to install, well documented, and has a number of convenient data structures and functions for manipulating vectors and matrices, including a high-level BLAS interface. Users who are interested in better performance can link GSL to an optimized CBLAS library, such as one produced by ATLAS or provided by their hardware vendor (e.g., Accelerate.framework, ACML, or Intel MKL).
Download the tarball above. The package contains a Makefile, the solver, and a standard library for reading in matrix data.
Install GSL and an MPI implementation like OpenMPI or MPICH. This is straightforward using any standard package manager on Linux. OpenMPI is bundled with Mac OS X 10.5 and later so no additional installation is needed. Ensure that the mpicc command is on the path after installing; some Linux distributions have mispackaged MPI libraries. The mpicc command is a wrapper for the C compiler that includes the necessary MPI libraries.
Edit the Makefile to ensure that the GSLROOT variable is set to point to the location where you installed GSL, and that the ARCH variable is set appropriately (most likely to i386 or x86_64). On some machines, it may be necessary to remove the use of the flag entirely.
Run make. This produces a binary called lasso.
This code has been tested on Mac OS X 10.6, Debian 6, and Ubuntu 10.04.
Here, we provide a brief tutorial on running MPI programs.
MPI programs can be run in single-machine mode, which is convenient for testing. In this case, each worker is just a separate process on the machine and they communicate directly, without having to have any server running. Simply run the following command:
$ mpirun -np 4 lasso
This will spawn 4 processes, all of which will run lasso (i.e., MPI follows the SPMD model). For ADMM, the first one will act as the master, and the remaining three will act as the slaves.
Running on multiple machines requires that all the relevant machines are set up to communicate with each other appropriately. A discussion of how to do so is outside the scope of this tutorial. We assume you either have access to an existing MPI cluster or will need to use a cloud platform like Amazon (see below).
First, create a host file. This file tells MPI which machines can be used to run processes as well as how many processes can run on each node. The number of slots should never be set higher than the number of processes on each machine (see here for some details).
foo.example.com slots=1 bar.example.com slots=1 baz.example.com slots=1 bat.example.com slots=1
This means MPI will run a single process on each machine. (This is not ideal computationally, since each machine likely has multiple CPUs, but it makes the bookkeeping simpler since we don't need to think about which data on each machine each process will use.)
To run a program using this hostfile, type the following:
$ mpirun --hostfile sample_hostfile -np 4 lasso
If you do not have access to an existing MPI cluster, it is possible to rent one from Amazon Web Services (AWS) for relatively low rates. Essentially, one can rent (virtual) machines from Amazon's Elastic Compute Cloud (EC2) for an hourly rate and use their cloud storage services to store datasets and experimental results. The primary downside is that it is necessary to set up the machines more or less from scratch. Fully explaining how AWS works is outside the scope of this document, but to get more familiar, we suggest the following steps.
Go through the EC2 Getting Started Guide. This will take you through setting up an EC2 account and launching and connecting to a test machine instance.
Install StarCluster, a suite of Python scripts that make it fairly straightforward to bring up a cluster of machines already set up with MPI and other scientific computing software.
GSL is not installed on the default StarCluster machine images. Customize the default images by installing GSL.
You can now launch a cluster using your custom image by following the StarCluster documentation, and make data available to the cluster as needed by following the appropriate EC2 documentation. It may suffice to just copy the local datasets onto each instance's local instance store, then copy any generated results that should be preserved back into an S3 bucket or attached EBS volume.
The package linked above has a small dataset included that can be used to verify that the code is working correctly. The dataset is sliced up into four shards, A1.dat and b1.dat through A4.dat and b4.dat and is in a subdirectory called data in the source tree. Solving the full problem requires 5 processes (one master and four slaves), and the expected output is below:
$ mpirun -np 5 lasso [0] reading A1.dat [1] reading A1.dat [2] reading A2.dat [3] reading A3.dat [4] reading A4.dat [1] reading b1.dat [3] reading b3.dat [0] reading b1.dat using lambda: 0.5000 [4] reading b4.dat [2] reading b2.dat # r norm eps_pri s norm eps_dual objective 0 0.0000 0.0430 0.1692 0.0045 12.0262 1 3.8267 0.0340 0.9591 0.0427 11.8101 2 2.6698 0.0349 1.5638 0.0687 12.1617 3 1.5666 0.0476 1.6647 0.0831 13.2944 4 0.8126 0.0614 1.4461 0.0886 14.8081 5 0.6825 0.0721 1.1210 0.0886 16.1636 6 0.7332 0.0793 0.8389 0.0862 17.0764 7 0.6889 0.0838 0.6616 0.0831 17.5325 8 0.5750 0.0867 0.5551 0.0802 17.6658 9 0.4539 0.0885 0.4675 0.0778 17.6560 10 0.3842 0.0897 0.3936 0.0759 17.5914 11 0.3121 0.0905 0.3389 0.0744 17.5154 12 0.2606 0.0912 0.2913 0.0733 17.4330 13 0.2245 0.0917 0.2558 0.0725 17.3519 14 0.1847 0.0923 0.2276 0.0720 17.2874 15 0.1622 0.0928 0.2076 0.0716 17.2312 16 0.1335 0.0934 0.1858 0.0713 17.1980 17 0.1214 0.0939 0.1689 0.0712 17.1803 18 0.1045 0.0944 0.1548 0.0710 17.1723 19 0.0931 0.0950 0.1344 0.0708 17.1768 20 0.0919 0.0954 0.1243 0.0707 17.1824 21 0.0723 0.0958 0.1152 0.0705 17.1867 22 0.0638 0.0962 0.1079 0.0704 17.1896 23 0.0570 0.0965 0.1019 0.0702 17.1900 24 0.0507 0.0968 0.0964 0.0701 17.1898 25 0.0460 0.0971 0.0917 0.0700 17.1885 26 0.0416 0.0973 0.0874 0.0699 17.1866 27 0.0382 0.0976 0.0834 0.0698 17.1846 28 0.0354 0.0978 0.0798 0.0697 17.1827 29 0.0329 0.0980 0.0762 0.0697 17.1815 30 0.0311 0.0983 0.0701 0.0696 17.1858 31 0.0355 0.0985 0.0667 0.0696 17.1890
The problem should take around a second to solve on an average laptop. The objective column is not the objective value of the full problem, since this would require an additional round of message passing to compute (evaluating the objective requires using all the data). Instead, the column shows the value of the objective using the first data shard, A1.dat and b1.dat, only. This is to ease comparison to results in other solvers without doing extra network communication. Modifying the source code above to compute the exact objective is straightforward. Similarly, it is straightforward to modify the code to look somewhere other than the data subdirectory for the data files.