Getting OFED installed to do IMB MPI Benchmark

Last Updated: 11/11/2008 - Phuong Nguyen
Last Updated: 9/16/2008 - Phuong Nguyen
Last Updated: 9/28/2007 - Steve Wise

---------------------------------------------------
Official OFED Release is at:
http://www.openfabrics.org/
click on link Download OFED X.Y.Z
Chelsio supported release is ofed-1.3.1
---------------------------------------------------

A. INSTALL OFED DRIVER

Use latest Development build of OFED. This includes all our
latest fixes:

- Pull down ofed distro from http://service.chelsio.com/ site, file OFED_1.3.1
dated 07/15/2008 or latest release
# tar -xzf path-to/OFED-X.Y.Z.tgz

- cd into ofed tree
# cd OFED-X.Y.Z

- Run install.pl
# ./install.pl
Choose option 2 to build OFED.
Then choose option 3 to build all libraries
and choose default options in build script to build and install OFED
OR
If you are familiar with OFED installation you can hoose option 2 then
option 4 for customized installation.

- Reboot system for changes to take effect.

- Set Chelsio driver option for MPI connection changes
Issue the below command on all systems
# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer
Or you can add the following line to /etc/modprobe.conf to set the option at
module load time:

options iw_cxgb3 peer2peer=1

- The option setting in file /etc/modprobe.conf shall take effect upon system reboot.

---------------------------------------------------
Install the cxgb3toe driver is recommended. If you choose not to please skip to "CONFIGURE T3" step
---------------------------------------------------

B. INSTALL THE CHELSIO cxgb3toe KIT.

- Pull down cxgb3toe distro from service.chelsio.com

- untar cxgb3toe distro in shared directory
# tar -xzf path-to/cxgb3toe-X.Y.ZZZ.tgz

- Build/install:
# cd cxgb3toe-X.Y.ZZZ
# (cd src; make && make install)
# (cd tools/cxgbtool; make && make install)

- Add this to /etc/modprobe.conf:
options iw_cxgb3 peer2peer=1
install cxgb3 /sbin/modprobe -i cxgb3; /sbin/modprobe -f iw_cxgb3; /sbin/modprobe t3_tom; /sbin/modprobe rdma_ucm
alias eth2 cxgb3 // assume eth2 is used by the Chelsio interface.

install iw_cxgb3
# /sbin/modprobe rdma_ucm;
# /sbin/modprobe --ignore-install --force iw_cxgb3
# /sbin/modprobe t3_tom

- Reboot the system to load the new modules
- To manually load the rdma drivers:
# modprobe --force iw_cxgb3 && modprobe rdma_ucm

---------------------------------------------------
CONFIGURE T3:
---------------------------------------------------

- configure the T3 interfaces with ipaddrs

- use ping to verify network is working

- use rping to verify rdma is working

E.g:

On system with T3 ipaddr 192.168.1.123:

# rping -s -a 192.168.1.123 -p 9999

On another system connected via the T3 network:
# rping -Vv -C10 -c -a 192.168.1.123 -p 9999

You should see output like this:

[mpi@r1-iw ~]$ rping -Vv -C10 -c -a 192.168.1.123 -p 9999
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
[mpi@r1-iw ~]$


-------------------------------------------------
C. MVAPICH2
-------------------------------------------------

1. Installation and Setup
- MVAPICH2 should have been built with OFED installtion.
- add this to root's .bashrc file:

export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.0.3-1/
export MV2_USE_IWARP_MODE=1
export MV2_USE_RDMA_CM=1

- On each node, add 'ulimit -l 999999' to /etc/profile

- On each node, add 'ulimit -l 999999' to /etc/init.d/sshd and restart sshd
# service sshd restart

- now if you login as user root, you should see 999999 for ulimit -l both
locally and if you ssh to another node. So make sure these show 999999:
# ulimit -l
# ssh ulimit -l
(For setting up ssh shell please see HP-MPI section)
- create 2 files in root's home dir named .mpd.conf and .mpdpasswd. Both
need to have one line in them like this:

secretword=f00bar

The secret word can be anything. 'f00bar' is just an example.
Both need to be chmod 600 as well,
# chmod 600 .mpd.conf
# chmod 600 .mpdpasswd

- create a hosts file in mpi's home dir named mpd.hosts with a list of
the hostname or ipaddrs in the cluster. These hostnames or ip addrs can
be either the T3 names/addrs or any other hostname/addr that you can
ssh to as user mpi and not enter passwords.

NOTE: Do NOT add the head node in the host file. Just the other nodes in the
cluster.

# cp .mpd.conf mpd.conf
# cp .mpd.conf /etc/.mpd.conf
# cp .mpd.conf /etc/mpd.conf
# cp mpd.hosts .mpd.hosts
# cp mpd.hosts /etc/mpd.hosts
# cp mpd.hosts /etc/.mpd.hosts

- On each node, create a file named /etc/mv2.conf with a single line
in it containing the ip address of the local T3 interface. This is how
mvapich2 picks which interface to use for rdma traffic.

2. Run MPI Application
- Start up mpd using mpdboot. On a two node cluster with hosts r1-iw and r2-iw:
[mpi@r1-iw ~]$ cat mpd.hosts
r2-iw
[mpi@r1-iw ~]$ mpdboot -n 2
[mpi@r1-iw ~]$ mpdtrace
r1-iw
r2-iw

- To run the IMB benchmarks on 2 nodes:
[mpi@r1-iw ~]$ mpiexec -n 2 $MVAPICH2_HOME/tests/IMB-3.0/IMB-MPI1

- To shut down mpd:
[mpi@r1-iw ~]$ mpdallexit

-------------------------------------------------
D. HP-MPI
-------------------------------------------------

1. Installation and Setup

HP MPI application and license can be obtained from HP website.
HP MPI is released in Linux "rpm" package and can be installed with command:
# rpm -ivh packagename.rpm

By default HP-MPI shall be installed in directory "/opt/hpmpi".
The application should be installed on all systems of a test cluster.
The license file which is granted by HP should be named as "license.lic" or
"your-prefer-name.lic" and copied to directory "/opt/hpmpi/licenses" of all systems.

Installing License File:
To start license server issue command
# /opt/hpmpi/bin/licensing//lmgrd -c /opt/hpmpi/licenses/license.lic

The command should return without any error for a valid license file.

Setting Up HP-MPI environment (Applicable to all systems in the cluster):
- Set the MPI_ROOT environment to point to where HP MPI was installed.
- Add $MPI_ROOT/bin to environment variable PATH
The environment setup can be done by modifying the contain of user's account
start-up file ".bashrc" or ".bash_profile"

2. Setting Shell for Remote Logon
By default HP-MPI attempts to use "ssh" on Linux for remote access to all computed nodes.
User needs to set up authenication on the user account on all systems in the cluster
to allow HP-MPI remotely logging on or executing commands without password requirement.

Quick steps to set up user authentication
Change to user home directory
# cd
Generate authentication key
# ssh-keygen -t rsa
Press enter upon prompting to accept default setup and empty password phrase.
Create authorization file
# cd .ssh
# cat *.pub > authorized_keys
# chmod 600 authorized_keys
Copy directory .ssh to all systems in the cluster
# cd
# scp -r .ssh remotehostname-or-ipaddress:

3. Compiling a MPI program with GCC
# mpicc -o programname programname.c

4. Setting DAPL for MPI

Setting DAPL 1.2 environment for all systems in the targeted cluster:
To run HP-MPI over RDMA interface, DAPL 1.2 should be set up.
Edit file "/etc/dat.conf" modify line
OpenIB-cma0 u1.2 nonthreadsafe default libdaplcma.so mv_dapl.1.2 "ib0 0" ""
Change to
OpenIB-cma0 u1.2 nonthreadsafe default libdaplcma.so mv_dapl.1.2 "eth2 0" ""
Assume "eth2" is the network device name in system.
Comment out other lines with # at the beginning of line.

Setting DAPL 2.0 for MPI
Change the entry in file "/etc/dat.conf"
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 mv_dapl.2.0 "ib0 0" ""
To
ofa-v2-cma u2.0 nonthreadsafe default libdaplofa.so.2 mv_dapl.2.0 "eth2 0" ""


5. Running MPI Application:

Run MPI test on DAPL 1.2
# mpirun -UDAPL -prot -e MPI_ICLIB_UDAPL=/usr/lib64/libdat.so -e MPI_HASIC_UDAPL=OpenIB-cma -f testfile

The -f option specifies a test file which should have the format of
-h remotehost -np numberofcpu-to-use /path-to-application-program
Example: Run MPI application on host "ynode1" and "ynode2" with 2 CPU cores on each node.
-h ynode1 -np 2 /path-to-program/programname
-h ynode2 -np 2 /path-to-program/programname

Run MPI test on DAPL 2.0
(Please see the section 7 for supressing unwanted debug message)
# mpirun -UDAPL -prot -e MPI_HASIC_UDAPL=ofa-v2-cma -f testfile


6. HP-MPI performance optimzation:
Set HP-MPI parameter "-e MPI_RDMA_MSGSIZE=4096,8192,4194304"

Chelsio cards give the best performance with NIC MTU set to 9000 bytes for all nodes.
The NIC MTU can be changed with command "#ifconfig ethX mtu 9000".
Please also be certain that the switch supporting jumble frame size.

7. UDAPL 2.0 and HP-MPI
DAPL 2.0 has debug flag set to 1 by default. To supress the debug message set and export user envrironment variable
"DAPL_DBG_TYPE=0" in user logon files ".bashrc" or ".bash_profile"

-------------------------------------------------
E. Intel-MPI
-------------------------------------------------

1. Installation and Setup
Intel MPI and license can be obtained from Intel web site. Follow the instruction
on Intel web site to install the MPI application and license.
Setup for Intel MPI is similar as MVAPICH2.

- Add environment variables to ".bashrc" file by running the below scripts
(Applicable to all nodes in the targeted cluster)

In 64-bit mode
# //bin64/mpivars.sh
In 32-bit mode
# //bin32/mpivars.sh
If you are using Intel Compiler, please execute the below command for proper setup
of variable "LD_LIBRARY_PATH"
# /
- Set up ssh shell: (Please see section 2 of HP-MPI)
- Set up MPD password
Change to user home directory
# cd
Create a $HOME/.mpd.conf file, enter the following line into this file
secretword=

Change file's privilege to user only read and write.
# chmod 600 $HOME/.mpd.conf

Copy file .mpd.conf to all nodes
# scp .mpd.conf remote-host-name:

- Set up mpd host test file
Create a mpd.hosts text file that list the computed nodes in the cluster using
one host name per line
Example
ynode1
ynode2

- Set up DAPL 1.2: (Please see section 4 of HP-MPI)

2. Run MPI Application

- Start MPD
% mpdboot -n <#nodes> -r ssh
The file $PWD/mpd.hosts will be used by default if it is present. The option -n
specifies number of computed nodes to be used
To check status of the MPD daemons, issue command
# mpdtrace -l

- Start MPI application on DAPL interface
# mpiexec -genv I_MPI_DEVICE rdma:OpenIB-cma -n <# of processes> /path-to-program-file/programname
Or
# mpiexec -genv I_MPI_DEVICE rdssm:OpenIB-cma -perhost <#of processes/node> -n <# of processes> /path-to-program-file/programname

NOTE: Using -r option to specify "ssh" to be used for remote access
Using -n option to specify the number of processes to be used in the test.
The performance is best with NIC MTU set to 9000 bytes

-------------------------------------------------
F. Open-MPI
-------------------------------------------------

1. Installation and Setup
The latest release of Open-MPI, packaged as tar ball file, can be obtained from wbe site
http://www.open-mpi.org/
Extract the tar ball file then change to Open-mpi installed directory and issue command
# ./autogen.sh && ./configure --prefix=/opt/ompi/openmpi-install && make clean all install

2. Edit selector files for use in mpi-selector
- Find and copy openmpi's mpivars.* files to /opt/ompi/openmpi-install/bin
- Edit mpivars.* and replace text "/usr/mpi/gcc/openmpi-1.2.6/lib64/" with "/opt/ompi/openmpi-install/lib/".
This shall correct the libraries path to the new location of the installed libraries.

3. Distribute compiled binaries to all nodes in the cluster (Skip this step if installed on NFS mount).
Example: node's name is 'r2"
# rsync -auxv --delete /opt/ompi/openmpi-install r2:/opt/ompi/openmpi-install

4. Register OMPI with mpi-selector on each node
# mpi-selector --register openmpi --source-dir /opt/ompi/openmpi-install/bin
# mpi-selector --set openmpi

6. Recomiple the MPI tests.
MPI tests can be located at directory /opt/ompi/openmpi-install/tests

7. Run MPI application

Example of IMB-MPI1 test on host "r1" and "r2":
# mpirun --host r1, r2 -mca btl openib,sm /opt/ompi/openmpi-install/tests/IMB-3.0/IMB-MPI1

Example of IMB-MPI1 on Open-MPI 1.3 Beta 2 version:
# mpirun --host r1,r2 -mca btl openib,sm,self /opt/ompi/openmpi-cpc2-install/tests/IMB-3.1/IMB-MPI1 pingpong

NOTE: Assuming Open-MPI was installed at directory "/opt/ompi/openmpi-cpc2-install"


-------------------------------------------------
G. Scali MPI
-------------------------------------------------
1. Installation and Setup
Generally, Scali MPI can be installed with command "smcinstall" at its installed directory
# ./smcinstall -t -u
The application should be installed on allnodes. Please see Scali MPI user guide for detail.
To install on a node
# ./smcinstall -n
Host name is the name of host server that has lincense to Scali MPI

Scali MPI Environment Variables:
There are 3 variables that can be set and exported in user account file ".bashrc" or ".bash_profile"
MPI_HOME Points to installation directory which should be default to /opt/scali
LD_LIBRARY_PATH Path to dynamic link libraries which should be found at $MPI_HOME/lib
PATH Path variable. Must be updated to include $MPI_HOME/bin

Setup "ssh" remote access shell (Please see section 2 of HP-MPI)

2. Running Scali MPI

Basic command for running Scali MPI is
# mpimon - -- <#proc> <#proc(n)>
- Scali MPI environment variables
The environment variables can also be specified in file $MPI_HOME/etc/ScaMPI.conf
Application's name
Application's options
-- Separator to end application's options
First computed node by host name
<#proc> Number of processes to be run on node
Other computed node by host name
<#proc(n)> Number of processes to be run on "node(n)"
Example:
# mpimon -channel_entry_count 32 /opt/scali/examples/bin/hello -- r1 1 r2 1

3.Run Scali MPI on uDAPL network

Setup uDAPL (Please see section 4 of HP-MPI)

To run Scali MPI on uDAPL please set variable "networks" to
-networks smp,OpenIB-cma

Example
# mpimon -networks smp,OpenIB-cma -channel_entry_count:OpenIB-cma 32 /opt/scali/examples/bin/hello -- r1 1 r2 1