Getting OFED installed to do IMB MPI Benchmark

Last Updated: 05/19/2010 - Vipul Pandya
Last Updated: 05/04/2010 - Shwetha Nagaraju
Last Updated: 02/03/2010 - Shwetha Nagaraju
Last Updated: 03/09/2009 - Phuong Nguyen
Last Updated: 11/11/2008 - Phuong Nguyen
Last Updated: 9/16/2008 - Phuong Nguyen
Last Updated: 9/28/2007 - Steve Wise

---------------------------------------------------
Official OFED Release is at:
http://www.openfabrics.org/
click on link Download OFED X.Y.Z
Chelsio supported release is ofed-1.5.1
---------------------------------------------------
======================
1. INSTALL OFED DRIVER
======================
-> Pull down ofed distro from http://service.chelsio.com/ site,
file OFED_1.5.1 dated 05/04/2010 or latest release
# tar -xzf path-to/OFED-X.Y.Z.tgz

-> cd into ofed tree
# cd OFED-X.Y.Z

-> Run install.pl
# ./install.pl

-> Choose option 2 to install OFED package.
-> Then choose option 3 to install all OFED libraries.
-> Then choose default options in build script to build and install OFED
OR
If you are familiar with OFED installation you can choose option 2 then
option 4 for customized installation.

-> After installation reboot system for changes to take effect.

-> Set Chelsio driver option for MPI connection changes
Issue the below command on all systems

# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer
OR
you can add the following line to /etc/modprobe.conf to set the
option at module load time:

options iw_cxgb3 peer2peer=1

-> The option setting in file /etc/modprobe.conf shall take effect upon
system reboot.

===================================
2. INSTALL CHELSIO cxgb3toe KIT
===================================
You can also install Chelsio's latest cxgb3 driver after installing
OFED-1.5.1 package as mentioned in section 1. However it requires specific
options to be given while loading the rdma drivers of OFED-1.5.1 on top
of Chelsio's latest cxgb3 driver. Following are the steps which describes
this in detail.

-> Pull down cxgb3toe distro from service.chelsio.com

-> untar cxgb3toe distro in shared directory
# tar -xzf path-to/cxgb3toe-W.X.YY.ZZZ.tgz

-> Build/install:
# cd cxgb3toe-W.X.YY.ZZZ
# (cd src; make && make install)
# (cd tools/cxgbtool; make && make install)

---------------------------------------------------------
NOTE: To automatically load Chelsio rdma drivers on
different Linux platforms please do the following
---------------------------------------------------------
-> On RHEL 4.7, 4.8, 5.2, 5.3 and 5.4, add this to
/etc/modprobe.conf:

options iw_cxgb3 peer2peer=1
install cxgb3 /sbin/modprobe -i cxgb3; /sbin/modprobe -f iw_cxgb3;
/sbin/modprobe rdma_ucm
alias eth2 cxgb3 # assume eth2 is used by the Chelsio interface.

-> On SLES 10 SP2, SLES 10 SP3, add this to
/etc/modprobe.conf:

options iw_cxgb3 peer2peer=1
install cxgb3 /sbin/modprobe -i cxgb3; /sbin/modprobe --force-modversion
iw_cxgb3; /sbin/modprobe rdma_ucm
alias eth2 cxgb3 # assume eth2 is used by the Chelsio interface.

-> On SLES 11, add this to
/etc/modprobe.conf:

options iw_cxgb3 peer2peer=1
install cxgb3 /sbin/modprobe -i --allow cxgb3; /sbin/modprobe --force
--allow iw_cxgb3; /sbin/modprobe --allow rdma_ucm
alias eth2 cxgb3 # assume eth2 is used by the Chelsio interface.

-> Reboot the system to load the new modules

-------------------------------------------------------------
NOTE: To manually load Chelsio rdma drivers on diferent Linux
platform do the following
-------------------------------------------------------------
-> On RHEL 4.7, 4.8, 5.2, 5.3 and 5.4, add this to
# modprobe cxgb3 && modprobe -f iw_cxgb3 && modprobe rdma_ucm

-> On SLES 10 SP2 and SLES 10 SP3 add this to
# modprobe cxgb3 && modprobe --force-modversion iw_cxgb3 &&
modprobe rdma_ucm

-> On SLES 11 add this to
# modprobe --allow cxgb3 && modprobe --allow --force iw_cxgb3 &&
modprobe --allow rdma_ucm

NOTE: Installation of the cxgb3toe kit driver is recommended. It may
contain the latest features and bug fixes.

============================================
Updating Firmware:
============================================

This release requires firmware version 7.10.0, and Protocol SRAM version
1.1.0. OFED-1.5.1 package comes by default with firmware 7.10.0. Firmware
7.10.0 gets loaded automatically if you load the Chelsio iWARP modules from
OFED-1.5.1 package. Please note that Chelsio iWARP modules from OFED-1.5.1
work with firmware 7.8.0 firmware as well. For this combination to work user
has to use the cxgb3toe-W.X.YY.ZZ.tar.gz driver from http://service.chelsio.com.
However it is recommended to use firmware 7.10.0 with Chelsio iWARP modules
from OFED-1.5.1 package for performance benefits.

If your distro/kernel supports firmware loading, you can place the chelsio
firmware and psram images in /lib/firmware/cxgb3, then unload and reload
the cxgb3 module to get the new images loaded. If this does not work,
then you can load the firmware images manually:

Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio.

To build cxgbtool:

# cd
# make && make install

Then load the cxgb3 driver:

# modprobe cxgb3

Now note the ethernet interface name for the T3 device. This can be
done by typing 'ifconfig -a' and noting the interface name for the
interface with a HW address that begins with "00:07:43". Then load the
new firmware and eeprom file:

# cxgbtool ethxx loadfw firmware_file_name
# update_eeprom.sh ethxx eeprom_file_name
# reboot

============================================
Testing connectivity with ping and rping:
============================================

Configure the ethernet interfaces for your cxgb3 device. After you
modprobe iw_cxgb3 you will see one or two ethernet interfaces for the
T3 device. Configure them with an appropriate ip address, netmask, etc.
You can use the Linux ping command to test basic connectivity via the
T3 interface.

To test RDMA, use the rping command that is included in the librdmacm-utils
rpm:

On the server machine:

# rping -s -a server_ip_addr -p 9999

On the client machine:

# rping -c -VvC10 -a server_ip_addr -p 9999

You should see ping data like this on the client:

ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
#

============================================
3. Enabling Various MPIs
============================================

-> For Intel MPI, HP MPI, and Scali MPI: you must set
the iw_cxgb3 module option peer2peer=1 on all systems. This can be done
by writing to the /sys/module file system during boot. EG:

# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer

OR

You can add the following line to /etc/modprobe.conf to set the option
at module load time:

options iw_cxgb3 peer2peer=1

-> To run Intel MPI, HP MPI, and Scali MPI over RDMA interface, DAPL 1.2 or 2.0
should be set up as follows:

Enable the chelsio device by adding an entry at the begining of the /etc/dat.conf
file for the chelsio interface. For instance, if your chelsio interface name is eth2,
then the following line adds a DAT version 1.2 and 2.0 devices named "chelsio"
and "chelsio2" for that interface:

chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

-> Setting Shell for Remote Logon
User needs to set up authenication on the user account on all systems in
the cluster to allow user to remotely logging on or executing commands without
password requirement.

Quick steps to set up user authentication:
- Change to user home directory
# cd
- Generate authentication key
# ssh-keygen -t rsa
- Press enter upon prompting to accept default setup and empty password
phrase.
- Create authorization file
# cd .ssh
# cat *.pub > authorized_keys
# chmod 600 authorized_keys
- Copy directory .ssh to all systems in the cluster
# cd
# scp -r .ssh remotehostname-or-ipaddress:

-------------------------------------------------
3.1 MVAPICH2
-------------------------------------------------
The following env vars enable MVAPICH2 version 1.4.1. Place these
in your user env after installing and setting up MVAPICH2 MPI:

export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4.1/
export MV2_USE_IWARP_MODE=1
export MV2_USE_RDMA_CM=1

On each node, add this to the end of /etc/profile.

ulimit -l 999999

On each node, add this to the end of /etc/init.d/sshd and restart sshd.

ulimit -l 999999
% service sshd restart

Verify the ulimit changes worked. These should show '999999':

% ulimit -l
% ssh ulimit -l

Note: You may have to restart sshd a few times to get it to work.

In root's home directory, create .mpd.conf and .mpdpasswd with one line in them:

secretword=f00bar

Note: The secrete word can be anything.

chmod both files

% chmod 600 .mpd.conf
% chmod 600 .mpdpasswd

Create mpd.hosts with list of hostname or ipaddrs in the cluster. They should be names/addresses that you can ssh to without passwords. (See Passwordless SSH Setup).

% cp .mpd.conf mpd.conf
% cp .mpd.conf /etc/.mpd.conf
% cp .mpd.conf /etc/mpd.conf
% cp mpd.hosts .mpd.hosts
% cp mpd.hosts /etc/mpd.hosts
% cp mpd.hosts /etc/.mpd.hosts

On each node, create /etc/mv2.conf with a single line containing the IP address of the local T3 interface. This is how MVAPICH2 picks which interface to use for RDMA traffic.

On each node, edit /etc/hosts file. Comment the entry if there is an entry with 127.0.0.1 IP Address and local host name. Add an entry for corporate IP address and local host name (name that you have given in mpd.hosts file) in /etc/hosts file.

To run MVAPICH2 application:

mpirun_rsh -ssh -np 8 -hostfile mpd.hosts

-------------------------------------------------
3.2 HP-MPI
-------------------------------------------------

1. Installation and Setup

HP MPI application and license can be obtained from HP website.
HP MPI is released in Linux "rpm" package and can be installed with
command:
# rpm -ivh .rpm

Note: There are 2 doc files that explain a lot of this in /opt/hpmpi/doc
once the rpm is installed.

By default HP-MPI shall be installed in directory "/opt/hpmpi".
The application should be installed on all systems of a test cluster.
The license file which is granted by HP should be named as "license.lic" or
"your-prefer-name.lic" and copied to directory "/opt/hpmpi/licenses" of
all systems.

Installing License File:
To start license server issue command
# /opt/hpmpi/bin/licensing/lmgrd -c /opt/hpmpi/licenses/license.lic

NOTE: For lmgrd to work, uncomment the loopback address (127.0.0.1)
from the /etc/hosts file only on the headnode.

The command should return without any error for a valid license file.

Setting Up HP-MPI environment (Applicable to all systems in the cluster):

The following env vars enable HP MPI version 2.03.01.00. Place these
in your user env after installing and setting up HP MPI:

export MPI_ROOT=/opt/hpmpi
export PATH=$MPI_ROOT/bin:/opt/bin:$PATH
export MANPATH=$MANPATH:$MPI_ROOT/share/man

Log out & log back in.

To run HP MPI applications, use these mpirun options:

-prot -e DAPL_MAX_INLINE=64 -UDAPL

EG:

$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob

Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces.

Also this assumes your first entry in /etc/dat.conf is for the chelsio
device.

Contact HP for obtaining their MPI with DAPL support.

The performance is best with NIC MTU set to 9000 bytes
-------------------------------------------------
3.3 Intel-MPI
-------------------------------------------------
1. Installation and Setup

Download latest Intel MPI from the Intel website

Copy COM_L___CF8J-98P6MBWL.lic into l_mpi_p_x.y.z directory

Create machines.LINUX (list of node names) in l_mpi_p_x.y.z

Install software on every node.
#./install.sh

Register and set IntelMPI with mpi-selector (do this on all nodes).
#mpi-selector --register intelmpi --source-dir /opt/intel/impi/3.1/bin/
#mpi-selector --set intelmpi

Edit .bashrc and add these lines:

export RSH=ssh
export DAPL_MAX_INLINE=64
export I_MPI_DEVICE=rdssm:chelsio
export MPIEXEC_TIMEOUT=180
export MPI_BIT_MODE=64

Logout & log back in.

Populate mpd.hosts with node names.
Note: The hosts in this file should be Chelsio interface IP addresses.

Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in
/etc/dat.conf named "chelsio".

Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic
is going across the systems.

Contact Intel for obtaining their MPI with DAPL support.

To run Intel MPI applications:

mpdboot -n -r ssh --ncpus=
mpiexec -ppn -n

The performance is best with NIC MTU set to 9000 bytes

-------------------------------------------------
3.4 Open-MPI
-------------------------------------------------

1. Installation and Setup
The latest release of Open-MPI-1.4.1 comes with OFED package only. Select
Open-MPI package while installing OFED-1.5.1 package.


OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater.

Open MPI will work without any specific configuration via the openib btl.
Users wishing to performance tune the configurable options may wish to
inspect the receive queue values. Those can be found in the "Chelsio T3"
section of mca-btl-openib-hca-params.ini.

Note: OpenMPI version 1.3 does not support newer Chelsio card with device
ID 0x0035 and 0x0036. To use those cards add the device id of the cards
in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file.

To run OpenMPI applications:

mpirun --host , -mca btl openib,sm,self


-------------------------------------------------
3.5 Scali MPI
-------------------------------------------------
-> Installation and Setup
Install the license key on the head node:
# ./lminstall -p

Install scampi on each node (including the head node):
# ./smcinstall -a -b -n

Create mpivars files:
#cp /opt/scali/etc/scampivars.sh /opt/scali/etc/mpivars.sh
#cp /opt/scali/etc/scampivars.csh /opt/scali/etc/mpivars.csh

Register and set scampi with mpi-selector
# mpi-selector --register scampi --source-dir /opt/scali/etc/
# mpi-selector --set scampi

-> Scali MPI Environment Variables:
The following env vars enable Scali MPI. Place these in your user env
after installing and setting up Scali MPI for running over IWARP:

export DAPL_MAX_INLINE=64
export SCAMPI_NETWORKS=chelsio
export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128"

Log out & log back in.

Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf
named "chelsio".

Note: SCAMPI supports only dapl 1.2 library not dapl 2.0

Contact Scali for obtaining their MPI with DAPL support.

To run SCALI MPI applications:

mpimon -networks chelsio --

Note: is the number of processes to run on the node Note: should be the IP of Chelsio's interface


============================================
Addition Notes and Issues
============================================

1) To run uDAPL over the chelsio device, you must export this environment
variable:

export DAPL_MAX_INLINE=64

2) If you have a multi-homed host and the physical ethernet networks
are bridged, or if you have multiple chelsio rnics in the system, then
you need to configure arp to only send replies on the interface with
the target ip address:

sysctl -w net.ipv4.conf.all.arp_ignore=2

3) If you are building OFED against a kernel.org kernel later than
2.6.20, then make sure your kernel is configured with the cxgb3 and
iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc
allocator, which is required for the OFED iw_cxgb3 module. Make sure
these config options are included in your .config file:

CONFIG_CHELSIO_T3=m
CONFIG_INFINIBAND_CXGB=m

4) If you run the RDMA latency test using the ib_rdma_lat program, make
sure
you use the following command lines to limit the amount of inline
data to 64:


server: ib_rdma_lat -c -I 64

client: ib_rdma_lat -c -I 64 server_ip_addr