Basic Cluster Instructions (for bigred) --------------------------------------------------------------------- To log into the cluster, just ssh into it, like so: [user@mypc]$ ssh @bigred.cacs.louisiana.edu PROFILE SETUP ------------- To set up your environment for Java and MPI compilation: In the .bash_profile file in your home-dir, add (or append to) the PATH variable: -------------in ~/.bash_profile---------------- For mpich, add/append: PATH=$PATH:/opt/mpich/gnu/bin For lam-mpi, add/append: PATH=$PATH:/opt/lam/gnu/bin For the Intel distribution of mpi, add/append: PATH=$PATH:/opt/mpich/intel/bin For java (particularly RMI) to work, add: CLASSPATH=~/ (e.g. ~/javaprogs) Then add/append: PATH=$PATH:/usr/java/jdk1.5.0_05/bin For SGE (this is recommended), add: echo -n "Setting SGE environment... " source $SGE_ROOT/default/common/settings.sh echo done! Then add/append: export PATH CLASSPATH ----------------------------------------------- Remember to re-source the .bash_profile file after making your changes: [user@bigred]$ source ~/.bash_profile SGE --- SGE v6 is the installed job-management system installed on bigred. You should use SGE to submit any medium/large jobs, especially parallel jobs. The major commands that users will need to be familiar with to use SGE are: qsub - To submit jobs. qdel - To delete jobs. qstat - To monitor jobs. Jobs are submitted to SGE via a shell-script. Some examples of what the scripts should look like are available in /opt/gridengine/examples. In particular, see: sge-qsub-test.sh. The script specifies what executable(s) will be run and what its environment should be. Some environment examples that can be set in the script are: #$ -cwd (tells SGE to put the output files in the current working directory) #$ -j y (merge the error and output into one file) #$ -S /usr/bin/bash (tells SGE to use the bash shell) #$ -pe mpi 20 (tells the mpi environment how many processors to use) The default output from SGE is in the format of .o and .e files for program output and errors, respectively. So if your script name is myscript.sh, SGE will output two files, myscript.o and myscript.e. The "-j y" will merge these two into one file. For MPICH jobs, the -pe switch is particularly important. Once you have created your script, submit it like so: [user@bigred]$ qsub my-sge-script.sh your job 1234 ("my-sge-script.sh") has been submitted (sample output after running the command) To delete it, you need the job-ID number (1234 in the example above), like so: [user@bigred]$ qdel 1234 has deleted job 1234 (sample output after running the command) To query the status of your SGE jobs use qstat, like so: [user@bigred]$ qstat -u Sample output: job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 7140 0 r 11/14/2005 11:53:43 compute-0- MASTER 7124 0 r 11/14/2005 11:06:46 compute-0- MASTER 7126 0 r 11/14/2005 11:08:50 compute-0- MASTER 7125 0 r 11/14/2005 11:07:47 compute-0- MASTER 7142 0 r 11/14/2005 11:55:14 compute-0- MASTER There are also a host of other sge commands, such as qalter, qrsh and qlogin. For more information, see the qsub/qstat/qdel man-pages. Full documentation on using SGE can be found at: http://gridengine.sunsource.net/documentation.html MPI --- To compile an MPI program: [user@bigred]$ mpicc -o e.g.: [user@bigred]$ mpicc -o test test.c To run an MPI program: [user@bigred]$ mpirun -np e.g.: [user@bigred]$ mpirun -np 4 test (will run a program on 4 processors) Note that the GNU mpich enviroment is somewhat fragile, and a crashed program will probably fail to free up shared memory arrays and semaphores. For this reason, there is a clean IPC command located here: /opt/mpich/gnu/clean/cleanipcs This command must be run on all nodes which a submitted program has be run on. So, if SGE has allocated nodes compute-0-1 and compute-0-2 to run an 8-processor MPI job on, then following a forced job kill you would have to run: [user@bigred]$ ssh compute-0-1 /opt/mpich/gnu/clean/cleanipcs [user@bigred]$ ssh compute-0-2 /opt/mpich/gnu/clean/cleanipcs You can use Ganglia to find out what nodes have been allocated for your MPI job. See the Ganglia section. For more info, check the man-pages on mpirun (man mpirun). There are many more run-time options (such as what nodes to run your program on) that you can specify. NOTE: Running MPICH jobs directly on the cluster is NOT recommended, particularly when the jobs are very large. You can directly run the job with one or two (-np 2) processors for testing, but for actual job-submission, use the SGE job-manager. See the SGE section. Java ---- For java, remember to put the programs you intend to compile in the directory specified in the CLASSPATH variable in the .bash_profile. For example, if in .bash_profile, you have: CLASSPATH=~/javaprogs Then create a directory called "javaprogs" in your home directory, and put all your .java files in it. See the Profile Setup section. NOTE: Running java jobs directly on the cluster is NOT recommended, particularly when the jobs are very large. For actual job-submission, use the SGE job-manager. See the SGE section. RMI --- Making an RMI program usually requires at a minimum: 1) An interface file 2) An implementation file 3) A client and/or server driver file For example, if you create an interface file called "Test1.java", then you will need a file called "Test1Impl.java" with your server-side implementation code in it. To compile java programs for RMI: [user@bigred]$ javac [user@bigred]$ rmic RMI support has been included in the cluster, but to allow the RMI system to work, you will need to create a file called "policy" in your classpath directory, and put the following in it: ---------------policy file----------------- grant { // allows permission for all permission java.security.AllPermission; }; ------------------------------------------- Running RMI programs: After compiling the programs, you need to start the RMI registry service to run RMI programs, like so: [user@bigred]$ rmiregistry & [1] (this should be output after running the above command) Remember to kill the RMI registry program after finishing: [user@bigred]$ kill -9 For more info on programming with Java and RMI, see: http://java.sun.com/j2se/1.5.0/docs/guide/rmi/index.html NOTE: Running RMI jobs directly on the cluster is NOT recommended, particularly when the jobs are very large. For actual job-submission, use the SGE job-manager. See the SGE section. Ganglia ------- You can check on the status of the cluster and of running jobs/processes via the Ganglia system. To do this, go to: https://bigred.cacs.louisiana.edu Then click on the "Cluster Status" link. To see the status of SGE jobs, click on the "Job Queue" link on the cluster status page to see the list of currently submitted SGE jobs. Details of each job can then be viewed by selecting the relevant job ID. For each job, the nodes that have been assigned are listed. This feature is useful for performing post-MPI IPC cleanup (see the MPI section.) Note that new SGE jobs may take a short while to show up in Ganglia. To monitor SGE jobs, it may be preferable to log into the cluster and use qstat. See the SGE section.