Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4531 + 2408 = 6939
Home Page

Parallel Java on the RIT CS Parallel Computers

Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science

Introduction
Setting Up Your Account to Access the Parallel Computers
Don't Run a Shell From Your .login or .cshrc File
You Must Use JDK 1.5
Developing Programs for the SMP Parallel Computers
Paragon Job Queue
Developing Programs for the Cluster Parallel Computer
Paranoia Job Queue
Developing Programs for the Hybrid SMP Cluster Parallel Computer
Tardis Job Queue
Running MPI Jobs
Running Non-PJ Jobs


Introduction

Parallel Java (PJ) is an API and middleware for parallel programming in 100% Java on shared memory multiprocessor (SMP) parallel computers, cluster parallel computers, and hybrid SMP cluster parallel computers. PJ was developed by Professor Alan Kaminsky and his student Luke McOmber in the Department of Computer Science at the Rochester Institute of Technology. For further information about PJ, see the "Parallel Java Library."

PJ is installed on each of the RIT Computer Science Department's parallel computers.

  • SMP parallel computers
  • Cluster parallel computer
    • Frontend computer -- paranoia.cs.rit.edu -- UltraSPARC-II CPU, 296 MHz clock, 192 MB main memory
    • 32 backend computers -- thug01 through thug32 -- each an UltraSPARC-IIe CPU, 650 MHz clock, 1 GB main memory
    • 100-Mbps switched Ethernet backend interconnection network
    • Aggregate 21 GHz clock, 32 GB main memory
       
  • Hybrid SMP cluster parallel computer
    • Frontend computer -- tardis.cs.rit.edu -- UltaSPARC-IIe CPU, 650 MHz clock, 512 MB main memory
    • 10 backend computers -- dr00 through dr09 -- each with two AMD Opteron 2218 dual-core CPUs, four processors, 2.6 GHz clock, 8 GB main memory
    • 1-Gbps switched Ethernet backend interconnection network
    • Aggregate 104 GHz clock, 80 GB main memory


Setting Up Your Account to Access the Parallel Computers

Running Parallel Java programs on the parallel computers requires you to set up SSH public key authentication in your account. Log into your CS Department account and type the commands below.

  1. mkdir .ssh

    This command creates the directory where SSH configuration files are stored. Type this command only if you do not already have a .ssh directory in your home directory.

  2. cd .ssh

    This command changes the current directory to the directory where SSH configuration files are stored.

  3. cp /home/fac/ark/public_html/known_hosts .

    This command copies a file named known_hosts into your .ssh directory. Each line of this file contains the RSA public key for one of the parallel computers (hosts). SSH uses this public key to authenticate the host it is logging into.

  4. ssh-keygen -t rsa

    This command creates an RSA public/private key pair for your account. When it asks you for a file in which to save the key, hit return. When it asks you for a passphrase, hit return. When it asks you for the passphrase again, hit return. You now have two files in your .ssh directory: id_rsa.pub contains your public key, id_rsa contains your private key.

  5. cp id_rsa.pub authorized_keys

    This command puts a copy of your public key into the file authorized_keys. This tells SSH to use your public key for authentication when logging into your account.

  6. chmod 600 *

    This command makes all files in your .ssh directory readable and writable only by you. This command is critical! If you do not do this command, other users will be able to log into your account without needing to know your password!

  7. chmod 700 .

    This command makes the .ssh directory itself accessible only by you. This command is critical! If you do not do this command, other users will be able to log into your account without needing to know your password!

You still have to type your password when you first log into your account. But once you are logged in, you -- and the Parallel Java middleware -- will be able to SSH into your own account on any of the CS Department parallel computers without having to type your password. Instead, SSH will authenticate you using the information in the files in your .ssh directory. A final reminder: If these files are accessible to anyone other than yourself, other users will be able to log into your account without needing to know your password!


Don't Run a Shell From Your .login or .cshrc File

Some folks don't like the default shell they get when they log in, so they put a command in their .login or .cshrc file to run some other shell.

If you do this, Parallel Java jobs will not work. Take this command out of the .login or .cshrc file immediately!

The reason Parallel Java jobs will not work is that when the Parallel Java Job Scheduler starts a job, the Job Scheduler actually logs into your account using ssh (with public key authentication as described above) and tells the machine to execute your program. However, logging into your account causes the .login and .cshrc files to be executed before executing your program. If the .login or .cshrc file invokes a shell, the shell sits there waiting for input (which will never arrive because the standard input is not connected to anything), the .login or .cshrc file never finishes, and your program never gets executed. Eventually the job times out and prints an error message like this:

Job backend process failed, processor thug05, rank 0


You Must Use JDK 1.5

Parallel Java was developed using Java Development Kit (JDK) 1.5. When compiling and running Parallel Java programs, you must use JDK 1.5. Parallel Java uses features of the Java language and platform introduced in JDK 1.5 and will not compile with earlier JDK versions.

Note: Parallel Java will work with JDK 1.6 and 1.7. However, my tests have revealed serious performance issues when a multithreaded Parallel Java program is run on an SMP parallel computer with JDK 1.6 or 1.7. Due to some as-yet-unfathomed behavior of the JIT compiler and/or the thread scheduler, SMP parallel programs that experienced near-ideal speedups with JDK 1.5 experience far-less-than-ideal speedups with JDK 1.6 and 1.7 on the same machine.

JDK 1.5 is not the default on the CS Department computers. You must use the command below to compile a Parallel Java source file and create a JDK 1.5 compatible class file:

$ javac -source 1.5 -target 1.5 Foo.java

If you are compiling Parallel Java source files on your own machine and/or using an IDE such as Eclipse, you must figure out on your own how to create JDK 1.5 compatible class files.

The Parallel Java job queues described below are set up to run your programs using JDK 1.5. If you try to run a program that has been compiled with JDK 1.6 or 1.7 in the Parallel Java job queue, you will see this error message:

Exception in thread "main" java.lang.UnsupportedClassVersionError:
Bad version number in .class file


Developing Programs for the SMP Parallel Computers

When developing programs for the SMP parallel computers, please obey the rules below.

  1. In your PJ program (both sequential and parallel programs), include the lines shown in bold below. Call the Comm.init() method first thing in your main program, before beginning to time the program. The purpose of the Comm.init() method is explained later.*
        import edu.rit.pj.Comm;
        public class Foo
            {
            public static void main
                (String[] args)
                throws Exception
                {
                Comm.init (args);
                long t = -System.currentTimeMillis();
                . . .
                }
            }
    
  2. Before compiling or running a PJ program on the CS Department machines, be sure to set your Java classpath to include the PJ distribution. Here is an example of a command for the bash shell to set the classpath to the current directory plus the PJ JAR file:
        export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
    
    Here is an example of a command for the csh shell to set the classpath to the current directory plus the PJ JAR file:
        setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
    
  3. When editing, compiling, and debugging your PJ program, log into one of the CS Department machines other than the parallel machines, or use your own personal computer. PJ runs just fine on any computer, even when running multiple threads; you simply won't see any parallel speedups. Leave the parallel machines for final testing.
     
  4. You may log into the paradox and paradise machines for final testing of your PJ program. However, depending on how many users are logged in and what they are doing, you may not get accurate timing measurements.
     
  5. There is a job queue for running PJ programs on the parasite machine to do timing measurements. The job queue itself runs on the paragon machine, and the job queue executes one PJ program at a time on the parasite machine. This allows one PJ program at a time have full use of the parasite machine for more accurate timing measurements. To use the parallel job queue, first log into the paragon machine, then run your program using this command:
        java -Dpj.nt=<K> Foo . . .
    
    replacing <K> with the number of parallel threads. (Logging into the parasite machine directly is not allowed.)
     
  6. The time limit for jobs in the job queue is one hour. If your program runs longer than that, the job queue will abort your program.

*Including the Comm.init() method call in your program causes your program to go through the job queue when you run your program on the paragon machine; the program itself actually executes on the parasite machine. When you run your program on the paradox or paradise machine, there is no job queue, and the program executes directly on the paradox or paradise machine.


Paragon Job Queue

Go to the following web page to view the paragon job queue status:

http://paragon.cs.rit.edu:8080/

The web page automatically refreshes itself every 20 seconds, or you can click the "Refresh" button to refresh immediately.

When you run a PJ program on the paragon machine using the job queue, your program may have to wait a while before the job queue lets it start running. If you get tired of waiting, you can kill your program (e.g., by typing CTRL-C). This removes your program from the job queue.


Developing Programs for the Cluster Parallel Computer

When developing programs for the cluster parallel computer, please obey the rules below.

  1. In your PJ program (both sequential and parallel programs), include the lines shown in bold below. Call the Comm.init() method first thing in your main program, after beginning to time the program. The purpose of the Comm.init() method is explained later.*
        import edu.rit.pj.Comm;
        public class Foo
            {
            public static void main
                (String[] args)
                throws Exception
                {
                long t = -System.currentTimeMillis();
                Comm.init (args);
                . . .
                }
            }
    
  2. Before compiling or running a PJ program on the CS Department machines, be sure to set your Java classpath to include the PJ distribution. Here is an example of a command for the bash shell to set the classpath to the current directory plus the PJ JAR file:
        export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
    
    Here is an example of a command for the csh shell to set the classpath to the current directory plus the PJ JAR file:
        setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
    
  3. When editing, compiling, and debugging your PJ program, log into one of the CS Department machines other than the paranoia machine, or use your own personal computer. PJ runs just fine on any computer, even when running a cluster parallel program; you simply won't see any parallel speedups. Leave the paranoia machine for final testing.
     
  4. You may log into the paranoia machine for final testing of your PJ program and for doing timing measurements.
     
  5. There is a job queue for running PJ programs on the paranoia machine. You must use the job queue to run a PJ program on the cluster. The parallel job queue runs each PJ program on a group of one or more thug machines. The parallel job queue runs multiple PJ programs simultaneously, in first-in-first-out order, as long as enough thug machines are available. However, the parallel job queue lets only one PJ program at a time run on each thug machine. This lets one PJ program at a time have full use of its group of thug machines for more accurate timing measurements. To use the parallel job queue, first log into the paranoia machine, then run your program using this command:
        java -Dpj.np=<K> Foo . . .
    
    replacing <K> with the number of parallel processes.
     
  6. Do not specify more than K = 8 parallel processes. This is to avoid having any one user tie up all the thug machines.
     
  7. The time limit for jobs in the job queue is one hour. If your program runs longer than that, the job queue will abort your program.

*Including the Comm.init() method call in your program causes your program to go through the job queue when you run your program on the paranoia machine; the program itself actually executes on the thug machines. The Comm.init() method also sets up the "world communicator" that the parallel processes use to send messages amongst themselves.


Paranoia Job Queue

Go to the following web page to view the paranoia job queue status:

http://paranoia.cs.rit.edu:8080/

The web page automatically refreshes itself every 20 seconds, or you can click the "Refresh" button to refresh immediately.

When you run a PJ program on the paranoia machine using the job queue, your program may have to wait a while before the job queue lets it start running. If you get tired of waiting, you can kill your program (e.g., by typing CTRL-C). This removes your program from the job queue.


Developing Programs for the Hybrid SMP Cluster Parallel Computer

When developing programs for the hybrid SMP cluster parallel computer, please obey the rules below.

  1. In your PJ program (both sequential and parallel programs), include the lines shown in bold below. Call the Comm.init() method first thing in your main program, after beginning to time the program. The purpose of the Comm.init() method is explained later.*
        import edu.rit.pj.Comm;
        public class Foo
            {
            public static void main
                (String[] args)
                throws Exception
                {
                long t = -System.currentTimeMillis();
                Comm.init (args);
                . . .
                }
            }
    
  2. Before compiling or running a PJ program on the CS Department machines, be sure to set your Java classpath to include the PJ distribution. Here is an example of a command for the bash shell to set the classpath to the current directory plus the PJ JAR file:
        export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
    
    Here is an example of a command for the csh shell to set the classpath to the current directory plus the PJ JAR file:
        setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
    
  3. When editing, compiling, and debugging your PJ program, log into one of the CS Department machines other than the tardis machine, or use your own personal computer. PJ runs just fine on any computer, even when running a cluster parallel program; you simply won't see any parallel speedups. Leave the tardis machine for final testing.
     
  4. You may log into the tardis machine for final testing of your PJ program and for doing timing measurements.
     
  5. There is a job queue for running PJ programs on the tardis machine. You must use the job queue to run a PJ program on the cluster. The parallel job queue runs each PJ program on a group of one or more dr machines. The parallel job queue runs multiple PJ programs simultaneously, in first-in-first-out order, as long as enough dr machines are available. However, the parallel job queue lets only one PJ program at a time run on each dr machine. This lets one PJ program at a time have full use of its group of dr machines for more accurate timing measurements. To use the parallel job queue, first log into the tardis machine, then run your program using this command:
        java -Dpj.np=<K> -Dpj.nt=<L> Foo . . .
    
    replacing <K> with the number of parallel processes and <L> with the number of parallel threads inside each parallel process.
     
  6. Do not specify more than K = 4 parallel processes. This is to avoid having any one user tie up all the dr machines.

*Including the Comm.init() method call in your program causes your program to go through the job queue when you run your program on the tardis machine; the program itself actually executes on the dr machines. The Comm.init() method also sets up the "world communicator" that the parallel processes use to send messages amongst themselves.


Tardis Job Queue

Go to the following web page to view the tardis job queue status:

http://tardis.cs.rit.edu:8080/

The web page automatically refreshes itself every 20 seconds, or you can click the "Refresh" button to refresh immediately.

When you run a PJ program on the tardis machine using the job queue, your program may have to wait a while before the job queue lets it start running. If you get tired of waiting, you can kill your program (e.g., by typing CTRL-C). This removes your program from the job queue.


Running MPI Jobs

There is a program named mprun that lets you run an MPI program on the paranoia.cs.rit.edu cluster parallel computer using the PJ job queue.

  1. Log into the paranoia machine.

  2. Set your Java classpath to include the PJ distribution. Here is an example of a command for the bash shell to set the classpath to the current directory plus the PJ JAR file:
        export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
    
    Here is an example of a command for the csh shell to set the classpath to the current directory plus the PJ JAR file:
        setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
    
  3. Type this command:
        java mprun -np <K> <command> <args> . . .
    
    where <K> is the number of backend nodes you need, <command> is the MPI program to run, and <args> are the MPI program's command line arguments if any. If you omit the -np <K> option, the default is one backend node.

The mprun program contacts the PJ Job Scheduler Daemon and requests a job running on K nodes of the cluster. The job goes into the job queue and may sit in the job queue for some time until K nodes are available. Once K nodes are available, the mprun program prints their names on the standard error. For example:

    $ java mprun -np 4 <command> <args> . . .
    Job 42, thug01, thug02, thug03, thug04
    . . .
The MPI program then runs automatically on the assigned nodes. The standard input, standard output, and standard error of the MPI program come from and go to the terminal as usual.

You can put a time limit on the mprun program this way:

    java -Dpj.jobtime=<T> mprun -np <K> <command> <args> . . .
where <K> is the number of backend nodes you need and <T> is the time limit in seconds. In this case the mprun program will terminate itself automatically after T seconds. If you specify a time limit of more than one hour, the job queue will still terminate the mprun program after one hour. You can also kill the mprun program manually, e.g. by typing CTRL-C.

Jobs created using the mprun program show up in the PJ job queue along with other PJ jobs. See "Paranoia Job Queue" above for further information.

Note: To run MPI jobs, it is not necessary to set up your account for SSH public key authentication. To run Parallel Java jobs, it is necessary to set up your account for SSH public key authentication.


Running Non-PJ Jobs

There is a program named pjrun that lets you run a non-PJ program on one of the above parallel computers using the PJ job queue. (PJ programs interact with the PJ job queue directly and do not need to use the pjrun program.)

  1. Log into the paragon, paranoia, or tardis machines.

  2. Set your Java classpath to include the PJ distribution. Here is an example of a command for the bash shell to set the classpath to the current directory plus the PJ JAR file:
        export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
    
    Here is an example of a command for the csh shell to set the classpath to the current directory plus the PJ JAR file:
        setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
    
  3. Type this command:
        java -Dpj.np=<K> pjrun
    
    where <K> is the number of backend nodes you need.

The pjrun program contacts the PJ Job Scheduler Daemon and requests a job running on K nodes of the cluster. The job goes into the job queue and may sit in the job queue for some time until K nodes are available. Once K nodes are available, the pjrun program prints their names on the standard output. For example:

    $ java -Dpj.np=4 pjrun
    thug01
    thug02
    thug03
    thug04
You can then do whatever you want with those nodes, such as log into them and run programs on them. Other PJ jobs will not be assigned those nodes as long as the pjrun program runs. The pjrun program continues to run until killed externally. To release the assigned nodes, kill the pjrun program, e.g. by typing CTRL-C.

You can put a time limit on the pjrun program this way:

    java -Dpj.np=<K> -Dpj.jobtime=<T> pjrun
where <K> is the number of backend nodes you need and <T> is the time limit in seconds. In this case the pjrun program will terminate itself automatically after T seconds. If you specify a time limit of more than one hour, the job queue will still terminate the pjrun program after one hour. You can also kill the pjrun program manually.

Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4531 + 2408 = 6939
Home Page
Copyright © 2013 Alan Kaminsky. All rights reserved. Last updated 28-Feb-2013. Please send comments to ark­@­cs.rit.edu.