Parallel Java on the RIT CS Parallel Computers
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
Introduction
Setting Up Your Account to Access the Parallel Computers
Don't Run a Shell From Your .login or .cshrc File
You Must Use JDK 1.5
Developing Programs for the SMP Parallel Computers
Paragon Job Queue
Developing Programs for the Cluster Parallel Computer
Paranoia Job Queue
Developing Programs for the Hybrid SMP Cluster Parallel Computer
Tardis Job Queue
Running MPI Jobs
Running Non-PJ Jobs
Introduction
Parallel Java (PJ)
is an API and middleware for parallel programming
in 100% Java
on shared memory multiprocessor (SMP) parallel computers,
cluster parallel computers,
and hybrid SMP cluster parallel computers.
PJ was developed by Professor Alan Kaminsky
and his student Luke McOmber
in the Department of Computer Science
at the Rochester Institute of Technology.
For further information about PJ,
see the "Parallel Java Library."
PJ is installed on each of the RIT Computer Science Department's
parallel computers.
- SMP parallel computers
- Cluster parallel computer
- Frontend computer -- paranoia.cs.rit.edu -- UltraSPARC-II CPU, 296 MHz clock, 192 MB main memory
- 32 backend computers -- thug01 through thug32 -- each an UltraSPARC-IIe CPU, 650 MHz clock, 1 GB main memory
- 100-Mbps switched Ethernet backend interconnection network
- Aggregate 21 GHz clock, 32 GB main memory
- Hybrid SMP cluster parallel computer
- Frontend computer -- tardis.cs.rit.edu -- UltaSPARC-IIe CPU, 650 MHz clock, 512 MB main memory
- 10 backend computers -- dr00 through dr09 -- each with two AMD Opteron 2218 dual-core CPUs, four processors, 2.6 GHz clock, 8 GB main memory
- 1-Gbps switched Ethernet backend interconnection network
- Aggregate 104 GHz clock, 80 GB main memory
Setting Up Your Account to Access the Parallel Computers
Running Parallel Java programs on the parallel computers
requires you to set up SSH public key authentication
in your account.
Log into your CS Department account
and type the commands below.
-
mkdir .ssh
This command creates the directory
where SSH configuration files are stored.
Type this command only if you do not already have
a .ssh directory in your home directory.
-
cd .ssh
This command changes the current directory
to the directory where SSH configuration files are stored.
-
cp /home/fac/ark/public_html/known_hosts .
This command copies a file named known_hosts
into your .ssh directory.
Each line of this file contains the RSA public key
for one of the parallel computers (hosts).
SSH uses this public key
to authenticate the host it is logging into.
-
ssh-keygen -t rsa
This command creates an RSA public/private key pair
for your account.
When it asks you for a file in which to save the key, hit return.
When it asks you for a passphrase, hit return.
When it asks you for the passphrase again, hit return.
You now have two files in your .ssh directory:
id_rsa.pub contains your public key,
id_rsa contains your private key.
-
cp id_rsa.pub authorized_keys
This command puts a copy of your public key
into the file authorized_keys.
This tells SSH to use your public key for authentication
when logging into your account.
-
chmod 600 *
This command makes all files in your .ssh directory
readable and writable only by you.
This command is critical!
If you do not do this command,
other users will be able to log into your account
without needing to know your password!
-
chmod 700 .
This command makes the .ssh directory itself
accessible only by you.
This command is critical!
If you do not do this command,
other users will be able to log into your account
without needing to know your password!
You still have to type your password
when you first log into your account.
But once you are logged in,
you -- and the Parallel Java middleware --
will be able to SSH into your own account
on any of the CS Department parallel computers
without having to type your password.
Instead, SSH will authenticate you
using the information in the files in your .ssh directory.
A final reminder:
If these files are accessible to anyone other than yourself,
other users will be able to log into your account
without needing to know your password!
Don't Run a Shell From Your .login or .cshrc File
Some folks don't like the default shell
they get when they log in,
so they put a command
in their .login or .cshrc file
to run some other shell.
If you do this, Parallel Java jobs will not work.
Take this command out of
the .login or .cshrc file
immediately!
The reason Parallel Java jobs will not work
is that when the Parallel Java Job Scheduler
starts a job,
the Job Scheduler actually logs into your account using ssh
(with public key authentication as described above)
and tells the machine to execute your program.
However, logging into your account
causes the .login and .cshrc files
to be executed
before executing your program.
If the .login or .cshrc file
invokes a shell,
the shell sits there waiting for input
(which will never arrive because
the standard input is not connected to anything),
the .login or .cshrc file
never finishes,
and your program never gets executed.
Eventually the job times out
and prints an error message like this:
Job backend process failed, processor thug05, rank 0
You Must Use JDK 1.5
Parallel Java was developed using
Java Development Kit (JDK) 1.5.
When compiling and running Parallel Java programs,
you must use JDK 1.5.
Parallel Java uses features of the Java language and platform
introduced in JDK 1.5
and will not compile with earlier JDK versions.
| |
Note:
Parallel Java will work with JDK 1.6 and 1.7.
However, my tests have revealed serious performance issues
when a multithreaded Parallel Java program
is run on an SMP parallel computer
with JDK 1.6 or 1.7.
Due to some as-yet-unfathomed behavior
of the JIT compiler
and/or the thread scheduler,
SMP parallel programs that experienced near-ideal speedups with JDK 1.5
experience far-less-than-ideal speedups with JDK 1.6 and 1.7
on the same machine.
|
|
JDK 1.5 is not the default on the CS Department computers.
You must use the command below
to compile a Parallel Java source file
and create a JDK 1.5 compatible class file:
$ javac -source 1.5 -target 1.5 Foo.java
If you are compiling Parallel Java source files
on your own machine
and/or using an IDE such as Eclipse,
you must figure out on your own
how to create JDK 1.5 compatible class files.
The Parallel Java job queues described below
are set up to run your programs using JDK 1.5.
If you try to run a program
that has been compiled with JDK 1.6 or 1.7
in the Parallel Java job queue,
you will see this error message:
Exception in thread "main" java.lang.UnsupportedClassVersionError:
Bad version number in .class file
Developing Programs for the SMP Parallel Computers
When developing programs for the SMP parallel computers,
please obey the rules below.
-
In your PJ program (both sequential and parallel programs),
include the lines shown in bold below.
Call the Comm.init() method
first thing in your main program,
before beginning to time the program.
The purpose of the Comm.init() method is explained later.*
import edu.rit.pj.Comm;
public class Foo
{
public static void main
(String[] args)
throws Exception
{
Comm.init (args);
long t = -System.currentTimeMillis();
. . .
}
}
-
Before compiling or running a PJ program on the CS Department machines,
be sure to set your Java classpath to include the PJ distribution.
Here is an example of a command for the bash shell
to set the classpath to the current directory
plus the PJ JAR file:
export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
Here is an example of a command for the csh shell
to set the classpath to the current directory
plus the PJ JAR file:
setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
-
When editing, compiling, and debugging your PJ program,
log into one of the CS Department machines
other than the parallel machines,
or use your own personal computer.
PJ runs just fine on any computer,
even when running multiple threads;
you simply won't see any parallel speedups.
Leave the parallel machines for final testing.
-
You may log into the paradox and paradise machines
for final testing of your PJ program.
However, depending on how many users are logged in
and what they are doing,
you may not get accurate timing measurements.
-
There is a job queue
for running PJ programs on the parasite machine
to do timing measurements.
The job queue itself runs on the paragon machine,
and the job queue executes one PJ program at a time
on the parasite machine.
This allows one PJ program at a time
have full use of the parasite machine
for more accurate timing measurements.
To use the parallel job queue,
first log into the paragon machine,
then run your program using this command:
java -Dpj.nt=<K> Foo . . .
replacing <K>
with the number of parallel threads.
(Logging into the parasite machine directly
is not allowed.)
-
The time limit for jobs in the job queue is one hour.
If your program runs longer than that,
the job queue will abort your program.
*Including the Comm.init() method call in your program
causes your program to go through the job queue
when you run your program on the paragon machine;
the program itself actually executes on the parasite machine.
When you run your program
on the paradox or paradise machine,
there is no job queue,
and the program executes directly
on the paradox or paradise machine.
Paragon Job Queue
Go to the following web page
to view the paragon job queue status:
http://paragon.cs.rit.edu:8080/
The web page automatically refreshes itself every 20 seconds,
or you can click the "Refresh" button to refresh immediately.
When you run a PJ program
on the paragon machine
using the job queue,
your program may have to wait a while
before the job queue lets it start running.
If you get tired of waiting,
you can kill your program
(e.g., by typing CTRL-C).
This removes your program from the job queue.
Developing Programs for the Cluster Parallel Computer
When developing programs for the cluster parallel computer,
please obey the rules below.
-
In your PJ program (both sequential and parallel programs),
include the lines shown in bold below.
Call the Comm.init() method
first thing in your main program,
after beginning to time the program.
The purpose of the Comm.init() method is explained later.*
import edu.rit.pj.Comm;
public class Foo
{
public static void main
(String[] args)
throws Exception
{
long t = -System.currentTimeMillis();
Comm.init (args);
. . .
}
}
-
Before compiling or running a PJ program on the CS Department machines,
be sure to set your Java classpath to include the PJ distribution.
Here is an example of a command for the bash shell
to set the classpath to the current directory
plus the PJ JAR file:
export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
Here is an example of a command for the csh shell
to set the classpath to the current directory
plus the PJ JAR file:
setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
-
When editing, compiling, and debugging your PJ program,
log into one of the CS Department machines
other than the paranoia machine,
or use your own personal computer.
PJ runs just fine on any computer,
even when running a cluster parallel program;
you simply won't see any parallel speedups.
Leave the paranoia machine for final testing.
-
You may log into the paranoia machine
for final testing of your PJ program
and for doing timing measurements.
-
There is a job queue
for running PJ programs on the paranoia machine.
You must use the job queue to run a PJ program on the cluster.
The parallel job queue runs each PJ program
on a group of one or more thug machines.
The parallel job queue runs multiple PJ programs simultaneously,
in first-in-first-out order,
as long as enough thug machines are available.
However, the parallel job queue
lets only one PJ program at a time
run on each thug machine.
This lets one PJ program at a time
have full use of its group of thug machines
for more accurate timing measurements.
To use the parallel job queue,
first log into the paranoia machine,
then run your program using this command:
java -Dpj.np=<K> Foo . . .
replacing <K>
with the number of parallel processes.
-
Do not specify more than K = 8 parallel processes.
This is to avoid having any one user
tie up all the thug machines.
-
The time limit for jobs in the job queue is one hour.
If your program runs longer than that,
the job queue will abort your program.
*Including the Comm.init() method call in your program
causes your program to go through the job queue
when you run your program on the paranoia machine;
the program itself actually executes on the thug machines.
The Comm.init() method also sets up the "world communicator"
that the parallel processes use
to send messages amongst themselves.
Paranoia Job Queue
Go to the following web page
to view the paranoia job queue status:
http://paranoia.cs.rit.edu:8080/
The web page automatically refreshes itself every 20 seconds,
or you can click the "Refresh" button to refresh immediately.
When you run a PJ program
on the paranoia machine
using the job queue,
your program may have to wait a while
before the job queue lets it start running.
If you get tired of waiting,
you can kill your program
(e.g., by typing CTRL-C).
This removes your program from the job queue.
Developing Programs for the Hybrid SMP Cluster Parallel Computer
When developing programs for the hybrid SMP cluster parallel computer,
please obey the rules below.
-
In your PJ program (both sequential and parallel programs),
include the lines shown in bold below.
Call the Comm.init() method
first thing in your main program,
after beginning to time the program.
The purpose of the Comm.init() method is explained later.*
import edu.rit.pj.Comm;
public class Foo
{
public static void main
(String[] args)
throws Exception
{
long t = -System.currentTimeMillis();
Comm.init (args);
. . .
}
}
-
Before compiling or running a PJ program on the CS Department machines,
be sure to set your Java classpath to include the PJ distribution.
Here is an example of a command for the bash shell
to set the classpath to the current directory
plus the PJ JAR file:
export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
Here is an example of a command for the csh shell
to set the classpath to the current directory
plus the PJ JAR file:
setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
-
When editing, compiling, and debugging your PJ program,
log into one of the CS Department machines
other than the tardis machine,
or use your own personal computer.
PJ runs just fine on any computer,
even when running a cluster parallel program;
you simply won't see any parallel speedups.
Leave the tardis machine for final testing.
-
You may log into the tardis machine
for final testing of your PJ program
and for doing timing measurements.
-
There is a job queue
for running PJ programs on the tardis machine.
You must use the job queue to run a PJ program on the cluster.
The parallel job queue runs each PJ program
on a group of one or more dr machines.
The parallel job queue runs multiple PJ programs simultaneously,
in first-in-first-out order,
as long as enough dr machines are available.
However, the parallel job queue
lets only one PJ program at a time
run on each dr machine.
This lets one PJ program at a time
have full use of its group of dr machines
for more accurate timing measurements.
To use the parallel job queue,
first log into the tardis machine,
then run your program using this command:
java -Dpj.np=<K> -Dpj.nt=<L> Foo . . .
replacing <K>
with the number of parallel processes
and <L>
with the number of parallel threads inside each parallel process.
-
Do not specify more than K = 4 parallel processes.
This is to avoid having any one user
tie up all the dr machines.
*Including the Comm.init() method call in your program
causes your program to go through the job queue
when you run your program on the tardis machine;
the program itself actually executes on the dr machines.
The Comm.init() method also sets up the "world communicator"
that the parallel processes use
to send messages amongst themselves.
Tardis Job Queue
Go to the following web page
to view the tardis job queue status:
http://tardis.cs.rit.edu:8080/
The web page automatically refreshes itself every 20 seconds,
or you can click the "Refresh" button to refresh immediately.
When you run a PJ program
on the tardis machine
using the job queue,
your program may have to wait a while
before the job queue lets it start running.
If you get tired of waiting,
you can kill your program
(e.g., by typing CTRL-C).
This removes your program from the job queue.
Running MPI Jobs
There is a program named mprun
that lets you run an MPI program
on the paranoia.cs.rit.edu cluster parallel computer
using the PJ job queue.
-
Log into the paranoia machine.
-
Set your Java classpath to include the PJ distribution.
Here is an example of a command for the bash shell
to set the classpath to the current directory
plus the PJ JAR file:
export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
Here is an example of a command for the csh shell
to set the classpath to the current directory
plus the PJ JAR file:
setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
-
Type this command:
java mprun -np <K> <command> <args> . . .
where <K> is the number of backend nodes you need,
<command> is the MPI program to run,
and <args> are the MPI program's command line arguments if any.
If you omit the -np <K> option,
the default is one backend node.
The mprun program contacts the PJ Job Scheduler Daemon and requests
a job running on K nodes of the cluster. The job goes into the job
queue and may sit in the job queue for some time until K nodes are
available. Once K nodes are available, the mprun program
prints their names on the standard error. For example:
$ java mprun -np 4 <command> <args> . . .
Job 42, thug01, thug02, thug03, thug04
. . .
The MPI program then runs automatically on the assigned nodes.
The standard input, standard output, and standard error
of the MPI program
come from and go to the terminal as usual.
You can put a time limit on the mprun program this way:
java -Dpj.jobtime=<T> mprun -np <K> <command> <args> . . .
where <K> is the number of backend nodes you need
and <T> is the time limit in seconds.
In this case the mprun program will terminate itself automatically
after T seconds.
If you specify a time limit of more than one hour,
the job queue will still terminate the mprun program after one hour.
You can also kill the mprun program
manually, e.g. by typing CTRL-C.
Jobs created using the mprun program
show up in the PJ job queue along with other PJ jobs.
See "Paranoia Job Queue" above
for further information.
Note:
To run MPI jobs,
it is not necessary
to set up your account for SSH public key authentication.
To run Parallel Java jobs,
it is necessary
to set up your account for SSH public key authentication.
Running Non-PJ Jobs
There is a program named pjrun
that lets you run a non-PJ program
on one of the above parallel computers
using the PJ job queue.
(PJ programs interact with the PJ job queue directly
and do not need to use the pjrun program.)
-
Log into the paragon,
paranoia,
or tardis machines.
-
Set your Java classpath to include the PJ distribution.
Here is an example of a command for the bash shell
to set the classpath to the current directory
plus the PJ JAR file:
export CLASSPATH=.:/home/fac/ark/public_html/pj.jar
Here is an example of a command for the csh shell
to set the classpath to the current directory
plus the PJ JAR file:
setenv CLASSPATH .:/home/fac/ark/public_html/pj.jar
-
Type this command:
java -Dpj.np=<K> pjrun
where <K> is the number of backend nodes you need.
The pjrun program contacts the PJ Job Scheduler Daemon and requests
a job running on K nodes of the cluster. The job goes into the job
queue and may sit in the job queue for some time until K nodes are
available. Once K nodes are available, the pjrun program
prints their names on the standard output. For example:
$ java -Dpj.np=4 pjrun
thug01
thug02
thug03
thug04
You can then do whatever you want with those nodes, such as log into them and
run programs on them. Other PJ jobs will not be assigned those nodes as long
as the pjrun program runs. The pjrun program continues to
run until killed externally. To release the assigned nodes, kill the
pjrun program, e.g. by typing CTRL-C.
You can put a time limit on the pjrun program this way:
java -Dpj.np=<K> -Dpj.jobtime=<T> pjrun
where <K> is the number of backend nodes you need
and <T> is the time limit in seconds.
In this case the pjrun program will terminate itself automatically
after T seconds.
If you specify a time limit of more than one hour,
the job queue will still terminate the pjrun program after one hour.
You can also kill the pjrun program
manually.
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2013 Alan Kaminsky.
All rights reserved.
Last updated 28-Feb-2013.
Please send comments to ark@cs.rit.edu.
|