Guide to High-Throughput Computing

Using Condor in the Computer Science Department at R.I.T.

Revision 1.2 -- 28 September 2004

Current Condor Version: 6.6.6

Condor Overview
What is Condor?
Condor is an execution environment that focuses on high-throughput, not high-performance. This means Condor's goal is to process as many jobs as possible in a given time period, as compared to processing a single job as fast as possible. This does not mean that Condor will sit idle with queued jobs, but rather will work to maximize resources and compute time utilization throughout the entire flock.

What Can Condor Do?
Condor can run just about any program in any modern programming language provided:
  1. During execution, the program requires no user interaction
  2. The program does not have a graphical interface, even if it's not used

What Can't Condor Do?
Condor cannot run a program that requires mid-processing user intervention.

How Does Condor Work?
Condor reads a set of configuration files that apply to the flock as a whole, and a set of per-system configuration files that define the behaviour of a single system.
Once a job is submitted, Condor works to place your job in the best possible place for execution. From here, the magic of Condor is revealed.

It is important to note that Codor executes your jobs as you. This means your paths and environment setup is the same for Condor as it is for you. Quite literally, Condor simply runs the jobs for you, and takes care of a great deal of optimal task scheduling.

Condor will always give pereference to console user. Say your job is chunking away on system C while you sit on system A. Someone sits down at system C and logs in. Condor is aware of this, and suspends your job for a few minutes. If this person is simply there to check mail and is gone in a few minutes, Condor resumes your job on system C. The only effect is a brief suspension. However, if this person does not leave after a few minutes, the job is migrated to a different system. Depending on the Condor environment, this means different things. More on this later.

Once a job ends, either due to completion, termination, or error, an email is sent to you the submitter. This email outlines a fair amount of processor statistics as well as other pertinent information.

Condor Systems in Computer Science
Currently, all student computer labs and all SMP systems are part of the Condor flock. This breaks down as follows:

System Catagory
Condor Node Count
Student Labs
134
SMP Systems
20
Total
154

Each of the SMP systems (holly, hilly, queeg, parasite, paradise) has an equal number of Condor nodes as processors. Associated with each processor is a dedicated block of memory, essentially emulating a single-processor machine.

For example, hilly and holly each have four processors and 4 gigabytes of memory. Condor will see each of these machines as 4 single processor machines each with a gigabyte of memory.

This is shown via example in the 'Finding the Status of a Condor Job' section.


Getting Started With Condor


Setting Paths
Only a few changes need be made to your path to successfully use Condor. In your [shell-name].rc file (the exact name depends on your shell eg. Bash uses .bashrc, CShell uses .cshrc, etc.), append ~condor/condor/bin to your path. It is important to make this change permanent as changes you make to a local environment (via setenv or export) will not be carried over to other systems where your job is run.

Brief Guide to Common Condor Commands

This is a simplified overview of the more common Condor commands. In no way is this intended to be a manual of operations.

For more, see the man pages for each command and the Users Guide (see the Resources section, below).


These executables are all in reference from ~condor/condor/bin/:
Using Condor

Universes
Universes define the Condor environment that a job is executed in. This is entirely disjoint from your personal operating environment.

Standard Universe
  • This universe supplies checkpointing (see Other Resources for more), and RPC support. To use this universe, a program must be relinked with condor_compile.
Checkpointing allows a process to be migrated (moved) from one system to another during execution and, upon arrival, resume where it left off.

Using condor_compile is very easy; say the current compilation command is something along the lines of:

gcc -o myprogram.condor file1.c file2.c ...

Simply change it as follows:

condor_compile gcc -o myprogram.condor file1.c file2.c ...

This will create the executeable myprogram.condor with Condor libraries linked in. Note that only C and C++ programs can take advantage of this Universe.

Vanilla Universe
  • This universe is designed for programs that cannot be relinked with condor_compile. Jobs in this universe cannot be checkpointed or use RPC. This means that if a job is forced to move, Condor can either:
    1. Suspend the job until later for execution on the same system
    2. End execution, move to a different system, and restart program execution from the begining.
  • The Vanilla Universe presumes the use of a shared file system. In the Computer Science department, this is not a problem as everything is mounted via NFS.
Java Universe
  • This universe is designed specifially for Java applications. For all practical purposes, it is nearly identical to the Vanilla Universe.
By the very nature of the Java Virtual Machine, a job cannot be suspended on a machine and moved to a second and continue processing. Eviction from a host is a literal death sentance for a Java process, and when it resettles, it will restart from the begining.


Submitting a Job to Condor

Reguardless of which technique you use to submit jobs to Condor, you must be on a Condor-submit system. These include all student labs and the SMP systems in the department as mentioned above in the 'Condor Systems in Computer Science' section.

Note to CS Department Faculty: For various reasons, faculty machines are NOT part of the Condor flock. This means that to submit jobs to Condor, you must connect to a Condor-submit system to enqueue tasks.

To submit a job to Condor, you have two choices:
  1. Use condor_run to enqueue a command-line-based program.
  2. Use condor_submit to enqueue a more complex configuration.
This technique allows you to define environmental parameters (eg. only run on Solaris machines with Solaris 9 with more than 512 Megabytes of memory), as well as the execution universe, IO destinations, and many others. This is the prefered method, as it allows maximum control over jobs.

A sample job-submission configuration file (arbitrarily named condor.tab):

Sample Job Configuration File
universe=       java
jar_files=      runme.jar
executable=     TempTest.class
arguments=      TempTest 64 -v
output=         tempTest.out
error=          tempTest.err

queue

This configuration defines the java universe, running the command "TempTest 64 -v" (implicitly the java command is used being this is the Java Universe), and placing all output (normally destined for standard out) to tempTest.out, and all error output (normally destined for standard error) in tempTest.err. Each of these redirects are ASCII text files.

Having created this file, simply enter condor_submit {config file name} {enter} and your job is enqueued and will begin processing.

Other Submission Techniques

Multiple Submissions of the Same Task

Suppose you have a task that you need run 150 times. This would be useful for finding averages or general testing of routines to ensure code stability with random data. You could have Condor do this for you with a similar configuration file:

Sample Job Configuration File with Multiple Submissions
universe=       java
jar_files=      runme.jar
executable=     TempTest.class
arguments=      TempTest 64 -v
output=         tempTest.out.$(Process)
error=          tempTest.err.$(Process)

queue 150

Another feature introduced in this configuration file is the $(Process) tag. This will tag each tempTest.out and tempTest.err file with the Process ID (PID) from Condor. This is particularly useful when enqueueing many jobs as we have done here.

Other tags such like $(Process) are available; see the Users Guide for more.

Multiple Data Runs for a Single Executable

Suppose you have a set of jobs that all use a common executable. For example, if you have two Mathematica jobs, you could enqueue them for Condor as follows:

Sample Job Configuration File with Multiple Tasks for Common Executable
executable=     mathematica
universe=       vanilla

input=          test.data
output=         loop.out
error=          loop.err

Initialdir=     run_1
queue

Initialdir=     run_2
queue

Such that the first pass will store to directory run_1, and the second job will use run_2.

Specifiying Hardware and Software Requirements for a Condor Task

Say Condor is running on a network of several different architectures with different OS's. You can specify a task to run on only machines that meet certain requirements as follows:

Sample Job Configuration File with System Specifications
executable=     foo
Requirements=   Memory >= 128 && OpSys == "Solaris9" && Arch == "Sparc"
Rank=           Memory >= 512


input=          dataIn
output=         foo.out

error=          foo.err

queue


Finding the Status of a Condor Job
During execution, we can find out what system a job is running on via the condor_status command. Example output:

Condor System Status
->condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

achilles.cs.r SOLARIS29   SUN4u  Owner      Idle       0.043   512  0+00:00:04
aeryn.cs.rit. SOLARIS29   SUN4u  Unclaimed  Idle       0.000   256  0+00:35:05
agamemnon.cs. SOLARIS29   SUN4u  Owner      Idle       0.137   512  0+01:55:04
...
...
connecticut.c SOLARIS29   SUN4u  Owner      Idle       0.000   512  0+02:00:05
cotterpin.cs. SOLARIS29   SUN4u  Owner      Idle       0.035   256  0+02:05:04
cyclops.cs.ri SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+03:05:04
delaware.cs.r SOLARIS29   SUN4u  Owner      Idle       1.051   512  1+03:00:08
denethor.cs.r SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+02:25:06
dent.cs.rit.e SOLARIS29   SUN4u  Unclaimed  Idle       0.000   256  0+00:00:04
dione.cs.rit. SOLARIS29   SUN4u  Claimed    Suspended  204.613   512  0+00:00:04
domino.cs.rit SOLARIS29   SUN4u  Owner      Idle       0.000   512  0+01:55:04
doors.cs.rit. SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+00:00:04
drifters.cs.r SOLARIS29   SUN4u  Owner      Idle       0.000   512  0+02:00:04
elrond.cs.rit SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+02:30:04
elvis.cs.rit. SOLARIS29   SUN4u  Owner      Idle       0.000   512  0+00:05:04
eomer.cs.rit. SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+00:35:04
...
...
vm1@hilly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  1+04:01:12
vm2@hilly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  0+00:00:05
vm3@hilly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  0+00:00:06
vm4@hilly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      0.723   1024  0+00:00:07
vm1@holly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  1+07:46:09
vm2@holly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  0+00:00:05
vm3@holly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      1.000   1024  0+00:00:06
vm4@holly.cs. SOLARIS29   SUN4u  Unclaimed  Idle      0.875   1024  0+00:00:07
...
...
wembly.cs.rit SOLARIS29   SUN4u  Owner      Idle       0.008   512  0+02:00:04
wingnut.cs.ri SOLARIS29   SUN4u  Owner      Idle       0.000   256  0+02:00:04
wisconsin.cs. SOLARIS29   SUN4u  Owner      Idle       1.133   512  1+04:00:05
wormtongue.cs SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+01:55:04
wrench.cs.rit SOLARIS29   SUN4u  Owner      Idle       0.031   256  0+02:00:05
xenon.cs.rit. SOLARIS29   SUN4u  Owner      Idle       0.000   256  0+02:05:04
yes.cs.rit.ed SOLARIS29   SUN4u  Unclaimed  Idle       0.000   512  0+00:10:04
zaphod.cs.rit SOLARIS29   SUN4u  Owner      Idle       0.000   256  0+02:00:04

                     Machines Owner Claimed Unclaimed Matched Preempting

     SUN4u/SOLARIS29      161    89       1        71       0          0

               Total      161    89       1        71       0          0
->

This output tells us that on dione.cs.rit.edu, a job is in process (the "Claimed" tag). Note the report for both holly and hilly.
Additionally, we can find information about this specific job by using condor_q:

Terse Job Status Report
->condor_q           
-- Submitter: tin.cs.rit.edu : <129.21.37.74:32781> : tin.cs.rit.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
   3.0   username         9/8 14:40   0+00:06:59 R  0   0.0  java TempTest 64 -
1 jobs; 0 idle, 1 running, 0 held
->

This reports the job was submitted from tin.cs.rit.edu on the 10th of September, and it is currently running. The verbose mode of this output (condor_q -long) reports:

Extended Job Status Report
->condor_q -long
-- Submitter: tin.cs.rit.edu : <129.21.37.74:32781> : tin.cs.rit.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 3
QDate = 1063209473
CompletionDate = 0
Owner = "username"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $"
CondorPlatform = "$CondorPlatform: SUN4X-SOLARIS28 $"
RootDir = "/"
Iwd = "/home/stuX/sXX/username/csclass/temps"
JobUniverse = 10
MinHosts = 1
MaxHosts = 1
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobPrio = 0
User = "username@cs.rit.edu"
NiceUser = FALSE
Env = ""
JobNotification = 2
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "tempTest.out"
Err = "tempTest.err"
BufferSize = 524288
BufferBlockSize = 32768
TransferFiles = "NEVER"
TransferInput = "TempTest.class,runme.jar"
Cmd = "java"
TransferExecutable = FALSE
ImageSize = 0
ExecutableSize = 0
DiskUsage = 24
Requirements = (HasJava) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (FileSystemDomain == "cs.rit.edu")
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
Args = "TempTest 64 -v"
JarFiles = "runme.jar"
ProcId = 0
WantMatchDiagnostics = TRUE
LastMatchTime = 1063209474
OrigMaxHosts = 1
JobStatus = 2
EnteredCurrentStatus = 1063209476
CurrentHosts = 1
RemoteHost = "dione"
RemoteVirtualMachineID = 1
ShadowBday = 1063209476
JobStartDate = 1063209476
JobCurrentStartDate = 1063209476
JobRunCount = 1
ServerTime = 1063210159

->

More information is available from the condor_q command -- investigate via man pages and the User Guide.

Canceling a Condor Job
From the above output, we see this job was assigned a ClusterID of 3. This is a parallel of the UNIX process ID (PID). So when we need to remove this job from processing, we can simply enter the following command.

condor_rm {cluster id} {enter}

This will not, however, send you a final email message. Only you can cancel your jobs, so you should know what happened.

Finding Out About How a Condor Job Ended
Reguardless of how a Condor job ends, via error and crash, sucessful termination, or otherwise, you will get an email that resembles the message below:

Status Report Email
Date: Wed, 10 Sept 2003 07:31:52 -0400 (EDT)
From: condor@cs.rit.edu
To: username@cs.rit.edu
Subject: [Condor] Condor Job 1.0

This is an automated email from the Condor system
on machine "dent.cs.rit.edu".  Do not reply.

Your Condor job 1.0
        java TempTest 64 -v
has exited normally with status 0.


Submitted at:        Mon Sep  8 14:40:33 2003
Completed at:        Wed Sep 10 07:31:52 2003
Real Time:           1 16:51:19

Virtual Image Size:  103248 Kilobytes

Statistics from last run:
Allocation/Run time:     0 15:46:52
Remote User CPU Time:    0 00:01:48
Remote System CPU Time:  0 00:00:06
Total Remote CPU Time:   0 00:01:54

Statistics totaled from all runs:
Allocation/Run time:     1 16:18:46

Network:
    0.0 B  Run Bytes Received By Job
    0.0 B  Run Bytes Sent By Job


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor@cs.rit.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor


From this message, we can see the job command (java TempTest 64 -v), the fact that it exited cleanly (exit status 0), took almost 17 hours of real time. Further analysis shows the job was moved; the Statistics From Last Run Run Time is different from the Statistics Totaled From All Runs Run Time. It was aparently moved at least once, and the last place it ran, it was working for just under 16 continuous hours.

Other Resources

Condor web page in Computer Science at R.I.T:
[ http://www.cs.rit.edu/~condor/ ]

Presentation from CIS department (in .pdf format)
[ http://www.cs.rit.edu/~condor/condor.pdf ]

Condor web site:
[ http://www.cs.wisc.edu/condor/ ]

Condor Reference Manual & Users Guide:

System Administration Group
Computer Science Department
Rochester Institute of Technology